[
https://issues.apache.org/jira/browse/PROTON-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Clifford Jansen reassigned PROTON-2928:
---------------------------------------
Assignee: Clifford Jansen
> Certain network conditions cause proton container to hang during connect phase
> ------------------------------------------------------------------------------
>
> Key: PROTON-2928
> URL: https://issues.apache.org/jira/browse/PROTON-2928
> Project: Qpid Proton
> Issue Type: Bug
> Components: cpp-binding, proton-c
> Affects Versions: proton-c-0.40.0
> Environment: Ubuntu 22.04
> Reporter: Bostjan Polanc
> Assignee: Clifford Jansen
> Priority: Critical
> Attachments: Dockerfile
>
>
> Running a proton container where route to host is not known, will randomly
> hang the container during connection attempt. Container does not stop, no
> exception is thrown and no callback called. During connection attempt CPU
> usage spikes to 100%.
>
> Steps to reproduce:
> 1) Add a firewall rule so that access to server is blocked with ICMP error
> iptables -A INPUT -p tcp --dport 5672 -j REJECT --reject-with
> icmp-host-unreachable
> 2) Modify the C++ helloworld sample so that it makes several attempts to run
> the container (while true...).
> Sample cpp:
> [https://github.com/bospol624/qpid-proton/blob/main/cpp/examples/helloworld.cpp]
> 3) Run helloworld. Exception thrown while iterating should be "proton:io: No
> route to host - disconnected"
> After random number of iterations, program stops. Note that while it is still
> iterating, CPU load while proton container is running is at 100%.
>
> I attached a Dockerfile that sets up this test (clean build + helloworld
> sample). Firewall rule on host machine still needs to be applied.
>
> My comment on the issue from [email protected]:
>
> Using a debugger I traced this to c/src/proactor/epool.c.
> Inside the pconnection_process function there seems to be a race condition
> on which is called first: recv or send.
> If send is called first via write_flush (line 1298), it fails and
> psocket_error function is called. psocket_error function will among other
> things add PN_TRANSPORT_CLOSED event to collector.
> This event batch will however never be returned to calling functions
> (next_event_batch/process), because after write_flush call, event batch is
> no longer returned.
> Because there is no new trigger from FD and the event batch was not
> returned, next_event_batch will just loop inside poller_do_epoll.
> For my project I made a temporary patch (epool.c/pconnection_process:1298):
> write_flush(pc);
> *if (pconnection_has_event(pc)) {pc->output_drained = false;return
> &pc->batch;}*
> So basically just add an additional check, in case if write_flush generated
> any events.
>
>
>
>
>
>
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]