Bostjan Polanc created PROTON-2928:
--------------------------------------
Summary: Certain network conditions cause proton container to hang
during connect phase
Key: PROTON-2928
URL: https://issues.apache.org/jira/browse/PROTON-2928
Project: Qpid Proton
Issue Type: Bug
Components: cpp-binding, proton-c
Affects Versions: proton-c-0.40.0
Environment: Ubuntu 22.04
Reporter: Bostjan Polanc
Attachments: Dockerfile
Running a proton container where route to host is not known, will randomly hang
the container during connection attempt. Container does not stop, no exception
is thrown and no callback called. During connection attempt CPU usage spikes to
100%.
Steps to reproduce:
1) Add a firewall rule so that access to server is blocked with ICMP error
iptables -A INPUT -p tcp --dport 5672 -j REJECT --reject-with
icmp-host-unreachable
2) Modify the C++ helloworld sample so that it makes several attempts to run
the container (while true...).
Sample cpp:
https://github.com/bospol624/qpid-proton/blob/main/cpp/examples/helloworld.cpp
3) Run helloworld. Exception thrown while iterating, should be "proton:io: No
route to host - disconnected"
After random number of iterations, program stops. Note that while it is still
iterating, CPU load while proton container is running is at 100%.
I attached a Dockerfile that sets up this test (clean build + helloworld
sample). Firewall rule on host machine still needs to be applied.
My comment on the issue from [email protected]:
Using a debugger I traced this to c/src/proactor/epool.c.
Inside the pconnection_process function there seems to be a race condition
on which is called first: recv or send.
If send is called first via write_flush (line 1298), it fails and
psocket_error function is called. psocket_error function will among other
things add PN_TRANSPORT_CLOSED event to collector.
This event batch will however never be returned to calling functions
(next_event_batch/process), because after write_flush call, event batch is
no longer returned.
Because there is no new trigger from FD and the event batch was not
returned, next_event_batch will just loop inside poller_do_epoll.
For my project I made a temporary patch (epool.c/pconnection_process:1298):
write_flush(pc);
*if (pconnection_has_event(pc)) \{pc->output_drained = false;return
&pc->batch;}*
So basically just add an additional check, in case if write_flush generated
any events.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]