[
https://issues.apache.org/jira/browse/PROTON-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18071655#comment-18071655
]
Bostjan Polanc commented on PROTON-2928:
----------------------------------------
Hello,
Thank you for the quick response. I tested your fix and can verify that it
solves my issue.
I also investigated the cause of the high CPU usage:
In my test setup EPOLLERR event gets triggered A LOT. This in turn calls
pconnection_process and it just loops, until eventually either EPOLLIN or
EPOLLOUT or both are set. In case EPOLLOUT was first set, the reported issue
was observed.
It is not yet clear to me, why this behavior is reproducible on some systems
and not others. But in any case, I think after EPOLLERR is received
FD/connection should be closed immediately.
Thank you, BR.
> Certain network conditions cause proton container to hang during connect phase
> ------------------------------------------------------------------------------
>
> Key: PROTON-2928
> URL: https://issues.apache.org/jira/browse/PROTON-2928
> Project: Qpid Proton
> Issue Type: Bug
> Components: cpp-binding, proton-c
> Affects Versions: proton-c-0.40.0
> Environment: Ubuntu 22.04
> Reporter: Bostjan Polanc
> Assignee: Clifford Jansen
> Priority: Critical
> Attachments: Dockerfile
>
>
> Running a proton container where route to host is not known, will randomly
> hang the container during connection attempt. Container does not stop, no
> exception is thrown and no callback called. During connection attempt CPU
> usage spikes to 100%.
>
> Steps to reproduce:
> 1) Add a firewall rule so that access to server is blocked with ICMP error
> iptables -A INPUT -p tcp --dport 5672 -j REJECT --reject-with
> icmp-host-unreachable
> 2) Modify the C++ helloworld sample so that it makes several attempts to run
> the container (while true...).
> Sample cpp:
> [https://github.com/bospol624/qpid-proton/blob/main/cpp/examples/helloworld.cpp]
> 3) Run helloworld. Exception thrown while iterating should be "proton:io: No
> route to host - disconnected"
> After random number of iterations, program stops. Note that while it is still
> iterating, CPU load while proton container is running is at 100%.
>
> I attached a Dockerfile that sets up this test (clean build + helloworld
> sample). Firewall rule on host machine still needs to be applied.
>
> My comment on the issue from [email protected]:
>
> Using a debugger I traced this to c/src/proactor/epool.c.
> Inside the pconnection_process function there seems to be a race condition
> on which is called first: recv or send.
> If send is called first via write_flush (line 1298), it fails and
> psocket_error function is called. psocket_error function will among other
> things add PN_TRANSPORT_CLOSED event to collector.
> This event batch will however never be returned to calling functions
> (next_event_batch/process), because after write_flush call, event batch is
> no longer returned.
> Because there is no new trigger from FD and the event batch was not
> returned, next_event_batch will just loop inside poller_do_epoll.
> For my project I made a temporary patch (epool.c/pconnection_process:1298):
> write_flush(pc);
> *if (pconnection_has_event(pc)) {pc->output_drained = false;return
> &pc->batch;}*
> So basically just add an additional check, in case if write_flush generated
> any events.
>
>
>
>
>
>
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]