[
https://issues.apache.org/jira/browse/PROTON-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18071617#comment-18071617
]
Clifford Jansen commented on PROTON-2928:
-----------------------------------------
Thank you for the excellent reproducer.
I think the "one line fix" I would prefer is very similar to your proposal,
just a few lines later:
if (pconnection_has_event(pc) || pconnection_work_pending(pc)) {
goto retry; // TODO: get rid of goto without adding more locking
}
This matches the similar test to "keep working" in pconnection_done()
bool has_event = pconnection_has_event(pc);
if (has_event || pconnection_work_pending(pc)) {
self_sched = true;
}
However, I think I would prefer even more an approach that provides a bit of
code refactoring to remove the false duality of work_pending separate from
having an event. Having an unconsumed event should always mean there is work
pending, as the concept is used in the code.
I have also verified the test case hangs on the main branch. I will try to get
a fix in in right away in time for the next release.
> Certain network conditions cause proton container to hang during connect phase
> ------------------------------------------------------------------------------
>
> Key: PROTON-2928
> URL: https://issues.apache.org/jira/browse/PROTON-2928
> Project: Qpid Proton
> Issue Type: Bug
> Components: cpp-binding, proton-c
> Affects Versions: proton-c-0.40.0
> Environment: Ubuntu 22.04
> Reporter: Bostjan Polanc
> Priority: Critical
> Attachments: Dockerfile
>
>
> Running a proton container where route to host is not known, will randomly
> hang the container during connection attempt. Container does not stop, no
> exception is thrown and no callback called. During connection attempt CPU
> usage spikes to 100%.
>
> Steps to reproduce:
> 1) Add a firewall rule so that access to server is blocked with ICMP error
> iptables -A INPUT -p tcp --dport 5672 -j REJECT --reject-with
> icmp-host-unreachable
> 2) Modify the C++ helloworld sample so that it makes several attempts to run
> the container (while true...).
> Sample cpp:
> [https://github.com/bospol624/qpid-proton/blob/main/cpp/examples/helloworld.cpp]
> 3) Run helloworld. Exception thrown while iterating should be "proton:io: No
> route to host - disconnected"
> After random number of iterations, program stops. Note that while it is still
> iterating, CPU load while proton container is running is at 100%.
>
> I attached a Dockerfile that sets up this test (clean build + helloworld
> sample). Firewall rule on host machine still needs to be applied.
>
> My comment on the issue from [email protected]:
>
> Using a debugger I traced this to c/src/proactor/epool.c.
> Inside the pconnection_process function there seems to be a race condition
> on which is called first: recv or send.
> If send is called first via write_flush (line 1298), it fails and
> psocket_error function is called. psocket_error function will among other
> things add PN_TRANSPORT_CLOSED event to collector.
> This event batch will however never be returned to calling functions
> (next_event_batch/process), because after write_flush call, event batch is
> no longer returned.
> Because there is no new trigger from FD and the event batch was not
> returned, next_event_batch will just loop inside poller_do_epoll.
> For my project I made a temporary patch (epool.c/pconnection_process:1298):
> write_flush(pc);
> *if (pconnection_has_event(pc)) {pc->output_drained = false;return
> &pc->batch;}*
> So basically just add an additional check, in case if write_flush generated
> any events.
>
>
>
>
>
>
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]