[jira] [Commented] (PROTON-2928) Certain network conditions cause proton container to hang during connect phase

Clifford Jansen (Jira) Mon, 06 Apr 2026 22:52:12 -0700


    [ 
https://issues.apache.org/jira/browse/PROTON-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18071617#comment-18071617
 ]


Clifford Jansen commented on PROTON-2928:
-----------------------------------------

Thank you for the excellent reproducer.

I think the "one line fix" I would prefer is very similar to your proposal, 
just a few lines later:

  if (pconnection_has_event(pc) || pconnection_work_pending(pc)) {
    goto retry;  // TODO: get rid of goto without adding more locking
  }

This matches the similar test to "keep working" in pconnection_done()

  bool has_event = pconnection_has_event(pc);

  if (has_event || pconnection_work_pending(pc)) {
    self_sched = true;
  }

However, I think I would prefer even more an approach that provides a bit of 
code refactoring to remove the false duality of work_pending separate from 
having an event.  Having an unconsumed event should always mean there is work 
pending, as the concept is used in the code.

I have also verified the test case hangs on the main branch.  I will try to get 
a fix in in right away in time for the next release.

> Certain network conditions cause proton container to hang during connect phase
> ------------------------------------------------------------------------------
>
>                 Key: PROTON-2928
>                 URL: https://issues.apache.org/jira/browse/PROTON-2928
>             Project: Qpid Proton
>          Issue Type: Bug
>          Components: cpp-binding, proton-c
>    Affects Versions: proton-c-0.40.0
>         Environment: Ubuntu 22.04
>            Reporter: Bostjan Polanc
>            Priority: Critical
>         Attachments: Dockerfile
>
>
> Running a proton container where route to host is not known, will randomly 
> hang the container during connection attempt. Container does not stop, no 
> exception is thrown and no callback called. During connection attempt CPU 
> usage spikes to 100%.
>  
> Steps to reproduce:
> 1) Add a firewall rule so that access to server is blocked with ICMP error
> iptables -A INPUT -p tcp --dport 5672 -j REJECT --reject-with 
> icmp-host-unreachable
> 2) Modify the C++ helloworld sample so that it makes several attempts to run 
> the container (while true...).
> Sample cpp: 
> [https://github.com/bospol624/qpid-proton/blob/main/cpp/examples/helloworld.cpp]
> 3) Run helloworld. Exception thrown while iterating should be "proton:io: No 
> route to host - disconnected"
> After random number of iterations, program stops. Note that while it is still 
> iterating, CPU load while proton container is running is at 100%.
>  
> I attached a Dockerfile that sets up this test (clean build + helloworld 
> sample). Firewall rule on host machine still needs to be applied.
>  
> My comment on the issue from [email protected]:
>  
> Using a debugger I traced this to c/src/proactor/epool.c.
> Inside the pconnection_process function there seems to be a race condition
> on which is called first: recv or send.
> If send is called first via write_flush (line 1298), it fails and
> psocket_error function is called. psocket_error function will among other
> things add PN_TRANSPORT_CLOSED event to collector.
> This event batch will however never be returned to calling functions
> (next_event_batch/process), because after write_flush call, event batch is
> no longer returned.
> Because there is no new trigger from FD and the event batch was not
> returned, next_event_batch will just loop inside poller_do_epoll.
> For my project I made a temporary patch (epool.c/pconnection_process:1298):
> write_flush(pc);
> *if (pconnection_has_event(pc)) {pc->output_drained = false;return
> &pc->batch;}*
> So basically just add an additional check, in case if write_flush generated
> any events.
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PROTON-2928) Certain network conditions cause proton container to hang during connect phase

Reply via email to