[jira] [Updated] (PROTON-2928) Certain network conditions cause proton container to hang during connect phase

Bostjan Polanc (Jira) Sat, 04 Apr 2026 11:39:54 -0700


     [ 
https://issues.apache.org/jira/browse/PROTON-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Bostjan Polanc updated PROTON-2928:
-----------------------------------
    Description: 
Running a proton container where route to host is not known, will randomly hang 
the container during connection attempt. Container does not stop, no exception 
is thrown and no callback called. During connection attempt CPU usage spikes to 
100%.

 

Steps to reproduce:

1) Add a firewall rule so that access to server is blocked with ICMP error

iptables -A INPUT -p tcp --dport 5672 -j REJECT --reject-with 
icmp-host-unreachable

2) Modify the C++ helloworld sample so that it makes several attempts to run 
the container (while true...).

Sample cpp: 
[https://github.com/bospol624/qpid-proton/blob/main/cpp/examples/helloworld.cpp]

3) Run helloworld. Exception thrown while iterating should be "proton:io: No 
route to host - disconnected"

After random number of iterations, program stops. Note that while it is still 
iterating, CPU load while proton container is running is at 100%.

 

I attached a Dockerfile that sets up this test (clean build + helloworld 
sample). Firewall rule on host machine still needs to be applied.

 

My comment on the issue from [email protected]:

 
Using a debugger I traced this to c/src/proactor/epool.c.

Inside the pconnection_process function there seems to be a race condition
on which is called first: recv or send.

If send is called first via write_flush (line 1298), it fails and
psocket_error function is called. psocket_error function will among other
things add PN_TRANSPORT_CLOSED event to collector.
This event batch will however never be returned to calling functions
(next_event_batch/process), because after write_flush call, event batch is
no longer returned.

Because there is no new trigger from FD and the event batch was not
returned, next_event_batch will just loop inside poller_do_epoll.

For my project I made a temporary patch (epool.c/pconnection_process:1298):

write_flush(pc);
*if (pconnection_has_event(pc)) {pc->output_drained = false;return
&pc->batch;}*

So basically just add an additional check, in case if write_flush generated
any events.
 

 

 

 

 

 

 

 

 

 

  was:
Running a proton container where route to host is not known, will randomly hang 
the container during connection attempt. Container does not stop, no exception 
is thrown and no callback called. During connection attempt CPU usage spikes to 
100%.

 

Steps to reproduce:

1) Add a firewall rule so that access to server is blocked with ICMP error

iptables -A INPUT -p tcp --dport 5672 -j REJECT --reject-with 
icmp-host-unreachable

2) Modify the C++ helloworld sample so that it makes several attempts to run 
the container (while true...).

Sample cpp: 
https://github.com/bospol624/qpid-proton/blob/main/cpp/examples/helloworld.cpp

3) Run helloworld. Exception thrown while iterating, should be "proton:io: No 
route to host - disconnected"

After random number of iterations, program stops. Note that while it is still 
iterating, CPU load while proton container is running is at 100%.

 

I attached a Dockerfile that sets up this test (clean build + helloworld 
sample). Firewall rule on host machine still needs to be applied.

 

My comment on the issue from [email protected]:

 
Using a debugger I traced this to c/src/proactor/epool.c.

Inside the pconnection_process function there seems to be a race condition
on which is called first: recv or send.

If send is called first via write_flush (line 1298), it fails and
psocket_error function is called. psocket_error function will among other
things add PN_TRANSPORT_CLOSED event to collector.
This event batch will however never be returned to calling functions
(next_event_batch/process), because after write_flush call, event batch is
no longer returned.

Because there is no new trigger from FD and the event batch was not
returned, next_event_batch will just loop inside poller_do_epoll.

For my project I made a temporary patch (epool.c/pconnection_process:1298):

write_flush(pc);
*if (pconnection_has_event(pc)) \{pc->output_drained = false;return
&pc->batch;}*

So basically just add an additional check, in case if write_flush generated
any events.
 

 

 

 

 

 

 

 

 

 


> Certain network conditions cause proton container to hang during connect phase
> ------------------------------------------------------------------------------
>
>                 Key: PROTON-2928
>                 URL: https://issues.apache.org/jira/browse/PROTON-2928
>             Project: Qpid Proton
>          Issue Type: Bug
>          Components: cpp-binding, proton-c
>    Affects Versions: proton-c-0.40.0
>         Environment: Ubuntu 22.04
>            Reporter: Bostjan Polanc
>            Priority: Critical
>         Attachments: Dockerfile
>
>
> Running a proton container where route to host is not known, will randomly 
> hang the container during connection attempt. Container does not stop, no 
> exception is thrown and no callback called. During connection attempt CPU 
> usage spikes to 100%.
>  
> Steps to reproduce:
> 1) Add a firewall rule so that access to server is blocked with ICMP error
> iptables -A INPUT -p tcp --dport 5672 -j REJECT --reject-with 
> icmp-host-unreachable
> 2) Modify the C++ helloworld sample so that it makes several attempts to run 
> the container (while true...).
> Sample cpp: 
> [https://github.com/bospol624/qpid-proton/blob/main/cpp/examples/helloworld.cpp]
> 3) Run helloworld. Exception thrown while iterating should be "proton:io: No 
> route to host - disconnected"
> After random number of iterations, program stops. Note that while it is still 
> iterating, CPU load while proton container is running is at 100%.
>  
> I attached a Dockerfile that sets up this test (clean build + helloworld 
> sample). Firewall rule on host machine still needs to be applied.
>  
> My comment on the issue from [email protected]:
>  
> Using a debugger I traced this to c/src/proactor/epool.c.
> Inside the pconnection_process function there seems to be a race condition
> on which is called first: recv or send.
> If send is called first via write_flush (line 1298), it fails and
> psocket_error function is called. psocket_error function will among other
> things add PN_TRANSPORT_CLOSED event to collector.
> This event batch will however never be returned to calling functions
> (next_event_batch/process), because after write_flush call, event batch is
> no longer returned.
> Because there is no new trigger from FD and the event batch was not
> returned, next_event_batch will just loop inside poller_do_epoll.
> For my project I made a temporary patch (epool.c/pconnection_process:1298):
> write_flush(pc);
> *if (pconnection_has_event(pc)) {pc->output_drained = false;return
> &pc->batch;}*
> So basically just add an additional check, in case if write_flush generated
> any events.
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (PROTON-2928) Certain network conditions cause proton container to hang during connect phase

Reply via email to