Bostjan Polanc created PROTON-2928:
--------------------------------------

             Summary: Certain network conditions cause proton container to hang 
during connect phase
                 Key: PROTON-2928
                 URL: https://issues.apache.org/jira/browse/PROTON-2928
             Project: Qpid Proton
          Issue Type: Bug
          Components: cpp-binding, proton-c
    Affects Versions: proton-c-0.40.0
         Environment: Ubuntu 22.04
            Reporter: Bostjan Polanc
         Attachments: Dockerfile

Running a proton container where route to host is not known, will randomly hang 
the container during connection attempt. Container does not stop, no exception 
is thrown and no callback called. During connection attempt CPU usage spikes to 
100%.

 

Steps to reproduce:

1) Add a firewall rule so that access to server is blocked with ICMP error

iptables -A INPUT -p tcp --dport 5672 -j REJECT --reject-with 
icmp-host-unreachable

2) Modify the C++ helloworld sample so that it makes several attempts to run 
the container (while true...).

Sample cpp: 
https://github.com/bospol624/qpid-proton/blob/main/cpp/examples/helloworld.cpp

3) Run helloworld. Exception thrown while iterating, should be "proton:io: No 
route to host - disconnected"

After random number of iterations, program stops. Note that while it is still 
iterating, CPU load while proton container is running is at 100%.

 

I attached a Dockerfile that sets up this test (clean build + helloworld 
sample). Firewall rule on host machine still needs to be applied.

 

My comment on the issue from [email protected]:

 
Using a debugger I traced this to c/src/proactor/epool.c.

Inside the pconnection_process function there seems to be a race condition
on which is called first: recv or send.

If send is called first via write_flush (line 1298), it fails and
psocket_error function is called. psocket_error function will among other
things add PN_TRANSPORT_CLOSED event to collector.
This event batch will however never be returned to calling functions
(next_event_batch/process), because after write_flush call, event batch is
no longer returned.

Because there is no new trigger from FD and the event batch was not
returned, next_event_batch will just loop inside poller_do_epoll.

For my project I made a temporary patch (epool.c/pconnection_process:1298):

write_flush(pc);
*if (pconnection_has_event(pc)) \{pc->output_drained = false;return
&pc->batch;}*

So basically just add an additional check, in case if write_flush generated
any events.
 

 

 

 

 

 

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to