Despite providing what looks like a simple sample program, I am unable
to reproduce the hang after more than 200 iterations.

I believe the iptables command you supplied was successfully invoked
as the connection error changes from

  proton:io: Connection refused - disconnected 127.0.0.1:5672

to

  proton:io: No route to host - disconnected 127.0.0.1:5672

with a noticeable time lag difference.  Test done on Fedora 43.

Perhaps you can post the actual full source code of the reproducer and
review the procedure used to force the error, perhaps starting from a
freshly rebooted system.  I might have more luck on a second try.

I would point out that the reproducer uses a container per connection.
This is valid but is at odds of the stated purpose of using reconnect
options which only make sense within a container that manages the
reconnecting.  That doesn't mean that you haven't found a bug in the
epoll proactor code needing attention.

In any event, you may find the following reconnect "torture test" code helpful:

  https://github.com/cliffjansen/senderciser

Cliff

On Mon, Mar 30, 2026 at 5:09 AM Boštjan Polanc <[email protected]> wrote:
>
> Hello,
>
> I am developing a Qpid Proton C++ client (0.40.0, Linux), that requires
> server reconnect functionality. I am using reconnection options and most of
> the time it is working OK.
> There however appears to be some timing related issue, so that on some
> reconnect attempts, proton container "hangs". That is, no callback function
> is called, no exception thrown and thread does not stop.
>
> The simplest setup to reproduce it is with a modified hello world client
> which is trying to connect to a non existing server:
>
> int main(int argc, char **argv) {
>     std::string conn_url = argc > 1 ? argv[1] : "//127.0.0.1:5672";
>     std::string addr = argc > 2 ? argv[2] : "examples";
>     hello_world hw(conn_url, addr);
>     unsigned int iteration = 0;
>     while (true)
>     {
>         try {
>             printf("iteration %u\n", iteration);
>             iteration++;
>             proton::container(hw).run();
>
>         } catch (const std::exception& e) {
>             std::cerr << e.what() << std::endl;
>         }
>     }
>     return 1;
> }
>
> And by adding the following rule to firewall:
> iptables -A INPUT -p tcp --dport 5672 -j REJECT --reject-with
> icmp-host-unreachable
>
> Iptables rule is essential, as it adds some delay to TCP level. With this
> setup, the issue is reproduced relatively quickly (Ubuntu and RHEL).
>
> Using a debugger I traced this to c/src/proactor/epool.c.
>
> Inside the pconnection_process function there seems to be a race condition
> on which is called first: recv or send.
>
> If send is called first via write_flush (line 1298), it fails and
> psocket_error function is called. psocket_error function will among other
> things add PN_TRANSPORT_CLOSED event to collector.
> This event batch will however never be returned to calling functions
> (next_event_batch/process), because after write_flush call, event batch is
> no longer returned.
>
> Because there is no new trigger from FD and the event batch was not
> returned, next_event_batch will just loop inside poller_do_epoll.
>
> For my project I made a temporary patch (epool.c/pconnection_process:1298):
>
> write_flush(pc);
>
>
>
>
> *if (pconnection_has_event(pc)) {pc->output_drained = false;return
> &pc->batch;}*
>
> So basically just add an additional check, in case if write_flush generated
> any events.
>
> This seems to fix my issues, but obviously another view/option would be
> helpful.
>
> Thank you, BR.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to