Hello,

I am developing a Qpid Proton C++ client (0.40.0, Linux), that requires
server reconnect functionality. I am using reconnection options and most of
the time it is working OK.
There however appears to be some timing related issue, so that on some
reconnect attempts, proton container "hangs". That is, no callback function
is called, no exception thrown and thread does not stop.

The simplest setup to reproduce it is with a modified hello world client
which is trying to connect to a non existing server:

int main(int argc, char **argv) {
    std::string conn_url = argc > 1 ? argv[1] : "//127.0.0.1:5672";
    std::string addr = argc > 2 ? argv[2] : "examples";
    hello_world hw(conn_url, addr);
    unsigned int iteration = 0;
    while (true)
    {
        try {
            printf("iteration %u\n", iteration);
            iteration++;
            proton::container(hw).run();

        } catch (const std::exception& e) {
            std::cerr << e.what() << std::endl;
        }
    }
    return 1;
}

And by adding the following rule to firewall:
iptables -A INPUT -p tcp --dport 5672 -j REJECT --reject-with
icmp-host-unreachable

Iptables rule is essential, as it adds some delay to TCP level. With this
setup, the issue is reproduced relatively quickly (Ubuntu and RHEL).

Using a debugger I traced this to c/src/proactor/epool.c.

Inside the pconnection_process function there seems to be a race condition
on which is called first: recv or send.

If send is called first via write_flush (line 1298), it fails and
psocket_error function is called. psocket_error function will among other
things add PN_TRANSPORT_CLOSED event to collector.
This event batch will however never be returned to calling functions
(next_event_batch/process), because after write_flush call, event batch is
no longer returned.

Because there is no new trigger from FD and the event batch was not
returned, next_event_batch will just loop inside poller_do_epoll.

For my project I made a temporary patch (epool.c/pconnection_process:1298):

write_flush(pc);




*if (pconnection_has_event(pc)) {pc->output_drained = false;return
&pc->batch;}*

So basically just add an additional check, in case if write_flush generated
any events.

This seems to fix my issues, but obviously another view/option would be
helpful.

Thank you, BR.

Reply via email to