I have recently encountered a problem with the Qpid C++ broker failing due
to a segfault.

I have done some analysis and believe I have found the cause.  I have a
proposed workaround,  but I also have a question about the desired
behaviour and wanted to see what others thought before submitting a
ticket/pull request.

In essence, the problem occurs when a session has 'pending' responses and
then a link error occurs.  When a link is re-established an attempt is made
to send the pending responses but, by that stage, they are invalid and a
NULL link pointer within a delivery object causes the segfault.

Here is some more detail. (Feel free to skip to the "Questions" below if
you don't want to wade through the detail yet)

The problem was initially found in broker version 1.39.0 running on CentOS
7 though I don't think the problem is platform specific. I have reproduced
it using the latest source from git.

The problem was shown up by a new AMQP 1.0 client program that I have been
developing.  There are two particular features of the client that I believe
contributed to the problem.  Firstly, it uses AMQP 1.0, and secondly it
sends messages "asynchronously" i.e. it sends multiple messages without
necessarily waiting for each one to be individually acknowledged.

The scenario being tested is sending messages to a broker.  The messages
are routed to a destination queue.  The queue has a qpid.max_size specified
(in this case 10,000 bytes) and a qpid.policy_type of 'reject'.

There are no subscribers to the queue so the messages will build up until
the limit is reached and should then be rejected.

The client program sends multiple messages with 1,000 bytes of payload
(1,087 bytes total size when on a queue)

The first nine messages get routed to the destination queue but the tenth
one won't fit.  At this point the broker returns a "Link Error" to the
client (with 'amqp:precondition-failed' / 'resource-limit-exceeded') and,
crucially, closes the link.

When the link error occurs, the client program tries to re-establish the
link and to re-send any unacknowledged messages.  If nothing has changed at
the broker end then the same link error occurs.  The client repeatedly
tries to re-connect (with a BackOff delay).

The next stage of the test is to remove some messages from the destination
queue (using the purge facility on the QMF console).  The hope being that
the client will successfully reconnect and send some more messages.

What actually happened was that, once the link is re-established, and data
starts to flow again, then the broker crashes with a SegFault.

The actual SegFault is in the proton library at pni_add_tpwork(pn_delivery_t
*delivery)
<https://github.com/apache/qpid-proton/blob/main/c/src/core/engine.c#L740> and
is caused by the 'link' pointer, within the passed 'delivery' being NULL.

The reason it is NULL is due, I think, to a logic error in the Session
handling in the AMQP 1.0 plugin Session.cpp
<https://github.com/apache/qpid-cpp/blob/main/src/qpid/broker/amqp/Session.cpp>

The session handling has some complex tasks to deal with, including
ensuring that messages are sent and received on the correct threads.

After tracing / debugging I think the following happens:


   - When an incoming message has been processed, and an 'accepted' state
   needs to be sent to the originator, a pointer to the relevant
   'delivery' object is placed in the 'pending' set. pending_accept()
   
<https://github.com/apache/qpid-cpp/blob/main/src/qpid/broker/amqp/Session.cpp#L649>
   - The 'delivery' pointer is later removed from the pending set when
Session::accepted()

   
<https://github.com/apache/qpid-cpp/blob/main/src/qpid/broker/amqp/Session.cpp#L677>is
   called with 'sync=True' and the delivery is accepted with a call to
   pn_delivery_update().
   - However, when the link error occurs there are still some delivery
   object pointers in the pending set.
   - After the queue is purged (and possibly after the link has been
   re-established - I'm not sure) then Session::accepted()
   
<https://github.com/apache/qpid-cpp/blob/main/src/qpid/broker/amqp/Session.cpp#L677>
is
   called for those deliveries in the pending set, but this time with
   'sync=False' which causes the pointers to be moved to the 'completed'
   deque.  The problem is that the *delivery objects are no longer valid*.
   It looks like they have been re-initialised to 'empty' values. Specifically
   the 'link' pointer is NULL.  I am guessing that they have been cleared by
   some C proton processing related to the link being cleared and
   re-established.  Since the 'pending' set holds bare pointers to the
   delivery objects the proton library doesn't 'know' they are in use - and
   anyway the link they were referring to is no longer there.
   - With the invalid delivery object having being moved from the 'pending'
   set to the 'completed' deque, the stage is set for the final act when,
   during processing of the 'completed' deque within Session::dispatch()
   
<https://github.com/apache/qpid-cpp/blob/main/src/qpid/broker/amqp/Session.cpp#L748>
an
   'empty' delivery object is passed to Session::accepted() and results in the
   Segfault at pni_add_tpwork()


*Questions:*

Firstly, a simple work around is to check the link pointer when moving
delivery pointers from 'pending' to completed and discarding them if the
'link' is NULL . Does this suffice?

Secondly, does the link need to be cleared just because a message is too
big for the queue? Can it not be signalled back to the sender in a less
drastic manner? (Dropping the link means it is likely for numerous messages
to be repeatedly delivered when using "at least once" delivery)

Thoughts and help would be very much appreciated.

Thanks

Pete

Reply via email to