Hi Qpid gurus,

We are using 0.16 Java broker and client on 0-10, and we are running into
deadlock issues on the client that involve AMQSession's
_messageDeliveryLock and AMQConnection's _failoverMutex, where different
threads acquire them in different orders. This is leading to major
production headaches for us and has us very worried.

(I've been looking at 0.16 code mostly but also skimmed the relevant parts
in 0.32, which seem largely the same in those places.)

Deadlock Variety 1

This is an example of a deadlock we see, where the IOReceiver thread
deadlocks with Session dispatcher thread (we have listeners that call
session rollback or commit in onMessage()):

IOReceiver
org.apache.qpid.client.AMQSession.closed(AMQSession.java:818)  ---------->
waiting for session's messageDeliveryLock
org.apache.qpid.client.AMQConnection.closeAllSessions(AMQConnection.java:938)
org.apache.qpid.client.AMQConnection.exceptionReceived(AMQConnection.java:1282)
 ----------> acquires connection's failoverMutex
org.apache.qpid.client.AMQSession_0_10.setCurrentException(AMQSession_0_10.java:1057)
org.apache.qpid.client.AMQSession_0_10.exception(AMQSession_0_10.java:907)
org.apache.qpid.transport.SessionDelegate.executionException(SessionDelegate.java:182)
org.apache.qpid.transport.SessionDelegate.executionException(SessionDelegate.java:32)
org.apache.qpid.transport.ExecutionException.dispatch(ExecutionException.java:103)
org.apache.qpid.transport.SessionDelegate.command(SessionDelegate.java:55)
org.apache.qpid.transport.SessionDelegate.command(SessionDelegate.java:50)
org.apache.qpid.transport.SessionDelegate.command(SessionDelegate.java:32)
org.apache.qpid.transport.Method.delegate(Method.java:159)
org.apache.qpid.transport.Session.received(Session.java:585)
org.apache.qpid.transport.Connection.dispatch(Connection.java:412)
org.apache.qpid.transport.ConnectionDelegate.handle(ConnectionDelegate.java:64)
org.apache.qpid.transport.ConnectionDelegate.handle(ConnectionDelegate.java:40)
org.apache.qpid.transport.MethodDelegate.executionException(MethodDelegate.java:110)
org.apache.qpid.transport.ExecutionException.dispatch(ExecutionException.java:103)
org.apache.qpid.transport.ConnectionDelegate.command(ConnectionDelegate.java:54)
org.apache.qpid.transport.ConnectionDelegate.command(ConnectionDelegate.java:40)
org.apache.qpid.transport.Method.delegate(Method.java:159)
org.apache.qpid.transport.Connection.received(Connection.java:367)
org.apache.qpid.transport.Connection.received(Connection.java:65)
org.apache.qpid.transport.network.Assembler.emit(Assembler.java:97)
org.apache.qpid.transport.network.Assembler.assemble(Assembler.java:198)
org.apache.qpid.transport.network.Assembler.frame(Assembler.java:131)
org.apache.qpid.transport.network.Frame.delegate(Frame.java:128)
org.apache.qpid.transport.network.Assembler.received(Assembler.java:102)
org.apache.qpid.transport.network.Assembler.received(Assembler.java:44)
org.apache.qpid.transport.network.InputHandler.next(InputHandler.java:189)
org.apache.qpid.transport.network.InputHandler.received(InputHandler.java:105)
org.apache.qpid.transport.network.InputHandler.received(InputHandler.java:44)
org.apache.qpid.transport.network.io.IoReceiver.run(IoReceiver.java:152)
java.lang.Thread.run(Thread.java:745)

Session dispatcher thread
org.apache.qpid.client.AMQConnection.exceptionReceived(AMQConnection.java:1255)
 ---------> waiting for connection's failoverMutex
org.apache.qpid.client.AMQSession_0_10.setCurrentException(AMQSession_0_10.java:1057)
org.apache.qpid.client.AMQSession_0_10.sync(AMQSession_0_10.java:1034)
org.apache.qpid.client.AMQSession_0_10.sendSuspendChannel(AMQSession_0_10.java:812)
org.apache.qpid.client.AMQSession.suspendChannel(AMQSession.java:3075)
org.apache.qpid.client.AMQSession.rollback(AMQSession.java:1837)
common.messaging.QpidSession.rollback(QpidSession.java:211)
common.messaging.QpidMessageHandler.rollbackSession(QpidMessageHandler.java:284)
common.messaging.QpidMessageHandler.onMessage(QpidMessageHandler.java:113)
org.apache.qpid.client.BasicMessageConsumer.notifyMessage(BasicMessageConsumer.java:748)
org.apache.qpid.client.BasicMessageConsumer_0_10.notifyMessage(BasicMessageConsumer_0_10.java:141)
org.apache.qpid.client.BasicMessageConsumer.notifyMessage(BasicMessageConsumer.java:722)
org.apache.qpid.client.BasicMessageConsumer_0_10.notifyMessage(BasicMessageConsumer_0_10.java:186)
org.apache.qpid.client.BasicMessageConsumer_0_10.notifyMessage(BasicMessageConsumer_0_10.java:54)
org.apache.qpid.client.AMQSession$Dispatcher.notifyConsumer(AMQSession.java:3454)
org.apache.qpid.client.AMQSession$Dispatcher.dispatchMessage(AMQSession.java:3393)
  -----------> acquires session's messageDeliverylock
org.apache.qpid.client.AMQSession$Dispatcher.access$1000(AMQSession.java:3180)
org.apache.qpid.client.AMQSession.dispatch(AMQSession.java:3173)
org.apache.qpid.client.message.UnprocessedMessage.dispatch(UnprocessedMessage.java:54)
org.apache.qpid.client.AMQSession$Dispatcher.run(AMQSession.java:3316)
java.lang.Thread.run(Thread.java:745)

The problem is that the IOReceiver thread acquires failoverMutex before
messageDeliveryLock (for each session), whereas the dispatcher thread
acquires it in the other order. We also see potential problems where other
threads (instead of IOReceiver) can deadlock with the dispatcher thread, as
long as it acquires failoverMutex before messageDeliveryLock. Examples we
can think of:

A) Another thread calling AMQSession.close()
B) Another thread calling BasicMessageConsumer.close()
C) Same connection, different session's dispatcher thread, calling
rollback() or commit() -> sync() -> setCurrentException() ->
AMQConnection.exceptionReceived() -> AMQConnection.closeAllSessions(),
which can try to acquire the messageDeliveryLock of another session and
deadlock with the other session's dispatcher thread


Deadlock Variety 2:
>From code inspection, it also appears that AMQConnection.close() can
deadlock with either AMQSession.close() or BasicMessageConsumer.close()
(where the session / consumer is on the same connection). This is because
AMQConnection.close() first acquires the messageDeliveryLock of all its
sessions in the recursive doClose(), before trying to acquire the
connection's failoverMutex. But the Session / consumer's close() acquires
the failoverMutex before messageDeliveryLock. We haven't seen this happen
but would like to know if this is possible.


We'd really appreciate your help on this. Assuming these can be fixed in
0.32, we are also wondering if clients are backward compatible -- i.e., can
we upgrade only our client to 0.32 while continuing to use the 0.16 broker?

Thanks,
Helen

Reply via email to