Hi Qpid gurus, We are using 0.16 Java broker and client on 0-10, and we are running into deadlock issues on the client that involve AMQSession's _messageDeliveryLock and AMQConnection's _failoverMutex, where different threads acquire them in different orders. This is leading to major production headaches for us and has us very worried.
(I've been looking at 0.16 code mostly but also skimmed the relevant parts in 0.32, which seem largely the same in those places.) Deadlock Variety 1 This is an example of a deadlock we see, where the IOReceiver thread deadlocks with Session dispatcher thread (we have listeners that call session rollback or commit in onMessage()): IOReceiver org.apache.qpid.client.AMQSession.closed(AMQSession.java:818) ----------> waiting for session's messageDeliveryLock org.apache.qpid.client.AMQConnection.closeAllSessions(AMQConnection.java:938) org.apache.qpid.client.AMQConnection.exceptionReceived(AMQConnection.java:1282) ----------> acquires connection's failoverMutex org.apache.qpid.client.AMQSession_0_10.setCurrentException(AMQSession_0_10.java:1057) org.apache.qpid.client.AMQSession_0_10.exception(AMQSession_0_10.java:907) org.apache.qpid.transport.SessionDelegate.executionException(SessionDelegate.java:182) org.apache.qpid.transport.SessionDelegate.executionException(SessionDelegate.java:32) org.apache.qpid.transport.ExecutionException.dispatch(ExecutionException.java:103) org.apache.qpid.transport.SessionDelegate.command(SessionDelegate.java:55) org.apache.qpid.transport.SessionDelegate.command(SessionDelegate.java:50) org.apache.qpid.transport.SessionDelegate.command(SessionDelegate.java:32) org.apache.qpid.transport.Method.delegate(Method.java:159) org.apache.qpid.transport.Session.received(Session.java:585) org.apache.qpid.transport.Connection.dispatch(Connection.java:412) org.apache.qpid.transport.ConnectionDelegate.handle(ConnectionDelegate.java:64) org.apache.qpid.transport.ConnectionDelegate.handle(ConnectionDelegate.java:40) org.apache.qpid.transport.MethodDelegate.executionException(MethodDelegate.java:110) org.apache.qpid.transport.ExecutionException.dispatch(ExecutionException.java:103) org.apache.qpid.transport.ConnectionDelegate.command(ConnectionDelegate.java:54) org.apache.qpid.transport.ConnectionDelegate.command(ConnectionDelegate.java:40) org.apache.qpid.transport.Method.delegate(Method.java:159) org.apache.qpid.transport.Connection.received(Connection.java:367) org.apache.qpid.transport.Connection.received(Connection.java:65) org.apache.qpid.transport.network.Assembler.emit(Assembler.java:97) org.apache.qpid.transport.network.Assembler.assemble(Assembler.java:198) org.apache.qpid.transport.network.Assembler.frame(Assembler.java:131) org.apache.qpid.transport.network.Frame.delegate(Frame.java:128) org.apache.qpid.transport.network.Assembler.received(Assembler.java:102) org.apache.qpid.transport.network.Assembler.received(Assembler.java:44) org.apache.qpid.transport.network.InputHandler.next(InputHandler.java:189) org.apache.qpid.transport.network.InputHandler.received(InputHandler.java:105) org.apache.qpid.transport.network.InputHandler.received(InputHandler.java:44) org.apache.qpid.transport.network.io.IoReceiver.run(IoReceiver.java:152) java.lang.Thread.run(Thread.java:745) Session dispatcher thread org.apache.qpid.client.AMQConnection.exceptionReceived(AMQConnection.java:1255) ---------> waiting for connection's failoverMutex org.apache.qpid.client.AMQSession_0_10.setCurrentException(AMQSession_0_10.java:1057) org.apache.qpid.client.AMQSession_0_10.sync(AMQSession_0_10.java:1034) org.apache.qpid.client.AMQSession_0_10.sendSuspendChannel(AMQSession_0_10.java:812) org.apache.qpid.client.AMQSession.suspendChannel(AMQSession.java:3075) org.apache.qpid.client.AMQSession.rollback(AMQSession.java:1837) common.messaging.QpidSession.rollback(QpidSession.java:211) common.messaging.QpidMessageHandler.rollbackSession(QpidMessageHandler.java:284) common.messaging.QpidMessageHandler.onMessage(QpidMessageHandler.java:113) org.apache.qpid.client.BasicMessageConsumer.notifyMessage(BasicMessageConsumer.java:748) org.apache.qpid.client.BasicMessageConsumer_0_10.notifyMessage(BasicMessageConsumer_0_10.java:141) org.apache.qpid.client.BasicMessageConsumer.notifyMessage(BasicMessageConsumer.java:722) org.apache.qpid.client.BasicMessageConsumer_0_10.notifyMessage(BasicMessageConsumer_0_10.java:186) org.apache.qpid.client.BasicMessageConsumer_0_10.notifyMessage(BasicMessageConsumer_0_10.java:54) org.apache.qpid.client.AMQSession$Dispatcher.notifyConsumer(AMQSession.java:3454) org.apache.qpid.client.AMQSession$Dispatcher.dispatchMessage(AMQSession.java:3393) -----------> acquires session's messageDeliverylock org.apache.qpid.client.AMQSession$Dispatcher.access$1000(AMQSession.java:3180) org.apache.qpid.client.AMQSession.dispatch(AMQSession.java:3173) org.apache.qpid.client.message.UnprocessedMessage.dispatch(UnprocessedMessage.java:54) org.apache.qpid.client.AMQSession$Dispatcher.run(AMQSession.java:3316) java.lang.Thread.run(Thread.java:745) The problem is that the IOReceiver thread acquires failoverMutex before messageDeliveryLock (for each session), whereas the dispatcher thread acquires it in the other order. We also see potential problems where other threads (instead of IOReceiver) can deadlock with the dispatcher thread, as long as it acquires failoverMutex before messageDeliveryLock. Examples we can think of: A) Another thread calling AMQSession.close() B) Another thread calling BasicMessageConsumer.close() C) Same connection, different session's dispatcher thread, calling rollback() or commit() -> sync() -> setCurrentException() -> AMQConnection.exceptionReceived() -> AMQConnection.closeAllSessions(), which can try to acquire the messageDeliveryLock of another session and deadlock with the other session's dispatcher thread Deadlock Variety 2: >From code inspection, it also appears that AMQConnection.close() can deadlock with either AMQSession.close() or BasicMessageConsumer.close() (where the session / consumer is on the same connection). This is because AMQConnection.close() first acquires the messageDeliveryLock of all its sessions in the recursive doClose(), before trying to acquire the connection's failoverMutex. But the Session / consumer's close() acquires the failoverMutex before messageDeliveryLock. We haven't seen this happen but would like to know if this is possible. We'd really appreciate your help on this. Assuming these can be fixed in 0.32, we are also wondering if clients are backward compatible -- i.e., can we upgrade only our client to 0.32 while continuing to use the 0.16 broker? Thanks, Helen