Hello all, We run Artemis embedded on our Spring service and occasionally get this issue where all our producer threads become blocked so no messages can be produced to queue (happened 3 times in 2 weeks). We produce both regular and large messages to the queue. All we get during this time from artemis are timeout exceptions on our producer client side:
nested exception is javax.jms.JMSException: AMQ219014: Timed out after waiting 30,000 ms for response when sending packet 71 We took a thread dump of the service while issue was in effect (attached). >From what we can see Thread t@11854 and thread Thread t@202 are permanently locked, all other BLOCKED threads are blocked on session creation: org.apache.activemq.artemis.jms.client.ActiveMQConnection.createSession(ActiveMQConnection.java:234) There seems to be a race condition that can cause thread deadlock during large message processing when satisfied, example: 1. Large message is produced 2. *Thread-1* JournalImpl.appendAddRecord() -> appendExecutor (JournalImpl:946) is delayed for some reason 3. *Thread-2* JournalImpl.appendDeleteRecord() is triggered because appendAdd is async-> takes lock on LargeMessage object (PostOfficeImpl:1305) and gets stuck on appendExecutor queue behind Thread-1 (JournalImpl:1058) 4. *Thread-1* JournalImpl.appendAddRecord() -> appendExecutor gets to the part where it needs the lock on same LargeMessage object (LargeServerMessageImpl:173), but it can't get it because Thread-2 keeps it So deadlock is: Thread-1 is waiting for lock on LargeMessage object that will never be released by Thread-2 because it is waiting for processing on appendExecutor queue behind Thread-1 We are still having issues actually reproducing this deadlock because Thread-2 when processing JournalImpl.appendDeleteRecord() checks records/pendingRecords for recordId before reaching appendExecutor, and in all cases we managed to reproduce it is always present in (JournalImpl:1051) Service is running within a docker container and folder containing the journal is mapped to the host machine. Metrics for the node on which service was running show no disk I/O issues at that time. Artemis version: 2.6.4, Spring boot version: 2.1.5.RELEASE Relevant artemis settings (rest of the settings are default): durable: true max-size-bytes : 1GB address-full-policy: FAIL journal-sync-non-transactional : false journal-sync-transactional: false We are in the process of analyzing the issue further but wanted to report this as soon as possible, so someone else can also take a look. If you need any additional info, we will provide it. Kind regards, Mario