[ https://issues.apache.org/jira/browse/ARTEMIS-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17531770#comment-17531770 ]
David Bennion commented on ARTEMIS-3809: ---------------------------------------- I have still been thinking about this scenario, and your explanation for a resolution to make this system more robust makes total sense. The delivery system being robust and timing out is really important and the possibility of sending a single packet and then vanishing seems like a plausible edge case. So that is all good. The piece of my situation that still doesn't make complete sense to me though is that all of these messages occur within a single JVM and using the InVM transporter. I don't (that I know of) have any kind of error in the log that indicates stuff went wrong. So how did I arrive at this point where a single packet of a large message made it through and it abandoned the send of the rest of it without a trace? With the fix that you are proposing (which I believe is a correct and valuable fix), would it not be true that my situation would simply get a failed message delivery for that message and continue on around it? > LargeMessageControllerImpl hangs the message consume > ---------------------------------------------------- > > Key: ARTEMIS-3809 > URL: https://issues.apache.org/jira/browse/ARTEMIS-3809 > Project: ActiveMQ Artemis > Issue Type: Bug > Components: Broker > Affects Versions: 2.21.0 > Environment: OS: Windows Server 2019 > JVM: OpenJDK 64-Bit Server VM Temurin-17.0.1+12 > Max Memory (-Xmx): 6GB > Allocated to JVM: 4.168GB > Currently in use: 3.398GB (heap 3.391GB, non-heap 0.123GB) > Reporter: David Bennion > Priority: Major > Labels: test-stability > Attachments: image-2022-05-03-10-51-46-872.png > > > I wondered if this might be a recurrence of issue ARTEMIS-2293 but this > happens on 2.21.0 and I can see the code change in > LargeMessageControllerImpl. > Using the default min-large-message-size of 100K. (defaults) > Many messages are passing through the broker when this happens. I would > anticipate that most of the messages are smaller than 100K, but clearly some > of them must exceed. After some number of messages, a particular consumer > ceases to consume messages. > After the system became "hung" I was able to get a stack trace and I was able > to identify that the system is stuck in an Object.wait() for a notify that > appears to never come. > Here is the trace I was able to capture: > {code:java} > Thread-2 (ActiveMQ-client-global-threads) id=78 state=TIMED_WAITING > - waiting on <0x43523a75> (a > org.apache.activemq.artemis.core.client.impl.LargeMessageControllerImpl) > - locked <0x43523a75> (a > org.apache.activemq.artemis.core.client.impl.LargeMessageControllerImpl) > at java.base@17.0.1/java.lang.Object.wait(Native Method) > at > org.apache.activemq.artemis.core.client.impl.LargeMessageControllerImpl.waitCompletion(LargeMessageControllerImpl.java:294) > at > org.apache.activemq.artemis.core.client.impl.LargeMessageControllerImpl.saveBuffer(LargeMessageControllerImpl.java:268) > at > org.apache.activemq.artemis.core.client.impl.ClientLargeMessageImpl.checkBuffer(ClientLargeMessageImpl.java:157) > at > org.apache.activemq.artemis.core.client.impl.ClientLargeMessageImpl.getBodyBuffer(ClientLargeMessageImpl.java:89) > at mypackage.MessageListener.handleMessage(MessageListener.java:46) > {code} > > The app can run either as a single node using the InVM transporter or as a > cluster using the TCP. To my knowledge, I have only seen this issue occur on > the InVM. > I am not expert in this code, but I can tell from the call stack that 0 must > be the value of timeWait passed into waitCompletion(). But from what I can > discern of the code changes in 2.21.0, it should be adjusting the > readTimeout to the timeout of the message (I think?) such that it causes the > read to eventually give up rather than remaining blocked forever. > We have persistenceEnabled = false, which leads me to believe that the only > disk activity for messages should be related to large messages(?). > On a machine and context where this was consistently happening, I adjusted > the min-large-message-size upwards and the problem went away. This makes > sense for my application, but ultimately if a message goes across the > threshold to become large it appears to hang the consumer indefinitely. -- This message was sent by Atlassian Jira (v8.20.7#820007)