[ 
https://issues.apache.org/jira/browse/ARTEMIS-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17531018#comment-17531018
 ] 

Justin Bertram commented on ARTEMIS-3809:
-----------------------------------------

bq. And from reading the code, it looks like a NEW instance of 
LargeMessageControllerImpl is instantiated for each large message as it is 
being received.  I therefore conclude that this thread remained blocked on the 
same message well beyond any kind of reasonable timeout.

Since the {{wait}} happens in a {{while}} loop it could be looping on the same 
{{LargeMessageControllerImpl}} instance. So the lock ID in itself is not 
sufficient to say that execution of that thread has not proceeded in anyway 
between thread dumps.

bq. The "blocking call timeout" settable via the serverLocator.setCallTimeout() 
method is at its default of 0 (at least I didn't set any value prior to this 
problem).

The default call timeout on the {{ServerLocator}} should be {{30000}} 
milliseconds. Therefore, if you are not setting it to {{0}} I don't see how a 
value of {{0}} could be used here.

bq. I adjusted the org.apache.activemq.artemis to TRACE and did this run.  It 
produced a LOT of logging.

The main thing I'm interested in at this point is {{TRACE}} logging on 
{{org.apache.activemq.artemis.core.protocol.core.impl.RemotingConnectionImpl}}. 
This will show what packets are arriving to the client. Specifically I want to 
see packets of the type {{SESS_RECEIVE_LARGE_MSG}} and 
{{SESS_RECEIVE_CONTINUATION}}.

Ultimately the best way to proceed is for you to provide a way to reproduce the 
problem. That way I can put a debugger on it and really see what's happening.

> LargeMessageControllerImpl hangs the message consume
> ----------------------------------------------------
>
>                 Key: ARTEMIS-3809
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-3809
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>          Components: Broker
>    Affects Versions: 2.21.0
>         Environment: OS: Windows Server 2019
> JVM: OpenJDK 64-Bit Server VM Temurin-17.0.1+12
> Max Memory (-Xmx): 6GB
> Allocated to JVM: 4.168GB
> Currently in use: 3.398GB  (heap 3.391GB, non-heap 0.123GB)
>            Reporter: David Bennion
>            Priority: Major
>              Labels: test-stability
>
> I wondered if this might be a recurrence of issue ARTEMIS-2293 but this 
> happens on 2.21.0 and I can see the code change in 
> LargeMessageControllerImpl.  
> Using the default min-large-message-size of 100K. (defaults)
> Many messages are passing through the broker when this happens.  I would 
> anticipate that most of the messages are smaller than 100K, but clearly some 
> of them must exceed.  After some number of messages, a particular consumer 
> ceases to consume messages.
> After the system became "hung" I was able to get a stack trace and I was able 
> to identify that the system is stuck in an Object.wait() for a notify that 
> appears to never come.
> Here is the trace I was able to capture:
> {code:java}
> Thread-2 (ActiveMQ-client-global-threads) id=78 state=TIMED_WAITING
>     - waiting on <0x43523a75> (a 
> org.apache.activemq.artemis.core.client.impl.LargeMessageControllerImpl)
>     - locked <0x43523a75> (a 
> org.apache.activemq.artemis.core.client.impl.LargeMessageControllerImpl)
>     at  java.base@17.0.1/java.lang.Object.wait(Native Method)
>     at 
> org.apache.activemq.artemis.core.client.impl.LargeMessageControllerImpl.waitCompletion(LargeMessageControllerImpl.java:294)
>     at 
> org.apache.activemq.artemis.core.client.impl.LargeMessageControllerImpl.saveBuffer(LargeMessageControllerImpl.java:268)
>     at 
> org.apache.activemq.artemis.core.client.impl.ClientLargeMessageImpl.checkBuffer(ClientLargeMessageImpl.java:157)
>     at 
> org.apache.activemq.artemis.core.client.impl.ClientLargeMessageImpl.getBodyBuffer(ClientLargeMessageImpl.java:89)
>     at mypackage.MessageListener.handleMessage(MessageListener.java:46)
> {code}
>  
> The app can run either as a single node using the InVM transporter or as a 
> cluster using the TCP.  To my knowledge, I have only seen this issue occur on 
> the InVM. 
> I am not expert in this code, but I can tell from the call stack that 0 must 
> be the value of timeWait passed into waitCompletion().  But from what I can 
> discern of the code changes in 2.21.0,  it should be adjusting the 
> readTimeout to the timeout of the message (I think?) such that it causes the 
> read to eventually give up rather than remaining blocked forever.
> We have persistenceEnabled = false, which leads me to believe that the only 
> disk activity  for messages should be related to large messages(?).  
> On a machine and context where this was consistently happening, I adjusted 
> the min-large-message-size upwards and the problem went away.   This makes 
> sense for my application, but ultimately if a message goes across the 
> threshold to become large it appears to hang the consumer indefinitely. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to