[jira] [Updated] (ARTEMIS-4522) Slow-Consumer-Handling may get delayed significantly on occupied pool

Rico Neubauer (Jira) Fri, 01 Dec 2023 02:19:08 -0800


     [ 
https://issues.apache.org/jira/browse/ARTEMIS-4522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Rico Neubauer updated ARTEMIS-4522:
-----------------------------------
    Description: 
Slack discussion with [~jbertram] see here:
[https://the-asf.slack.com/archives/CFL910J30/p1699520764243169]
h2. Initial Description

consumers-window-size set to 0 and there some queues, that are heavily used and 
have consumers attached, that allow for a concurrency of 500 messages (larger 
than thread-pool-client-size).
After some time - unclear what triggers it - the behavior becomes bad in the 
way that from those queues only 1-5 messages are fetched and immediately 
processed, but then there is a pause of 10 seconds before the next 1-5 messages 
are fetched. This continues.
In the thread-dumps, several like the following can be seen:
{noformat}
"Thread-0 (ActiveMQ-client-global-threads)" #640 daemon prio=5 os_prio=0 
cpu=899719.36ms elapsed=354425.44s tid=0x00007f904c02e850 nid=0x150d29 runnable 
 [0x00007f8fdbae5000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at jdk.internal.misc.Unsafe.park(java.base@17.0.5/Native Method)
        - parking to wait for  <0x00000003f6c9d1d0> (a 
org.apache.activemq.artemis.utils.AbstractLatch$CountSync)
        at 
java.util.concurrent.locks.LockSupport.parkNanos(java.base@17.0.5/LockSupport.java:252)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@17.0.5/AbstractQueuedSynchronizer.java:717)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(java.base@17.0.5/AbstractQueuedSynchronizer.java:1074)
        at 
org.apache.activemq.artemis.utils.AbstractLatch.await(AbstractLatch.java:115)
        at 
org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.startSlowConsumer(ClientConsumerImpl.java:869)
        at 
org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:1025)
        at 
org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1154)
        at 
org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
        at 
org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
        at 
org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
        at 
org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$2008/0x00000007ce20c000.run(Unknown
 Source)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@17.0.5/ThreadPoolExecutor.java:1136)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@17.0.5/ThreadPoolExecutor.java:635)
        at 
org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
{noformat}
which corresponds to code location with a 10 seconds sleep here:
[https://github.com/apache/activemq-artemis/blob/main/artemis-core-client/src/main/java/org/apache/activemq/artemis/core/client/impl/ClientConsumerImpl.java#L854|https://github.com/seeburger-ag/activemq-artemis/blob/main/artemis-core-client/src/main/java/org/apache/activemq/artemis/core/client/impl/ClientConsumerImpl.java#L854]
It seems it always needs to wait until the 10 seconds are reached, which is 
unexpected, since the latch should count-down before, after having sent the 
credit to the server. We could assume some bug, that maybe increments the 
latch, but misses to decrement, so it is not able to reach 0 ever again, but 
that's speculative.
h2. Evaluation

See slack for details, it boils down to the thread-pool in use being out of 
free threads, so the slow-consumer-handling is not able to run in good time.

Affected versions: all since at least 2.22.0 up to current - probably since 
"ever", involved code derives from HornetQ.
h2. Possible Solutions

We experimented a bit and found a solution for us - or actually 2 variants.
 * No. 1 just uses an own single-threaded executor for flow-control, whose only 
job is to guarantee the credit was sent before continuing with other operations 
when starting/resetting slow-consumers.
This looks simple and works perfectly in the test-case as well as in the real 
product in stress-tests
[https://github.com/seeburger-ag/activemq-artemis/commit/5b94560fc55fabcc15f138fbec6b2f855508cc7e]
 * No. 2 has a similar idea, but goes the full way of adding an own thread-pool 
for flow-control and adding it like the others.
This is more fitting into the style of the existing thread-pools and allows its 
configuration.
[https://github.com/seeburger-ag/activemq-artemis/commit/b4d558c5fc90f3a442391ebcf4976e8d6ec7adb0]

and [~jbertram] said:
{quote}I think either one would be fine. Create a Jira and send a PR.
{quote}
So here is the Jira and I will push No. 2 as a pull-request and update here.

  was:
Slack discussion with [~jbertram] see here:
[https://the-asf.slack.com/archives/CFL910J30/p1699520764243169]
h2. Initial Description

consumers-window-size set to 0 and there some queues, that are heavily used and 
have consumers attached, that allow for a concurrency of 500 messages (larger 
than thread-pool-client-size).
After some time - unclear what triggers it - the behavior becomes bad in the 
way that from those queues only 1-5 messages are fetched and immediately 
processed, but then there is a pause of 10 seconds before the next 1-5 messages 
are fetched. This continues.
In the thread-dumps, several like the following can be seen:
{noformat}
"Thread-0 (ActiveMQ-client-global-threads)" #640 daemon prio=5 os_prio=0 
cpu=899719.36ms elapsed=354425.44s tid=0x00007f904c02e850 nid=0x150d29 runnable 
 [0x00007f8fdbae5000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at jdk.internal.misc.Unsafe.park(java.base@17.0.5/Native Method)
        - parking to wait for  <0x00000003f6c9d1d0> (a 
org.apache.activemq.artemis.utils.AbstractLatch$CountSync)
        at 
java.util.concurrent.locks.LockSupport.parkNanos(java.base@17.0.5/LockSupport.java:252)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@17.0.5/AbstractQueuedSynchronizer.java:717)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(java.base@17.0.5/AbstractQueuedSynchronizer.java:1074)
        at 
org.apache.activemq.artemis.utils.AbstractLatch.await(AbstractLatch.java:115)
        at 
org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.startSlowConsumer(ClientConsumerImpl.java:869)
        at 
org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:1025)
        at 
org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1154)
        at 
org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
        at 
org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
        at 
org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
        at 
org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$2008/0x00000007ce20c000.run(Unknown
 Source)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@17.0.5/ThreadPoolExecutor.java:1136)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@17.0.5/ThreadPoolExecutor.java:635)
        at 
org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
{noformat}
which corresponds to code location with a 10 seconds sleep here:
[https://github.com/seeburger-ag/activemq-artemis/blob/main/artemis-core-client/src/main/java/org/apache/activemq/artemis/core/client/impl/ClientConsumerImpl.java#L854]
It seems it always needs to wait until the 10 seconds are reached, which is 
unexpected, since the latch should count-down before, after having sent the 
credit to the server. We could assume some bug, that maybe increments the 
latch, but misses to decrement, so it is not able to reach 0 ever again, but 
that's speculative.
h2. Evaluation

See slack for details, it boils down to the thread-pool in use being out of 
free threads, so the slow-consumer-handling is not able to run in good time.

Affected versions: all since at least 2.22.0 up to current - probably since 
"ever", involved code derives from HornetQ.
h2. Possible Solutions

We experimented a bit and found a solution for us - or actually 2 variants.
 * No. 1 just uses an own single-threaded executor for flow-control, whose only 
job is to guarantee the credit was sent before continuing with other operations 
when starting/resetting slow-consumers.
This looks simple and works perfectly in the test-case as well as in the real 
product in stress-tests
[https://github.com/seeburger-ag/activemq-artemis/commit/5b94560fc55fabcc15f138fbec6b2f855508cc7e]
 * No. 2 has a similar idea, but goes the full way of adding an own thread-pool 
for flow-control and adding it like the others.
This is more fitting into the style of the existing thread-pools and allows its 
configuration.
[https://github.com/seeburger-ag/activemq-artemis/commit/b4d558c5fc90f3a442391ebcf4976e8d6ec7adb0]

and [~jbertram] said:
{quote}I think either one would be fine. Create a Jira and send a PR.
{quote}
So here is the Jira and I will push No. 2 as a pull-request and update here.


> Slow-Consumer-Handling may get delayed significantly on occupied pool
> ---------------------------------------------------------------------
>
>                 Key: ARTEMIS-4522
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-4522
>             Project: ActiveMQ Artemis
>          Issue Type: Improvement
>            Reporter: Rico Neubauer
>            Priority: Major
>
> Slack discussion with [~jbertram] see here:
> [https://the-asf.slack.com/archives/CFL910J30/p1699520764243169]
> h2. Initial Description
> consumers-window-size set to 0 and there some queues, that are heavily used 
> and have consumers attached, that allow for a concurrency of 500 messages 
> (larger than thread-pool-client-size).
> After some time - unclear what triggers it - the behavior becomes bad in the 
> way that from those queues only 1-5 messages are fetched and immediately 
> processed, but then there is a pause of 10 seconds before the next 1-5 
> messages are fetched. This continues.
> In the thread-dumps, several like the following can be seen:
> {noformat}
> "Thread-0 (ActiveMQ-client-global-threads)" #640 daemon prio=5 os_prio=0 
> cpu=899719.36ms elapsed=354425.44s tid=0x00007f904c02e850 nid=0x150d29 
> runnable  [0x00007f8fdbae5000]
>    java.lang.Thread.State: TIMED_WAITING (parking)
>       at jdk.internal.misc.Unsafe.park(java.base@17.0.5/Native Method)
>       - parking to wait for  <0x00000003f6c9d1d0> (a 
> org.apache.activemq.artemis.utils.AbstractLatch$CountSync)
>       at 
> java.util.concurrent.locks.LockSupport.parkNanos(java.base@17.0.5/LockSupport.java:252)
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@17.0.5/AbstractQueuedSynchronizer.java:717)
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(java.base@17.0.5/AbstractQueuedSynchronizer.java:1074)
>       at 
> org.apache.activemq.artemis.utils.AbstractLatch.await(AbstractLatch.java:115)
>       at 
> org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.startSlowConsumer(ClientConsumerImpl.java:869)
>       at 
> org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:1025)
>       at 
> org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1154)
>       at 
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
>       at 
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
>       at 
> org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
>       at 
> org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$2008/0x00000007ce20c000.run(Unknown
>  Source)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@17.0.5/ThreadPoolExecutor.java:1136)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@17.0.5/ThreadPoolExecutor.java:635)
>       at 
> org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
> {noformat}
> which corresponds to code location with a 10 seconds sleep here:
> [https://github.com/apache/activemq-artemis/blob/main/artemis-core-client/src/main/java/org/apache/activemq/artemis/core/client/impl/ClientConsumerImpl.java#L854|https://github.com/seeburger-ag/activemq-artemis/blob/main/artemis-core-client/src/main/java/org/apache/activemq/artemis/core/client/impl/ClientConsumerImpl.java#L854]
> It seems it always needs to wait until the 10 seconds are reached, which is 
> unexpected, since the latch should count-down before, after having sent the 
> credit to the server. We could assume some bug, that maybe increments the 
> latch, but misses to decrement, so it is not able to reach 0 ever again, but 
> that's speculative.
> h2. Evaluation
> See slack for details, it boils down to the thread-pool in use being out of 
> free threads, so the slow-consumer-handling is not able to run in good time.
> Affected versions: all since at least 2.22.0 up to current - probably since 
> "ever", involved code derives from HornetQ.
> h2. Possible Solutions
> We experimented a bit and found a solution for us - or actually 2 variants.
>  * No. 1 just uses an own single-threaded executor for flow-control, whose 
> only job is to guarantee the credit was sent before continuing with other 
> operations when starting/resetting slow-consumers.
> This looks simple and works perfectly in the test-case as well as in the real 
> product in stress-tests
> [https://github.com/seeburger-ag/activemq-artemis/commit/5b94560fc55fabcc15f138fbec6b2f855508cc7e]
>  * No. 2 has a similar idea, but goes the full way of adding an own 
> thread-pool for flow-control and adding it like the others.
> This is more fitting into the style of the existing thread-pools and allows 
> its configuration.
> [https://github.com/seeburger-ag/activemq-artemis/commit/b4d558c5fc90f3a442391ebcf4976e8d6ec7adb0]
> and [~jbertram] said:
> {quote}I think either one would be fine. Create a Jira and send a PR.
> {quote}
> So here is the Jira and I will push No. 2 as a pull-request and update here.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARTEMIS-4522) Slow-Consumer-Handling may get delayed significantly on occupied pool

Reply via email to