[ https://issues.apache.org/jira/browse/ARTEMIS-4522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rico Neubauer updated ARTEMIS-4522: ----------------------------------- Description: Slack discussion with [~jbertram] see here: [https://the-asf.slack.com/archives/CFL910J30/p1699520764243169] h2. Initial Description consumers-window-size set to 0 and there some queues, that are heavily used and have consumers attached, that allow for a concurrency of 500 messages (larger than thread-pool-client-size). After some time - unclear what triggers it - the behavior becomes bad in the way that from those queues only 1-5 messages are fetched and immediately processed, but then there is a pause of 10 seconds before the next 1-5 messages are fetched. This continues. In the thread-dumps, several like the following can be seen: {noformat} "Thread-0 (ActiveMQ-client-global-threads)" #640 daemon prio=5 os_prio=0 cpu=899719.36ms elapsed=354425.44s tid=0x00007f904c02e850 nid=0x150d29 runnable [0x00007f8fdbae5000] java.lang.Thread.State: TIMED_WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@17.0.5/Native Method) - parking to wait for <0x00000003f6c9d1d0> (a org.apache.activemq.artemis.utils.AbstractLatch$CountSync) at java.util.concurrent.locks.LockSupport.parkNanos(java.base@17.0.5/LockSupport.java:252) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@17.0.5/AbstractQueuedSynchronizer.java:717) at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(java.base@17.0.5/AbstractQueuedSynchronizer.java:1074) at org.apache.activemq.artemis.utils.AbstractLatch.await(AbstractLatch.java:115) at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.startSlowConsumer(ClientConsumerImpl.java:869) at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:1025) at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1154) at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42) at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31) at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65) at org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$2008/0x00000007ce20c000.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@17.0.5/ThreadPoolExecutor.java:1136) at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@17.0.5/ThreadPoolExecutor.java:635) at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118) {noformat} which corresponds to code location with a 10 seconds sleep here: [https://github.com/apache/activemq-artemis/blob/main/artemis-core-client/src/main/java/org/apache/activemq/artemis/core/client/impl/ClientConsumerImpl.java#L854|https://github.com/seeburger-ag/activemq-artemis/blob/main/artemis-core-client/src/main/java/org/apache/activemq/artemis/core/client/impl/ClientConsumerImpl.java#L854] It seems it always needs to wait until the 10 seconds are reached, which is unexpected, since the latch should count-down before, after having sent the credit to the server. We could assume some bug, that maybe increments the latch, but misses to decrement, so it is not able to reach 0 ever again, but that's speculative. h2. Evaluation See slack for details, it boils down to the thread-pool in use being out of free threads, so the slow-consumer-handling is not able to run in good time. Affected versions: all since at least 2.22.0 up to current - probably since "ever", involved code derives from HornetQ. h2. Possible Solutions We experimented a bit and found a solution for us - or actually 2 variants. * No. 1 just uses an own single-threaded executor for flow-control, whose only job is to guarantee the credit was sent before continuing with other operations when starting/resetting slow-consumers. This looks simple and works perfectly in the test-case as well as in the real product in stress-tests [https://github.com/seeburger-ag/activemq-artemis/commit/5b94560fc55fabcc15f138fbec6b2f855508cc7e] * No. 2 has a similar idea, but goes the full way of adding an own thread-pool for flow-control and adding it like the others. This is more fitting into the style of the existing thread-pools and allows its configuration. [https://github.com/seeburger-ag/activemq-artemis/commit/b4d558c5fc90f3a442391ebcf4976e8d6ec7adb0] and [~jbertram] said: {quote}I think either one would be fine. Create a Jira and send a PR. {quote} So here is the Jira and I will push No. 2 as a pull-request and update here. was: Slack discussion with [~jbertram] see here: [https://the-asf.slack.com/archives/CFL910J30/p1699520764243169] h2. Initial Description consumers-window-size set to 0 and there some queues, that are heavily used and have consumers attached, that allow for a concurrency of 500 messages (larger than thread-pool-client-size). After some time - unclear what triggers it - the behavior becomes bad in the way that from those queues only 1-5 messages are fetched and immediately processed, but then there is a pause of 10 seconds before the next 1-5 messages are fetched. This continues. In the thread-dumps, several like the following can be seen: {noformat} "Thread-0 (ActiveMQ-client-global-threads)" #640 daemon prio=5 os_prio=0 cpu=899719.36ms elapsed=354425.44s tid=0x00007f904c02e850 nid=0x150d29 runnable [0x00007f8fdbae5000] java.lang.Thread.State: TIMED_WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@17.0.5/Native Method) - parking to wait for <0x00000003f6c9d1d0> (a org.apache.activemq.artemis.utils.AbstractLatch$CountSync) at java.util.concurrent.locks.LockSupport.parkNanos(java.base@17.0.5/LockSupport.java:252) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@17.0.5/AbstractQueuedSynchronizer.java:717) at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(java.base@17.0.5/AbstractQueuedSynchronizer.java:1074) at org.apache.activemq.artemis.utils.AbstractLatch.await(AbstractLatch.java:115) at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.startSlowConsumer(ClientConsumerImpl.java:869) at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:1025) at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1154) at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42) at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31) at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65) at org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$2008/0x00000007ce20c000.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@17.0.5/ThreadPoolExecutor.java:1136) at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@17.0.5/ThreadPoolExecutor.java:635) at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118) {noformat} which corresponds to code location with a 10 seconds sleep here: [https://github.com/seeburger-ag/activemq-artemis/blob/main/artemis-core-client/src/main/java/org/apache/activemq/artemis/core/client/impl/ClientConsumerImpl.java#L854] It seems it always needs to wait until the 10 seconds are reached, which is unexpected, since the latch should count-down before, after having sent the credit to the server. We could assume some bug, that maybe increments the latch, but misses to decrement, so it is not able to reach 0 ever again, but that's speculative. h2. Evaluation See slack for details, it boils down to the thread-pool in use being out of free threads, so the slow-consumer-handling is not able to run in good time. Affected versions: all since at least 2.22.0 up to current - probably since "ever", involved code derives from HornetQ. h2. Possible Solutions We experimented a bit and found a solution for us - or actually 2 variants. * No. 1 just uses an own single-threaded executor for flow-control, whose only job is to guarantee the credit was sent before continuing with other operations when starting/resetting slow-consumers. This looks simple and works perfectly in the test-case as well as in the real product in stress-tests [https://github.com/seeburger-ag/activemq-artemis/commit/5b94560fc55fabcc15f138fbec6b2f855508cc7e] * No. 2 has a similar idea, but goes the full way of adding an own thread-pool for flow-control and adding it like the others. This is more fitting into the style of the existing thread-pools and allows its configuration. [https://github.com/seeburger-ag/activemq-artemis/commit/b4d558c5fc90f3a442391ebcf4976e8d6ec7adb0] and [~jbertram] said: {quote}I think either one would be fine. Create a Jira and send a PR. {quote} So here is the Jira and I will push No. 2 as a pull-request and update here. > Slow-Consumer-Handling may get delayed significantly on occupied pool > --------------------------------------------------------------------- > > Key: ARTEMIS-4522 > URL: https://issues.apache.org/jira/browse/ARTEMIS-4522 > Project: ActiveMQ Artemis > Issue Type: Improvement > Reporter: Rico Neubauer > Priority: Major > > Slack discussion with [~jbertram] see here: > [https://the-asf.slack.com/archives/CFL910J30/p1699520764243169] > h2. Initial Description > consumers-window-size set to 0 and there some queues, that are heavily used > and have consumers attached, that allow for a concurrency of 500 messages > (larger than thread-pool-client-size). > After some time - unclear what triggers it - the behavior becomes bad in the > way that from those queues only 1-5 messages are fetched and immediately > processed, but then there is a pause of 10 seconds before the next 1-5 > messages are fetched. This continues. > In the thread-dumps, several like the following can be seen: > {noformat} > "Thread-0 (ActiveMQ-client-global-threads)" #640 daemon prio=5 os_prio=0 > cpu=899719.36ms elapsed=354425.44s tid=0x00007f904c02e850 nid=0x150d29 > runnable [0x00007f8fdbae5000] > java.lang.Thread.State: TIMED_WAITING (parking) > at jdk.internal.misc.Unsafe.park(java.base@17.0.5/Native Method) > - parking to wait for <0x00000003f6c9d1d0> (a > org.apache.activemq.artemis.utils.AbstractLatch$CountSync) > at > java.util.concurrent.locks.LockSupport.parkNanos(java.base@17.0.5/LockSupport.java:252) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@17.0.5/AbstractQueuedSynchronizer.java:717) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(java.base@17.0.5/AbstractQueuedSynchronizer.java:1074) > at > org.apache.activemq.artemis.utils.AbstractLatch.await(AbstractLatch.java:115) > at > org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.startSlowConsumer(ClientConsumerImpl.java:869) > at > org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:1025) > at > org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1154) > at > org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42) > at > org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31) > at > org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65) > at > org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$2008/0x00000007ce20c000.run(Unknown > Source) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@17.0.5/ThreadPoolExecutor.java:1136) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@17.0.5/ThreadPoolExecutor.java:635) > at > org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118) > {noformat} > which corresponds to code location with a 10 seconds sleep here: > [https://github.com/apache/activemq-artemis/blob/main/artemis-core-client/src/main/java/org/apache/activemq/artemis/core/client/impl/ClientConsumerImpl.java#L854|https://github.com/seeburger-ag/activemq-artemis/blob/main/artemis-core-client/src/main/java/org/apache/activemq/artemis/core/client/impl/ClientConsumerImpl.java#L854] > It seems it always needs to wait until the 10 seconds are reached, which is > unexpected, since the latch should count-down before, after having sent the > credit to the server. We could assume some bug, that maybe increments the > latch, but misses to decrement, so it is not able to reach 0 ever again, but > that's speculative. > h2. Evaluation > See slack for details, it boils down to the thread-pool in use being out of > free threads, so the slow-consumer-handling is not able to run in good time. > Affected versions: all since at least 2.22.0 up to current - probably since > "ever", involved code derives from HornetQ. > h2. Possible Solutions > We experimented a bit and found a solution for us - or actually 2 variants. > * No. 1 just uses an own single-threaded executor for flow-control, whose > only job is to guarantee the credit was sent before continuing with other > operations when starting/resetting slow-consumers. > This looks simple and works perfectly in the test-case as well as in the real > product in stress-tests > [https://github.com/seeburger-ag/activemq-artemis/commit/5b94560fc55fabcc15f138fbec6b2f855508cc7e] > * No. 2 has a similar idea, but goes the full way of adding an own > thread-pool for flow-control and adding it like the others. > This is more fitting into the style of the existing thread-pools and allows > its configuration. > [https://github.com/seeburger-ag/activemq-artemis/commit/b4d558c5fc90f3a442391ebcf4976e8d6ec7adb0] > and [~jbertram] said: > {quote}I think either one would be fine. Create a Jira and send a PR. > {quote} > So here is the Jira and I will push No. 2 as a pull-request and update here. -- This message was sent by Atlassian Jira (v8.20.10#820010)