[ https://issues.apache.org/jira/browse/AMQ-9625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17900078#comment-17900078 ]
ASF subversion and git services commented on AMQ-9625: ------------------------------------------------------ Commit e33f44a7b967dcf7001988b7b15c7ff567170b5b in activemq's branch refs/heads/activemq-5.18.x from Christopher L. Shannon [ https://gitbox.apache.org/repos/asf?p=activemq.git;h=e33f44a7b9 ] AMQ-9625 - Prevent queue messages from becoming stuck Fixes a race condition bug that can lead to a message being missed on dispatch and stuck on a Queue until restart when caching and concurrentStoreAndDispatch are enabled on a Queue and the cache becomes disabled. (cherry picked from commit 7f218fe05d67117573329593a712ec420532810d) > Messages can become stuck on Queues > ----------------------------------- > > Key: AMQ-9625 > URL: https://issues.apache.org/jira/browse/AMQ-9625 > Project: ActiveMQ Classic > Issue Type: Bug > Affects Versions: 5.18.6, 6.1.4 > Reporter: Christopher L. Shannon > Assignee: Christopher L. Shannon > Priority: Major > Fix For: 6.2.0, 5.18.7, 6.1.5 > > Time Spent: 40m > Remaining Estimate: 0h > > For the last several years I have occasionally seen "stuck" messages that > appear on queues that will not be dispatched until after a broker restart. > The bug looks to be the same as described in > https://issues.apache.org/jira/browse/AMQ-2955 (the root cause was never > figured out, it was just closed as they couldn't reproduce it). The resulting > behavior seen is that KahaDB has the batch cursor set to a point after a > message that is stored so that message will never dispatch. There's been > some other work done previously to help this issue, notably AMQ-3149 (more > details in that issue and more below) > I recently figured out how to reproduce it in a test environment and finally > tracked down what the root cause and a fix. Not surprisingly, there are a few > things at play here and the bug is a race condition so it won't be seen > unless a bunch of things hold true (and if the broker is configured a certain > way) > h3. Background: > There are 2 optimizations that the broker uses that are playing into this and > both must be enabled for the issue to happen. > # {{useCache=true}} , The normal flow for incoming messages is that they get > written to the store and then they get paged off disk (same thread or another > thread) to be dispatched to consumers. However, there's also a message cache > and if enabled and if there's free memory, the message will be added to the > the cache after sending to disk so we don't need to re-read it off disk again > later when dispatching. > # {{concurrentStoreAndDispatchQueues=true}} The broker also has an > optimization for queues where it it will try and dispatch incoming messages > concurrently to consumers while also writing to disk. (Async message writes) > if the consumers are fast enough to ack, we can cancel the disk write which > saves disk IO and this obviously is a benefit for slow disks. This requires > the cache to be enabled as described in AMQ-3149 otherwise we run into > problems because messages won't be dispatched if not finished so this mode is > really only useful for the cache being enabled. Furthermore, if the cache is > not enabled messages could get stuck if no new messages come in for a while > so that's another reason this mode has no effect if the cache is off. > The two settings work together and in practice this means the flow ends up > being that the message is submitted to the store to be added as part of an > async task that is queued up in the background by the store. While the task > is in the queue, the message is then concurrently added to the in memory > cache and the broker will proceed to dispatch to consumers, who may or may > not acknowledge dispatched messages before the disk write is finished if the > consumers are fast and keeping up. Messages that were already written are > removed like normal but if the async task was not finished it gets cancelled > and saves a disk write. > h3. Bug description: > When the broker runs out of memory to cache messages, the cache has to be > [disabled|https://github.com/apache/activemq/blob/3400983a22284a28a8989d4b0aaf762090b0911a/activemq-broker/src/main/java/org/apache/activemq/broker/region/cursors/AbstractStoreCursor.java#L258]. > As part of this process the cache has to > [tell|https://github.com/apache/activemq/blob/3400983a22284a28a8989d4b0aaf762090b0911a/activemq-broker/src/main/java/org/apache/activemq/broker/region/cursors/AbstractStoreCursor.java#L336] > the store what the last message is that was cached so that when the cache is > exhausted we can resume paging off disk and dispatching in the correct spot. > The process for disabling the cache starts when a new incoming message is > attempted to be added to the cache and it detects that memory is full. When > this happens the process for disabling and syncing to the store starts and > the cache goes through and makes sure any previously cached messages that may > be pending to be written are completed (either acked and cancelled or written > to disk and completed) and after that will tell the store where to resume, > which would be after the last cached message. When the cache is disabled, new > writes should [no > longer|https://github.com/apache/activemq/blob/3400983a22284a28a8989d4b0aaf762090b0911a/activemq-broker/src/main/java/org/apache/activemq/broker/region/Queue.java#L895] > be async (AMQ-3149) because we need to have the messages written to disk to > be dispatched and prevent stuck/delayed delivery. > In theory, because the store was told the last cached message, the new > incoming message that triggered the disabling/sync would be eventually paged > off disk and dispatched. However, there is a race condition bug and what is > actually happening is sometimes the new incoming message has not completed > the write to the store when the queue goes to fill the next batch to dispatch > (because it was async still), so it gets missed as it's still pending. > Normally as described earlier, if the cache is disabled, all the new store > writes should be sync and not async messages but in this case the message > that triggered the disabling/sync was still submitted as async but never > cached. Furthermore, if there are several producers you can have multiple in > flight async messages that were submitted before the cache was disabled, but > then by the time they get added to the cache the cache is seen as disabled > and they are not added. > The end result is that if the consumer and producers are very fast, you might > have one or more pending async writes that are skipped by the cache because > it was disabled so you get into a similar situation as AMQ-3149 described, > and when the cursor goes to page off the next messages it may not see the > pending write(s) that haven't finished (and was not added to the cache) so > those messages are skipped. By the time the incoming message is finished > writing to the store, the disk cursor has moved past it and the message will > be skipped and gets stuck until the broker restarts to reset the batch. > This can happen repeatedly for cache enables/disables which is why you might > see 1 stuck message or more if it repeatedly happens. > h3. Solution: > The solution is actually rather simple and a couple lines of code. When the > cache is disabled we just need to ensure that any messages that are in flight > are now treated as sync and not async so they do not get missed as they will > no longer be added to the memory in the cache. > When a message added to the cache triggers the cache to disable, we just need > to wait for its async task to finish so that it will be visible when reading > in the next batch and paging off disk before we set the batch on the store. > Also, lots of fast producers could cause other messages to be submitted as > async just before the cache was disabled, and then by the time it reaches the > cache they get skipped so we need to wait on those as well. > Writing a unit test for this would be very difficult due to the race > conditions an async going on, however these small changes are very simple and > low risk as the only thing that is happening is we are just converting > waiting on the async write to finish (converts them to sync) during the small > window when the cache is toggled off which prevents the bug. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@activemq.apache.org For additional commands, e-mail: issues-h...@activemq.apache.org For further information, visit: https://activemq.apache.org/contact