Thanks for the detailed response. I generally agree it's not flawed and it is likely my configuration, I'm trying to take steps to track down the cause, but of course it is behaving now.
I do think some defensiveness to protect it from shutting down regardless of configuration would be a good idea. Is there some system in place that would allow me to be alerted of issues that could be catastrophic before they happen? On 21 October 2017 at 14:08, Tim Bain <tb...@alumni.duke.edu> wrote: > Responses inline. > > On Fri, Oct 20, 2017 at 5:46 AM, Lionel van den Berg <lion...@gmail.com> > wrote: > > > Hi, thanks for the response. > > > > Some questions on these points from the troubleshooting. > > > > > > 1. *It contains a pending message for a destination or durable topic > > subscription* > > > > This seems a little flawed, if a consumer who I have little control of is > > mis-behaving then my ActiveMQ can end up shutting down and unrecoverable. > > Is there some way of timing this out or similar? > > > > There are multiple ways of discarding messages that are not being consumed, > which are detailed at http://activemq.apache.org/ > slow-consumer-handling.html > (several of which it sounds like you're already using). Keep in mind that > unconsumed DLQ messages are unconsumed messages, so you'll want to make > sure you address those messages as well; > http://activemq.apache.org/message-redelivery-and-dlq-handling.html > contains additional information about handling messages in the context of > the DLQ. And no, I wouldn't say it's flawed, it just means you have to do > some configuration work that you haven't yet done. > > > > *2. It contains an ack for a message which is in an in-use data file - > the > > ack cannot be removed as a recovery would then mark the message for > > redelivery* > > > > Same comment as 1. > > > > Same response as for #1. There's one additional wrinkle (KahaDB keeps an > entire data file alive because of a single live message, which in turn > means you have to keep the acks for the later messages which are in later > data files), but that's been partially mitigated by the addition of the > ability to compact acks by replaying them into the current data file, which > should allow any data file that contains no live non-ack messages to be > GC'ed. So there's a small portion of this that's purely the result of > KahaDB's design as a non-compacting data store, but it's a problem only > when there's an old unacknowledged message, which takes us back to #1. > > > > *3. The journal references a pending transaction* > > > > I'm not using transactions, but are there transactions under the hood? > > > > No, this would only apply if you were directly using transactions, so this > doesn't apply to you. > > > > *4. It is a journal file, and there may be a pending write to it* > > > > Why would this be the case? > > > > If we haven't finished flushing the file, using a buffer-then-flush > paradigm. This will be an infrequent situation, and should only be a small > number of data files, so if you're having a problem with the number of > files kept, it's not because of this. It's just included in the list for > completeness. > > I'll see if I can change the logging settings, since the first occurrence > > the number of log files does not seem to have been an issue. I have it > > configured to keep messages for 7 days so regardless of the above > > conditions I would have thought that at that expiry the log would be > > cleaned up so we don't end up in such a situation where the system stops > > and cannot restart. > > > > If you are indeed configured as you describe, I would think that log > cleanup would indeed happen as you expect, which means that either there's > an undiscovered bug in our code or you're not configured the way you think > you are. > > The page I linked to originally has instructions for how to determine which > destinations have messages that are preventing the KahaDB data files from > being deleted, which might let you investigate further (for example, by > looking at the messages and their attributes to see if timestamps are being > set correctly). > > Tim >