Gave the system a weekend time to see if it could recover automatically... alas that's not the case.
However, when I stopped all brokers, changed one of the brokers with 'replicas="1"' (rather than 3); started it (kind of single mode) and started the other brokers one by one so they could catch up and then stopped them; restarted the master broker but now the replicas back to 3 and the backup brokers got started as well... Now the brokers are back online in a consistent state. There are some issues though... As the broker that is now primary is one that apparently failed earlier there are now messages on the queues that should've been consumed (I think); So, is it possible that the brokers try to heal themselves? (with warnings and errors in the logs, so the administrator knows to validate the situation; remaining messages that are uncertain in the 'exactly once' situation should be moved to the DLQ Going through the logs I noticed a few things; for simplicty I will call the HA brokers 'broker1, broker2 and broker3' although they're all 'broker1' :-) I see that broker1 was online and accepting messages when the test started; However, it missed some zookeeper pings and got demoted to slave. broker2 became the new master. After the test, all brokers got restarted; Raising the issue, described in my previous post. This morning I "recovered" broker1 (I now realize I should've done this on the last known master: broker2; alas...) there are now many messages that I _think_ (they're a bit too many and the test too generic to determine this easily) have been processed already. -- View this message in context: http://activemq.2283324.n4.nabble.com/ActiveMQ-ReplicatedLevelDB-corruption-tp4716831p4716912.html Sent from the ActiveMQ - User mailing list archive at Nabble.com.