Gave the system a weekend time to see if it could recover automatically...
alas that's not the case.

However, when I stopped all brokers, changed one of the brokers with
'replicas="1"' (rather than 3); started it (kind of single mode) and started
the other brokers one by one so they could catch up and then stopped them;
restarted the master broker but now the replicas back to 3 and the backup
brokers got started as well... 

Now the brokers are back online in a consistent state. There are some issues
though...

As the broker that is now primary is one that apparently failed earlier
there are now messages on the queues that should've been consumed (I think);
So, is it possible that the brokers try to heal themselves? (with warnings
and errors in the logs, so the administrator knows to validate the
situation; remaining messages that are uncertain in the 'exactly once'
situation should be moved to the DLQ

Going through the logs I noticed a few things; for simplicty I will call the
HA brokers 'broker1, broker2 and broker3' although they're all 'broker1' :-)

I see that broker1 was online and accepting messages when the test started;
However, it missed some zookeeper pings and got demoted to slave. broker2
became the new master. After the test, all brokers got restarted; Raising
the issue, described in my previous post. This morning I "recovered" broker1
(I now realize I should've done this on the last known master: broker2;
alas...) there are now many messages that I _think_ (they're a bit too many
and the test too generic to determine this easily) have been processed
already.



--
View this message in context: 
http://activemq.2283324.n4.nabble.com/ActiveMQ-ReplicatedLevelDB-corruption-tp4716831p4716912.html
Sent from the ActiveMQ - User mailing list archive at Nabble.com.

Reply via email to