I suppose the problem could be in zkHelper.copyQueuesFromRSUsingMulti(rsZnode) 
as called from ReplicationSourceManager.NodeFailoverWorker.run().
copyQueuesFromRSUsingMulti will return the queues it read even when the multi 
operation failed (because another RS managed to execute it first).

-- Lars



________________________________
 From: lars hofhansl <la...@apache.org>
To: hbase-dev <dev@hbase.apache.org> 
Sent: Wednesday, March 13, 2013 6:12 PM
Subject: Replication hosed after simple cluster restart
 
We just ran into an interesting scenario. We restarted a cluster that was setup 
as a replication source.
The stop went cleanly.

Upon restart *all* regionservers aborted within a few seconds with variations 
of these errors:
http://pastebin.com/3iQVuBqS

This is scary!

-- Lars

Reply via email to