Also what version of HBase are you running ?
On Wed, May 22, 2013 at 1:38 PM, Varun Sharma <va...@pinterest.com> wrote: > Basically, > > You had va-p-hbase-02 crash - that caused all the replication related data > in zookeeper to be moved to va-p-hbase-01 and have it take over for > replicating 02's logs. Now each region server also maintains an in-memory > state of whats in ZK, it seems like when you start up 01, its trying to > replicate the 02 logs underneath but its failing to because that data is > not in ZK. This is somewhat weird... > > Can you open the zookeepeer shell and do > > ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379 > > And give the output ? > > > On Wed, May 22, 2013 at 1:27 PM, amit.mor.m...@gmail.com < > amit.mor.m...@gmail.com> wrote: > >> Hi, >> >> This is bad ... and happened twice: I had my replication-slave cluster >> offlined. I performed quite a massive Merge operation on it and after a >> couple of hours it had finished and I returned it back online. At the same >> time, the replication-master RS machines crashed (see first crash >> http://pastebin.com/1msNZ2tH) with the first exception being: >> >> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = >> NoNode for >> >> /hbase/replication/rs/va-p-hbase-01-c,60020,1369233253404/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719 >> at >> org.apache.zookeeper.KeeperException.create(KeeperException.java:111) >> at >> org.apache.zookeeper.KeeperException.create(KeeperException.java:51) >> at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266) >> at >> >> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:354) >> at >> org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:846) >> at >> org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:898) >> at >> org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:892) >> at >> >> org.apache.hadoop.hbase.replication.ReplicationZookeeper.writeReplicationStatus(ReplicationZookeeper.java:558) >> at >> >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:154) >> at >> >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:638) >> at >> >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:387) >> >> Before restarting the crashed RS's, I have applied a 'stop_replication' >> cmd. Then fired up the RS's again. They've started o.k. but once I've hit >> 'start_replication' they have crashed once again. The second crash log >> http://pastebin.com/8Nb5epJJ has the same initial exception >> (org.apache.zookeeper.KeeperException$NoNodeException: >> KeeperErrorCode = NoNode). I've started the crash region servers again >> without replication and currently all is well, but I need to start >> replication asap. >> >> Does anyone have an idea what's going on and how can I solve it ? >> >> Thanks, >> Amit >> > >