Basically,

You had va-p-hbase-02 crash - that caused all the replication related data
in zookeeper to be moved to va-p-hbase-01 and have it take over for
replicating 02's logs. Now each region server also maintains an in-memory
state of whats in ZK, it seems like when you start up 01, its trying to
replicate the 02 logs underneath but its failing to because that data is
not in ZK. This is somewhat weird...

Can you open the zookeepeer shell and do

ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379

And give the output ?


On Wed, May 22, 2013 at 1:27 PM, amit.mor.m...@gmail.com <
amit.mor.m...@gmail.com> wrote:

> Hi,
>
> This is bad ... and happened twice: I had my replication-slave cluster
> offlined. I performed quite a massive Merge operation on it and after a
> couple of hours it had finished and I returned it back online. At the same
> time, the replication-master RS machines crashed (see first crash
> http://pastebin.com/1msNZ2tH) with the first exception being:
>
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode =
> NoNode for
>
> /hbase/replication/rs/va-p-hbase-01-c,60020,1369233253404/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719
>         at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
>         at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266)
>         at
>
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:354)
>         at
> org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:846)
>         at
> org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:898)
>         at
> org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:892)
>         at
>
> org.apache.hadoop.hbase.replication.ReplicationZookeeper.writeReplicationStatus(ReplicationZookeeper.java:558)
>         at
>
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:154)
>         at
>
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:638)
>         at
>
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:387)
>
> Before restarting the crashed RS's, I have applied a 'stop_replication'
> cmd. Then fired up the RS's again. They've started o.k. but once I've hit
> 'start_replication' they have crashed once again. The second crash log
> http://pastebin.com/8Nb5epJJ has the same initial exception
> (org.apache.zookeeper.KeeperException$NoNodeException:
> KeeperErrorCode = NoNode). I've started the crash region servers again
> without replication and currently all is well, but I need to start
> replication asap.
>
> Does anyone have an idea what's going on and how can I solve it ?
>
> Thanks,
> Amit
>

Reply via email to