I found this: [zk: va-p-zookeeper-01-c:2181(CONNECTED) 17] ls /hbase/replication/rs/va-p-hbase-02-d,60020,1369249862401 [1-va-p-hbase-02-e,60020,1369042377129-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475, 1-va-p-hbase-01-d,60020,1369042382584-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475, 1, 1-va-p-hbase-02-e,60020,1369233253407-va-p-hbase-02-d,60020,1369233252475, 1-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475, 1-va-p-hbase-02-d,60020,1369233252475, 1-va-p-hbase-01-e,60020,1369233254969-va-p-hbase-02-e,60020,1369233253407-va-p-hbase-02-d,60020,1369233252475, 1-va-p-hbase-02-d,60020,1369042368330-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475, 1-va-p-hbase-01-e,60020,1369042368595-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475, 1-va-p-hbase-01-c,60020,1369233253404-va-p-hbase-02-e,60020,1369233253407-va-p-hbase-02-d,60020,1369233252475, 1-va-p-hbase-01-d,60020,1369233257617-va-p-hbase-02-d,60020,1369233252475, 1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475, 1-va-p-hbase-02-c,60020,1369233268385-va-p-hbase-02-d,60020,1369233252475]
On Thu, May 23, 2013 at 12:09 AM, Amit Mor <amit.mor.m...@gmail.com> wrote: > empty return: > > [zk: va-p-zookeeper-01-c:2181(CONNECTED) 10] ls > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 > [] > > > > On Thu, May 23, 2013 at 12:05 AM, Varun Sharma <va...@pinterest.com>wrote: > >> Do an "ls" not a get here and give the output ? >> >> ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 >> >> >> On Wed, May 22, 2013 at 1:53 PM, amit.mor.m...@gmail.com < >> amit.mor.m...@gmail.com> wrote: >> >> > [zk: va-p-zookeeper-01-c:2181(CONNECTED) 3] get >> > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 >> > >> > cZxid = 0x60281c1de >> > ctime = Wed May 22 15:11:17 EDT 2013 >> > mZxid = 0x60281c1de >> > mtime = Wed May 22 15:11:17 EDT 2013 >> > pZxid = 0x60281c1de >> > cversion = 0 >> > dataVersion = 0 >> > aclVersion = 0 >> > ephemeralOwner = 0x0 >> > dataLength = 0 >> > numChildren = 0 >> > >> > >> > >> > On Wed, May 22, 2013 at 11:49 PM, Ted Yu <yuzhih...@gmail.com> wrote: >> > >> > > What does this command show you ? >> > > >> > > get /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 >> > > >> > > Cheers >> > > >> > > On Wed, May 22, 2013 at 1:46 PM, amit.mor.m...@gmail.com < >> > > amit.mor.m...@gmail.com> wrote: >> > > >> > > > ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379 >> > > > [1] >> > > > [zk: va-p-zookeeper-01-c:2181(CONNECTED) 2] ls >> > > > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 >> > > > [] >> > > > >> > > > I'm on hbase-0.94.2-cdh4.2.1 >> > > > >> > > > Thanks >> > > > >> > > > >> > > > On Wed, May 22, 2013 at 11:40 PM, Varun Sharma <va...@pinterest.com >> > >> > > > wrote: >> > > > >> > > > > Also what version of HBase are you running ? >> > > > > >> > > > > >> > > > > On Wed, May 22, 2013 at 1:38 PM, Varun Sharma < >> va...@pinterest.com> >> > > > wrote: >> > > > > >> > > > > > Basically, >> > > > > > >> > > > > > You had va-p-hbase-02 crash - that caused all the replication >> > related >> > > > > data >> > > > > > in zookeeper to be moved to va-p-hbase-01 and have it take over >> for >> > > > > > replicating 02's logs. Now each region server also maintains an >> > > > in-memory >> > > > > > state of whats in ZK, it seems like when you start up 01, its >> > trying >> > > to >> > > > > > replicate the 02 logs underneath but its failing to because that >> > data >> > > > is >> > > > > > not in ZK. This is somewhat weird... >> > > > > > >> > > > > > Can you open the zookeepeer shell and do >> > > > > > >> > > > > > ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379 >> > > > > > >> > > > > > And give the output ? >> > > > > > >> > > > > > >> > > > > > On Wed, May 22, 2013 at 1:27 PM, amit.mor.m...@gmail.com < >> > > > > > amit.mor.m...@gmail.com> wrote: >> > > > > > >> > > > > >> Hi, >> > > > > >> >> > > > > >> This is bad ... and happened twice: I had my replication-slave >> > > cluster >> > > > > >> offlined. I performed quite a massive Merge operation on it and >> > > after >> > > > a >> > > > > >> couple of hours it had finished and I returned it back online. >> At >> > > the >> > > > > same >> > > > > >> time, the replication-master RS machines crashed (see first >> crash >> > > > > >> http://pastebin.com/1msNZ2tH) with the first exception being: >> > > > > >> >> > > > > >> org.apache.zookeeper.KeeperException$NoNodeException: >> > > KeeperErrorCode >> > > > = >> > > > > >> NoNode for >> > > > > >> >> > > > > >> >> > > > > >> > > > >> > > >> > >> /hbase/replication/rs/va-p-hbase-01-c,60020,1369233253404/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719 >> > > > > >> at >> > > > > >> >> > > org.apache.zookeeper.KeeperException.create(KeeperException.java:111) >> > > > > >> at >> > > > > >> >> > org.apache.zookeeper.KeeperException.create(KeeperException.java:51) >> > > > > >> at >> > > org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266) >> > > > > >> at >> > > > > >> >> > > > > >> >> > > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:354) >> > > > > >> at >> > > > > >> >> org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:846) >> > > > > >> at >> > > > > >> >> org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:898) >> > > > > >> at >> > > > > >> >> org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:892) >> > > > > >> at >> > > > > >> >> > > > > >> >> > > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.replication.ReplicationZookeeper.writeReplicationStatus(ReplicationZookeeper.java:558) >> > > > > >> at >> > > > > >> >> > > > > >> >> > > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:154) >> > > > > >> at >> > > > > >> >> > > > > >> >> > > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:638) >> > > > > >> at >> > > > > >> >> > > > > >> >> > > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:387) >> > > > > >> >> > > > > >> Before restarting the crashed RS's, I have applied a >> > > > 'stop_replication' >> > > > > >> cmd. Then fired up the RS's again. They've started o.k. but >> once >> > > I've >> > > > > hit >> > > > > >> 'start_replication' they have crashed once again. The second >> crash >> > > log >> > > > > >> http://pastebin.com/8Nb5epJJ has the same initial exception >> > > > > >> (org.apache.zookeeper.KeeperException$NoNodeException: >> > > > > >> KeeperErrorCode = NoNode). I've started the crash region >> servers >> > > again >> > > > > >> without replication and currently all is well, but I need to >> start >> > > > > >> replication asap. >> > > > > >> >> > > > > >> Does anyone have an idea what's going on and how can I solve >> it ? >> > > > > >> >> > > > > >> Thanks, >> > > > > >> Amit >> > > > > >> >> > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> > >