I found this:

[zk: va-p-zookeeper-01-c:2181(CONNECTED) 17] ls
/hbase/replication/rs/va-p-hbase-02-d,60020,1369249862401
[1-va-p-hbase-02-e,60020,1369042377129-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475,
1-va-p-hbase-01-d,60020,1369042382584-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475,
1,
1-va-p-hbase-02-e,60020,1369233253407-va-p-hbase-02-d,60020,1369233252475,
1-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475,
1-va-p-hbase-02-d,60020,1369233252475,
1-va-p-hbase-01-e,60020,1369233254969-va-p-hbase-02-e,60020,1369233253407-va-p-hbase-02-d,60020,1369233252475,
1-va-p-hbase-02-d,60020,1369042368330-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475,
1-va-p-hbase-01-e,60020,1369042368595-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475,
1-va-p-hbase-01-c,60020,1369233253404-va-p-hbase-02-e,60020,1369233253407-va-p-hbase-02-d,60020,1369233252475,
1-va-p-hbase-01-d,60020,1369233257617-va-p-hbase-02-d,60020,1369233252475,
1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475,
1-va-p-hbase-02-c,60020,1369233268385-va-p-hbase-02-d,60020,1369233252475]



On Thu, May 23, 2013 at 12:09 AM, Amit Mor <amit.mor.m...@gmail.com> wrote:

> empty return:
>
> [zk: va-p-zookeeper-01-c:2181(CONNECTED) 10] ls
> /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
> []
>
>
>
> On Thu, May 23, 2013 at 12:05 AM, Varun Sharma <va...@pinterest.com>wrote:
>
>> Do an "ls" not a get here and give the output ?
>>
>> ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
>>
>>
>> On Wed, May 22, 2013 at 1:53 PM, amit.mor.m...@gmail.com <
>> amit.mor.m...@gmail.com> wrote:
>>
>> > [zk: va-p-zookeeper-01-c:2181(CONNECTED) 3] get
>> > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
>> >
>> > cZxid = 0x60281c1de
>> > ctime = Wed May 22 15:11:17 EDT 2013
>> > mZxid = 0x60281c1de
>> > mtime = Wed May 22 15:11:17 EDT 2013
>> > pZxid = 0x60281c1de
>> > cversion = 0
>> > dataVersion = 0
>> > aclVersion = 0
>> > ephemeralOwner = 0x0
>> > dataLength = 0
>> > numChildren = 0
>> >
>> >
>> >
>> > On Wed, May 22, 2013 at 11:49 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>> >
>> > > What does this command show you ?
>> > >
>> > > get /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
>> > >
>> > > Cheers
>> > >
>> > > On Wed, May 22, 2013 at 1:46 PM, amit.mor.m...@gmail.com <
>> > > amit.mor.m...@gmail.com> wrote:
>> > >
>> > > > ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379
>> > > > [1]
>> > > > [zk: va-p-zookeeper-01-c:2181(CONNECTED) 2] ls
>> > > > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
>> > > > []
>> > > >
>> > > > I'm on hbase-0.94.2-cdh4.2.1
>> > > >
>> > > > Thanks
>> > > >
>> > > >
>> > > > On Wed, May 22, 2013 at 11:40 PM, Varun Sharma <va...@pinterest.com
>> >
>> > > > wrote:
>> > > >
>> > > > > Also what version of HBase are you running ?
>> > > > >
>> > > > >
>> > > > > On Wed, May 22, 2013 at 1:38 PM, Varun Sharma <
>> va...@pinterest.com>
>> > > > wrote:
>> > > > >
>> > > > > > Basically,
>> > > > > >
>> > > > > > You had va-p-hbase-02 crash - that caused all the replication
>> > related
>> > > > > data
>> > > > > > in zookeeper to be moved to va-p-hbase-01 and have it take over
>> for
>> > > > > > replicating 02's logs. Now each region server also maintains an
>> > > > in-memory
>> > > > > > state of whats in ZK, it seems like when you start up 01, its
>> > trying
>> > > to
>> > > > > > replicate the 02 logs underneath but its failing to because that
>> > data
>> > > > is
>> > > > > > not in ZK. This is somewhat weird...
>> > > > > >
>> > > > > > Can you open the zookeepeer shell and do
>> > > > > >
>> > > > > > ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379
>> > > > > >
>> > > > > > And give the output ?
>> > > > > >
>> > > > > >
>> > > > > > On Wed, May 22, 2013 at 1:27 PM, amit.mor.m...@gmail.com <
>> > > > > > amit.mor.m...@gmail.com> wrote:
>> > > > > >
>> > > > > >> Hi,
>> > > > > >>
>> > > > > >> This is bad ... and happened twice: I had my replication-slave
>> > > cluster
>> > > > > >> offlined. I performed quite a massive Merge operation on it and
>> > > after
>> > > > a
>> > > > > >> couple of hours it had finished and I returned it back online.
>> At
>> > > the
>> > > > > same
>> > > > > >> time, the replication-master RS machines crashed (see first
>> crash
>> > > > > >> http://pastebin.com/1msNZ2tH) with the first exception being:
>> > > > > >>
>> > > > > >> org.apache.zookeeper.KeeperException$NoNodeException:
>> > > KeeperErrorCode
>> > > > =
>> > > > > >> NoNode for
>> > > > > >>
>> > > > > >>
>> > > > >
>> > > >
>> > >
>> >
>> /hbase/replication/rs/va-p-hbase-01-c,60020,1369233253404/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719
>> > > > > >>         at
>> > > > > >>
>> > > org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
>> > > > > >>         at
>> > > > > >>
>> > org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>> > > > > >>         at
>> > > org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266)
>> > > > > >>         at
>> > > > > >>
>> > > > > >>
>> > > > >
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:354)
>> > > > > >>         at
>> > > > > >>
>> org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:846)
>> > > > > >>         at
>> > > > > >>
>> org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:898)
>> > > > > >>         at
>> > > > > >>
>> org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:892)
>> > > > > >>         at
>> > > > > >>
>> > > > > >>
>> > > > >
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.replication.ReplicationZookeeper.writeReplicationStatus(ReplicationZookeeper.java:558)
>> > > > > >>         at
>> > > > > >>
>> > > > > >>
>> > > > >
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:154)
>> > > > > >>         at
>> > > > > >>
>> > > > > >>
>> > > > >
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:638)
>> > > > > >>         at
>> > > > > >>
>> > > > > >>
>> > > > >
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:387)
>> > > > > >>
>> > > > > >> Before restarting the crashed RS's, I have applied a
>> > > > 'stop_replication'
>> > > > > >> cmd. Then fired up the RS's again. They've started o.k. but
>> once
>> > > I've
>> > > > > hit
>> > > > > >> 'start_replication' they have crashed once again. The second
>> crash
>> > > log
>> > > > > >> http://pastebin.com/8Nb5epJJ has the same initial exception
>> > > > > >> (org.apache.zookeeper.KeeperException$NoNodeException:
>> > > > > >> KeeperErrorCode = NoNode). I've started the crash region
>> servers
>> > > again
>> > > > > >> without replication and currently all is well, but I need to
>> start
>> > > > > >> replication asap.
>> > > > > >>
>> > > > > >> Does anyone have an idea what's going on and how can I solve
>> it ?
>> > > > > >>
>> > > > > >> Thanks,
>> > > > > >> Amit
>> > > > > >>
>> > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Reply via email to