Yes, I have checked the source files of the 0.94.2-cdh4.2.1 jar and HBASE-8207 issues are present in the source codes, namely:
String[] parts = peerClusterZnode.split("-"); On Thu, May 23, 2013 at 12:42 AM, Amit Mor <amit.mor.m...@gmail.com> wrote: > yes, indeed - hyphens are part of the host name (annoying legacy stuff in > my company). It's hbase-0.94.2-cdh4.2.1. I have no idea if 0.94.6 was > backported by Cloudera into their flavor of 0.94.2, but > the mysterious occurrence of the percent sign in zkcli (ls > /hbase/replication/rs/va-p-hbase-02-d,60020,1369249862401/1-va-p-hbase-02-e,60020,1369042377129-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-02-e%2C60020%2C1369042377129.1369227474895) > might be a sign for such problem. How deep should my rmr in zkcli (an > example would be most welcomed :) be ? I have no serious problem running > copyTable with a time period corresponding to the outage and then to start > the sync back again. One question though, how did it cause a crash ? > > > On Thu, May 23, 2013 at 12:32 AM, Varun Sharma <va...@pinterest.com>wrote: > >> I believe there were cascading failures which got these deep nodes >> containing still to be replicated WAL(s) - I suspect there is either some >> parsing bug or something which is causing the replication source to not >> work - also which version are you using - does it have >> https://issues.apache.org/jira/browse/HBASE-8207 - since you use hyphens >> in >> our paths. One way to get back up is to delete these nodes but then you >> lose data in these WAL(s)... >> >> >> On Wed, May 22, 2013 at 2:22 PM, Amit Mor <amit.mor.m...@gmail.com> >> wrote: >> >> > va-p-hbase-02-d,60020,1369249862401 >> > >> > >> > On Thu, May 23, 2013 at 12:20 AM, Varun Sharma <va...@pinterest.com> >> > wrote: >> > >> > > Basically >> > > >> > > ls /hbase/rs and what do you see for va-p-02-d ? >> > > >> > > >> > > On Wed, May 22, 2013 at 2:19 PM, Varun Sharma <va...@pinterest.com> >> > wrote: >> > > >> > > > Can you do ls /hbase/rs and see what you get for 02-d - instead of >> > > looking >> > > > in /replication/, could you look in /hbase/replication/rs - I want >> to >> > see >> > > > if the timestamps are matching or not ? >> > > > >> > > > Varun >> > > > >> > > > >> > > > On Wed, May 22, 2013 at 2:17 PM, Varun Sharma <va...@pinterest.com> >> > > wrote: >> > > > >> > > >> I see - so looks okay - there's just a lot of deep nesting in >> there - >> > if >> > > >> you look into these you nodes by doing ls - you should see a bunch >> of >> > > >> WAL(s) which still need to be replicated... >> > > >> >> > > >> Varun >> > > >> >> > > >> >> > > >> On Wed, May 22, 2013 at 2:16 PM, Varun Sharma <va...@pinterest.com >> > > >wrote: >> > > >> >> > > >>> 2013-05-22 15:31:25,929 WARN >> > > >>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly >> > > transient >> > > >>> ZooKeeper exception: >> > > >>> org.apache.zookeeper.KeeperException$SessionExpiredException: >> > > >>> KeeperErrorCode = Session expired for * >> > > >>> >> > > >> > >> /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719 >> > > >>> * >> > > >>> * >> > > >>> * >> > > >>> *01->[01->02->02]->01* >> > > >>> >> > > >>> *Looks like a bunch of cascading failures causing this deep >> > nesting... >> > > * >> > > >>> >> > > >>> >> > > >>> On Wed, May 22, 2013 at 2:09 PM, Amit Mor < >> amit.mor.m...@gmail.com >> > > >wrote: >> > > >>> >> > > >>>> empty return: >> > > >>>> >> > > >>>> [zk: va-p-zookeeper-01-c:2181(CONNECTED) 10] ls >> > > >>>> /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 >> > > >>>> [] >> > > >>>> >> > > >>>> >> > > >>>> >> > > >>>> On Thu, May 23, 2013 at 12:05 AM, Varun Sharma < >> va...@pinterest.com >> > > >> > > >>>> wrote: >> > > >>>> >> > > >>>> > Do an "ls" not a get here and give the output ? >> > > >>>> > >> > > >>>> > ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 >> > > >>>> > >> > > >>>> > >> > > >>>> > On Wed, May 22, 2013 at 1:53 PM, amit.mor.m...@gmail.com < >> > > >>>> > amit.mor.m...@gmail.com> wrote: >> > > >>>> > >> > > >>>> > > [zk: va-p-zookeeper-01-c:2181(CONNECTED) 3] get >> > > >>>> > > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 >> > > >>>> > > >> > > >>>> > > cZxid = 0x60281c1de >> > > >>>> > > ctime = Wed May 22 15:11:17 EDT 2013 >> > > >>>> > > mZxid = 0x60281c1de >> > > >>>> > > mtime = Wed May 22 15:11:17 EDT 2013 >> > > >>>> > > pZxid = 0x60281c1de >> > > >>>> > > cversion = 0 >> > > >>>> > > dataVersion = 0 >> > > >>>> > > aclVersion = 0 >> > > >>>> > > ephemeralOwner = 0x0 >> > > >>>> > > dataLength = 0 >> > > >>>> > > numChildren = 0 >> > > >>>> > > >> > > >>>> > > >> > > >>>> > > >> > > >>>> > > On Wed, May 22, 2013 at 11:49 PM, Ted Yu < >> yuzhih...@gmail.com> >> > > >>>> wrote: >> > > >>>> > > >> > > >>>> > > > What does this command show you ? >> > > >>>> > > > >> > > >>>> > > > get >> > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 >> > > >>>> > > > >> > > >>>> > > > Cheers >> > > >>>> > > > >> > > >>>> > > > On Wed, May 22, 2013 at 1:46 PM, amit.mor.m...@gmail.com < >> > > >>>> > > > amit.mor.m...@gmail.com> wrote: >> > > >>>> > > > >> > > >>>> > > > > ls >> /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379 >> > > >>>> > > > > [1] >> > > >>>> > > > > [zk: va-p-zookeeper-01-c:2181(CONNECTED) 2] ls >> > > >>>> > > > > >> /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 >> > > >>>> > > > > [] >> > > >>>> > > > > >> > > >>>> > > > > I'm on hbase-0.94.2-cdh4.2.1 >> > > >>>> > > > > >> > > >>>> > > > > Thanks >> > > >>>> > > > > >> > > >>>> > > > > >> > > >>>> > > > > On Wed, May 22, 2013 at 11:40 PM, Varun Sharma < >> > > >>>> va...@pinterest.com> >> > > >>>> > > > > wrote: >> > > >>>> > > > > >> > > >>>> > > > > > Also what version of HBase are you running ? >> > > >>>> > > > > > >> > > >>>> > > > > > >> > > >>>> > > > > > On Wed, May 22, 2013 at 1:38 PM, Varun Sharma < >> > > >>>> va...@pinterest.com >> > > >>>> > > >> > > >>>> > > > > wrote: >> > > >>>> > > > > > >> > > >>>> > > > > > > Basically, >> > > >>>> > > > > > > >> > > >>>> > > > > > > You had va-p-hbase-02 crash - that caused all the >> > > >>>> replication >> > > >>>> > > related >> > > >>>> > > > > > data >> > > >>>> > > > > > > in zookeeper to be moved to va-p-hbase-01 and have it >> > take >> > > >>>> over >> > > >>>> > for >> > > >>>> > > > > > > replicating 02's logs. Now each region server also >> > > >>>> maintains an >> > > >>>> > > > > in-memory >> > > >>>> > > > > > > state of whats in ZK, it seems like when you start up >> > 01, >> > > >>>> its >> > > >>>> > > trying >> > > >>>> > > > to >> > > >>>> > > > > > > replicate the 02 logs underneath but its failing to >> > > because >> > > >>>> that >> > > >>>> > > data >> > > >>>> > > > > is >> > > >>>> > > > > > > not in ZK. This is somewhat weird... >> > > >>>> > > > > > > >> > > >>>> > > > > > > Can you open the zookeepeer shell and do >> > > >>>> > > > > > > >> > > >>>> > > > > > > ls >> > > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379 >> > > >>>> > > > > > > >> > > >>>> > > > > > > And give the output ? >> > > >>>> > > > > > > >> > > >>>> > > > > > > >> > > >>>> > > > > > > On Wed, May 22, 2013 at 1:27 PM, >> > amit.mor.m...@gmail.com< >> > > >>>> > > > > > > amit.mor.m...@gmail.com> wrote: >> > > >>>> > > > > > > >> > > >>>> > > > > > >> Hi, >> > > >>>> > > > > > >> >> > > >>>> > > > > > >> This is bad ... and happened twice: I had my >> > > >>>> replication-slave >> > > >>>> > > > cluster >> > > >>>> > > > > > >> offlined. I performed quite a massive Merge >> operation >> > on >> > > >>>> it and >> > > >>>> > > > after >> > > >>>> > > > > a >> > > >>>> > > > > > >> couple of hours it had finished and I returned it >> back >> > > >>>> online. >> > > >>>> > At >> > > >>>> > > > the >> > > >>>> > > > > > same >> > > >>>> > > > > > >> time, the replication-master RS machines crashed >> (see >> > > first >> > > >>>> > crash >> > > >>>> > > > > > >> http://pastebin.com/1msNZ2tH) with the first >> exception >> > > >>>> being: >> > > >>>> > > > > > >> >> > > >>>> > > > > > >> >> org.apache.zookeeper.KeeperException$NoNodeException: >> > > >>>> > > > KeeperErrorCode >> > > >>>> > > > > = >> > > >>>> > > > > > >> NoNode for >> > > >>>> > > > > > >> >> > > >>>> > > > > > >> >> > > >>>> > > > > > >> > > >>>> > > > > >> > > >>>> > > > >> > > >>>> > > >> > > >>>> > >> > > >>>> >> > > >> > >> /hbase/replication/rs/va-p-hbase-01-c,60020,1369233253404/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719 >> > > >>>> > > > > > >> at >> > > >>>> > > > > > >> >> > > >>>> > > > >> > > >>>> >> > org.apache.zookeeper.KeeperException.create(KeeperException.java:111) >> > > >>>> > > > > > >> at >> > > >>>> > > > > > >> >> > > >>>> > > >> > > org.apache.zookeeper.KeeperException.create(KeeperException.java:51) >> > > >>>> > > > > > >> at >> > > >>>> > > > org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266) >> > > >>>> > > > > > >> at >> > > >>>> > > > > > >> >> > > >>>> > > > > > >> >> > > >>>> > > > > > >> > > >>>> > > > > >> > > >>>> > > > >> > > >>>> > > >> > > >>>> > >> > > >>>> >> > > >> > >> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:354) >> > > >>>> > > > > > >> at >> > > >>>> > > > > > >> >> > > >>>> > >> org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:846) >> > > >>>> > > > > > >> at >> > > >>>> > > > > > >> >> > > >>>> > >> org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:898) >> > > >>>> > > > > > >> at >> > > >>>> > > > > > >> >> > > >>>> > >> org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:892) >> > > >>>> > > > > > >> at >> > > >>>> > > > > > >> >> > > >>>> > > > > > >> >> > > >>>> > > > > > >> > > >>>> > > > > >> > > >>>> > > > >> > > >>>> > > >> > > >>>> > >> > > >>>> >> > > >> > >> org.apache.hadoop.hbase.replication.ReplicationZookeeper.writeReplicationStatus(ReplicationZookeeper.java:558) >> > > >>>> > > > > > >> at >> > > >>>> > > > > > >> >> > > >>>> > > > > > >> >> > > >>>> > > > > > >> > > >>>> > > > > >> > > >>>> > > > >> > > >>>> > > >> > > >>>> > >> > > >>>> >> > > >> > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:154) >> > > >>>> > > > > > >> at >> > > >>>> > > > > > >> >> > > >>>> > > > > > >> >> > > >>>> > > > > > >> > > >>>> > > > > >> > > >>>> > > > >> > > >>>> > > >> > > >>>> > >> > > >>>> >> > > >> > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:638) >> > > >>>> > > > > > >> at >> > > >>>> > > > > > >> >> > > >>>> > > > > > >> >> > > >>>> > > > > > >> > > >>>> > > > > >> > > >>>> > > > >> > > >>>> > > >> > > >>>> > >> > > >>>> >> > > >> > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:387) >> > > >>>> > > > > > >> >> > > >>>> > > > > > >> Before restarting the crashed RS's, I have applied a >> > > >>>> > > > > 'stop_replication' >> > > >>>> > > > > > >> cmd. Then fired up the RS's again. They've started >> o.k. >> > > >>>> but once >> > > >>>> > > > I've >> > > >>>> > > > > > hit >> > > >>>> > > > > > >> 'start_replication' they have crashed once again. >> The >> > > >>>> second >> > > >>>> > crash >> > > >>>> > > > log >> > > >>>> > > > > > >> http://pastebin.com/8Nb5epJJ has the same initial >> > > >>>> exception >> > > >>>> > > > > > >> >> (org.apache.zookeeper.KeeperException$NoNodeException: >> > > >>>> > > > > > >> KeeperErrorCode = NoNode). I've started the crash >> > region >> > > >>>> servers >> > > >>>> > > > again >> > > >>>> > > > > > >> without replication and currently all is well, but I >> > need >> > > >>>> to >> > > >>>> > start >> > > >>>> > > > > > >> replication asap. >> > > >>>> > > > > > >> >> > > >>>> > > > > > >> Does anyone have an idea what's going on and how >> can I >> > > >>>> solve it >> > > >>>> > ? >> > > >>>> > > > > > >> >> > > >>>> > > > > > >> Thanks, >> > > >>>> > > > > > >> Amit >> > > >>>> > > > > > >> >> > > >>>> > > > > > > >> > > >>>> > > > > > > >> > > >>>> > > > > > >> > > >>>> > > > > >> > > >>>> > > > >> > > >>>> > > >> > > >>>> > >> > > >>>> >> > > >>> >> > > >>> >> > > >> >> > > > >> > > >> > >> > >