Re: RS crash upon replication
Actually, it seems like something else was wrong here - the servers came up just fine on trying again - so could not really reproduce the issue. Amit: Did you try patching 8207 ? Varun On Wed, May 22, 2013 at 5:40 PM, Himanshu Vashishtha hv.cs...@gmail.comwrote: That sounds like a bug for sure. Could you create a jira with logs/znode dump/steps to reproduce it? Thanks, himanshu On Wed, May 22, 2013 at 5:01 PM, Varun Sharma va...@pinterest.com wrote: It seems I can reproduce this - I did a few rolling restarts and got screwed with NoNode exceptions - I am running 0.94.7 which has the fix but my nodes don't contain hyphens - nodes are no longer coming back up... Thanks Varun On Wed, May 22, 2013 at 3:02 PM, Himanshu Vashishtha hv.cs...@gmail.com wrote: I'd suggest to please patch the code with 8207; cdh4.2.1 doesn't have it. With hyphens in the name, ReplicationSource gets confused and tried to set data in a znode which doesn't exist. Thanks, Himanshu On Wed, May 22, 2013 at 2:42 PM, Amit Mor amit.mor.m...@gmail.com wrote: yes, indeed - hyphens are part of the host name (annoying legacy stuff in my company). It's hbase-0.94.2-cdh4.2.1. I have no idea if 0.94.6 was backported by Cloudera into their flavor of 0.94.2, but the mysterious occurrence of the percent sign in zkcli (ls /hbase/replication/rs/va-p-hbase-02-d,60020,1369249862401/1-va-p-hbase-02-e,60020,1369042377129-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-02-e%2C60020%2C1369042377129.1369227474895) might be a sign for such problem. How deep should my rmr in zkcli (an example would be most welcomed :) be ? I have no serious problem running copyTable with a time period corresponding to the outage and then to start the sync back again. One question though, how did it cause a crash ? On Thu, May 23, 2013 at 12:32 AM, Varun Sharma va...@pinterest.com wrote: I believe there were cascading failures which got these deep nodes containing still to be replicated WAL(s) - I suspect there is either some parsing bug or something which is causing the replication source to not work - also which version are you using - does it have https://issues.apache.org/jira/browse/HBASE-8207 - since you use hyphens in our paths. One way to get back up is to delete these nodes but then you lose data in these WAL(s)... On Wed, May 22, 2013 at 2:22 PM, Amit Mor amit.mor.m...@gmail.com wrote: va-p-hbase-02-d,60020,1369249862401 On Thu, May 23, 2013 at 12:20 AM, Varun Sharma va...@pinterest.com wrote: Basically ls /hbase/rs and what do you see for va-p-02-d ? On Wed, May 22, 2013 at 2:19 PM, Varun Sharma va...@pinterest.com wrote: Can you do ls /hbase/rs and see what you get for 02-d - instead of looking in /replication/, could you look in /hbase/replication/rs - I want to see if the timestamps are matching or not ? Varun On Wed, May 22, 2013 at 2:17 PM, Varun Sharma va...@pinterest.com wrote: I see - so looks okay - there's just a lot of deep nesting in there - if you look into these you nodes by doing ls - you should see a bunch of WAL(s) which still need to be replicated... Varun On Wed, May 22, 2013 at 2:16 PM, Varun Sharma va...@pinterest.com wrote: 2013-05-22 15:31:25,929 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for * /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719 * * * *01-[01-02-02]-01* *Looks like a bunch of cascading failures causing this deep nesting... * On Wed, May 22, 2013 at 2:09 PM, Amit Mor amit.mor.m...@gmail.com wrote: empty return: [zk: va-p-zookeeper-01-c:2181(CONNECTED) 10] ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 [] On Thu, May 23, 2013 at 12:05 AM, Varun Sharma va...@pinterest.com wrote: Do an ls not a get here and give the output ? ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
Re: RS crash upon replication
No the server came out fine just because after the crash (RS's - the masters were still running), I immediately pulled the breaks with stop_replication. Then I start the RS's and they came back fine (not replicating). Once I hit 'start_replication' again they had crashed for the second time. Eventually I deleted the heavily nested replication znodes and the 'start_replication' succeeded. I didn't patch 8207 because I'm on CDH with Cloudera Manager Parcels thing and I'm still trying to figure out how to replace their jars with mine in a clean and non intrusive way On Thu, May 23, 2013 at 10:33 AM, Varun Sharma va...@pinterest.com wrote: Actually, it seems like something else was wrong here - the servers came up just fine on trying again - so could not really reproduce the issue. Amit: Did you try patching 8207 ? Varun On Wed, May 22, 2013 at 5:40 PM, Himanshu Vashishtha hv.cs...@gmail.com wrote: That sounds like a bug for sure. Could you create a jira with logs/znode dump/steps to reproduce it? Thanks, himanshu On Wed, May 22, 2013 at 5:01 PM, Varun Sharma va...@pinterest.com wrote: It seems I can reproduce this - I did a few rolling restarts and got screwed with NoNode exceptions - I am running 0.94.7 which has the fix but my nodes don't contain hyphens - nodes are no longer coming back up... Thanks Varun On Wed, May 22, 2013 at 3:02 PM, Himanshu Vashishtha hv.cs...@gmail.com wrote: I'd suggest to please patch the code with 8207; cdh4.2.1 doesn't have it. With hyphens in the name, ReplicationSource gets confused and tried to set data in a znode which doesn't exist. Thanks, Himanshu On Wed, May 22, 2013 at 2:42 PM, Amit Mor amit.mor.m...@gmail.com wrote: yes, indeed - hyphens are part of the host name (annoying legacy stuff in my company). It's hbase-0.94.2-cdh4.2.1. I have no idea if 0.94.6 was backported by Cloudera into their flavor of 0.94.2, but the mysterious occurrence of the percent sign in zkcli (ls /hbase/replication/rs/va-p-hbase-02-d,60020,1369249862401/1-va-p-hbase-02-e,60020,1369042377129-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-02-e%2C60020%2C1369042377129.1369227474895) might be a sign for such problem. How deep should my rmr in zkcli (an example would be most welcomed :) be ? I have no serious problem running copyTable with a time period corresponding to the outage and then to start the sync back again. One question though, how did it cause a crash ? On Thu, May 23, 2013 at 12:32 AM, Varun Sharma va...@pinterest.com wrote: I believe there were cascading failures which got these deep nodes containing still to be replicated WAL(s) - I suspect there is either some parsing bug or something which is causing the replication source to not work - also which version are you using - does it have https://issues.apache.org/jira/browse/HBASE-8207 - since you use hyphens in our paths. One way to get back up is to delete these nodes but then you lose data in these WAL(s)... On Wed, May 22, 2013 at 2:22 PM, Amit Mor amit.mor.m...@gmail.com wrote: va-p-hbase-02-d,60020,1369249862401 On Thu, May 23, 2013 at 12:20 AM, Varun Sharma va...@pinterest.com wrote: Basically ls /hbase/rs and what do you see for va-p-02-d ? On Wed, May 22, 2013 at 2:19 PM, Varun Sharma va...@pinterest.com wrote: Can you do ls /hbase/rs and see what you get for 02-d - instead of looking in /replication/, could you look in /hbase/replication/rs - I want to see if the timestamps are matching or not ? Varun On Wed, May 22, 2013 at 2:17 PM, Varun Sharma va...@pinterest.com wrote: I see - so looks okay - there's just a lot of deep nesting in there - if you look into these you nodes by doing ls - you should see a bunch of WAL(s) which still need to be replicated... Varun On Wed, May 22, 2013 at 2:16 PM, Varun Sharma va...@pinterest.com wrote: 2013-05-22 15:31:25,929 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for *
Re: RS crash upon replication
fwiw stop_replication is a kill switch, not a general way to start and stop replicating, and start_replication may put you in an inconsistent state: hbase(main):001:0 help 'stop_replication' Stops all the replication features. The state in which each stream stops in is undetermined. WARNING: start/stop replication is only meant to be used in critical load situations. On Thu, May 23, 2013 at 1:17 AM, Amit Mor amit.mor.m...@gmail.com wrote: No the server came out fine just because after the crash (RS's - the masters were still running), I immediately pulled the breaks with stop_replication. Then I start the RS's and they came back fine (not replicating). Once I hit 'start_replication' again they had crashed for the second time. Eventually I deleted the heavily nested replication znodes and the 'start_replication' succeeded. I didn't patch 8207 because I'm on CDH with Cloudera Manager Parcels thing and I'm still trying to figure out how to replace their jars with mine in a clean and non intrusive way On Thu, May 23, 2013 at 10:33 AM, Varun Sharma va...@pinterest.com wrote: Actually, it seems like something else was wrong here - the servers came up just fine on trying again - so could not really reproduce the issue. Amit: Did you try patching 8207 ? Varun On Wed, May 22, 2013 at 5:40 PM, Himanshu Vashishtha hv.cs...@gmail.com wrote: That sounds like a bug for sure. Could you create a jira with logs/znode dump/steps to reproduce it? Thanks, himanshu On Wed, May 22, 2013 at 5:01 PM, Varun Sharma va...@pinterest.com wrote: It seems I can reproduce this - I did a few rolling restarts and got screwed with NoNode exceptions - I am running 0.94.7 which has the fix but my nodes don't contain hyphens - nodes are no longer coming back up... Thanks Varun On Wed, May 22, 2013 at 3:02 PM, Himanshu Vashishtha hv.cs...@gmail.com wrote: I'd suggest to please patch the code with 8207; cdh4.2.1 doesn't have it. With hyphens in the name, ReplicationSource gets confused and tried to set data in a znode which doesn't exist. Thanks, Himanshu On Wed, May 22, 2013 at 2:42 PM, Amit Mor amit.mor.m...@gmail.com wrote: yes, indeed - hyphens are part of the host name (annoying legacy stuff in my company). It's hbase-0.94.2-cdh4.2.1. I have no idea if 0.94.6 was backported by Cloudera into their flavor of 0.94.2, but the mysterious occurrence of the percent sign in zkcli (ls /hbase/replication/rs/va-p-hbase-02-d,60020,1369249862401/1-va-p-hbase-02-e,60020,1369042377129-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-02-e%2C60020%2C1369042377129.1369227474895) might be a sign for such problem. How deep should my rmr in zkcli (an example would be most welcomed :) be ? I have no serious problem running copyTable with a time period corresponding to the outage and then to start the sync back again. One question though, how did it cause a crash ? On Thu, May 23, 2013 at 12:32 AM, Varun Sharma va...@pinterest.com wrote: I believe there were cascading failures which got these deep nodes containing still to be replicated WAL(s) - I suspect there is either some parsing bug or something which is causing the replication source to not work - also which version are you using - does it have https://issues.apache.org/jira/browse/HBASE-8207 - since you use hyphens in our paths. One way to get back up is to delete these nodes but then you lose data in these WAL(s)... On Wed, May 22, 2013 at 2:22 PM, Amit Mor amit.mor.m...@gmail.com wrote: va-p-hbase-02-d,60020,1369249862401 On Thu, May 23, 2013 at 12:20 AM, Varun Sharma va...@pinterest.com wrote: Basically ls /hbase/rs and what do you see for va-p-02-d ? On Wed, May 22, 2013 at 2:19 PM, Varun Sharma va...@pinterest.com wrote: Can you do ls /hbase/rs and see what you get for 02-d - instead of looking in /replication/, could you look in /hbase/replication/rs - I want to see if the timestamps are matching or not ? Varun On Wed, May 22, 2013 at 2:17 PM, Varun Sharma va...@pinterest.com wrote: I see - so looks okay - there's just a lot of deep nesting in there - if you look into these you nodes by doing ls - you should see a bunch of WAL(s) which still need to be replicated... Varun On Wed, May 22, 2013 at 2:16 PM, Varun Sharma va...@pinterest.com wrote: 2013-05-22
Re: RS crash upon replication
But wouldn't a copy table b/w timestamps bring you back since the mutations are all timestamp based we should okay ? Basically doing a copy table which supersedes the downtime interval ? On Thu, May 23, 2013 at 9:48 AM, Jean-Daniel Cryans jdcry...@apache.orgwrote: fwiw stop_replication is a kill switch, not a general way to start and stop replicating, and start_replication may put you in an inconsistent state: hbase(main):001:0 help 'stop_replication' Stops all the replication features. The state in which each stream stops in is undetermined. WARNING: start/stop replication is only meant to be used in critical load situations. On Thu, May 23, 2013 at 1:17 AM, Amit Mor amit.mor.m...@gmail.com wrote: No the server came out fine just because after the crash (RS's - the masters were still running), I immediately pulled the breaks with stop_replication. Then I start the RS's and they came back fine (not replicating). Once I hit 'start_replication' again they had crashed for the second time. Eventually I deleted the heavily nested replication znodes and the 'start_replication' succeeded. I didn't patch 8207 because I'm on CDH with Cloudera Manager Parcels thing and I'm still trying to figure out how to replace their jars with mine in a clean and non intrusive way On Thu, May 23, 2013 at 10:33 AM, Varun Sharma va...@pinterest.com wrote: Actually, it seems like something else was wrong here - the servers came up just fine on trying again - so could not really reproduce the issue. Amit: Did you try patching 8207 ? Varun On Wed, May 22, 2013 at 5:40 PM, Himanshu Vashishtha hv.cs...@gmail.com wrote: That sounds like a bug for sure. Could you create a jira with logs/znode dump/steps to reproduce it? Thanks, himanshu On Wed, May 22, 2013 at 5:01 PM, Varun Sharma va...@pinterest.com wrote: It seems I can reproduce this - I did a few rolling restarts and got screwed with NoNode exceptions - I am running 0.94.7 which has the fix but my nodes don't contain hyphens - nodes are no longer coming back up... Thanks Varun On Wed, May 22, 2013 at 3:02 PM, Himanshu Vashishtha hv.cs...@gmail.com wrote: I'd suggest to please patch the code with 8207; cdh4.2.1 doesn't have it. With hyphens in the name, ReplicationSource gets confused and tried to set data in a znode which doesn't exist. Thanks, Himanshu On Wed, May 22, 2013 at 2:42 PM, Amit Mor amit.mor.m...@gmail.com wrote: yes, indeed - hyphens are part of the host name (annoying legacy stuff in my company). It's hbase-0.94.2-cdh4.2.1. I have no idea if 0.94.6 was backported by Cloudera into their flavor of 0.94.2, but the mysterious occurrence of the percent sign in zkcli (ls /hbase/replication/rs/va-p-hbase-02-d,60020,1369249862401/1-va-p-hbase-02-e,60020,1369042377129-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-02-e%2C60020%2C1369042377129.1369227474895) might be a sign for such problem. How deep should my rmr in zkcli (an example would be most welcomed :) be ? I have no serious problem running copyTable with a time period corresponding to the outage and then to start the sync back again. One question though, how did it cause a crash ? On Thu, May 23, 2013 at 12:32 AM, Varun Sharma va...@pinterest.com wrote: I believe there were cascading failures which got these deep nodes containing still to be replicated WAL(s) - I suspect there is either some parsing bug or something which is causing the replication source to not work - also which version are you using - does it have https://issues.apache.org/jira/browse/HBASE-8207 - since you use hyphens in our paths. One way to get back up is to delete these nodes but then you lose data in these WAL(s)... On Wed, May 22, 2013 at 2:22 PM, Amit Mor amit.mor.m...@gmail.com wrote: va-p-hbase-02-d,60020,1369249862401 On Thu, May 23, 2013 at 12:20 AM, Varun Sharma va...@pinterest.com wrote: Basically ls /hbase/rs and what do you see for va-p-02-d ? On Wed, May 22, 2013 at 2:19 PM, Varun Sharma va...@pinterest.com wrote: Can you do ls /hbase/rs and see what you get for 02-d - instead of looking in /replication/, could you look in /hbase/replication/rs - I want to see if the timestamps are matching or not ? Varun On Wed, May 22, 2013 at 2:17 PM, Varun Sharma
Re: RS crash upon replication
Thanks for the helpful comments. I would certainly dig deeper now that everything has stabilized. Regarding J-D's comment - once my slave cluster was started, after about 4 hours of downtime (it's for offline stuff), at the very moment it came back online, 5 RS of my master-replication cluster crashed. Since I had no time figuring out what went wrong with the replication I submitted the 'stop_replication' knowing that's a last resort,since I had to get those production RS's online asap. I think renaming that cmd to something like 'abort_replication' would be more fitting. On the other hand, remove_peer(1) at a time of crisis feels like a developer's solution to a DBA's problem ;) Regarding copyTable, it's all good and well, but one has to consider that I'm on ec2 and the cluster is already streched out by 'online' service requests and copyTable would hit it's resources quite badly. I'll be glad to update. Thanks again, Amit Original message From: Varun Sharma va...@pinterest.com Date: To: user@hbase.apache.org Subject: Re: RS crash upon replication But wouldn't a copy table b/w timestamps bring you back since the mutations are all timestamp based we should okay ? Basically doing a copy table which supersedes the downtime interval ? On Thu, May 23, 2013 at 9:48 AM, Jean-Daniel Cryans jdcry...@apache.orgwrote: fwiw stop_replication is a kill switch, not a general way to start and stop replicating, and start_replication may put you in an inconsistent state: hbase(main):001:0 help 'stop_replication' Stops all the replication features. The state in which each stream stops in is undetermined. WARNING: start/stop replication is only meant to be used in critical load situations. On Thu, May 23, 2013 at 1:17 AM, Amit Mor amit.mor.m...@gmail.com wrote: No the server came out fine just because after the crash (RS's - the masters were still running), I immediately pulled the breaks with stop_replication. Then I start the RS's and they came back fine (not replicating). Once I hit 'start_replication' again they had crashed for the second time. Eventually I deleted the heavily nested replication znodes and the 'start_replication' succeeded. I didn't patch 8207 because I'm on CDH with Cloudera Manager Parcels thing and I'm still trying to figure out how to replace their jars with mine in a clean and non intrusive way On Thu, May 23, 2013 at 10:33 AM, Varun Sharma va...@pinterest.com wrote: Actually, it seems like something else was wrong here - the servers came up just fine on trying again - so could not really reproduce the issue. Amit: Did you try patching 8207 ? Varun On Wed, May 22, 2013 at 5:40 PM, Himanshu Vashishtha hv.cs...@gmail.com wrote: That sounds like a bug for sure. Could you create a jira with logs/znode dump/steps to reproduce it? Thanks, himanshu On Wed, May 22, 2013 at 5:01 PM, Varun Sharma va...@pinterest.com wrote: It seems I can reproduce this - I did a few rolling restarts and got screwed with NoNode exceptions - I am running 0.94.7 which has the fix but my nodes don't contain hyphens - nodes are no longer coming back up... Thanks Varun On Wed, May 22, 2013 at 3:02 PM, Himanshu Vashishtha hv.cs...@gmail.com wrote: I'd suggest to please patch the code with 8207; cdh4.2.1 doesn't have it. With hyphens in the name, ReplicationSource gets confused and tried to set data in a znode which doesn't exist. Thanks, Himanshu On Wed, May 22, 2013 at 2:42 PM, Amit Mor amit.mor.m...@gmail.com wrote: yes, indeed - hyphens are part of the host name (annoying legacy stuff in my company). It's hbase-0.94.2-cdh4.2.1. I have no idea if 0.94.6 was backported by Cloudera into their flavor of 0.94.2, but the mysterious occurrence of the percent sign in zkcli (ls /hbase/replication/rs/va-p-hbase-02-d,60020,1369249862401/1-va-p-hbase-02-e,60020,1369042377129-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-02-e%2C60020%2C1369042377129.1369227474895) might be a sign for such problem. How deep should my rmr in zkcli (an example would be most welcomed :) be ? I have no serious problem running copyTable with a time period corresponding to the outage and then to start the sync back again. One question though, how did it cause a crash ? On Thu, May 23, 2013 at 12:32 AM, Varun Sharma va...@pinterest.com wrote: I believe there were cascading failures which got these deep nodes containing still to be replicated WAL(s) - I suspect there is either some parsing bug or something which is causing the replication source to not work - also which version are you using
Re: RS crash upon replication
I have pasted most of the RS's logs just prior to their FATAL and including. Would be very thankful if someone can take a look: http://pastebin.com/qFzycXNS . Interestingly, some RS's experience an IOException for not finding an .oldlogs/ file. The rest get KeeperException$NoNodeException w/o the IOE. Thanks
Re: RS crash upon replication
Basically, You had va-p-hbase-02 crash - that caused all the replication related data in zookeeper to be moved to va-p-hbase-01 and have it take over for replicating 02's logs. Now each region server also maintains an in-memory state of whats in ZK, it seems like when you start up 01, its trying to replicate the 02 logs underneath but its failing to because that data is not in ZK. This is somewhat weird... Can you open the zookeepeer shell and do ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379 And give the output ? On Wed, May 22, 2013 at 1:27 PM, amit.mor.m...@gmail.com amit.mor.m...@gmail.com wrote: Hi, This is bad ... and happened twice: I had my replication-slave cluster offlined. I performed quite a massive Merge operation on it and after a couple of hours it had finished and I returned it back online. At the same time, the replication-master RS machines crashed (see first crash http://pastebin.com/1msNZ2tH) with the first exception being: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/replication/rs/va-p-hbase-01-c,60020,1369233253404/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:354) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:846) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:898) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:892) at org.apache.hadoop.hbase.replication.ReplicationZookeeper.writeReplicationStatus(ReplicationZookeeper.java:558) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:154) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:638) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:387) Before restarting the crashed RS's, I have applied a 'stop_replication' cmd. Then fired up the RS's again. They've started o.k. but once I've hit 'start_replication' they have crashed once again. The second crash log http://pastebin.com/8Nb5epJJ has the same initial exception (org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode). I've started the crash region servers again without replication and currently all is well, but I need to start replication asap. Does anyone have an idea what's going on and how can I solve it ? Thanks, Amit
Re: RS crash upon replication
Also what version of HBase are you running ? On Wed, May 22, 2013 at 1:38 PM, Varun Sharma va...@pinterest.com wrote: Basically, You had va-p-hbase-02 crash - that caused all the replication related data in zookeeper to be moved to va-p-hbase-01 and have it take over for replicating 02's logs. Now each region server also maintains an in-memory state of whats in ZK, it seems like when you start up 01, its trying to replicate the 02 logs underneath but its failing to because that data is not in ZK. This is somewhat weird... Can you open the zookeepeer shell and do ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379 And give the output ? On Wed, May 22, 2013 at 1:27 PM, amit.mor.m...@gmail.com amit.mor.m...@gmail.com wrote: Hi, This is bad ... and happened twice: I had my replication-slave cluster offlined. I performed quite a massive Merge operation on it and after a couple of hours it had finished and I returned it back online. At the same time, the replication-master RS machines crashed (see first crash http://pastebin.com/1msNZ2tH) with the first exception being: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/replication/rs/va-p-hbase-01-c,60020,1369233253404/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:354) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:846) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:898) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:892) at org.apache.hadoop.hbase.replication.ReplicationZookeeper.writeReplicationStatus(ReplicationZookeeper.java:558) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:154) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:638) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:387) Before restarting the crashed RS's, I have applied a 'stop_replication' cmd. Then fired up the RS's again. They've started o.k. but once I've hit 'start_replication' they have crashed once again. The second crash log http://pastebin.com/8Nb5epJJ has the same initial exception (org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode). I've started the crash region servers again without replication and currently all is well, but I need to start replication asap. Does anyone have an idea what's going on and how can I solve it ? Thanks, Amit
Re: RS crash upon replication
ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379 [1] [zk: va-p-zookeeper-01-c:2181(CONNECTED) 2] ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 [] I'm on hbase-0.94.2-cdh4.2.1 Thanks On Wed, May 22, 2013 at 11:40 PM, Varun Sharma va...@pinterest.com wrote: Also what version of HBase are you running ? On Wed, May 22, 2013 at 1:38 PM, Varun Sharma va...@pinterest.com wrote: Basically, You had va-p-hbase-02 crash - that caused all the replication related data in zookeeper to be moved to va-p-hbase-01 and have it take over for replicating 02's logs. Now each region server also maintains an in-memory state of whats in ZK, it seems like when you start up 01, its trying to replicate the 02 logs underneath but its failing to because that data is not in ZK. This is somewhat weird... Can you open the zookeepeer shell and do ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379 And give the output ? On Wed, May 22, 2013 at 1:27 PM, amit.mor.m...@gmail.com amit.mor.m...@gmail.com wrote: Hi, This is bad ... and happened twice: I had my replication-slave cluster offlined. I performed quite a massive Merge operation on it and after a couple of hours it had finished and I returned it back online. At the same time, the replication-master RS machines crashed (see first crash http://pastebin.com/1msNZ2tH) with the first exception being: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/replication/rs/va-p-hbase-01-c,60020,1369233253404/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:354) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:846) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:898) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:892) at org.apache.hadoop.hbase.replication.ReplicationZookeeper.writeReplicationStatus(ReplicationZookeeper.java:558) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:154) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:638) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:387) Before restarting the crashed RS's, I have applied a 'stop_replication' cmd. Then fired up the RS's again. They've started o.k. but once I've hit 'start_replication' they have crashed once again. The second crash log http://pastebin.com/8Nb5epJJ has the same initial exception (org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode). I've started the crash region servers again without replication and currently all is well, but I need to start replication asap. Does anyone have an idea what's going on and how can I solve it ? Thanks, Amit
Re: RS crash upon replication
What does this command show you ? get /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 Cheers On Wed, May 22, 2013 at 1:46 PM, amit.mor.m...@gmail.com amit.mor.m...@gmail.com wrote: ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379 [1] [zk: va-p-zookeeper-01-c:2181(CONNECTED) 2] ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 [] I'm on hbase-0.94.2-cdh4.2.1 Thanks On Wed, May 22, 2013 at 11:40 PM, Varun Sharma va...@pinterest.com wrote: Also what version of HBase are you running ? On Wed, May 22, 2013 at 1:38 PM, Varun Sharma va...@pinterest.com wrote: Basically, You had va-p-hbase-02 crash - that caused all the replication related data in zookeeper to be moved to va-p-hbase-01 and have it take over for replicating 02's logs. Now each region server also maintains an in-memory state of whats in ZK, it seems like when you start up 01, its trying to replicate the 02 logs underneath but its failing to because that data is not in ZK. This is somewhat weird... Can you open the zookeepeer shell and do ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379 And give the output ? On Wed, May 22, 2013 at 1:27 PM, amit.mor.m...@gmail.com amit.mor.m...@gmail.com wrote: Hi, This is bad ... and happened twice: I had my replication-slave cluster offlined. I performed quite a massive Merge operation on it and after a couple of hours it had finished and I returned it back online. At the same time, the replication-master RS machines crashed (see first crash http://pastebin.com/1msNZ2tH) with the first exception being: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/replication/rs/va-p-hbase-01-c,60020,1369233253404/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:354) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:846) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:898) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:892) at org.apache.hadoop.hbase.replication.ReplicationZookeeper.writeReplicationStatus(ReplicationZookeeper.java:558) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:154) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:638) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:387) Before restarting the crashed RS's, I have applied a 'stop_replication' cmd. Then fired up the RS's again. They've started o.k. but once I've hit 'start_replication' they have crashed once again. The second crash log http://pastebin.com/8Nb5epJJ has the same initial exception (org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode). I've started the crash region servers again without replication and currently all is well, but I need to start replication asap. Does anyone have an idea what's going on and how can I solve it ? Thanks, Amit
Re: RS crash upon replication
[zk: va-p-zookeeper-01-c:2181(CONNECTED) 3] get /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 cZxid = 0x60281c1de ctime = Wed May 22 15:11:17 EDT 2013 mZxid = 0x60281c1de mtime = Wed May 22 15:11:17 EDT 2013 pZxid = 0x60281c1de cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x0 dataLength = 0 numChildren = 0 On Wed, May 22, 2013 at 11:49 PM, Ted Yu yuzhih...@gmail.com wrote: What does this command show you ? get /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 Cheers On Wed, May 22, 2013 at 1:46 PM, amit.mor.m...@gmail.com amit.mor.m...@gmail.com wrote: ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379 [1] [zk: va-p-zookeeper-01-c:2181(CONNECTED) 2] ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 [] I'm on hbase-0.94.2-cdh4.2.1 Thanks On Wed, May 22, 2013 at 11:40 PM, Varun Sharma va...@pinterest.com wrote: Also what version of HBase are you running ? On Wed, May 22, 2013 at 1:38 PM, Varun Sharma va...@pinterest.com wrote: Basically, You had va-p-hbase-02 crash - that caused all the replication related data in zookeeper to be moved to va-p-hbase-01 and have it take over for replicating 02's logs. Now each region server also maintains an in-memory state of whats in ZK, it seems like when you start up 01, its trying to replicate the 02 logs underneath but its failing to because that data is not in ZK. This is somewhat weird... Can you open the zookeepeer shell and do ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379 And give the output ? On Wed, May 22, 2013 at 1:27 PM, amit.mor.m...@gmail.com amit.mor.m...@gmail.com wrote: Hi, This is bad ... and happened twice: I had my replication-slave cluster offlined. I performed quite a massive Merge operation on it and after a couple of hours it had finished and I returned it back online. At the same time, the replication-master RS machines crashed (see first crash http://pastebin.com/1msNZ2tH) with the first exception being: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/replication/rs/va-p-hbase-01-c,60020,1369233253404/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:354) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:846) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:898) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:892) at org.apache.hadoop.hbase.replication.ReplicationZookeeper.writeReplicationStatus(ReplicationZookeeper.java:558) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:154) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:638) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:387) Before restarting the crashed RS's, I have applied a 'stop_replication' cmd. Then fired up the RS's again. They've started o.k. but once I've hit 'start_replication' they have crashed once again. The second crash log http://pastebin.com/8Nb5epJJ has the same initial exception (org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode). I've started the crash region servers again without replication and currently all is well, but I need to start replication asap. Does anyone have an idea what's going on and how can I solve it ? Thanks, Amit
Re: RS crash upon replication
Do an ls not a get here and give the output ? ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 On Wed, May 22, 2013 at 1:53 PM, amit.mor.m...@gmail.com amit.mor.m...@gmail.com wrote: [zk: va-p-zookeeper-01-c:2181(CONNECTED) 3] get /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 cZxid = 0x60281c1de ctime = Wed May 22 15:11:17 EDT 2013 mZxid = 0x60281c1de mtime = Wed May 22 15:11:17 EDT 2013 pZxid = 0x60281c1de cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x0 dataLength = 0 numChildren = 0 On Wed, May 22, 2013 at 11:49 PM, Ted Yu yuzhih...@gmail.com wrote: What does this command show you ? get /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 Cheers On Wed, May 22, 2013 at 1:46 PM, amit.mor.m...@gmail.com amit.mor.m...@gmail.com wrote: ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379 [1] [zk: va-p-zookeeper-01-c:2181(CONNECTED) 2] ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 [] I'm on hbase-0.94.2-cdh4.2.1 Thanks On Wed, May 22, 2013 at 11:40 PM, Varun Sharma va...@pinterest.com wrote: Also what version of HBase are you running ? On Wed, May 22, 2013 at 1:38 PM, Varun Sharma va...@pinterest.com wrote: Basically, You had va-p-hbase-02 crash - that caused all the replication related data in zookeeper to be moved to va-p-hbase-01 and have it take over for replicating 02's logs. Now each region server also maintains an in-memory state of whats in ZK, it seems like when you start up 01, its trying to replicate the 02 logs underneath but its failing to because that data is not in ZK. This is somewhat weird... Can you open the zookeepeer shell and do ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379 And give the output ? On Wed, May 22, 2013 at 1:27 PM, amit.mor.m...@gmail.com amit.mor.m...@gmail.com wrote: Hi, This is bad ... and happened twice: I had my replication-slave cluster offlined. I performed quite a massive Merge operation on it and after a couple of hours it had finished and I returned it back online. At the same time, the replication-master RS machines crashed (see first crash http://pastebin.com/1msNZ2tH) with the first exception being: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/replication/rs/va-p-hbase-01-c,60020,1369233253404/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:354) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:846) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:898) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:892) at org.apache.hadoop.hbase.replication.ReplicationZookeeper.writeReplicationStatus(ReplicationZookeeper.java:558) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:154) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:638) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:387) Before restarting the crashed RS's, I have applied a 'stop_replication' cmd. Then fired up the RS's again. They've started o.k. but once I've hit 'start_replication' they have crashed once again. The second crash log http://pastebin.com/8Nb5epJJ has the same initial exception (org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode). I've started the crash region servers again without replication and currently all is well, but I need to start replication asap. Does anyone have an idea what's going on and how can I solve it ? Thanks, Amit
Re: RS crash upon replication
empty return: [zk: va-p-zookeeper-01-c:2181(CONNECTED) 10] ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 [] On Thu, May 23, 2013 at 12:05 AM, Varun Sharma va...@pinterest.com wrote: Do an ls not a get here and give the output ? ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 On Wed, May 22, 2013 at 1:53 PM, amit.mor.m...@gmail.com amit.mor.m...@gmail.com wrote: [zk: va-p-zookeeper-01-c:2181(CONNECTED) 3] get /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 cZxid = 0x60281c1de ctime = Wed May 22 15:11:17 EDT 2013 mZxid = 0x60281c1de mtime = Wed May 22 15:11:17 EDT 2013 pZxid = 0x60281c1de cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x0 dataLength = 0 numChildren = 0 On Wed, May 22, 2013 at 11:49 PM, Ted Yu yuzhih...@gmail.com wrote: What does this command show you ? get /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 Cheers On Wed, May 22, 2013 at 1:46 PM, amit.mor.m...@gmail.com amit.mor.m...@gmail.com wrote: ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379 [1] [zk: va-p-zookeeper-01-c:2181(CONNECTED) 2] ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 [] I'm on hbase-0.94.2-cdh4.2.1 Thanks On Wed, May 22, 2013 at 11:40 PM, Varun Sharma va...@pinterest.com wrote: Also what version of HBase are you running ? On Wed, May 22, 2013 at 1:38 PM, Varun Sharma va...@pinterest.com wrote: Basically, You had va-p-hbase-02 crash - that caused all the replication related data in zookeeper to be moved to va-p-hbase-01 and have it take over for replicating 02's logs. Now each region server also maintains an in-memory state of whats in ZK, it seems like when you start up 01, its trying to replicate the 02 logs underneath but its failing to because that data is not in ZK. This is somewhat weird... Can you open the zookeepeer shell and do ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379 And give the output ? On Wed, May 22, 2013 at 1:27 PM, amit.mor.m...@gmail.com amit.mor.m...@gmail.com wrote: Hi, This is bad ... and happened twice: I had my replication-slave cluster offlined. I performed quite a massive Merge operation on it and after a couple of hours it had finished and I returned it back online. At the same time, the replication-master RS machines crashed (see first crash http://pastebin.com/1msNZ2tH) with the first exception being: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/replication/rs/va-p-hbase-01-c,60020,1369233253404/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:354) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:846) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:898) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:892) at org.apache.hadoop.hbase.replication.ReplicationZookeeper.writeReplicationStatus(ReplicationZookeeper.java:558) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:154) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:638) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:387) Before restarting the crashed RS's, I have applied a 'stop_replication' cmd. Then fired up the RS's again. They've started o.k. but once I've hit 'start_replication' they have crashed once again. The second crash log http://pastebin.com/8Nb5epJJ has the same initial exception (org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode). I've started the crash region servers again without replication and currently all is well, but I need to start replication asap. Does anyone have an idea what's going on and how can I solve
Re: RS crash upon replication
I found this: [zk: va-p-zookeeper-01-c:2181(CONNECTED) 17] ls /hbase/replication/rs/va-p-hbase-02-d,60020,1369249862401 [1-va-p-hbase-02-e,60020,1369042377129-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475, 1-va-p-hbase-01-d,60020,1369042382584-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475, 1, 1-va-p-hbase-02-e,60020,1369233253407-va-p-hbase-02-d,60020,1369233252475, 1-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475, 1-va-p-hbase-02-d,60020,1369233252475, 1-va-p-hbase-01-e,60020,1369233254969-va-p-hbase-02-e,60020,1369233253407-va-p-hbase-02-d,60020,1369233252475, 1-va-p-hbase-02-d,60020,1369042368330-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475, 1-va-p-hbase-01-e,60020,1369042368595-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475, 1-va-p-hbase-01-c,60020,1369233253404-va-p-hbase-02-e,60020,1369233253407-va-p-hbase-02-d,60020,1369233252475, 1-va-p-hbase-01-d,60020,1369233257617-va-p-hbase-02-d,60020,1369233252475, 1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475, 1-va-p-hbase-02-c,60020,1369233268385-va-p-hbase-02-d,60020,1369233252475] On Thu, May 23, 2013 at 12:09 AM, Amit Mor amit.mor.m...@gmail.com wrote: empty return: [zk: va-p-zookeeper-01-c:2181(CONNECTED) 10] ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 [] On Thu, May 23, 2013 at 12:05 AM, Varun Sharma va...@pinterest.comwrote: Do an ls not a get here and give the output ? ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 On Wed, May 22, 2013 at 1:53 PM, amit.mor.m...@gmail.com amit.mor.m...@gmail.com wrote: [zk: va-p-zookeeper-01-c:2181(CONNECTED) 3] get /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 cZxid = 0x60281c1de ctime = Wed May 22 15:11:17 EDT 2013 mZxid = 0x60281c1de mtime = Wed May 22 15:11:17 EDT 2013 pZxid = 0x60281c1de cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x0 dataLength = 0 numChildren = 0 On Wed, May 22, 2013 at 11:49 PM, Ted Yu yuzhih...@gmail.com wrote: What does this command show you ? get /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 Cheers On Wed, May 22, 2013 at 1:46 PM, amit.mor.m...@gmail.com amit.mor.m...@gmail.com wrote: ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379 [1] [zk: va-p-zookeeper-01-c:2181(CONNECTED) 2] ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 [] I'm on hbase-0.94.2-cdh4.2.1 Thanks On Wed, May 22, 2013 at 11:40 PM, Varun Sharma va...@pinterest.com wrote: Also what version of HBase are you running ? On Wed, May 22, 2013 at 1:38 PM, Varun Sharma va...@pinterest.com wrote: Basically, You had va-p-hbase-02 crash - that caused all the replication related data in zookeeper to be moved to va-p-hbase-01 and have it take over for replicating 02's logs. Now each region server also maintains an in-memory state of whats in ZK, it seems like when you start up 01, its trying to replicate the 02 logs underneath but its failing to because that data is not in ZK. This is somewhat weird... Can you open the zookeepeer shell and do ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379 And give the output ? On Wed, May 22, 2013 at 1:27 PM, amit.mor.m...@gmail.com amit.mor.m...@gmail.com wrote: Hi, This is bad ... and happened twice: I had my replication-slave cluster offlined. I performed quite a massive Merge operation on it and after a couple of hours it had finished and I returned it back online. At the same time, the replication-master RS machines crashed (see first crash http://pastebin.com/1msNZ2tH) with the first exception being: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/replication/rs/va-p-hbase-01-c,60020,1369233253404/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:354) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:846) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:898) at
Re: RS crash upon replication
2013-05-22 15:31:25,929 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for * /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719 * * * *01-[01-02-02]-01* *Looks like a bunch of cascading failures causing this deep nesting... * On Wed, May 22, 2013 at 2:09 PM, Amit Mor amit.mor.m...@gmail.com wrote: empty return: [zk: va-p-zookeeper-01-c:2181(CONNECTED) 10] ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 [] On Thu, May 23, 2013 at 12:05 AM, Varun Sharma va...@pinterest.com wrote: Do an ls not a get here and give the output ? ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 On Wed, May 22, 2013 at 1:53 PM, amit.mor.m...@gmail.com amit.mor.m...@gmail.com wrote: [zk: va-p-zookeeper-01-c:2181(CONNECTED) 3] get /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 cZxid = 0x60281c1de ctime = Wed May 22 15:11:17 EDT 2013 mZxid = 0x60281c1de mtime = Wed May 22 15:11:17 EDT 2013 pZxid = 0x60281c1de cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x0 dataLength = 0 numChildren = 0 On Wed, May 22, 2013 at 11:49 PM, Ted Yu yuzhih...@gmail.com wrote: What does this command show you ? get /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 Cheers On Wed, May 22, 2013 at 1:46 PM, amit.mor.m...@gmail.com amit.mor.m...@gmail.com wrote: ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379 [1] [zk: va-p-zookeeper-01-c:2181(CONNECTED) 2] ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 [] I'm on hbase-0.94.2-cdh4.2.1 Thanks On Wed, May 22, 2013 at 11:40 PM, Varun Sharma va...@pinterest.com wrote: Also what version of HBase are you running ? On Wed, May 22, 2013 at 1:38 PM, Varun Sharma va...@pinterest.com wrote: Basically, You had va-p-hbase-02 crash - that caused all the replication related data in zookeeper to be moved to va-p-hbase-01 and have it take over for replicating 02's logs. Now each region server also maintains an in-memory state of whats in ZK, it seems like when you start up 01, its trying to replicate the 02 logs underneath but its failing to because that data is not in ZK. This is somewhat weird... Can you open the zookeepeer shell and do ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379 And give the output ? On Wed, May 22, 2013 at 1:27 PM, amit.mor.m...@gmail.com amit.mor.m...@gmail.com wrote: Hi, This is bad ... and happened twice: I had my replication-slave cluster offlined. I performed quite a massive Merge operation on it and after a couple of hours it had finished and I returned it back online. At the same time, the replication-master RS machines crashed (see first crash http://pastebin.com/1msNZ2tH) with the first exception being: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/replication/rs/va-p-hbase-01-c,60020,1369233253404/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:354) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:846) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:898) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:892) at org.apache.hadoop.hbase.replication.ReplicationZookeeper.writeReplicationStatus(ReplicationZookeeper.java:558) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:154) at
Re: RS crash upon replication
I see - so looks okay - there's just a lot of deep nesting in there - if you look into these you nodes by doing ls - you should see a bunch of WAL(s) which still need to be replicated... Varun On Wed, May 22, 2013 at 2:16 PM, Varun Sharma va...@pinterest.com wrote: 2013-05-22 15:31:25,929 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for * /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719 * * * *01-[01-02-02]-01* *Looks like a bunch of cascading failures causing this deep nesting... * On Wed, May 22, 2013 at 2:09 PM, Amit Mor amit.mor.m...@gmail.com wrote: empty return: [zk: va-p-zookeeper-01-c:2181(CONNECTED) 10] ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 [] On Thu, May 23, 2013 at 12:05 AM, Varun Sharma va...@pinterest.com wrote: Do an ls not a get here and give the output ? ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 On Wed, May 22, 2013 at 1:53 PM, amit.mor.m...@gmail.com amit.mor.m...@gmail.com wrote: [zk: va-p-zookeeper-01-c:2181(CONNECTED) 3] get /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 cZxid = 0x60281c1de ctime = Wed May 22 15:11:17 EDT 2013 mZxid = 0x60281c1de mtime = Wed May 22 15:11:17 EDT 2013 pZxid = 0x60281c1de cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x0 dataLength = 0 numChildren = 0 On Wed, May 22, 2013 at 11:49 PM, Ted Yu yuzhih...@gmail.com wrote: What does this command show you ? get /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 Cheers On Wed, May 22, 2013 at 1:46 PM, amit.mor.m...@gmail.com amit.mor.m...@gmail.com wrote: ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379 [1] [zk: va-p-zookeeper-01-c:2181(CONNECTED) 2] ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 [] I'm on hbase-0.94.2-cdh4.2.1 Thanks On Wed, May 22, 2013 at 11:40 PM, Varun Sharma va...@pinterest.com wrote: Also what version of HBase are you running ? On Wed, May 22, 2013 at 1:38 PM, Varun Sharma va...@pinterest.com wrote: Basically, You had va-p-hbase-02 crash - that caused all the replication related data in zookeeper to be moved to va-p-hbase-01 and have it take over for replicating 02's logs. Now each region server also maintains an in-memory state of whats in ZK, it seems like when you start up 01, its trying to replicate the 02 logs underneath but its failing to because that data is not in ZK. This is somewhat weird... Can you open the zookeepeer shell and do ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379 And give the output ? On Wed, May 22, 2013 at 1:27 PM, amit.mor.m...@gmail.com amit.mor.m...@gmail.com wrote: Hi, This is bad ... and happened twice: I had my replication-slave cluster offlined. I performed quite a massive Merge operation on it and after a couple of hours it had finished and I returned it back online. At the same time, the replication-master RS machines crashed (see first crash http://pastebin.com/1msNZ2tH) with the first exception being: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/replication/rs/va-p-hbase-01-c,60020,1369233253404/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:354) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:846) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:898) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:892) at org.apache.hadoop.hbase.replication.ReplicationZookeeper.writeReplicationStatus(ReplicationZookeeper.java:558) at
Re: RS crash upon replication
Can you do ls /hbase/rs and see what you get for 02-d - instead of looking in /replication/, could you look in /hbase/replication/rs - I want to see if the timestamps are matching or not ? Varun On Wed, May 22, 2013 at 2:17 PM, Varun Sharma va...@pinterest.com wrote: I see - so looks okay - there's just a lot of deep nesting in there - if you look into these you nodes by doing ls - you should see a bunch of WAL(s) which still need to be replicated... Varun On Wed, May 22, 2013 at 2:16 PM, Varun Sharma va...@pinterest.com wrote: 2013-05-22 15:31:25,929 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for * /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719 * * * *01-[01-02-02]-01* *Looks like a bunch of cascading failures causing this deep nesting... * On Wed, May 22, 2013 at 2:09 PM, Amit Mor amit.mor.m...@gmail.comwrote: empty return: [zk: va-p-zookeeper-01-c:2181(CONNECTED) 10] ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 [] On Thu, May 23, 2013 at 12:05 AM, Varun Sharma va...@pinterest.com wrote: Do an ls not a get here and give the output ? ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 On Wed, May 22, 2013 at 1:53 PM, amit.mor.m...@gmail.com amit.mor.m...@gmail.com wrote: [zk: va-p-zookeeper-01-c:2181(CONNECTED) 3] get /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 cZxid = 0x60281c1de ctime = Wed May 22 15:11:17 EDT 2013 mZxid = 0x60281c1de mtime = Wed May 22 15:11:17 EDT 2013 pZxid = 0x60281c1de cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x0 dataLength = 0 numChildren = 0 On Wed, May 22, 2013 at 11:49 PM, Ted Yu yuzhih...@gmail.com wrote: What does this command show you ? get /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 Cheers On Wed, May 22, 2013 at 1:46 PM, amit.mor.m...@gmail.com amit.mor.m...@gmail.com wrote: ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379 [1] [zk: va-p-zookeeper-01-c:2181(CONNECTED) 2] ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 [] I'm on hbase-0.94.2-cdh4.2.1 Thanks On Wed, May 22, 2013 at 11:40 PM, Varun Sharma va...@pinterest.com wrote: Also what version of HBase are you running ? On Wed, May 22, 2013 at 1:38 PM, Varun Sharma va...@pinterest.com wrote: Basically, You had va-p-hbase-02 crash - that caused all the replication related data in zookeeper to be moved to va-p-hbase-01 and have it take over for replicating 02's logs. Now each region server also maintains an in-memory state of whats in ZK, it seems like when you start up 01, its trying to replicate the 02 logs underneath but its failing to because that data is not in ZK. This is somewhat weird... Can you open the zookeepeer shell and do ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379 And give the output ? On Wed, May 22, 2013 at 1:27 PM, amit.mor.m...@gmail.com amit.mor.m...@gmail.com wrote: Hi, This is bad ... and happened twice: I had my replication-slave cluster offlined. I performed quite a massive Merge operation on it and after a couple of hours it had finished and I returned it back online. At the same time, the replication-master RS machines crashed (see first crash http://pastebin.com/1msNZ2tH) with the first exception being: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/replication/rs/va-p-hbase-01-c,60020,1369233253404/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:354) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:846) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:898) at
Re: RS crash upon replication
Basically ls /hbase/rs and what do you see for va-p-02-d ? On Wed, May 22, 2013 at 2:19 PM, Varun Sharma va...@pinterest.com wrote: Can you do ls /hbase/rs and see what you get for 02-d - instead of looking in /replication/, could you look in /hbase/replication/rs - I want to see if the timestamps are matching or not ? Varun On Wed, May 22, 2013 at 2:17 PM, Varun Sharma va...@pinterest.com wrote: I see - so looks okay - there's just a lot of deep nesting in there - if you look into these you nodes by doing ls - you should see a bunch of WAL(s) which still need to be replicated... Varun On Wed, May 22, 2013 at 2:16 PM, Varun Sharma va...@pinterest.comwrote: 2013-05-22 15:31:25,929 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for * /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719 * * * *01-[01-02-02]-01* *Looks like a bunch of cascading failures causing this deep nesting... * On Wed, May 22, 2013 at 2:09 PM, Amit Mor amit.mor.m...@gmail.comwrote: empty return: [zk: va-p-zookeeper-01-c:2181(CONNECTED) 10] ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 [] On Thu, May 23, 2013 at 12:05 AM, Varun Sharma va...@pinterest.com wrote: Do an ls not a get here and give the output ? ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 On Wed, May 22, 2013 at 1:53 PM, amit.mor.m...@gmail.com amit.mor.m...@gmail.com wrote: [zk: va-p-zookeeper-01-c:2181(CONNECTED) 3] get /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 cZxid = 0x60281c1de ctime = Wed May 22 15:11:17 EDT 2013 mZxid = 0x60281c1de mtime = Wed May 22 15:11:17 EDT 2013 pZxid = 0x60281c1de cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x0 dataLength = 0 numChildren = 0 On Wed, May 22, 2013 at 11:49 PM, Ted Yu yuzhih...@gmail.com wrote: What does this command show you ? get /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 Cheers On Wed, May 22, 2013 at 1:46 PM, amit.mor.m...@gmail.com amit.mor.m...@gmail.com wrote: ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379 [1] [zk: va-p-zookeeper-01-c:2181(CONNECTED) 2] ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 [] I'm on hbase-0.94.2-cdh4.2.1 Thanks On Wed, May 22, 2013 at 11:40 PM, Varun Sharma va...@pinterest.com wrote: Also what version of HBase are you running ? On Wed, May 22, 2013 at 1:38 PM, Varun Sharma va...@pinterest.com wrote: Basically, You had va-p-hbase-02 crash - that caused all the replication related data in zookeeper to be moved to va-p-hbase-01 and have it take over for replicating 02's logs. Now each region server also maintains an in-memory state of whats in ZK, it seems like when you start up 01, its trying to replicate the 02 logs underneath but its failing to because that data is not in ZK. This is somewhat weird... Can you open the zookeepeer shell and do ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379 And give the output ? On Wed, May 22, 2013 at 1:27 PM, amit.mor.m...@gmail.com amit.mor.m...@gmail.com wrote: Hi, This is bad ... and happened twice: I had my replication-slave cluster offlined. I performed quite a massive Merge operation on it and after a couple of hours it had finished and I returned it back online. At the same time, the replication-master RS machines crashed (see first crash http://pastebin.com/1msNZ2tH) with the first exception being: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/replication/rs/va-p-hbase-01-c,60020,1369233253404/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:354) at
Re: RS crash upon replication
va-p-hbase-02-d,60020,1369249862401 On Thu, May 23, 2013 at 12:20 AM, Varun Sharma va...@pinterest.com wrote: Basically ls /hbase/rs and what do you see for va-p-02-d ? On Wed, May 22, 2013 at 2:19 PM, Varun Sharma va...@pinterest.com wrote: Can you do ls /hbase/rs and see what you get for 02-d - instead of looking in /replication/, could you look in /hbase/replication/rs - I want to see if the timestamps are matching or not ? Varun On Wed, May 22, 2013 at 2:17 PM, Varun Sharma va...@pinterest.com wrote: I see - so looks okay - there's just a lot of deep nesting in there - if you look into these you nodes by doing ls - you should see a bunch of WAL(s) which still need to be replicated... Varun On Wed, May 22, 2013 at 2:16 PM, Varun Sharma va...@pinterest.com wrote: 2013-05-22 15:31:25,929 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for * /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719 * * * *01-[01-02-02]-01* *Looks like a bunch of cascading failures causing this deep nesting... * On Wed, May 22, 2013 at 2:09 PM, Amit Mor amit.mor.m...@gmail.com wrote: empty return: [zk: va-p-zookeeper-01-c:2181(CONNECTED) 10] ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 [] On Thu, May 23, 2013 at 12:05 AM, Varun Sharma va...@pinterest.com wrote: Do an ls not a get here and give the output ? ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 On Wed, May 22, 2013 at 1:53 PM, amit.mor.m...@gmail.com amit.mor.m...@gmail.com wrote: [zk: va-p-zookeeper-01-c:2181(CONNECTED) 3] get /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 cZxid = 0x60281c1de ctime = Wed May 22 15:11:17 EDT 2013 mZxid = 0x60281c1de mtime = Wed May 22 15:11:17 EDT 2013 pZxid = 0x60281c1de cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x0 dataLength = 0 numChildren = 0 On Wed, May 22, 2013 at 11:49 PM, Ted Yu yuzhih...@gmail.com wrote: What does this command show you ? get /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 Cheers On Wed, May 22, 2013 at 1:46 PM, amit.mor.m...@gmail.com amit.mor.m...@gmail.com wrote: ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379 [1] [zk: va-p-zookeeper-01-c:2181(CONNECTED) 2] ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 [] I'm on hbase-0.94.2-cdh4.2.1 Thanks On Wed, May 22, 2013 at 11:40 PM, Varun Sharma va...@pinterest.com wrote: Also what version of HBase are you running ? On Wed, May 22, 2013 at 1:38 PM, Varun Sharma va...@pinterest.com wrote: Basically, You had va-p-hbase-02 crash - that caused all the replication related data in zookeeper to be moved to va-p-hbase-01 and have it take over for replicating 02's logs. Now each region server also maintains an in-memory state of whats in ZK, it seems like when you start up 01, its trying to replicate the 02 logs underneath but its failing to because that data is not in ZK. This is somewhat weird... Can you open the zookeepeer shell and do ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379 And give the output ? On Wed, May 22, 2013 at 1:27 PM, amit.mor.m...@gmail.com amit.mor.m...@gmail.com wrote: Hi, This is bad ... and happened twice: I had my replication-slave cluster offlined. I performed quite a massive Merge operation on it and after a couple of hours it had finished and I returned it back online. At the same time, the replication-master RS machines crashed (see first crash http://pastebin.com/1msNZ2tH) with the first exception being: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/replication/rs/va-p-hbase-01-c,60020,1369233253404/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at
Re: RS crash upon replication
I believe there were cascading failures which got these deep nodes containing still to be replicated WAL(s) - I suspect there is either some parsing bug or something which is causing the replication source to not work - also which version are you using - does it have https://issues.apache.org/jira/browse/HBASE-8207 - since you use hyphens in our paths. One way to get back up is to delete these nodes but then you lose data in these WAL(s)... On Wed, May 22, 2013 at 2:22 PM, Amit Mor amit.mor.m...@gmail.com wrote: va-p-hbase-02-d,60020,1369249862401 On Thu, May 23, 2013 at 12:20 AM, Varun Sharma va...@pinterest.com wrote: Basically ls /hbase/rs and what do you see for va-p-02-d ? On Wed, May 22, 2013 at 2:19 PM, Varun Sharma va...@pinterest.com wrote: Can you do ls /hbase/rs and see what you get for 02-d - instead of looking in /replication/, could you look in /hbase/replication/rs - I want to see if the timestamps are matching or not ? Varun On Wed, May 22, 2013 at 2:17 PM, Varun Sharma va...@pinterest.com wrote: I see - so looks okay - there's just a lot of deep nesting in there - if you look into these you nodes by doing ls - you should see a bunch of WAL(s) which still need to be replicated... Varun On Wed, May 22, 2013 at 2:16 PM, Varun Sharma va...@pinterest.com wrote: 2013-05-22 15:31:25,929 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for * /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719 * * * *01-[01-02-02]-01* *Looks like a bunch of cascading failures causing this deep nesting... * On Wed, May 22, 2013 at 2:09 PM, Amit Mor amit.mor.m...@gmail.com wrote: empty return: [zk: va-p-zookeeper-01-c:2181(CONNECTED) 10] ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 [] On Thu, May 23, 2013 at 12:05 AM, Varun Sharma va...@pinterest.com wrote: Do an ls not a get here and give the output ? ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 On Wed, May 22, 2013 at 1:53 PM, amit.mor.m...@gmail.com amit.mor.m...@gmail.com wrote: [zk: va-p-zookeeper-01-c:2181(CONNECTED) 3] get /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 cZxid = 0x60281c1de ctime = Wed May 22 15:11:17 EDT 2013 mZxid = 0x60281c1de mtime = Wed May 22 15:11:17 EDT 2013 pZxid = 0x60281c1de cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x0 dataLength = 0 numChildren = 0 On Wed, May 22, 2013 at 11:49 PM, Ted Yu yuzhih...@gmail.com wrote: What does this command show you ? get /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 Cheers On Wed, May 22, 2013 at 1:46 PM, amit.mor.m...@gmail.com amit.mor.m...@gmail.com wrote: ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379 [1] [zk: va-p-zookeeper-01-c:2181(CONNECTED) 2] ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 [] I'm on hbase-0.94.2-cdh4.2.1 Thanks On Wed, May 22, 2013 at 11:40 PM, Varun Sharma va...@pinterest.com wrote: Also what version of HBase are you running ? On Wed, May 22, 2013 at 1:38 PM, Varun Sharma va...@pinterest.com wrote: Basically, You had va-p-hbase-02 crash - that caused all the replication related data in zookeeper to be moved to va-p-hbase-01 and have it take over for replicating 02's logs. Now each region server also maintains an in-memory state of whats in ZK, it seems like when you start up 01, its trying to replicate the 02 logs underneath but its failing to because that data is not in ZK. This is somewhat weird... Can you open the zookeepeer shell and do ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379 And give the output ? On Wed, May 22, 2013 at 1:27 PM, amit.mor.m...@gmail.com amit.mor.m...@gmail.com wrote: Hi, This is bad ... and happened twice: I had my replication-slave cluster offlined. I performed quite a massive Merge operation on it and after a couple of hours it had finished and I returned it back online. At
Re: RS crash upon replication
yes, indeed - hyphens are part of the host name (annoying legacy stuff in my company). It's hbase-0.94.2-cdh4.2.1. I have no idea if 0.94.6 was backported by Cloudera into their flavor of 0.94.2, but the mysterious occurrence of the percent sign in zkcli (ls /hbase/replication/rs/va-p-hbase-02-d,60020,1369249862401/1-va-p-hbase-02-e,60020,1369042377129-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-02-e%2C60020%2C1369042377129.1369227474895) might be a sign for such problem. How deep should my rmr in zkcli (an example would be most welcomed :) be ? I have no serious problem running copyTable with a time period corresponding to the outage and then to start the sync back again. One question though, how did it cause a crash ? On Thu, May 23, 2013 at 12:32 AM, Varun Sharma va...@pinterest.com wrote: I believe there were cascading failures which got these deep nodes containing still to be replicated WAL(s) - I suspect there is either some parsing bug or something which is causing the replication source to not work - also which version are you using - does it have https://issues.apache.org/jira/browse/HBASE-8207 - since you use hyphens in our paths. One way to get back up is to delete these nodes but then you lose data in these WAL(s)... On Wed, May 22, 2013 at 2:22 PM, Amit Mor amit.mor.m...@gmail.com wrote: va-p-hbase-02-d,60020,1369249862401 On Thu, May 23, 2013 at 12:20 AM, Varun Sharma va...@pinterest.com wrote: Basically ls /hbase/rs and what do you see for va-p-02-d ? On Wed, May 22, 2013 at 2:19 PM, Varun Sharma va...@pinterest.com wrote: Can you do ls /hbase/rs and see what you get for 02-d - instead of looking in /replication/, could you look in /hbase/replication/rs - I want to see if the timestamps are matching or not ? Varun On Wed, May 22, 2013 at 2:17 PM, Varun Sharma va...@pinterest.com wrote: I see - so looks okay - there's just a lot of deep nesting in there - if you look into these you nodes by doing ls - you should see a bunch of WAL(s) which still need to be replicated... Varun On Wed, May 22, 2013 at 2:16 PM, Varun Sharma va...@pinterest.com wrote: 2013-05-22 15:31:25,929 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for * /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719 * * * *01-[01-02-02]-01* *Looks like a bunch of cascading failures causing this deep nesting... * On Wed, May 22, 2013 at 2:09 PM, Amit Mor amit.mor.m...@gmail.com wrote: empty return: [zk: va-p-zookeeper-01-c:2181(CONNECTED) 10] ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 [] On Thu, May 23, 2013 at 12:05 AM, Varun Sharma va...@pinterest.com wrote: Do an ls not a get here and give the output ? ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 On Wed, May 22, 2013 at 1:53 PM, amit.mor.m...@gmail.com amit.mor.m...@gmail.com wrote: [zk: va-p-zookeeper-01-c:2181(CONNECTED) 3] get /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 cZxid = 0x60281c1de ctime = Wed May 22 15:11:17 EDT 2013 mZxid = 0x60281c1de mtime = Wed May 22 15:11:17 EDT 2013 pZxid = 0x60281c1de cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x0 dataLength = 0 numChildren = 0 On Wed, May 22, 2013 at 11:49 PM, Ted Yu yuzhih...@gmail.com wrote: What does this command show you ? get /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 Cheers On Wed, May 22, 2013 at 1:46 PM, amit.mor.m...@gmail.com amit.mor.m...@gmail.com wrote: ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379 [1] [zk: va-p-zookeeper-01-c:2181(CONNECTED) 2] ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 [] I'm on hbase-0.94.2-cdh4.2.1 Thanks On Wed, May 22, 2013 at 11:40 PM, Varun Sharma va...@pinterest.com wrote: Also what version of HBase are you running ? On Wed, May 22, 2013 at 1:38 PM, Varun Sharma va...@pinterest.com wrote: Basically, You had va-p-hbase-02 crash - that caused all the replication related data in zookeeper to be
Re: RS crash upon replication
Yes, I have checked the source files of the 0.94.2-cdh4.2.1 jar and HBASE-8207 issues are present in the source codes, namely: String[] parts = peerClusterZnode.split(-); On Thu, May 23, 2013 at 12:42 AM, Amit Mor amit.mor.m...@gmail.com wrote: yes, indeed - hyphens are part of the host name (annoying legacy stuff in my company). It's hbase-0.94.2-cdh4.2.1. I have no idea if 0.94.6 was backported by Cloudera into their flavor of 0.94.2, but the mysterious occurrence of the percent sign in zkcli (ls /hbase/replication/rs/va-p-hbase-02-d,60020,1369249862401/1-va-p-hbase-02-e,60020,1369042377129-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-02-e%2C60020%2C1369042377129.1369227474895) might be a sign for such problem. How deep should my rmr in zkcli (an example would be most welcomed :) be ? I have no serious problem running copyTable with a time period corresponding to the outage and then to start the sync back again. One question though, how did it cause a crash ? On Thu, May 23, 2013 at 12:32 AM, Varun Sharma va...@pinterest.comwrote: I believe there were cascading failures which got these deep nodes containing still to be replicated WAL(s) - I suspect there is either some parsing bug or something which is causing the replication source to not work - also which version are you using - does it have https://issues.apache.org/jira/browse/HBASE-8207 - since you use hyphens in our paths. One way to get back up is to delete these nodes but then you lose data in these WAL(s)... On Wed, May 22, 2013 at 2:22 PM, Amit Mor amit.mor.m...@gmail.com wrote: va-p-hbase-02-d,60020,1369249862401 On Thu, May 23, 2013 at 12:20 AM, Varun Sharma va...@pinterest.com wrote: Basically ls /hbase/rs and what do you see for va-p-02-d ? On Wed, May 22, 2013 at 2:19 PM, Varun Sharma va...@pinterest.com wrote: Can you do ls /hbase/rs and see what you get for 02-d - instead of looking in /replication/, could you look in /hbase/replication/rs - I want to see if the timestamps are matching or not ? Varun On Wed, May 22, 2013 at 2:17 PM, Varun Sharma va...@pinterest.com wrote: I see - so looks okay - there's just a lot of deep nesting in there - if you look into these you nodes by doing ls - you should see a bunch of WAL(s) which still need to be replicated... Varun On Wed, May 22, 2013 at 2:16 PM, Varun Sharma va...@pinterest.com wrote: 2013-05-22 15:31:25,929 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for * /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719 * * * *01-[01-02-02]-01* *Looks like a bunch of cascading failures causing this deep nesting... * On Wed, May 22, 2013 at 2:09 PM, Amit Mor amit.mor.m...@gmail.com wrote: empty return: [zk: va-p-zookeeper-01-c:2181(CONNECTED) 10] ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 [] On Thu, May 23, 2013 at 12:05 AM, Varun Sharma va...@pinterest.com wrote: Do an ls not a get here and give the output ? ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 On Wed, May 22, 2013 at 1:53 PM, amit.mor.m...@gmail.com amit.mor.m...@gmail.com wrote: [zk: va-p-zookeeper-01-c:2181(CONNECTED) 3] get /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 cZxid = 0x60281c1de ctime = Wed May 22 15:11:17 EDT 2013 mZxid = 0x60281c1de mtime = Wed May 22 15:11:17 EDT 2013 pZxid = 0x60281c1de cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x0 dataLength = 0 numChildren = 0 On Wed, May 22, 2013 at 11:49 PM, Ted Yu yuzhih...@gmail.com wrote: What does this command show you ? get /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 Cheers On Wed, May 22, 2013 at 1:46 PM, amit.mor.m...@gmail.com amit.mor.m...@gmail.com wrote: ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379 [1] [zk: va-p-zookeeper-01-c:2181(CONNECTED) 2] ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 [] I'm on hbase-0.94.2-cdh4.2.1 Thanks On Wed, May 22, 2013 at 11:40 PM, Varun Sharma va...@pinterest.com wrote: Also what version of HBase are you running ? On
Re: RS crash upon replication
I'd suggest to please patch the code with 8207; cdh4.2.1 doesn't have it. With hyphens in the name, ReplicationSource gets confused and tried to set data in a znode which doesn't exist. Thanks, Himanshu On Wed, May 22, 2013 at 2:42 PM, Amit Mor amit.mor.m...@gmail.com wrote: yes, indeed - hyphens are part of the host name (annoying legacy stuff in my company). It's hbase-0.94.2-cdh4.2.1. I have no idea if 0.94.6 was backported by Cloudera into their flavor of 0.94.2, but the mysterious occurrence of the percent sign in zkcli (ls /hbase/replication/rs/va-p-hbase-02-d,60020,1369249862401/1-va-p-hbase-02-e,60020,1369042377129-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-02-e%2C60020%2C1369042377129.1369227474895) might be a sign for such problem. How deep should my rmr in zkcli (an example would be most welcomed :) be ? I have no serious problem running copyTable with a time period corresponding to the outage and then to start the sync back again. One question though, how did it cause a crash ? On Thu, May 23, 2013 at 12:32 AM, Varun Sharma va...@pinterest.com wrote: I believe there were cascading failures which got these deep nodes containing still to be replicated WAL(s) - I suspect there is either some parsing bug or something which is causing the replication source to not work - also which version are you using - does it have https://issues.apache.org/jira/browse/HBASE-8207 - since you use hyphens in our paths. One way to get back up is to delete these nodes but then you lose data in these WAL(s)... On Wed, May 22, 2013 at 2:22 PM, Amit Mor amit.mor.m...@gmail.com wrote: va-p-hbase-02-d,60020,1369249862401 On Thu, May 23, 2013 at 12:20 AM, Varun Sharma va...@pinterest.com wrote: Basically ls /hbase/rs and what do you see for va-p-02-d ? On Wed, May 22, 2013 at 2:19 PM, Varun Sharma va...@pinterest.com wrote: Can you do ls /hbase/rs and see what you get for 02-d - instead of looking in /replication/, could you look in /hbase/replication/rs - I want to see if the timestamps are matching or not ? Varun On Wed, May 22, 2013 at 2:17 PM, Varun Sharma va...@pinterest.com wrote: I see - so looks okay - there's just a lot of deep nesting in there - if you look into these you nodes by doing ls - you should see a bunch of WAL(s) which still need to be replicated... Varun On Wed, May 22, 2013 at 2:16 PM, Varun Sharma va...@pinterest.com wrote: 2013-05-22 15:31:25,929 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for * /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719 * * * *01-[01-02-02]-01* *Looks like a bunch of cascading failures causing this deep nesting... * On Wed, May 22, 2013 at 2:09 PM, Amit Mor amit.mor.m...@gmail.com wrote: empty return: [zk: va-p-zookeeper-01-c:2181(CONNECTED) 10] ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 [] On Thu, May 23, 2013 at 12:05 AM, Varun Sharma va...@pinterest.com wrote: Do an ls not a get here and give the output ? ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 On Wed, May 22, 2013 at 1:53 PM, amit.mor.m...@gmail.com amit.mor.m...@gmail.com wrote: [zk: va-p-zookeeper-01-c:2181(CONNECTED) 3] get /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 cZxid = 0x60281c1de ctime = Wed May 22 15:11:17 EDT 2013 mZxid = 0x60281c1de mtime = Wed May 22 15:11:17 EDT 2013 pZxid = 0x60281c1de cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x0 dataLength = 0 numChildren = 0 On Wed, May 22, 2013 at 11:49 PM, Ted Yu yuzhih...@gmail.com wrote: What does this command show you ? get /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 Cheers On Wed, May 22, 2013 at 1:46 PM, amit.mor.m...@gmail.com amit.mor.m...@gmail.com wrote: ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379 [1] [zk: va-p-zookeeper-01-c:2181(CONNECTED) 2] ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 [] I'm on hbase-0.94.2-cdh4.2.1 Thanks
Re: RS crash upon replication
It seems I can reproduce this - I did a few rolling restarts and got screwed with NoNode exceptions - I am running 0.94.7 which has the fix but my nodes don't contain hyphens - nodes are no longer coming back up... Thanks Varun On Wed, May 22, 2013 at 3:02 PM, Himanshu Vashishtha hv.cs...@gmail.comwrote: I'd suggest to please patch the code with 8207; cdh4.2.1 doesn't have it. With hyphens in the name, ReplicationSource gets confused and tried to set data in a znode which doesn't exist. Thanks, Himanshu On Wed, May 22, 2013 at 2:42 PM, Amit Mor amit.mor.m...@gmail.com wrote: yes, indeed - hyphens are part of the host name (annoying legacy stuff in my company). It's hbase-0.94.2-cdh4.2.1. I have no idea if 0.94.6 was backported by Cloudera into their flavor of 0.94.2, but the mysterious occurrence of the percent sign in zkcli (ls /hbase/replication/rs/va-p-hbase-02-d,60020,1369249862401/1-va-p-hbase-02-e,60020,1369042377129-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-02-e%2C60020%2C1369042377129.1369227474895) might be a sign for such problem. How deep should my rmr in zkcli (an example would be most welcomed :) be ? I have no serious problem running copyTable with a time period corresponding to the outage and then to start the sync back again. One question though, how did it cause a crash ? On Thu, May 23, 2013 at 12:32 AM, Varun Sharma va...@pinterest.com wrote: I believe there were cascading failures which got these deep nodes containing still to be replicated WAL(s) - I suspect there is either some parsing bug or something which is causing the replication source to not work - also which version are you using - does it have https://issues.apache.org/jira/browse/HBASE-8207 - since you use hyphens in our paths. One way to get back up is to delete these nodes but then you lose data in these WAL(s)... On Wed, May 22, 2013 at 2:22 PM, Amit Mor amit.mor.m...@gmail.com wrote: va-p-hbase-02-d,60020,1369249862401 On Thu, May 23, 2013 at 12:20 AM, Varun Sharma va...@pinterest.com wrote: Basically ls /hbase/rs and what do you see for va-p-02-d ? On Wed, May 22, 2013 at 2:19 PM, Varun Sharma va...@pinterest.com wrote: Can you do ls /hbase/rs and see what you get for 02-d - instead of looking in /replication/, could you look in /hbase/replication/rs - I want to see if the timestamps are matching or not ? Varun On Wed, May 22, 2013 at 2:17 PM, Varun Sharma va...@pinterest.com wrote: I see - so looks okay - there's just a lot of deep nesting in there - if you look into these you nodes by doing ls - you should see a bunch of WAL(s) which still need to be replicated... Varun On Wed, May 22, 2013 at 2:16 PM, Varun Sharma va...@pinterest.com wrote: 2013-05-22 15:31:25,929 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for * /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719 * * * *01-[01-02-02]-01* *Looks like a bunch of cascading failures causing this deep nesting... * On Wed, May 22, 2013 at 2:09 PM, Amit Mor amit.mor.m...@gmail.com wrote: empty return: [zk: va-p-zookeeper-01-c:2181(CONNECTED) 10] ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 [] On Thu, May 23, 2013 at 12:05 AM, Varun Sharma va...@pinterest.com wrote: Do an ls not a get here and give the output ? ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 On Wed, May 22, 2013 at 1:53 PM, amit.mor.m...@gmail.com amit.mor.m...@gmail.com wrote: [zk: va-p-zookeeper-01-c:2181(CONNECTED) 3] get /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 cZxid = 0x60281c1de ctime = Wed May 22 15:11:17 EDT 2013 mZxid = 0x60281c1de mtime = Wed May 22 15:11:17 EDT 2013 pZxid = 0x60281c1de cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x0 dataLength = 0 numChildren = 0 On Wed, May 22, 2013 at 11:49 PM, Ted Yu yuzhih...@gmail.com wrote: What does this command show you ? get
Re: RS crash upon replication
That sounds like a bug for sure. Could you create a jira with logs/znode dump/steps to reproduce it? Thanks, himanshu On Wed, May 22, 2013 at 5:01 PM, Varun Sharma va...@pinterest.com wrote: It seems I can reproduce this - I did a few rolling restarts and got screwed with NoNode exceptions - I am running 0.94.7 which has the fix but my nodes don't contain hyphens - nodes are no longer coming back up... Thanks Varun On Wed, May 22, 2013 at 3:02 PM, Himanshu Vashishtha hv.cs...@gmail.com wrote: I'd suggest to please patch the code with 8207; cdh4.2.1 doesn't have it. With hyphens in the name, ReplicationSource gets confused and tried to set data in a znode which doesn't exist. Thanks, Himanshu On Wed, May 22, 2013 at 2:42 PM, Amit Mor amit.mor.m...@gmail.com wrote: yes, indeed - hyphens are part of the host name (annoying legacy stuff in my company). It's hbase-0.94.2-cdh4.2.1. I have no idea if 0.94.6 was backported by Cloudera into their flavor of 0.94.2, but the mysterious occurrence of the percent sign in zkcli (ls /hbase/replication/rs/va-p-hbase-02-d,60020,1369249862401/1-va-p-hbase-02-e,60020,1369042377129-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-02-e%2C60020%2C1369042377129.1369227474895) might be a sign for such problem. How deep should my rmr in zkcli (an example would be most welcomed :) be ? I have no serious problem running copyTable with a time period corresponding to the outage and then to start the sync back again. One question though, how did it cause a crash ? On Thu, May 23, 2013 at 12:32 AM, Varun Sharma va...@pinterest.com wrote: I believe there were cascading failures which got these deep nodes containing still to be replicated WAL(s) - I suspect there is either some parsing bug or something which is causing the replication source to not work - also which version are you using - does it have https://issues.apache.org/jira/browse/HBASE-8207 - since you use hyphens in our paths. One way to get back up is to delete these nodes but then you lose data in these WAL(s)... On Wed, May 22, 2013 at 2:22 PM, Amit Mor amit.mor.m...@gmail.com wrote: va-p-hbase-02-d,60020,1369249862401 On Thu, May 23, 2013 at 12:20 AM, Varun Sharma va...@pinterest.com wrote: Basically ls /hbase/rs and what do you see for va-p-02-d ? On Wed, May 22, 2013 at 2:19 PM, Varun Sharma va...@pinterest.com wrote: Can you do ls /hbase/rs and see what you get for 02-d - instead of looking in /replication/, could you look in /hbase/replication/rs - I want to see if the timestamps are matching or not ? Varun On Wed, May 22, 2013 at 2:17 PM, Varun Sharma va...@pinterest.com wrote: I see - so looks okay - there's just a lot of deep nesting in there - if you look into these you nodes by doing ls - you should see a bunch of WAL(s) which still need to be replicated... Varun On Wed, May 22, 2013 at 2:16 PM, Varun Sharma va...@pinterest.com wrote: 2013-05-22 15:31:25,929 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for * /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719 * * * *01-[01-02-02]-01* *Looks like a bunch of cascading failures causing this deep nesting... * On Wed, May 22, 2013 at 2:09 PM, Amit Mor amit.mor.m...@gmail.com wrote: empty return: [zk: va-p-zookeeper-01-c:2181(CONNECTED) 10] ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 [] On Thu, May 23, 2013 at 12:05 AM, Varun Sharma va...@pinterest.com wrote: Do an ls not a get here and give the output ? ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 On Wed, May 22, 2013 at 1:53 PM, amit.mor.m...@gmail.com amit.mor.m...@gmail.com wrote: [zk: va-p-zookeeper-01-c:2181(CONNECTED) 3] get /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 cZxid = 0x60281c1de ctime = Wed May 22 15:11:17 EDT 2013 mZxid = 0x60281c1de mtime = Wed May 22 15:11:17 EDT 2013 pZxid = 0x60281c1de cversion =