Re: RS crash upon replication

2013-05-23 Thread Varun Sharma
Actually, it seems like something else was wrong here - the servers came up
just fine on trying again - so could not really reproduce the issue.

Amit: Did you try patching 8207 ?

Varun


On Wed, May 22, 2013 at 5:40 PM, Himanshu Vashishtha hv.cs...@gmail.comwrote:

 That sounds like a bug for sure. Could you create a jira with logs/znode
 dump/steps to reproduce it?

 Thanks,
 himanshu


 On Wed, May 22, 2013 at 5:01 PM, Varun Sharma va...@pinterest.com wrote:

  It seems I can reproduce this - I did a few rolling restarts and got
  screwed with NoNode exceptions - I am running 0.94.7 which has the fix
 but
  my nodes don't contain hyphens - nodes are no longer coming back up...
 
  Thanks
  Varun
 
 
  On Wed, May 22, 2013 at 3:02 PM, Himanshu Vashishtha hv.cs...@gmail.com
  wrote:
 
   I'd suggest to please patch the code with 8207;  cdh4.2.1 doesn't have
  it.
  
   With hyphens in the name, ReplicationSource gets confused and tried to
  set
   data in a znode which doesn't exist.
  
   Thanks,
   Himanshu
  
  
   On Wed, May 22, 2013 at 2:42 PM, Amit Mor amit.mor.m...@gmail.com
  wrote:
  
yes, indeed - hyphens are part of the host name (annoying legacy
 stuff
  in
my company). It's hbase-0.94.2-cdh4.2.1. I have no idea if 0.94.6 was
backported by Cloudera into their flavor of 0.94.2, but
the mysterious occurrence of the percent sign in zkcli (ls
   
   
  
 
 /hbase/replication/rs/va-p-hbase-02-d,60020,1369249862401/1-va-p-hbase-02-e,60020,1369042377129-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-02-e%2C60020%2C1369042377129.1369227474895)
might be a sign for such problem. How deep should my rmr in zkcli (an
example would be most welcomed :) be ? I have no serious problem
  running
copyTable with a time period corresponding to the outage and then to
   start
the sync back again. One question though, how did it cause a crash ?
   
   
On Thu, May 23, 2013 at 12:32 AM, Varun Sharma va...@pinterest.com
wrote:
   
 I believe there were cascading failures which got these deep nodes
 containing still to be replicated WAL(s) - I suspect there is
 either
   some
 parsing bug or something which is causing the replication source to
  not
 work - also which version are you using - does it have
 https://issues.apache.org/jira/browse/HBASE-8207 - since you use
   hyphens
 in
 our paths. One way to get back up is to delete these nodes but then
  you
 lose data in these WAL(s)...


 On Wed, May 22, 2013 at 2:22 PM, Amit Mor amit.mor.m...@gmail.com
 
wrote:

   va-p-hbase-02-d,60020,1369249862401
 
 
  On Thu, May 23, 2013 at 12:20 AM, Varun Sharma 
  va...@pinterest.com
  wrote:
 
   Basically
  
   ls /hbase/rs and what do you see for va-p-02-d ?
  
  
   On Wed, May 22, 2013 at 2:19 PM, Varun Sharma 
  va...@pinterest.com
   
  wrote:
  
Can you do ls /hbase/rs and see what you get for 02-d -
 instead
   of
   looking
in /replication/, could you look in /hbase/replication/rs - I
   want
to
  see
if the timestamps are matching or not ?
   
Varun
   
   
On Wed, May 22, 2013 at 2:17 PM, Varun Sharma 
   va...@pinterest.com

   wrote:
   
I see - so looks okay - there's just a lot of deep nesting
 in
there
 -
  if
you look into these you nodes by doing ls - you should see a
   bunch
 of
WAL(s) which still need to be replicated...
   
Varun
   
   
On Wed, May 22, 2013 at 2:16 PM, Varun Sharma 
va...@pinterest.com
   wrote:
   
2013-05-22 15:31:25,929 WARN
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
   Possibly
   transient
ZooKeeper exception:
   
 org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for *
   
  
 

   
  
 
 /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719
*
*
*
*01-[01-02-02]-01*
   
*Looks like a bunch of cascading failures causing this deep
  nesting...
   *
   
   
On Wed, May 22, 2013 at 2:09 PM, Amit Mor 
amit.mor.m...@gmail.com
   wrote:
   
empty return:
   
[zk: va-p-zookeeper-01-c:2181(CONNECTED) 10] ls
   
 /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
[]
   
   
   
On Thu, May 23, 2013 at 12:05 AM, Varun Sharma 
 va...@pinterest.com
  
wrote:
   
 Do an ls not a get here and give the output ?

 ls
   /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1


 

Re: RS crash upon replication

2013-05-23 Thread Amit Mor
No the server came out fine just because after the crash (RS's - the
masters were still running), I immediately pulled the breaks with
stop_replication. Then I start the RS's and they came back fine (not
replicating). Once I hit 'start_replication' again they had crashed for the
second time. Eventually I deleted the heavily nested replication znodes and
the 'start_replication' succeeded. I didn't patch 8207 because I'm on CDH
with Cloudera Manager Parcels thing and I'm still trying to figure out how
to replace their jars with mine in a clean and non intrusive way


On Thu, May 23, 2013 at 10:33 AM, Varun Sharma va...@pinterest.com wrote:

 Actually, it seems like something else was wrong here - the servers came up
 just fine on trying again - so could not really reproduce the issue.

 Amit: Did you try patching 8207 ?

 Varun


 On Wed, May 22, 2013 at 5:40 PM, Himanshu Vashishtha hv.cs...@gmail.com
 wrote:

  That sounds like a bug for sure. Could you create a jira with logs/znode
  dump/steps to reproduce it?
 
  Thanks,
  himanshu
 
 
  On Wed, May 22, 2013 at 5:01 PM, Varun Sharma va...@pinterest.com
 wrote:
 
   It seems I can reproduce this - I did a few rolling restarts and got
   screwed with NoNode exceptions - I am running 0.94.7 which has the fix
  but
   my nodes don't contain hyphens - nodes are no longer coming back up...
  
   Thanks
   Varun
  
  
   On Wed, May 22, 2013 at 3:02 PM, Himanshu Vashishtha 
 hv.cs...@gmail.com
   wrote:
  
I'd suggest to please patch the code with 8207;  cdh4.2.1 doesn't
 have
   it.
   
With hyphens in the name, ReplicationSource gets confused and tried
 to
   set
data in a znode which doesn't exist.
   
Thanks,
Himanshu
   
   
On Wed, May 22, 2013 at 2:42 PM, Amit Mor amit.mor.m...@gmail.com
   wrote:
   
 yes, indeed - hyphens are part of the host name (annoying legacy
  stuff
   in
 my company). It's hbase-0.94.2-cdh4.2.1. I have no idea if 0.94.6
 was
 backported by Cloudera into their flavor of 0.94.2, but
 the mysterious occurrence of the percent sign in zkcli (ls


   
  
 
 /hbase/replication/rs/va-p-hbase-02-d,60020,1369249862401/1-va-p-hbase-02-e,60020,1369042377129-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-02-e%2C60020%2C1369042377129.1369227474895)
 might be a sign for such problem. How deep should my rmr in zkcli
 (an
 example would be most welcomed :) be ? I have no serious problem
   running
 copyTable with a time period corresponding to the outage and then
 to
start
 the sync back again. One question though, how did it cause a crash
 ?


 On Thu, May 23, 2013 at 12:32 AM, Varun Sharma 
 va...@pinterest.com
 wrote:

  I believe there were cascading failures which got these deep
 nodes
  containing still to be replicated WAL(s) - I suspect there is
  either
some
  parsing bug or something which is causing the replication source
 to
   not
  work - also which version are you using - does it have
  https://issues.apache.org/jira/browse/HBASE-8207 - since you use
hyphens
  in
  our paths. One way to get back up is to delete these nodes but
 then
   you
  lose data in these WAL(s)...
 
 
  On Wed, May 22, 2013 at 2:22 PM, Amit Mor 
 amit.mor.m...@gmail.com
  
 wrote:
 
va-p-hbase-02-d,60020,1369249862401
  
  
   On Thu, May 23, 2013 at 12:20 AM, Varun Sharma 
   va...@pinterest.com
   wrote:
  
Basically
   
ls /hbase/rs and what do you see for va-p-02-d ?
   
   
On Wed, May 22, 2013 at 2:19 PM, Varun Sharma 
   va...@pinterest.com

   wrote:
   
 Can you do ls /hbase/rs and see what you get for 02-d -
  instead
of
looking
 in /replication/, could you look in /hbase/replication/rs
 - I
want
 to
   see
 if the timestamps are matching or not ?

 Varun


 On Wed, May 22, 2013 at 2:17 PM, Varun Sharma 
va...@pinterest.com
 
wrote:

 I see - so looks okay - there's just a lot of deep nesting
  in
 there
  -
   if
 you look into these you nodes by doing ls - you should
 see a
bunch
  of
 WAL(s) which still need to be replicated...

 Varun


 On Wed, May 22, 2013 at 2:16 PM, Varun Sharma 
 va...@pinterest.com
wrote:

 2013-05-22 15:31:25,929 WARN
 org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
Possibly
transient
 ZooKeeper exception:

  org.apache.zookeeper.KeeperException$SessionExpiredException:
 KeeperErrorCode = Session expired for *

   
  
 

   
  
 
 

Re: RS crash upon replication

2013-05-23 Thread Jean-Daniel Cryans
fwiw stop_replication is a kill switch, not a general way to start and
stop replicating, and start_replication may put you in an inconsistent
state:

hbase(main):001:0 help 'stop_replication'
Stops all the replication features. The state in which each
stream stops in is undetermined.
WARNING:
start/stop replication is only meant to be used in critical load situations.

On Thu, May 23, 2013 at 1:17 AM, Amit Mor amit.mor.m...@gmail.com wrote:
 No the server came out fine just because after the crash (RS's - the
 masters were still running), I immediately pulled the breaks with
 stop_replication. Then I start the RS's and they came back fine (not
 replicating). Once I hit 'start_replication' again they had crashed for the
 second time. Eventually I deleted the heavily nested replication znodes and
 the 'start_replication' succeeded. I didn't patch 8207 because I'm on CDH
 with Cloudera Manager Parcels thing and I'm still trying to figure out how
 to replace their jars with mine in a clean and non intrusive way


 On Thu, May 23, 2013 at 10:33 AM, Varun Sharma va...@pinterest.com wrote:

 Actually, it seems like something else was wrong here - the servers came up
 just fine on trying again - so could not really reproduce the issue.

 Amit: Did you try patching 8207 ?

 Varun


 On Wed, May 22, 2013 at 5:40 PM, Himanshu Vashishtha hv.cs...@gmail.com
 wrote:

  That sounds like a bug for sure. Could you create a jira with logs/znode
  dump/steps to reproduce it?
 
  Thanks,
  himanshu
 
 
  On Wed, May 22, 2013 at 5:01 PM, Varun Sharma va...@pinterest.com
 wrote:
 
   It seems I can reproduce this - I did a few rolling restarts and got
   screwed with NoNode exceptions - I am running 0.94.7 which has the fix
  but
   my nodes don't contain hyphens - nodes are no longer coming back up...
  
   Thanks
   Varun
  
  
   On Wed, May 22, 2013 at 3:02 PM, Himanshu Vashishtha 
 hv.cs...@gmail.com
   wrote:
  
I'd suggest to please patch the code with 8207;  cdh4.2.1 doesn't
 have
   it.
   
With hyphens in the name, ReplicationSource gets confused and tried
 to
   set
data in a znode which doesn't exist.
   
Thanks,
Himanshu
   
   
On Wed, May 22, 2013 at 2:42 PM, Amit Mor amit.mor.m...@gmail.com
   wrote:
   
 yes, indeed - hyphens are part of the host name (annoying legacy
  stuff
   in
 my company). It's hbase-0.94.2-cdh4.2.1. I have no idea if 0.94.6
 was
 backported by Cloudera into their flavor of 0.94.2, but
 the mysterious occurrence of the percent sign in zkcli (ls


   
  
 
 /hbase/replication/rs/va-p-hbase-02-d,60020,1369249862401/1-va-p-hbase-02-e,60020,1369042377129-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-02-e%2C60020%2C1369042377129.1369227474895)
 might be a sign for such problem. How deep should my rmr in zkcli
 (an
 example would be most welcomed :) be ? I have no serious problem
   running
 copyTable with a time period corresponding to the outage and then
 to
start
 the sync back again. One question though, how did it cause a crash
 ?


 On Thu, May 23, 2013 at 12:32 AM, Varun Sharma 
 va...@pinterest.com
 wrote:

  I believe there were cascading failures which got these deep
 nodes
  containing still to be replicated WAL(s) - I suspect there is
  either
some
  parsing bug or something which is causing the replication source
 to
   not
  work - also which version are you using - does it have
  https://issues.apache.org/jira/browse/HBASE-8207 - since you use
hyphens
  in
  our paths. One way to get back up is to delete these nodes but
 then
   you
  lose data in these WAL(s)...
 
 
  On Wed, May 22, 2013 at 2:22 PM, Amit Mor 
 amit.mor.m...@gmail.com
  
 wrote:
 
va-p-hbase-02-d,60020,1369249862401
  
  
   On Thu, May 23, 2013 at 12:20 AM, Varun Sharma 
   va...@pinterest.com
   wrote:
  
Basically
   
ls /hbase/rs and what do you see for va-p-02-d ?
   
   
On Wed, May 22, 2013 at 2:19 PM, Varun Sharma 
   va...@pinterest.com

   wrote:
   
 Can you do ls /hbase/rs and see what you get for 02-d -
  instead
of
looking
 in /replication/, could you look in /hbase/replication/rs
 - I
want
 to
   see
 if the timestamps are matching or not ?

 Varun


 On Wed, May 22, 2013 at 2:17 PM, Varun Sharma 
va...@pinterest.com
 
wrote:

 I see - so looks okay - there's just a lot of deep nesting
  in
 there
  -
   if
 you look into these you nodes by doing ls - you should
 see a
bunch
  of
 WAL(s) which still need to be replicated...

 Varun


 On Wed, May 22, 2013 at 2:16 PM, Varun Sharma 
 va...@pinterest.com
wrote:

 2013-05-22 

Re: RS crash upon replication

2013-05-23 Thread Varun Sharma
But wouldn't a copy table b/w timestamps bring you back since the mutations
are all timestamp based we should okay ? Basically doing a copy table which
supersedes the downtime interval ?


On Thu, May 23, 2013 at 9:48 AM, Jean-Daniel Cryans jdcry...@apache.orgwrote:

 fwiw stop_replication is a kill switch, not a general way to start and
 stop replicating, and start_replication may put you in an inconsistent
 state:

 hbase(main):001:0 help 'stop_replication'
 Stops all the replication features. The state in which each
 stream stops in is undetermined.
 WARNING:
 start/stop replication is only meant to be used in critical load
 situations.

 On Thu, May 23, 2013 at 1:17 AM, Amit Mor amit.mor.m...@gmail.com wrote:
  No the server came out fine just because after the crash (RS's - the
  masters were still running), I immediately pulled the breaks with
  stop_replication. Then I start the RS's and they came back fine (not
  replicating). Once I hit 'start_replication' again they had crashed for
 the
  second time. Eventually I deleted the heavily nested replication znodes
 and
  the 'start_replication' succeeded. I didn't patch 8207 because I'm on CDH
  with Cloudera Manager Parcels thing and I'm still trying to figure out
 how
  to replace their jars with mine in a clean and non intrusive way
 
 
  On Thu, May 23, 2013 at 10:33 AM, Varun Sharma va...@pinterest.com
 wrote:
 
  Actually, it seems like something else was wrong here - the servers
 came up
  just fine on trying again - so could not really reproduce the issue.
 
  Amit: Did you try patching 8207 ?
 
  Varun
 
 
  On Wed, May 22, 2013 at 5:40 PM, Himanshu Vashishtha 
 hv.cs...@gmail.com
  wrote:
 
   That sounds like a bug for sure. Could you create a jira with
 logs/znode
   dump/steps to reproduce it?
  
   Thanks,
   himanshu
  
  
   On Wed, May 22, 2013 at 5:01 PM, Varun Sharma va...@pinterest.com
  wrote:
  
It seems I can reproduce this - I did a few rolling restarts and got
screwed with NoNode exceptions - I am running 0.94.7 which has the
 fix
   but
my nodes don't contain hyphens - nodes are no longer coming back
 up...
   
Thanks
Varun
   
   
On Wed, May 22, 2013 at 3:02 PM, Himanshu Vashishtha 
  hv.cs...@gmail.com
wrote:
   
 I'd suggest to please patch the code with 8207;  cdh4.2.1 doesn't
  have
it.

 With hyphens in the name, ReplicationSource gets confused and
 tried
  to
set
 data in a znode which doesn't exist.

 Thanks,
 Himanshu


 On Wed, May 22, 2013 at 2:42 PM, Amit Mor 
 amit.mor.m...@gmail.com
wrote:

  yes, indeed - hyphens are part of the host name (annoying legacy
   stuff
in
  my company). It's hbase-0.94.2-cdh4.2.1. I have no idea if
 0.94.6
  was
  backported by Cloudera into their flavor of 0.94.2, but
  the mysterious occurrence of the percent sign in zkcli (ls
 
 

   
  
 
 /hbase/replication/rs/va-p-hbase-02-d,60020,1369249862401/1-va-p-hbase-02-e,60020,1369042377129-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-02-e%2C60020%2C1369042377129.1369227474895)
  might be a sign for such problem. How deep should my rmr in
 zkcli
  (an
  example would be most welcomed :) be ? I have no serious problem
running
  copyTable with a time period corresponding to the outage and
 then
  to
 start
  the sync back again. One question though, how did it cause a
 crash
  ?
 
 
  On Thu, May 23, 2013 at 12:32 AM, Varun Sharma 
  va...@pinterest.com
  wrote:
 
   I believe there were cascading failures which got these deep
  nodes
   containing still to be replicated WAL(s) - I suspect there is
   either
 some
   parsing bug or something which is causing the replication
 source
  to
not
   work - also which version are you using - does it have
   https://issues.apache.org/jira/browse/HBASE-8207 - since you
 use
 hyphens
   in
   our paths. One way to get back up is to delete these nodes but
  then
you
   lose data in these WAL(s)...
  
  
   On Wed, May 22, 2013 at 2:22 PM, Amit Mor 
  amit.mor.m...@gmail.com
   
  wrote:
  
 va-p-hbase-02-d,60020,1369249862401
   
   
On Thu, May 23, 2013 at 12:20 AM, Varun Sharma 
va...@pinterest.com
wrote:
   
 Basically

 ls /hbase/rs and what do you see for va-p-02-d ?


 On Wed, May 22, 2013 at 2:19 PM, Varun Sharma 
va...@pinterest.com
 
wrote:

  Can you do ls /hbase/rs and see what you get for 02-d -
   instead
 of
 looking
  in /replication/, could you look in
 /hbase/replication/rs
  - I
 want
  to
see
  if the timestamps are matching or not ?
 
  Varun
 
 
  On Wed, May 22, 2013 at 2:17 PM, Varun Sharma 
 

Re: RS crash upon replication

2013-05-23 Thread Amit Mor
Thanks for the helpful comments. I would certainly dig deeper now that 
everything has stabilized. Regarding J-D's comment - once my slave cluster was 
started, after about 4 hours of downtime (it's for offline stuff), at the very 
moment it came back online, 5 RS of my master-replication cluster crashed. 
Since I had no time figuring out what went wrong with the replication I 
submitted the 'stop_replication' knowing that's a last resort,since I had to 
get those production RS's online asap. I think renaming that cmd to something 
like 'abort_replication' would be more fitting. On the other hand, 
remove_peer(1) at a time of crisis feels like a developer's solution to a 
DBA's problem ;) 
Regarding copyTable, it's all good and well, but one has to consider that I'm 
on ec2 and the cluster is already streched out by 'online' service requests and 
copyTable would hit it's resources quite badly. I'll be glad to update. 
Thanks again,
Amit

 Original message 
From: Varun Sharma va...@pinterest.com 
Date:  
To: user@hbase.apache.org 
Subject: Re: RS crash upon replication 
 
But wouldn't a copy table b/w timestamps bring you back since the mutations
are all timestamp based we should okay ? Basically doing a copy table which
supersedes the downtime interval ?


On Thu, May 23, 2013 at 9:48 AM, Jean-Daniel Cryans jdcry...@apache.orgwrote:

 fwiw stop_replication is a kill switch, not a general way to start and
 stop replicating, and start_replication may put you in an inconsistent
 state:

 hbase(main):001:0 help 'stop_replication'
 Stops all the replication features. The state in which each
 stream stops in is undetermined.
 WARNING:
 start/stop replication is only meant to be used in critical load
 situations.

 On Thu, May 23, 2013 at 1:17 AM, Amit Mor amit.mor.m...@gmail.com wrote:
  No the server came out fine just because after the crash (RS's - the
  masters were still running), I immediately pulled the breaks with
  stop_replication. Then I start the RS's and they came back fine (not
  replicating). Once I hit 'start_replication' again they had crashed for
 the
  second time. Eventually I deleted the heavily nested replication znodes
 and
  the 'start_replication' succeeded. I didn't patch 8207 because I'm on CDH
  with Cloudera Manager Parcels thing and I'm still trying to figure out
 how
  to replace their jars with mine in a clean and non intrusive way
 
 
  On Thu, May 23, 2013 at 10:33 AM, Varun Sharma va...@pinterest.com
 wrote:
 
  Actually, it seems like something else was wrong here - the servers
 came up
  just fine on trying again - so could not really reproduce the issue.
 
  Amit: Did you try patching 8207 ?
 
  Varun
 
 
  On Wed, May 22, 2013 at 5:40 PM, Himanshu Vashishtha 
 hv.cs...@gmail.com
  wrote:
 
   That sounds like a bug for sure. Could you create a jira with
 logs/znode
   dump/steps to reproduce it?
  
   Thanks,
   himanshu
  
  
   On Wed, May 22, 2013 at 5:01 PM, Varun Sharma va...@pinterest.com
  wrote:
  
It seems I can reproduce this - I did a few rolling restarts and got
screwed with NoNode exceptions - I am running 0.94.7 which has the
 fix
   but
my nodes don't contain hyphens - nodes are no longer coming back
 up...
   
Thanks
Varun
   
   
On Wed, May 22, 2013 at 3:02 PM, Himanshu Vashishtha 
  hv.cs...@gmail.com
wrote:
   
 I'd suggest to please patch the code with 8207;  cdh4.2.1 doesn't
  have
it.

 With hyphens in the name, ReplicationSource gets confused and
 tried
  to
set
 data in a znode which doesn't exist.

 Thanks,
 Himanshu


 On Wed, May 22, 2013 at 2:42 PM, Amit Mor 
 amit.mor.m...@gmail.com
wrote:

  yes, indeed - hyphens are part of the host name (annoying legacy
   stuff
in
  my company). It's hbase-0.94.2-cdh4.2.1. I have no idea if
 0.94.6
  was
  backported by Cloudera into their flavor of 0.94.2, but
  the mysterious occurrence of the percent sign in zkcli (ls
 
 

   
  
 
 /hbase/replication/rs/va-p-hbase-02-d,60020,1369249862401/1-va-p-hbase-02-e,60020,1369042377129-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-02-e%2C60020%2C1369042377129.1369227474895)
  might be a sign for such problem. How deep should my rmr in
 zkcli
  (an
  example would be most welcomed :) be ? I have no serious problem
running
  copyTable with a time period corresponding to the outage and
 then
  to
 start
  the sync back again. One question though, how did it cause a
 crash
  ?
 
 
  On Thu, May 23, 2013 at 12:32 AM, Varun Sharma 
  va...@pinterest.com
  wrote:
 
   I believe there were cascading failures which got these deep
  nodes
   containing still to be replicated WAL(s) - I suspect there is
   either
 some
   parsing bug or something which is causing the replication
 source
  to
not
   work - also which version are you using

Re: RS crash upon replication

2013-05-23 Thread Amit Mor
I have pasted most of the RS's logs just prior to their FATAL and
including. Would be very thankful if someone can take a look:
http://pastebin.com/qFzycXNS . Interestingly, some RS's experience an
IOException for not finding an .oldlogs/ file. The rest get
KeeperException$NoNodeException
w/o the IOE.

Thanks


Re: RS crash upon replication

2013-05-22 Thread Varun Sharma
Basically,

You had va-p-hbase-02 crash - that caused all the replication related data
in zookeeper to be moved to va-p-hbase-01 and have it take over for
replicating 02's logs. Now each region server also maintains an in-memory
state of whats in ZK, it seems like when you start up 01, its trying to
replicate the 02 logs underneath but its failing to because that data is
not in ZK. This is somewhat weird...

Can you open the zookeepeer shell and do

ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379

And give the output ?


On Wed, May 22, 2013 at 1:27 PM, amit.mor.m...@gmail.com 
amit.mor.m...@gmail.com wrote:

 Hi,

 This is bad ... and happened twice: I had my replication-slave cluster
 offlined. I performed quite a massive Merge operation on it and after a
 couple of hours it had finished and I returned it back online. At the same
 time, the replication-master RS machines crashed (see first crash
 http://pastebin.com/1msNZ2tH) with the first exception being:

 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode =
 NoNode for

 /hbase/replication/rs/va-p-hbase-01-c,60020,1369233253404/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719
 at
 org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
 at
 org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
 at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266)
 at

 org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:354)
 at
 org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:846)
 at
 org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:898)
 at
 org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:892)
 at

 org.apache.hadoop.hbase.replication.ReplicationZookeeper.writeReplicationStatus(ReplicationZookeeper.java:558)
 at

 org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:154)
 at

 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:638)
 at

 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:387)

 Before restarting the crashed RS's, I have applied a 'stop_replication'
 cmd. Then fired up the RS's again. They've started o.k. but once I've hit
 'start_replication' they have crashed once again. The second crash log
 http://pastebin.com/8Nb5epJJ has the same initial exception
 (org.apache.zookeeper.KeeperException$NoNodeException:
 KeeperErrorCode = NoNode). I've started the crash region servers again
 without replication and currently all is well, but I need to start
 replication asap.

 Does anyone have an idea what's going on and how can I solve it ?

 Thanks,
 Amit



Re: RS crash upon replication

2013-05-22 Thread Varun Sharma
Also what version of HBase are you running ?


On Wed, May 22, 2013 at 1:38 PM, Varun Sharma va...@pinterest.com wrote:

 Basically,

 You had va-p-hbase-02 crash - that caused all the replication related data
 in zookeeper to be moved to va-p-hbase-01 and have it take over for
 replicating 02's logs. Now each region server also maintains an in-memory
 state of whats in ZK, it seems like when you start up 01, its trying to
 replicate the 02 logs underneath but its failing to because that data is
 not in ZK. This is somewhat weird...

 Can you open the zookeepeer shell and do

 ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379

 And give the output ?


 On Wed, May 22, 2013 at 1:27 PM, amit.mor.m...@gmail.com 
 amit.mor.m...@gmail.com wrote:

 Hi,

 This is bad ... and happened twice: I had my replication-slave cluster
 offlined. I performed quite a massive Merge operation on it and after a
 couple of hours it had finished and I returned it back online. At the same
 time, the replication-master RS machines crashed (see first crash
 http://pastebin.com/1msNZ2tH) with the first exception being:

 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode =
 NoNode for

 /hbase/replication/rs/va-p-hbase-01-c,60020,1369233253404/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719
 at
 org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
 at
 org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
 at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266)
 at

 org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:354)
 at
 org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:846)
 at
 org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:898)
 at
 org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:892)
 at

 org.apache.hadoop.hbase.replication.ReplicationZookeeper.writeReplicationStatus(ReplicationZookeeper.java:558)
 at

 org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:154)
 at

 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:638)
 at

 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:387)

 Before restarting the crashed RS's, I have applied a 'stop_replication'
 cmd. Then fired up the RS's again. They've started o.k. but once I've hit
 'start_replication' they have crashed once again. The second crash log
 http://pastebin.com/8Nb5epJJ has the same initial exception
 (org.apache.zookeeper.KeeperException$NoNodeException:
 KeeperErrorCode = NoNode). I've started the crash region servers again
 without replication and currently all is well, but I need to start
 replication asap.

 Does anyone have an idea what's going on and how can I solve it ?

 Thanks,
 Amit





Re: RS crash upon replication

2013-05-22 Thread amit.mor.m...@gmail.com
ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379
[1]
[zk: va-p-zookeeper-01-c:2181(CONNECTED) 2] ls
/hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
[]

I'm on hbase-0.94.2-cdh4.2.1

Thanks


On Wed, May 22, 2013 at 11:40 PM, Varun Sharma va...@pinterest.com wrote:

 Also what version of HBase are you running ?


 On Wed, May 22, 2013 at 1:38 PM, Varun Sharma va...@pinterest.com wrote:

  Basically,
 
  You had va-p-hbase-02 crash - that caused all the replication related
 data
  in zookeeper to be moved to va-p-hbase-01 and have it take over for
  replicating 02's logs. Now each region server also maintains an in-memory
  state of whats in ZK, it seems like when you start up 01, its trying to
  replicate the 02 logs underneath but its failing to because that data is
  not in ZK. This is somewhat weird...
 
  Can you open the zookeepeer shell and do
 
  ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379
 
  And give the output ?
 
 
  On Wed, May 22, 2013 at 1:27 PM, amit.mor.m...@gmail.com 
  amit.mor.m...@gmail.com wrote:
 
  Hi,
 
  This is bad ... and happened twice: I had my replication-slave cluster
  offlined. I performed quite a massive Merge operation on it and after a
  couple of hours it had finished and I returned it back online. At the
 same
  time, the replication-master RS machines crashed (see first crash
  http://pastebin.com/1msNZ2tH) with the first exception being:
 
  org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode =
  NoNode for
 
 
 /hbase/replication/rs/va-p-hbase-01-c,60020,1369233253404/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719
  at
  org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
  at
  org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
  at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266)
  at
 
 
 org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:354)
  at
  org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:846)
  at
  org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:898)
  at
  org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:892)
  at
 
 
 org.apache.hadoop.hbase.replication.ReplicationZookeeper.writeReplicationStatus(ReplicationZookeeper.java:558)
  at
 
 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:154)
  at
 
 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:638)
  at
 
 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:387)
 
  Before restarting the crashed RS's, I have applied a 'stop_replication'
  cmd. Then fired up the RS's again. They've started o.k. but once I've
 hit
  'start_replication' they have crashed once again. The second crash log
  http://pastebin.com/8Nb5epJJ has the same initial exception
  (org.apache.zookeeper.KeeperException$NoNodeException:
  KeeperErrorCode = NoNode). I've started the crash region servers again
  without replication and currently all is well, but I need to start
  replication asap.
 
  Does anyone have an idea what's going on and how can I solve it ?
 
  Thanks,
  Amit
 
 
 



Re: RS crash upon replication

2013-05-22 Thread Ted Yu
What does this command show you ?

get /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1

Cheers

On Wed, May 22, 2013 at 1:46 PM, amit.mor.m...@gmail.com 
amit.mor.m...@gmail.com wrote:

 ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379
 [1]
 [zk: va-p-zookeeper-01-c:2181(CONNECTED) 2] ls
 /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
 []

 I'm on hbase-0.94.2-cdh4.2.1

 Thanks


 On Wed, May 22, 2013 at 11:40 PM, Varun Sharma va...@pinterest.com
 wrote:

  Also what version of HBase are you running ?
 
 
  On Wed, May 22, 2013 at 1:38 PM, Varun Sharma va...@pinterest.com
 wrote:
 
   Basically,
  
   You had va-p-hbase-02 crash - that caused all the replication related
  data
   in zookeeper to be moved to va-p-hbase-01 and have it take over for
   replicating 02's logs. Now each region server also maintains an
 in-memory
   state of whats in ZK, it seems like when you start up 01, its trying to
   replicate the 02 logs underneath but its failing to because that data
 is
   not in ZK. This is somewhat weird...
  
   Can you open the zookeepeer shell and do
  
   ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379
  
   And give the output ?
  
  
   On Wed, May 22, 2013 at 1:27 PM, amit.mor.m...@gmail.com 
   amit.mor.m...@gmail.com wrote:
  
   Hi,
  
   This is bad ... and happened twice: I had my replication-slave cluster
   offlined. I performed quite a massive Merge operation on it and after
 a
   couple of hours it had finished and I returned it back online. At the
  same
   time, the replication-master RS machines crashed (see first crash
   http://pastebin.com/1msNZ2tH) with the first exception being:
  
   org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode
 =
   NoNode for
  
  
 
 /hbase/replication/rs/va-p-hbase-01-c,60020,1369233253404/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719
   at
   org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
   at
   org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
   at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266)
   at
  
  
 
 org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:354)
   at
   org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:846)
   at
   org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:898)
   at
   org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:892)
   at
  
  
 
 org.apache.hadoop.hbase.replication.ReplicationZookeeper.writeReplicationStatus(ReplicationZookeeper.java:558)
   at
  
  
 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:154)
   at
  
  
 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:638)
   at
  
  
 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:387)
  
   Before restarting the crashed RS's, I have applied a
 'stop_replication'
   cmd. Then fired up the RS's again. They've started o.k. but once I've
  hit
   'start_replication' they have crashed once again. The second crash log
   http://pastebin.com/8Nb5epJJ has the same initial exception
   (org.apache.zookeeper.KeeperException$NoNodeException:
   KeeperErrorCode = NoNode). I've started the crash region servers again
   without replication and currently all is well, but I need to start
   replication asap.
  
   Does anyone have an idea what's going on and how can I solve it ?
  
   Thanks,
   Amit
  
  
  
 



Re: RS crash upon replication

2013-05-22 Thread amit.mor.m...@gmail.com
[zk: va-p-zookeeper-01-c:2181(CONNECTED) 3] get
/hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1

cZxid = 0x60281c1de
ctime = Wed May 22 15:11:17 EDT 2013
mZxid = 0x60281c1de
mtime = Wed May 22 15:11:17 EDT 2013
pZxid = 0x60281c1de
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 0
numChildren = 0



On Wed, May 22, 2013 at 11:49 PM, Ted Yu yuzhih...@gmail.com wrote:

 What does this command show you ?

 get /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1

 Cheers

 On Wed, May 22, 2013 at 1:46 PM, amit.mor.m...@gmail.com 
 amit.mor.m...@gmail.com wrote:

  ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379
  [1]
  [zk: va-p-zookeeper-01-c:2181(CONNECTED) 2] ls
  /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
  []
 
  I'm on hbase-0.94.2-cdh4.2.1
 
  Thanks
 
 
  On Wed, May 22, 2013 at 11:40 PM, Varun Sharma va...@pinterest.com
  wrote:
 
   Also what version of HBase are you running ?
  
  
   On Wed, May 22, 2013 at 1:38 PM, Varun Sharma va...@pinterest.com
  wrote:
  
Basically,
   
You had va-p-hbase-02 crash - that caused all the replication related
   data
in zookeeper to be moved to va-p-hbase-01 and have it take over for
replicating 02's logs. Now each region server also maintains an
  in-memory
state of whats in ZK, it seems like when you start up 01, its trying
 to
replicate the 02 logs underneath but its failing to because that data
  is
not in ZK. This is somewhat weird...
   
Can you open the zookeepeer shell and do
   
ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379
   
And give the output ?
   
   
On Wed, May 22, 2013 at 1:27 PM, amit.mor.m...@gmail.com 
amit.mor.m...@gmail.com wrote:
   
Hi,
   
This is bad ... and happened twice: I had my replication-slave
 cluster
offlined. I performed quite a massive Merge operation on it and
 after
  a
couple of hours it had finished and I returned it back online. At
 the
   same
time, the replication-master RS machines crashed (see first crash
http://pastebin.com/1msNZ2tH) with the first exception being:
   
org.apache.zookeeper.KeeperException$NoNodeException:
 KeeperErrorCode
  =
NoNode for
   
   
  
 
 /hbase/replication/rs/va-p-hbase-01-c,60020,1369233253404/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719
at
   
 org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
at
org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at
 org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266)
at
   
   
  
 
 org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:354)
at
org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:846)
at
org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:898)
at
org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:892)
at
   
   
  
 
 org.apache.hadoop.hbase.replication.ReplicationZookeeper.writeReplicationStatus(ReplicationZookeeper.java:558)
at
   
   
  
 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:154)
at
   
   
  
 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:638)
at
   
   
  
 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:387)
   
Before restarting the crashed RS's, I have applied a
  'stop_replication'
cmd. Then fired up the RS's again. They've started o.k. but once
 I've
   hit
'start_replication' they have crashed once again. The second crash
 log
http://pastebin.com/8Nb5epJJ has the same initial exception
(org.apache.zookeeper.KeeperException$NoNodeException:
KeeperErrorCode = NoNode). I've started the crash region servers
 again
without replication and currently all is well, but I need to start
replication asap.
   
Does anyone have an idea what's going on and how can I solve it ?
   
Thanks,
Amit
   
   
   
  
 



Re: RS crash upon replication

2013-05-22 Thread Varun Sharma
Do an ls not a get here and give the output ?

ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1


On Wed, May 22, 2013 at 1:53 PM, amit.mor.m...@gmail.com 
amit.mor.m...@gmail.com wrote:

 [zk: va-p-zookeeper-01-c:2181(CONNECTED) 3] get
 /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1

 cZxid = 0x60281c1de
 ctime = Wed May 22 15:11:17 EDT 2013
 mZxid = 0x60281c1de
 mtime = Wed May 22 15:11:17 EDT 2013
 pZxid = 0x60281c1de
 cversion = 0
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0x0
 dataLength = 0
 numChildren = 0



 On Wed, May 22, 2013 at 11:49 PM, Ted Yu yuzhih...@gmail.com wrote:

  What does this command show you ?
 
  get /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
 
  Cheers
 
  On Wed, May 22, 2013 at 1:46 PM, amit.mor.m...@gmail.com 
  amit.mor.m...@gmail.com wrote:
 
   ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379
   [1]
   [zk: va-p-zookeeper-01-c:2181(CONNECTED) 2] ls
   /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
   []
  
   I'm on hbase-0.94.2-cdh4.2.1
  
   Thanks
  
  
   On Wed, May 22, 2013 at 11:40 PM, Varun Sharma va...@pinterest.com
   wrote:
  
Also what version of HBase are you running ?
   
   
On Wed, May 22, 2013 at 1:38 PM, Varun Sharma va...@pinterest.com
   wrote:
   
 Basically,

 You had va-p-hbase-02 crash - that caused all the replication
 related
data
 in zookeeper to be moved to va-p-hbase-01 and have it take over for
 replicating 02's logs. Now each region server also maintains an
   in-memory
 state of whats in ZK, it seems like when you start up 01, its
 trying
  to
 replicate the 02 logs underneath but its failing to because that
 data
   is
 not in ZK. This is somewhat weird...

 Can you open the zookeepeer shell and do

 ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379

 And give the output ?


 On Wed, May 22, 2013 at 1:27 PM, amit.mor.m...@gmail.com 
 amit.mor.m...@gmail.com wrote:

 Hi,

 This is bad ... and happened twice: I had my replication-slave
  cluster
 offlined. I performed quite a massive Merge operation on it and
  after
   a
 couple of hours it had finished and I returned it back online. At
  the
same
 time, the replication-master RS machines crashed (see first crash
 http://pastebin.com/1msNZ2tH) with the first exception being:

 org.apache.zookeeper.KeeperException$NoNodeException:
  KeeperErrorCode
   =
 NoNode for


   
  
 
 /hbase/replication/rs/va-p-hbase-01-c,60020,1369233253404/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719
 at

  org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
 at

 org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
 at
  org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266)
 at


   
  
 
 org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:354)
 at
 org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:846)
 at
 org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:898)
 at
 org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:892)
 at


   
  
 
 org.apache.hadoop.hbase.replication.ReplicationZookeeper.writeReplicationStatus(ReplicationZookeeper.java:558)
 at


   
  
 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:154)
 at


   
  
 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:638)
 at


   
  
 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:387)

 Before restarting the crashed RS's, I have applied a
   'stop_replication'
 cmd. Then fired up the RS's again. They've started o.k. but once
  I've
hit
 'start_replication' they have crashed once again. The second crash
  log
 http://pastebin.com/8Nb5epJJ has the same initial exception
 (org.apache.zookeeper.KeeperException$NoNodeException:
 KeeperErrorCode = NoNode). I've started the crash region servers
  again
 without replication and currently all is well, but I need to start
 replication asap.

 Does anyone have an idea what's going on and how can I solve it ?

 Thanks,
 Amit



   
  
 



Re: RS crash upon replication

2013-05-22 Thread Amit Mor
empty return:

[zk: va-p-zookeeper-01-c:2181(CONNECTED) 10] ls
/hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
[]



On Thu, May 23, 2013 at 12:05 AM, Varun Sharma va...@pinterest.com wrote:

 Do an ls not a get here and give the output ?

 ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1


 On Wed, May 22, 2013 at 1:53 PM, amit.mor.m...@gmail.com 
 amit.mor.m...@gmail.com wrote:

  [zk: va-p-zookeeper-01-c:2181(CONNECTED) 3] get
  /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
 
  cZxid = 0x60281c1de
  ctime = Wed May 22 15:11:17 EDT 2013
  mZxid = 0x60281c1de
  mtime = Wed May 22 15:11:17 EDT 2013
  pZxid = 0x60281c1de
  cversion = 0
  dataVersion = 0
  aclVersion = 0
  ephemeralOwner = 0x0
  dataLength = 0
  numChildren = 0
 
 
 
  On Wed, May 22, 2013 at 11:49 PM, Ted Yu yuzhih...@gmail.com wrote:
 
   What does this command show you ?
  
   get /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
  
   Cheers
  
   On Wed, May 22, 2013 at 1:46 PM, amit.mor.m...@gmail.com 
   amit.mor.m...@gmail.com wrote:
  
ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379
[1]
[zk: va-p-zookeeper-01-c:2181(CONNECTED) 2] ls
/hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
[]
   
I'm on hbase-0.94.2-cdh4.2.1
   
Thanks
   
   
On Wed, May 22, 2013 at 11:40 PM, Varun Sharma va...@pinterest.com
wrote:
   
 Also what version of HBase are you running ?


 On Wed, May 22, 2013 at 1:38 PM, Varun Sharma va...@pinterest.com
 
wrote:

  Basically,
 
  You had va-p-hbase-02 crash - that caused all the replication
  related
 data
  in zookeeper to be moved to va-p-hbase-01 and have it take over
 for
  replicating 02's logs. Now each region server also maintains an
in-memory
  state of whats in ZK, it seems like when you start up 01, its
  trying
   to
  replicate the 02 logs underneath but its failing to because that
  data
is
  not in ZK. This is somewhat weird...
 
  Can you open the zookeepeer shell and do
 
  ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379
 
  And give the output ?
 
 
  On Wed, May 22, 2013 at 1:27 PM, amit.mor.m...@gmail.com 
  amit.mor.m...@gmail.com wrote:
 
  Hi,
 
  This is bad ... and happened twice: I had my replication-slave
   cluster
  offlined. I performed quite a massive Merge operation on it and
   after
a
  couple of hours it had finished and I returned it back online.
 At
   the
 same
  time, the replication-master RS machines crashed (see first
 crash
  http://pastebin.com/1msNZ2tH) with the first exception being:
 
  org.apache.zookeeper.KeeperException$NoNodeException:
   KeeperErrorCode
=
  NoNode for
 
 

   
  
 
 /hbase/replication/rs/va-p-hbase-01-c,60020,1369233253404/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719
  at
 
   org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
  at
 
  org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
  at
   org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266)
  at
 
 

   
  
 
 org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:354)
  at
 
 org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:846)
  at
 
 org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:898)
  at
 
 org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:892)
  at
 
 

   
  
 
 org.apache.hadoop.hbase.replication.ReplicationZookeeper.writeReplicationStatus(ReplicationZookeeper.java:558)
  at
 
 

   
  
 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:154)
  at
 
 

   
  
 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:638)
  at
 
 

   
  
 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:387)
 
  Before restarting the crashed RS's, I have applied a
'stop_replication'
  cmd. Then fired up the RS's again. They've started o.k. but once
   I've
 hit
  'start_replication' they have crashed once again. The second
 crash
   log
  http://pastebin.com/8Nb5epJJ has the same initial exception
  (org.apache.zookeeper.KeeperException$NoNodeException:
  KeeperErrorCode = NoNode). I've started the crash region servers
   again
  without replication and currently all is well, but I need to
 start
  replication asap.
 
  Does anyone have an idea what's going on and how can I solve 

Re: RS crash upon replication

2013-05-22 Thread Amit Mor
I found this:

[zk: va-p-zookeeper-01-c:2181(CONNECTED) 17] ls
/hbase/replication/rs/va-p-hbase-02-d,60020,1369249862401
[1-va-p-hbase-02-e,60020,1369042377129-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475,
1-va-p-hbase-01-d,60020,1369042382584-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475,
1,
1-va-p-hbase-02-e,60020,1369233253407-va-p-hbase-02-d,60020,1369233252475,
1-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475,
1-va-p-hbase-02-d,60020,1369233252475,
1-va-p-hbase-01-e,60020,1369233254969-va-p-hbase-02-e,60020,1369233253407-va-p-hbase-02-d,60020,1369233252475,
1-va-p-hbase-02-d,60020,1369042368330-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475,
1-va-p-hbase-01-e,60020,1369042368595-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475,
1-va-p-hbase-01-c,60020,1369233253404-va-p-hbase-02-e,60020,1369233253407-va-p-hbase-02-d,60020,1369233252475,
1-va-p-hbase-01-d,60020,1369233257617-va-p-hbase-02-d,60020,1369233252475,
1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475,
1-va-p-hbase-02-c,60020,1369233268385-va-p-hbase-02-d,60020,1369233252475]



On Thu, May 23, 2013 at 12:09 AM, Amit Mor amit.mor.m...@gmail.com wrote:

 empty return:

 [zk: va-p-zookeeper-01-c:2181(CONNECTED) 10] ls
 /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
 []



 On Thu, May 23, 2013 at 12:05 AM, Varun Sharma va...@pinterest.comwrote:

 Do an ls not a get here and give the output ?

 ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1


 On Wed, May 22, 2013 at 1:53 PM, amit.mor.m...@gmail.com 
 amit.mor.m...@gmail.com wrote:

  [zk: va-p-zookeeper-01-c:2181(CONNECTED) 3] get
  /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
 
  cZxid = 0x60281c1de
  ctime = Wed May 22 15:11:17 EDT 2013
  mZxid = 0x60281c1de
  mtime = Wed May 22 15:11:17 EDT 2013
  pZxid = 0x60281c1de
  cversion = 0
  dataVersion = 0
  aclVersion = 0
  ephemeralOwner = 0x0
  dataLength = 0
  numChildren = 0
 
 
 
  On Wed, May 22, 2013 at 11:49 PM, Ted Yu yuzhih...@gmail.com wrote:
 
   What does this command show you ?
  
   get /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
  
   Cheers
  
   On Wed, May 22, 2013 at 1:46 PM, amit.mor.m...@gmail.com 
   amit.mor.m...@gmail.com wrote:
  
ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379
[1]
[zk: va-p-zookeeper-01-c:2181(CONNECTED) 2] ls
/hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
[]
   
I'm on hbase-0.94.2-cdh4.2.1
   
Thanks
   
   
On Wed, May 22, 2013 at 11:40 PM, Varun Sharma va...@pinterest.com
 
wrote:
   
 Also what version of HBase are you running ?


 On Wed, May 22, 2013 at 1:38 PM, Varun Sharma 
 va...@pinterest.com
wrote:

  Basically,
 
  You had va-p-hbase-02 crash - that caused all the replication
  related
 data
  in zookeeper to be moved to va-p-hbase-01 and have it take over
 for
  replicating 02's logs. Now each region server also maintains an
in-memory
  state of whats in ZK, it seems like when you start up 01, its
  trying
   to
  replicate the 02 logs underneath but its failing to because that
  data
is
  not in ZK. This is somewhat weird...
 
  Can you open the zookeepeer shell and do
 
  ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379
 
  And give the output ?
 
 
  On Wed, May 22, 2013 at 1:27 PM, amit.mor.m...@gmail.com 
  amit.mor.m...@gmail.com wrote:
 
  Hi,
 
  This is bad ... and happened twice: I had my replication-slave
   cluster
  offlined. I performed quite a massive Merge operation on it and
   after
a
  couple of hours it had finished and I returned it back online.
 At
   the
 same
  time, the replication-master RS machines crashed (see first
 crash
  http://pastebin.com/1msNZ2tH) with the first exception being:
 
  org.apache.zookeeper.KeeperException$NoNodeException:
   KeeperErrorCode
=
  NoNode for
 
 

   
  
 
 /hbase/replication/rs/va-p-hbase-01-c,60020,1369233253404/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719
  at
 
   org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
  at
 
  org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
  at
   org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266)
  at
 
 

   
  
 
 org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:354)
  at
 
 org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:846)
  at
 
 org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:898)
  at
 
 

Re: RS crash upon replication

2013-05-22 Thread Varun Sharma
2013-05-22 15:31:25,929 WARN
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient
ZooKeeper exception:
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for *
/hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719
*
*
*
*01-[01-02-02]-01*

*Looks like a bunch of cascading failures causing this deep nesting... *


On Wed, May 22, 2013 at 2:09 PM, Amit Mor amit.mor.m...@gmail.com wrote:

 empty return:

 [zk: va-p-zookeeper-01-c:2181(CONNECTED) 10] ls
 /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
 []



 On Thu, May 23, 2013 at 12:05 AM, Varun Sharma va...@pinterest.com
 wrote:

  Do an ls not a get here and give the output ?
 
  ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
 
 
  On Wed, May 22, 2013 at 1:53 PM, amit.mor.m...@gmail.com 
  amit.mor.m...@gmail.com wrote:
 
   [zk: va-p-zookeeper-01-c:2181(CONNECTED) 3] get
   /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
  
   cZxid = 0x60281c1de
   ctime = Wed May 22 15:11:17 EDT 2013
   mZxid = 0x60281c1de
   mtime = Wed May 22 15:11:17 EDT 2013
   pZxid = 0x60281c1de
   cversion = 0
   dataVersion = 0
   aclVersion = 0
   ephemeralOwner = 0x0
   dataLength = 0
   numChildren = 0
  
  
  
   On Wed, May 22, 2013 at 11:49 PM, Ted Yu yuzhih...@gmail.com wrote:
  
What does this command show you ?
   
get /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
   
Cheers
   
On Wed, May 22, 2013 at 1:46 PM, amit.mor.m...@gmail.com 
amit.mor.m...@gmail.com wrote:
   
 ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379
 [1]
 [zk: va-p-zookeeper-01-c:2181(CONNECTED) 2] ls
 /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
 []

 I'm on hbase-0.94.2-cdh4.2.1

 Thanks


 On Wed, May 22, 2013 at 11:40 PM, Varun Sharma 
 va...@pinterest.com
 wrote:

  Also what version of HBase are you running ?
 
 
  On Wed, May 22, 2013 at 1:38 PM, Varun Sharma 
 va...@pinterest.com
  
 wrote:
 
   Basically,
  
   You had va-p-hbase-02 crash - that caused all the replication
   related
  data
   in zookeeper to be moved to va-p-hbase-01 and have it take over
  for
   replicating 02's logs. Now each region server also maintains an
 in-memory
   state of whats in ZK, it seems like when you start up 01, its
   trying
to
   replicate the 02 logs underneath but its failing to because
 that
   data
 is
   not in ZK. This is somewhat weird...
  
   Can you open the zookeepeer shell and do
  
   ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379
  
   And give the output ?
  
  
   On Wed, May 22, 2013 at 1:27 PM, amit.mor.m...@gmail.com 
   amit.mor.m...@gmail.com wrote:
  
   Hi,
  
   This is bad ... and happened twice: I had my replication-slave
cluster
   offlined. I performed quite a massive Merge operation on it
 and
after
 a
   couple of hours it had finished and I returned it back online.
  At
the
  same
   time, the replication-master RS machines crashed (see first
  crash
   http://pastebin.com/1msNZ2tH) with the first exception being:
  
   org.apache.zookeeper.KeeperException$NoNodeException:
KeeperErrorCode
 =
   NoNode for
  
  
 

   
  
 
 /hbase/replication/rs/va-p-hbase-01-c,60020,1369233253404/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719
   at
  
org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
   at
  
   org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
   at
org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266)
   at
  
  
 

   
  
 
 org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:354)
   at
  
  org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:846)
   at
  
  org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:898)
   at
  
  org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:892)
   at
  
  
 

   
  
 
 org.apache.hadoop.hbase.replication.ReplicationZookeeper.writeReplicationStatus(ReplicationZookeeper.java:558)
   at
  
  
 

   
  
 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:154)
   at
  
  
 

   
  
 
 

Re: RS crash upon replication

2013-05-22 Thread Varun Sharma
I see - so looks okay - there's just a lot of deep nesting in there - if
you look into these you nodes by doing ls - you should see a bunch of
WAL(s) which still need to be replicated...

Varun


On Wed, May 22, 2013 at 2:16 PM, Varun Sharma va...@pinterest.com wrote:

 2013-05-22 15:31:25,929 WARN
 org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient
 ZooKeeper exception:
 org.apache.zookeeper.KeeperException$SessionExpiredException:
 KeeperErrorCode = Session expired for *
 /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719
 *
 *
 *
 *01-[01-02-02]-01*

 *Looks like a bunch of cascading failures causing this deep nesting... *


 On Wed, May 22, 2013 at 2:09 PM, Amit Mor amit.mor.m...@gmail.com wrote:

 empty return:

 [zk: va-p-zookeeper-01-c:2181(CONNECTED) 10] ls
 /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
 []



 On Thu, May 23, 2013 at 12:05 AM, Varun Sharma va...@pinterest.com
 wrote:

  Do an ls not a get here and give the output ?
 
  ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
 
 
  On Wed, May 22, 2013 at 1:53 PM, amit.mor.m...@gmail.com 
  amit.mor.m...@gmail.com wrote:
 
   [zk: va-p-zookeeper-01-c:2181(CONNECTED) 3] get
   /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
  
   cZxid = 0x60281c1de
   ctime = Wed May 22 15:11:17 EDT 2013
   mZxid = 0x60281c1de
   mtime = Wed May 22 15:11:17 EDT 2013
   pZxid = 0x60281c1de
   cversion = 0
   dataVersion = 0
   aclVersion = 0
   ephemeralOwner = 0x0
   dataLength = 0
   numChildren = 0
  
  
  
   On Wed, May 22, 2013 at 11:49 PM, Ted Yu yuzhih...@gmail.com wrote:
  
What does this command show you ?
   
get /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
   
Cheers
   
On Wed, May 22, 2013 at 1:46 PM, amit.mor.m...@gmail.com 
amit.mor.m...@gmail.com wrote:
   
 ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379
 [1]
 [zk: va-p-zookeeper-01-c:2181(CONNECTED) 2] ls
 /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
 []

 I'm on hbase-0.94.2-cdh4.2.1

 Thanks


 On Wed, May 22, 2013 at 11:40 PM, Varun Sharma 
 va...@pinterest.com
 wrote:

  Also what version of HBase are you running ?
 
 
  On Wed, May 22, 2013 at 1:38 PM, Varun Sharma 
 va...@pinterest.com
  
 wrote:
 
   Basically,
  
   You had va-p-hbase-02 crash - that caused all the replication
   related
  data
   in zookeeper to be moved to va-p-hbase-01 and have it take
 over
  for
   replicating 02's logs. Now each region server also maintains
 an
 in-memory
   state of whats in ZK, it seems like when you start up 01, its
   trying
to
   replicate the 02 logs underneath but its failing to because
 that
   data
 is
   not in ZK. This is somewhat weird...
  
   Can you open the zookeepeer shell and do
  
   ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379
  
   And give the output ?
  
  
   On Wed, May 22, 2013 at 1:27 PM, amit.mor.m...@gmail.com 
   amit.mor.m...@gmail.com wrote:
  
   Hi,
  
   This is bad ... and happened twice: I had my
 replication-slave
cluster
   offlined. I performed quite a massive Merge operation on it
 and
after
 a
   couple of hours it had finished and I returned it back
 online.
  At
the
  same
   time, the replication-master RS machines crashed (see first
  crash
   http://pastebin.com/1msNZ2tH) with the first exception
 being:
  
   org.apache.zookeeper.KeeperException$NoNodeException:
KeeperErrorCode
 =
   NoNode for
  
  
 

   
  
 
 /hbase/replication/rs/va-p-hbase-01-c,60020,1369233253404/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719
   at
  
   
 org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
   at
  
   org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
   at
org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266)
   at
  
  
 

   
  
 
 org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:354)
   at
  
  org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:846)
   at
  
  org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:898)
   at
  
  org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:892)
   at
  
  
 

   
  
 
 org.apache.hadoop.hbase.replication.ReplicationZookeeper.writeReplicationStatus(ReplicationZookeeper.java:558)
   at
  
  
 

Re: RS crash upon replication

2013-05-22 Thread Varun Sharma
Can you do ls /hbase/rs and see what you get for 02-d - instead of looking
in /replication/, could you look in /hbase/replication/rs - I want to see
if the timestamps are matching or not ?

Varun


On Wed, May 22, 2013 at 2:17 PM, Varun Sharma va...@pinterest.com wrote:

 I see - so looks okay - there's just a lot of deep nesting in there - if
 you look into these you nodes by doing ls - you should see a bunch of
 WAL(s) which still need to be replicated...

 Varun


 On Wed, May 22, 2013 at 2:16 PM, Varun Sharma va...@pinterest.com wrote:

 2013-05-22 15:31:25,929 WARN
 org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient
 ZooKeeper exception:
 org.apache.zookeeper.KeeperException$SessionExpiredException:
 KeeperErrorCode = Session expired for *
 /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719
 *
 *
 *
 *01-[01-02-02]-01*

 *Looks like a bunch of cascading failures causing this deep nesting... *


 On Wed, May 22, 2013 at 2:09 PM, Amit Mor amit.mor.m...@gmail.comwrote:

 empty return:

 [zk: va-p-zookeeper-01-c:2181(CONNECTED) 10] ls
 /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
 []



 On Thu, May 23, 2013 at 12:05 AM, Varun Sharma va...@pinterest.com
 wrote:

  Do an ls not a get here and give the output ?
 
  ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
 
 
  On Wed, May 22, 2013 at 1:53 PM, amit.mor.m...@gmail.com 
  amit.mor.m...@gmail.com wrote:
 
   [zk: va-p-zookeeper-01-c:2181(CONNECTED) 3] get
   /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
  
   cZxid = 0x60281c1de
   ctime = Wed May 22 15:11:17 EDT 2013
   mZxid = 0x60281c1de
   mtime = Wed May 22 15:11:17 EDT 2013
   pZxid = 0x60281c1de
   cversion = 0
   dataVersion = 0
   aclVersion = 0
   ephemeralOwner = 0x0
   dataLength = 0
   numChildren = 0
  
  
  
   On Wed, May 22, 2013 at 11:49 PM, Ted Yu yuzhih...@gmail.com
 wrote:
  
What does this command show you ?
   
get /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
   
Cheers
   
On Wed, May 22, 2013 at 1:46 PM, amit.mor.m...@gmail.com 
amit.mor.m...@gmail.com wrote:
   
 ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379
 [1]
 [zk: va-p-zookeeper-01-c:2181(CONNECTED) 2] ls
 /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
 []

 I'm on hbase-0.94.2-cdh4.2.1

 Thanks


 On Wed, May 22, 2013 at 11:40 PM, Varun Sharma 
 va...@pinterest.com
 wrote:

  Also what version of HBase are you running ?
 
 
  On Wed, May 22, 2013 at 1:38 PM, Varun Sharma 
 va...@pinterest.com
  
 wrote:
 
   Basically,
  
   You had va-p-hbase-02 crash - that caused all the replication
   related
  data
   in zookeeper to be moved to va-p-hbase-01 and have it take
 over
  for
   replicating 02's logs. Now each region server also maintains
 an
 in-memory
   state of whats in ZK, it seems like when you start up 01, its
   trying
to
   replicate the 02 logs underneath but its failing to because
 that
   data
 is
   not in ZK. This is somewhat weird...
  
   Can you open the zookeepeer shell and do
  
   ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379
  
   And give the output ?
  
  
   On Wed, May 22, 2013 at 1:27 PM, amit.mor.m...@gmail.com 
   amit.mor.m...@gmail.com wrote:
  
   Hi,
  
   This is bad ... and happened twice: I had my
 replication-slave
cluster
   offlined. I performed quite a massive Merge operation on it
 and
after
 a
   couple of hours it had finished and I returned it back
 online.
  At
the
  same
   time, the replication-master RS machines crashed (see first
  crash
   http://pastebin.com/1msNZ2tH) with the first exception
 being:
  
   org.apache.zookeeper.KeeperException$NoNodeException:
KeeperErrorCode
 =
   NoNode for
  
  
 

   
  
 
 /hbase/replication/rs/va-p-hbase-01-c,60020,1369233253404/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719
   at
  
   
 org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
   at
  
   org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
   at
org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266)
   at
  
  
 

   
  
 
 org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:354)
   at
  
  org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:846)
   at
  
  org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:898)
   at

Re: RS crash upon replication

2013-05-22 Thread Varun Sharma
Basically

ls /hbase/rs and what do you see for va-p-02-d ?


On Wed, May 22, 2013 at 2:19 PM, Varun Sharma va...@pinterest.com wrote:

 Can you do ls /hbase/rs and see what you get for 02-d - instead of looking
 in /replication/, could you look in /hbase/replication/rs - I want to see
 if the timestamps are matching or not ?

 Varun


 On Wed, May 22, 2013 at 2:17 PM, Varun Sharma va...@pinterest.com wrote:

 I see - so looks okay - there's just a lot of deep nesting in there - if
 you look into these you nodes by doing ls - you should see a bunch of
 WAL(s) which still need to be replicated...

 Varun


 On Wed, May 22, 2013 at 2:16 PM, Varun Sharma va...@pinterest.comwrote:

 2013-05-22 15:31:25,929 WARN
 org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient
 ZooKeeper exception:
 org.apache.zookeeper.KeeperException$SessionExpiredException:
 KeeperErrorCode = Session expired for *
 /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719
 *
 *
 *
 *01-[01-02-02]-01*

 *Looks like a bunch of cascading failures causing this deep nesting... *


 On Wed, May 22, 2013 at 2:09 PM, Amit Mor amit.mor.m...@gmail.comwrote:

 empty return:

 [zk: va-p-zookeeper-01-c:2181(CONNECTED) 10] ls
 /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
 []



 On Thu, May 23, 2013 at 12:05 AM, Varun Sharma va...@pinterest.com
 wrote:

  Do an ls not a get here and give the output ?
 
  ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
 
 
  On Wed, May 22, 2013 at 1:53 PM, amit.mor.m...@gmail.com 
  amit.mor.m...@gmail.com wrote:
 
   [zk: va-p-zookeeper-01-c:2181(CONNECTED) 3] get
   /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
  
   cZxid = 0x60281c1de
   ctime = Wed May 22 15:11:17 EDT 2013
   mZxid = 0x60281c1de
   mtime = Wed May 22 15:11:17 EDT 2013
   pZxid = 0x60281c1de
   cversion = 0
   dataVersion = 0
   aclVersion = 0
   ephemeralOwner = 0x0
   dataLength = 0
   numChildren = 0
  
  
  
   On Wed, May 22, 2013 at 11:49 PM, Ted Yu yuzhih...@gmail.com
 wrote:
  
What does this command show you ?
   
get /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
   
Cheers
   
On Wed, May 22, 2013 at 1:46 PM, amit.mor.m...@gmail.com 
amit.mor.m...@gmail.com wrote:
   
 ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379
 [1]
 [zk: va-p-zookeeper-01-c:2181(CONNECTED) 2] ls
 /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
 []

 I'm on hbase-0.94.2-cdh4.2.1

 Thanks


 On Wed, May 22, 2013 at 11:40 PM, Varun Sharma 
 va...@pinterest.com
 wrote:

  Also what version of HBase are you running ?
 
 
  On Wed, May 22, 2013 at 1:38 PM, Varun Sharma 
 va...@pinterest.com
  
 wrote:
 
   Basically,
  
   You had va-p-hbase-02 crash - that caused all the
 replication
   related
  data
   in zookeeper to be moved to va-p-hbase-01 and have it take
 over
  for
   replicating 02's logs. Now each region server also
 maintains an
 in-memory
   state of whats in ZK, it seems like when you start up 01,
 its
   trying
to
   replicate the 02 logs underneath but its failing to because
 that
   data
 is
   not in ZK. This is somewhat weird...
  
   Can you open the zookeepeer shell and do
  
   ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379
  
   And give the output ?
  
  
   On Wed, May 22, 2013 at 1:27 PM, amit.mor.m...@gmail.com 
   amit.mor.m...@gmail.com wrote:
  
   Hi,
  
   This is bad ... and happened twice: I had my
 replication-slave
cluster
   offlined. I performed quite a massive Merge operation on
 it and
after
 a
   couple of hours it had finished and I returned it back
 online.
  At
the
  same
   time, the replication-master RS machines crashed (see first
  crash
   http://pastebin.com/1msNZ2tH) with the first exception
 being:
  
   org.apache.zookeeper.KeeperException$NoNodeException:
KeeperErrorCode
 =
   NoNode for
  
  
 

   
  
 
 /hbase/replication/rs/va-p-hbase-01-c,60020,1369233253404/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719
   at
  
   
 org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
   at
  
   org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
   at
org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266)
   at
  
  
 

   
  
 
 org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:354)
   at
  
  

Re: RS crash upon replication

2013-05-22 Thread Amit Mor
 va-p-hbase-02-d,60020,1369249862401


On Thu, May 23, 2013 at 12:20 AM, Varun Sharma va...@pinterest.com wrote:

 Basically

 ls /hbase/rs and what do you see for va-p-02-d ?


 On Wed, May 22, 2013 at 2:19 PM, Varun Sharma va...@pinterest.com wrote:

  Can you do ls /hbase/rs and see what you get for 02-d - instead of
 looking
  in /replication/, could you look in /hbase/replication/rs - I want to see
  if the timestamps are matching or not ?
 
  Varun
 
 
  On Wed, May 22, 2013 at 2:17 PM, Varun Sharma va...@pinterest.com
 wrote:
 
  I see - so looks okay - there's just a lot of deep nesting in there - if
  you look into these you nodes by doing ls - you should see a bunch of
  WAL(s) which still need to be replicated...
 
  Varun
 
 
  On Wed, May 22, 2013 at 2:16 PM, Varun Sharma va...@pinterest.com
 wrote:
 
  2013-05-22 15:31:25,929 WARN
  org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
 transient
  ZooKeeper exception:
  org.apache.zookeeper.KeeperException$SessionExpiredException:
  KeeperErrorCode = Session expired for *
 
 /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719
  *
  *
  *
  *01-[01-02-02]-01*
 
  *Looks like a bunch of cascading failures causing this deep nesting...
 *
 
 
  On Wed, May 22, 2013 at 2:09 PM, Amit Mor amit.mor.m...@gmail.com
 wrote:
 
  empty return:
 
  [zk: va-p-zookeeper-01-c:2181(CONNECTED) 10] ls
  /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
  []
 
 
 
  On Thu, May 23, 2013 at 12:05 AM, Varun Sharma va...@pinterest.com
  wrote:
 
   Do an ls not a get here and give the output ?
  
   ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
  
  
   On Wed, May 22, 2013 at 1:53 PM, amit.mor.m...@gmail.com 
   amit.mor.m...@gmail.com wrote:
  
[zk: va-p-zookeeper-01-c:2181(CONNECTED) 3] get
/hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
   
cZxid = 0x60281c1de
ctime = Wed May 22 15:11:17 EDT 2013
mZxid = 0x60281c1de
mtime = Wed May 22 15:11:17 EDT 2013
pZxid = 0x60281c1de
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 0
numChildren = 0
   
   
   
On Wed, May 22, 2013 at 11:49 PM, Ted Yu yuzhih...@gmail.com
  wrote:
   
 What does this command show you ?

 get /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1

 Cheers

 On Wed, May 22, 2013 at 1:46 PM, amit.mor.m...@gmail.com 
 amit.mor.m...@gmail.com wrote:

  ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379
  [1]
  [zk: va-p-zookeeper-01-c:2181(CONNECTED) 2] ls
  /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
  []
 
  I'm on hbase-0.94.2-cdh4.2.1
 
  Thanks
 
 
  On Wed, May 22, 2013 at 11:40 PM, Varun Sharma 
  va...@pinterest.com
  wrote:
 
   Also what version of HBase are you running ?
  
  
   On Wed, May 22, 2013 at 1:38 PM, Varun Sharma 
  va...@pinterest.com
   
  wrote:
  
Basically,
   
You had va-p-hbase-02 crash - that caused all the
  replication
related
   data
in zookeeper to be moved to va-p-hbase-01 and have it take
  over
   for
replicating 02's logs. Now each region server also
  maintains an
  in-memory
state of whats in ZK, it seems like when you start up 01,
  its
trying
 to
replicate the 02 logs underneath but its failing to
 because
  that
data
  is
not in ZK. This is somewhat weird...
   
Can you open the zookeepeer shell and do
   
ls
 /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379
   
And give the output ?
   
   
On Wed, May 22, 2013 at 1:27 PM, amit.mor.m...@gmail.com
amit.mor.m...@gmail.com wrote:
   
Hi,
   
This is bad ... and happened twice: I had my
  replication-slave
 cluster
offlined. I performed quite a massive Merge operation on
  it and
 after
  a
couple of hours it had finished and I returned it back
  online.
   At
 the
   same
time, the replication-master RS machines crashed (see
 first
   crash
http://pastebin.com/1msNZ2tH) with the first exception
  being:
   
org.apache.zookeeper.KeeperException$NoNodeException:
 KeeperErrorCode
  =
NoNode for
   
   
  
 

   
  
 
 /hbase/replication/rs/va-p-hbase-01-c,60020,1369233253404/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719
at
   

  org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
at
   
   
 

Re: RS crash upon replication

2013-05-22 Thread Varun Sharma
I believe there were cascading failures which got these deep nodes
containing still to be replicated WAL(s) - I suspect there is either some
parsing bug or something which is causing the replication source to not
work - also which version are you using - does it have
https://issues.apache.org/jira/browse/HBASE-8207 - since you use hyphens in
our paths. One way to get back up is to delete these nodes but then you
lose data in these WAL(s)...


On Wed, May 22, 2013 at 2:22 PM, Amit Mor amit.mor.m...@gmail.com wrote:

  va-p-hbase-02-d,60020,1369249862401


 On Thu, May 23, 2013 at 12:20 AM, Varun Sharma va...@pinterest.com
 wrote:

  Basically
 
  ls /hbase/rs and what do you see for va-p-02-d ?
 
 
  On Wed, May 22, 2013 at 2:19 PM, Varun Sharma va...@pinterest.com
 wrote:
 
   Can you do ls /hbase/rs and see what you get for 02-d - instead of
  looking
   in /replication/, could you look in /hbase/replication/rs - I want to
 see
   if the timestamps are matching or not ?
  
   Varun
  
  
   On Wed, May 22, 2013 at 2:17 PM, Varun Sharma va...@pinterest.com
  wrote:
  
   I see - so looks okay - there's just a lot of deep nesting in there -
 if
   you look into these you nodes by doing ls - you should see a bunch of
   WAL(s) which still need to be replicated...
  
   Varun
  
  
   On Wed, May 22, 2013 at 2:16 PM, Varun Sharma va...@pinterest.com
  wrote:
  
   2013-05-22 15:31:25,929 WARN
   org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
  transient
   ZooKeeper exception:
   org.apache.zookeeper.KeeperException$SessionExpiredException:
   KeeperErrorCode = Session expired for *
  
 
 /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719
   *
   *
   *
   *01-[01-02-02]-01*
  
   *Looks like a bunch of cascading failures causing this deep
 nesting...
  *
  
  
   On Wed, May 22, 2013 at 2:09 PM, Amit Mor amit.mor.m...@gmail.com
  wrote:
  
   empty return:
  
   [zk: va-p-zookeeper-01-c:2181(CONNECTED) 10] ls
   /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
   []
  
  
  
   On Thu, May 23, 2013 at 12:05 AM, Varun Sharma va...@pinterest.com
 
   wrote:
  
Do an ls not a get here and give the output ?
   
ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
   
   
On Wed, May 22, 2013 at 1:53 PM, amit.mor.m...@gmail.com 
amit.mor.m...@gmail.com wrote:
   
 [zk: va-p-zookeeper-01-c:2181(CONNECTED) 3] get
 /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1

 cZxid = 0x60281c1de
 ctime = Wed May 22 15:11:17 EDT 2013
 mZxid = 0x60281c1de
 mtime = Wed May 22 15:11:17 EDT 2013
 pZxid = 0x60281c1de
 cversion = 0
 dataVersion = 0
 aclVersion = 0
 ephemeralOwner = 0x0
 dataLength = 0
 numChildren = 0



 On Wed, May 22, 2013 at 11:49 PM, Ted Yu yuzhih...@gmail.com
   wrote:

  What does this command show you ?
 
  get
 /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
 
  Cheers
 
  On Wed, May 22, 2013 at 1:46 PM, amit.mor.m...@gmail.com 
  amit.mor.m...@gmail.com wrote:
 
   ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379
   [1]
   [zk: va-p-zookeeper-01-c:2181(CONNECTED) 2] ls
   /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
   []
  
   I'm on hbase-0.94.2-cdh4.2.1
  
   Thanks
  
  
   On Wed, May 22, 2013 at 11:40 PM, Varun Sharma 
   va...@pinterest.com
   wrote:
  
Also what version of HBase are you running ?
   
   
On Wed, May 22, 2013 at 1:38 PM, Varun Sharma 
   va...@pinterest.com

   wrote:
   
 Basically,

 You had va-p-hbase-02 crash - that caused all the
   replication
 related
data
 in zookeeper to be moved to va-p-hbase-01 and have it
 take
   over
for
 replicating 02's logs. Now each region server also
   maintains an
   in-memory
 state of whats in ZK, it seems like when you start up
 01,
   its
 trying
  to
 replicate the 02 logs underneath but its failing to
  because
   that
 data
   is
 not in ZK. This is somewhat weird...

 Can you open the zookeepeer shell and do

 ls
  /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379

 And give the output ?


 On Wed, May 22, 2013 at 1:27 PM,
 amit.mor.m...@gmail.com
 amit.mor.m...@gmail.com wrote:

 Hi,

 This is bad ... and happened twice: I had my
   replication-slave
  cluster
 offlined. I performed quite a massive Merge operation
 on
   it and
  after
   a
 couple of hours it had finished and I returned it back
   online.
At
  

Re: RS crash upon replication

2013-05-22 Thread Amit Mor
yes, indeed - hyphens are part of the host name (annoying legacy stuff in
my company). It's hbase-0.94.2-cdh4.2.1. I have no idea if 0.94.6 was
backported by Cloudera into their flavor of 0.94.2, but
the mysterious occurrence of the percent sign in zkcli (ls
/hbase/replication/rs/va-p-hbase-02-d,60020,1369249862401/1-va-p-hbase-02-e,60020,1369042377129-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-02-e%2C60020%2C1369042377129.1369227474895)
might be a sign for such problem. How deep should my rmr in zkcli (an
example would be most welcomed :) be ? I have no serious problem running
copyTable with a time period corresponding to the outage and then to start
the sync back again. One question though, how did it cause a crash ?


On Thu, May 23, 2013 at 12:32 AM, Varun Sharma va...@pinterest.com wrote:

 I believe there were cascading failures which got these deep nodes
 containing still to be replicated WAL(s) - I suspect there is either some
 parsing bug or something which is causing the replication source to not
 work - also which version are you using - does it have
 https://issues.apache.org/jira/browse/HBASE-8207 - since you use hyphens
 in
 our paths. One way to get back up is to delete these nodes but then you
 lose data in these WAL(s)...


 On Wed, May 22, 2013 at 2:22 PM, Amit Mor amit.mor.m...@gmail.com wrote:

   va-p-hbase-02-d,60020,1369249862401
 
 
  On Thu, May 23, 2013 at 12:20 AM, Varun Sharma va...@pinterest.com
  wrote:
 
   Basically
  
   ls /hbase/rs and what do you see for va-p-02-d ?
  
  
   On Wed, May 22, 2013 at 2:19 PM, Varun Sharma va...@pinterest.com
  wrote:
  
Can you do ls /hbase/rs and see what you get for 02-d - instead of
   looking
in /replication/, could you look in /hbase/replication/rs - I want to
  see
if the timestamps are matching or not ?
   
Varun
   
   
On Wed, May 22, 2013 at 2:17 PM, Varun Sharma va...@pinterest.com
   wrote:
   
I see - so looks okay - there's just a lot of deep nesting in there
 -
  if
you look into these you nodes by doing ls - you should see a bunch
 of
WAL(s) which still need to be replicated...
   
Varun
   
   
On Wed, May 22, 2013 at 2:16 PM, Varun Sharma va...@pinterest.com
   wrote:
   
2013-05-22 15:31:25,929 WARN
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
   transient
ZooKeeper exception:
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for *
   
  
 
 /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719
*
*
*
*01-[01-02-02]-01*
   
*Looks like a bunch of cascading failures causing this deep
  nesting...
   *
   
   
On Wed, May 22, 2013 at 2:09 PM, Amit Mor amit.mor.m...@gmail.com
   wrote:
   
empty return:
   
[zk: va-p-zookeeper-01-c:2181(CONNECTED) 10] ls
/hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
[]
   
   
   
On Thu, May 23, 2013 at 12:05 AM, Varun Sharma 
 va...@pinterest.com
  
wrote:
   
 Do an ls not a get here and give the output ?

 ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1


 On Wed, May 22, 2013 at 1:53 PM, amit.mor.m...@gmail.com 
 amit.mor.m...@gmail.com wrote:

  [zk: va-p-zookeeper-01-c:2181(CONNECTED) 3] get
  /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
 
  cZxid = 0x60281c1de
  ctime = Wed May 22 15:11:17 EDT 2013
  mZxid = 0x60281c1de
  mtime = Wed May 22 15:11:17 EDT 2013
  pZxid = 0x60281c1de
  cversion = 0
  dataVersion = 0
  aclVersion = 0
  ephemeralOwner = 0x0
  dataLength = 0
  numChildren = 0
 
 
 
  On Wed, May 22, 2013 at 11:49 PM, Ted Yu yuzhih...@gmail.com
 
wrote:
 
   What does this command show you ?
  
   get
  /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
  
   Cheers
  
   On Wed, May 22, 2013 at 1:46 PM, amit.mor.m...@gmail.com 
   amit.mor.m...@gmail.com wrote:
  
ls
 /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379
[1]
[zk: va-p-zookeeper-01-c:2181(CONNECTED) 2] ls
   
 /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
[]
   
I'm on hbase-0.94.2-cdh4.2.1
   
Thanks
   
   
On Wed, May 22, 2013 at 11:40 PM, Varun Sharma 
va...@pinterest.com
wrote:
   
 Also what version of HBase are you running ?


 On Wed, May 22, 2013 at 1:38 PM, Varun Sharma 
va...@pinterest.com
 
wrote:

  Basically,
 
  You had va-p-hbase-02 crash - that caused all the
replication
  related
 data
  in zookeeper to be 

Re: RS crash upon replication

2013-05-22 Thread Amit Mor
Yes, I have checked the source files of the 0.94.2-cdh4.2.1 jar and
HBASE-8207 issues are present in the source codes, namely:

String[] parts = peerClusterZnode.split(-);


On Thu, May 23, 2013 at 12:42 AM, Amit Mor amit.mor.m...@gmail.com wrote:

 yes, indeed - hyphens are part of the host name (annoying legacy stuff in
 my company). It's hbase-0.94.2-cdh4.2.1. I have no idea if 0.94.6 was
 backported by Cloudera into their flavor of 0.94.2, but
 the mysterious occurrence of the percent sign in zkcli (ls
 /hbase/replication/rs/va-p-hbase-02-d,60020,1369249862401/1-va-p-hbase-02-e,60020,1369042377129-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-02-e%2C60020%2C1369042377129.1369227474895)
 might be a sign for such problem. How deep should my rmr in zkcli (an
 example would be most welcomed :) be ? I have no serious problem running
 copyTable with a time period corresponding to the outage and then to start
 the sync back again. One question though, how did it cause a crash ?


 On Thu, May 23, 2013 at 12:32 AM, Varun Sharma va...@pinterest.comwrote:

 I believe there were cascading failures which got these deep nodes
 containing still to be replicated WAL(s) - I suspect there is either some
 parsing bug or something which is causing the replication source to not
 work - also which version are you using - does it have
 https://issues.apache.org/jira/browse/HBASE-8207 - since you use hyphens
 in
 our paths. One way to get back up is to delete these nodes but then you
 lose data in these WAL(s)...


 On Wed, May 22, 2013 at 2:22 PM, Amit Mor amit.mor.m...@gmail.com
 wrote:

   va-p-hbase-02-d,60020,1369249862401
 
 
  On Thu, May 23, 2013 at 12:20 AM, Varun Sharma va...@pinterest.com
  wrote:
 
   Basically
  
   ls /hbase/rs and what do you see for va-p-02-d ?
  
  
   On Wed, May 22, 2013 at 2:19 PM, Varun Sharma va...@pinterest.com
  wrote:
  
Can you do ls /hbase/rs and see what you get for 02-d - instead of
   looking
in /replication/, could you look in /hbase/replication/rs - I want
 to
  see
if the timestamps are matching or not ?
   
Varun
   
   
On Wed, May 22, 2013 at 2:17 PM, Varun Sharma va...@pinterest.com
   wrote:
   
I see - so looks okay - there's just a lot of deep nesting in
 there -
  if
you look into these you nodes by doing ls - you should see a bunch
 of
WAL(s) which still need to be replicated...
   
Varun
   
   
On Wed, May 22, 2013 at 2:16 PM, Varun Sharma va...@pinterest.com
   wrote:
   
2013-05-22 15:31:25,929 WARN
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
   transient
ZooKeeper exception:
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for *
   
  
 
 /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719
*
*
*
*01-[01-02-02]-01*
   
*Looks like a bunch of cascading failures causing this deep
  nesting...
   *
   
   
On Wed, May 22, 2013 at 2:09 PM, Amit Mor 
 amit.mor.m...@gmail.com
   wrote:
   
empty return:
   
[zk: va-p-zookeeper-01-c:2181(CONNECTED) 10] ls
/hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
[]
   
   
   
On Thu, May 23, 2013 at 12:05 AM, Varun Sharma 
 va...@pinterest.com
  
wrote:
   
 Do an ls not a get here and give the output ?

 ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1


 On Wed, May 22, 2013 at 1:53 PM, amit.mor.m...@gmail.com 
 amit.mor.m...@gmail.com wrote:

  [zk: va-p-zookeeper-01-c:2181(CONNECTED) 3] get
  /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
 
  cZxid = 0x60281c1de
  ctime = Wed May 22 15:11:17 EDT 2013
  mZxid = 0x60281c1de
  mtime = Wed May 22 15:11:17 EDT 2013
  pZxid = 0x60281c1de
  cversion = 0
  dataVersion = 0
  aclVersion = 0
  ephemeralOwner = 0x0
  dataLength = 0
  numChildren = 0
 
 
 
  On Wed, May 22, 2013 at 11:49 PM, Ted Yu 
 yuzhih...@gmail.com
wrote:
 
   What does this command show you ?
  
   get
  /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
  
   Cheers
  
   On Wed, May 22, 2013 at 1:46 PM, amit.mor.m...@gmail.com 
   amit.mor.m...@gmail.com wrote:
  
ls
 /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379
[1]
[zk: va-p-zookeeper-01-c:2181(CONNECTED) 2] ls
   
 /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
[]
   
I'm on hbase-0.94.2-cdh4.2.1
   
Thanks
   
   
On Wed, May 22, 2013 at 11:40 PM, Varun Sharma 
va...@pinterest.com
wrote:
   
 Also what version of HBase are you running ?


 On 

Re: RS crash upon replication

2013-05-22 Thread Himanshu Vashishtha
I'd suggest to please patch the code with 8207;  cdh4.2.1 doesn't have it.

With hyphens in the name, ReplicationSource gets confused and tried to set
data in a znode which doesn't exist.

Thanks,
Himanshu


On Wed, May 22, 2013 at 2:42 PM, Amit Mor amit.mor.m...@gmail.com wrote:

 yes, indeed - hyphens are part of the host name (annoying legacy stuff in
 my company). It's hbase-0.94.2-cdh4.2.1. I have no idea if 0.94.6 was
 backported by Cloudera into their flavor of 0.94.2, but
 the mysterious occurrence of the percent sign in zkcli (ls

 /hbase/replication/rs/va-p-hbase-02-d,60020,1369249862401/1-va-p-hbase-02-e,60020,1369042377129-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-02-e%2C60020%2C1369042377129.1369227474895)
 might be a sign for such problem. How deep should my rmr in zkcli (an
 example would be most welcomed :) be ? I have no serious problem running
 copyTable with a time period corresponding to the outage and then to start
 the sync back again. One question though, how did it cause a crash ?


 On Thu, May 23, 2013 at 12:32 AM, Varun Sharma va...@pinterest.com
 wrote:

  I believe there were cascading failures which got these deep nodes
  containing still to be replicated WAL(s) - I suspect there is either some
  parsing bug or something which is causing the replication source to not
  work - also which version are you using - does it have
  https://issues.apache.org/jira/browse/HBASE-8207 - since you use hyphens
  in
  our paths. One way to get back up is to delete these nodes but then you
  lose data in these WAL(s)...
 
 
  On Wed, May 22, 2013 at 2:22 PM, Amit Mor amit.mor.m...@gmail.com
 wrote:
 
va-p-hbase-02-d,60020,1369249862401
  
  
   On Thu, May 23, 2013 at 12:20 AM, Varun Sharma va...@pinterest.com
   wrote:
  
Basically
   
ls /hbase/rs and what do you see for va-p-02-d ?
   
   
On Wed, May 22, 2013 at 2:19 PM, Varun Sharma va...@pinterest.com
   wrote:
   
 Can you do ls /hbase/rs and see what you get for 02-d - instead of
looking
 in /replication/, could you look in /hbase/replication/rs - I want
 to
   see
 if the timestamps are matching or not ?

 Varun


 On Wed, May 22, 2013 at 2:17 PM, Varun Sharma va...@pinterest.com
 
wrote:

 I see - so looks okay - there's just a lot of deep nesting in
 there
  -
   if
 you look into these you nodes by doing ls - you should see a bunch
  of
 WAL(s) which still need to be replicated...

 Varun


 On Wed, May 22, 2013 at 2:16 PM, Varun Sharma 
 va...@pinterest.com
wrote:

 2013-05-22 15:31:25,929 WARN
 org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
transient
 ZooKeeper exception:
 org.apache.zookeeper.KeeperException$SessionExpiredException:
 KeeperErrorCode = Session expired for *

   
  
 
 /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719
 *
 *
 *
 *01-[01-02-02]-01*

 *Looks like a bunch of cascading failures causing this deep
   nesting...
*


 On Wed, May 22, 2013 at 2:09 PM, Amit Mor 
 amit.mor.m...@gmail.com
wrote:

 empty return:

 [zk: va-p-zookeeper-01-c:2181(CONNECTED) 10] ls
 /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
 []



 On Thu, May 23, 2013 at 12:05 AM, Varun Sharma 
  va...@pinterest.com
   
 wrote:

  Do an ls not a get here and give the output ?
 
  ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
 
 
  On Wed, May 22, 2013 at 1:53 PM, amit.mor.m...@gmail.com 
  amit.mor.m...@gmail.com wrote:
 
   [zk: va-p-zookeeper-01-c:2181(CONNECTED) 3] get
   /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
  
   cZxid = 0x60281c1de
   ctime = Wed May 22 15:11:17 EDT 2013
   mZxid = 0x60281c1de
   mtime = Wed May 22 15:11:17 EDT 2013
   pZxid = 0x60281c1de
   cversion = 0
   dataVersion = 0
   aclVersion = 0
   ephemeralOwner = 0x0
   dataLength = 0
   numChildren = 0
  
  
  
   On Wed, May 22, 2013 at 11:49 PM, Ted Yu 
 yuzhih...@gmail.com
  
 wrote:
  
What does this command show you ?
   
get
   /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
   
Cheers
   
On Wed, May 22, 2013 at 1:46 PM, amit.mor.m...@gmail.com
amit.mor.m...@gmail.com wrote:
   
 ls
  /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379
 [1]
 [zk: va-p-zookeeper-01-c:2181(CONNECTED) 2] ls

  /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
 []

 I'm on hbase-0.94.2-cdh4.2.1

 Thanks


   

Re: RS crash upon replication

2013-05-22 Thread Varun Sharma
It seems I can reproduce this - I did a few rolling restarts and got
screwed with NoNode exceptions - I am running 0.94.7 which has the fix but
my nodes don't contain hyphens - nodes are no longer coming back up...

Thanks
Varun


On Wed, May 22, 2013 at 3:02 PM, Himanshu Vashishtha hv.cs...@gmail.comwrote:

 I'd suggest to please patch the code with 8207;  cdh4.2.1 doesn't have it.

 With hyphens in the name, ReplicationSource gets confused and tried to set
 data in a znode which doesn't exist.

 Thanks,
 Himanshu


 On Wed, May 22, 2013 at 2:42 PM, Amit Mor amit.mor.m...@gmail.com wrote:

  yes, indeed - hyphens are part of the host name (annoying legacy stuff in
  my company). It's hbase-0.94.2-cdh4.2.1. I have no idea if 0.94.6 was
  backported by Cloudera into their flavor of 0.94.2, but
  the mysterious occurrence of the percent sign in zkcli (ls
 
 
 /hbase/replication/rs/va-p-hbase-02-d,60020,1369249862401/1-va-p-hbase-02-e,60020,1369042377129-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-02-e%2C60020%2C1369042377129.1369227474895)
  might be a sign for such problem. How deep should my rmr in zkcli (an
  example would be most welcomed :) be ? I have no serious problem running
  copyTable with a time period corresponding to the outage and then to
 start
  the sync back again. One question though, how did it cause a crash ?
 
 
  On Thu, May 23, 2013 at 12:32 AM, Varun Sharma va...@pinterest.com
  wrote:
 
   I believe there were cascading failures which got these deep nodes
   containing still to be replicated WAL(s) - I suspect there is either
 some
   parsing bug or something which is causing the replication source to not
   work - also which version are you using - does it have
   https://issues.apache.org/jira/browse/HBASE-8207 - since you use
 hyphens
   in
   our paths. One way to get back up is to delete these nodes but then you
   lose data in these WAL(s)...
  
  
   On Wed, May 22, 2013 at 2:22 PM, Amit Mor amit.mor.m...@gmail.com
  wrote:
  
 va-p-hbase-02-d,60020,1369249862401
   
   
On Thu, May 23, 2013 at 12:20 AM, Varun Sharma va...@pinterest.com
wrote:
   
 Basically

 ls /hbase/rs and what do you see for va-p-02-d ?


 On Wed, May 22, 2013 at 2:19 PM, Varun Sharma va...@pinterest.com
 
wrote:

  Can you do ls /hbase/rs and see what you get for 02-d - instead
 of
 looking
  in /replication/, could you look in /hbase/replication/rs - I
 want
  to
see
  if the timestamps are matching or not ?
 
  Varun
 
 
  On Wed, May 22, 2013 at 2:17 PM, Varun Sharma 
 va...@pinterest.com
  
 wrote:
 
  I see - so looks okay - there's just a lot of deep nesting in
  there
   -
if
  you look into these you nodes by doing ls - you should see a
 bunch
   of
  WAL(s) which still need to be replicated...
 
  Varun
 
 
  On Wed, May 22, 2013 at 2:16 PM, Varun Sharma 
  va...@pinterest.com
 wrote:
 
  2013-05-22 15:31:25,929 WARN
  org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
 Possibly
 transient
  ZooKeeper exception:
  org.apache.zookeeper.KeeperException$SessionExpiredException:
  KeeperErrorCode = Session expired for *
 

   
  
 
 /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719
  *
  *
  *
  *01-[01-02-02]-01*
 
  *Looks like a bunch of cascading failures causing this deep
nesting...
 *
 
 
  On Wed, May 22, 2013 at 2:09 PM, Amit Mor 
  amit.mor.m...@gmail.com
 wrote:
 
  empty return:
 
  [zk: va-p-zookeeper-01-c:2181(CONNECTED) 10] ls
  /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
  []
 
 
 
  On Thu, May 23, 2013 at 12:05 AM, Varun Sharma 
   va...@pinterest.com

  wrote:
 
   Do an ls not a get here and give the output ?
  
   ls
 /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
  
  
   On Wed, May 22, 2013 at 1:53 PM, amit.mor.m...@gmail.com 
   amit.mor.m...@gmail.com wrote:
  
[zk: va-p-zookeeper-01-c:2181(CONNECTED) 3] get
   
 /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
   
cZxid = 0x60281c1de
ctime = Wed May 22 15:11:17 EDT 2013
mZxid = 0x60281c1de
mtime = Wed May 22 15:11:17 EDT 2013
pZxid = 0x60281c1de
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 0
numChildren = 0
   
   
   
On Wed, May 22, 2013 at 11:49 PM, Ted Yu 
  yuzhih...@gmail.com
   
  wrote:
   
 What does this command show you ?

 get

Re: RS crash upon replication

2013-05-22 Thread Himanshu Vashishtha
That sounds like a bug for sure. Could you create a jira with logs/znode
dump/steps to reproduce it?

Thanks,
himanshu


On Wed, May 22, 2013 at 5:01 PM, Varun Sharma va...@pinterest.com wrote:

 It seems I can reproduce this - I did a few rolling restarts and got
 screwed with NoNode exceptions - I am running 0.94.7 which has the fix but
 my nodes don't contain hyphens - nodes are no longer coming back up...

 Thanks
 Varun


 On Wed, May 22, 2013 at 3:02 PM, Himanshu Vashishtha hv.cs...@gmail.com
 wrote:

  I'd suggest to please patch the code with 8207;  cdh4.2.1 doesn't have
 it.
 
  With hyphens in the name, ReplicationSource gets confused and tried to
 set
  data in a znode which doesn't exist.
 
  Thanks,
  Himanshu
 
 
  On Wed, May 22, 2013 at 2:42 PM, Amit Mor amit.mor.m...@gmail.com
 wrote:
 
   yes, indeed - hyphens are part of the host name (annoying legacy stuff
 in
   my company). It's hbase-0.94.2-cdh4.2.1. I have no idea if 0.94.6 was
   backported by Cloudera into their flavor of 0.94.2, but
   the mysterious occurrence of the percent sign in zkcli (ls
  
  
 
 /hbase/replication/rs/va-p-hbase-02-d,60020,1369249862401/1-va-p-hbase-02-e,60020,1369042377129-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-02-e%2C60020%2C1369042377129.1369227474895)
   might be a sign for such problem. How deep should my rmr in zkcli (an
   example would be most welcomed :) be ? I have no serious problem
 running
   copyTable with a time period corresponding to the outage and then to
  start
   the sync back again. One question though, how did it cause a crash ?
  
  
   On Thu, May 23, 2013 at 12:32 AM, Varun Sharma va...@pinterest.com
   wrote:
  
I believe there were cascading failures which got these deep nodes
containing still to be replicated WAL(s) - I suspect there is either
  some
parsing bug or something which is causing the replication source to
 not
work - also which version are you using - does it have
https://issues.apache.org/jira/browse/HBASE-8207 - since you use
  hyphens
in
our paths. One way to get back up is to delete these nodes but then
 you
lose data in these WAL(s)...
   
   
On Wed, May 22, 2013 at 2:22 PM, Amit Mor amit.mor.m...@gmail.com
   wrote:
   
  va-p-hbase-02-d,60020,1369249862401


 On Thu, May 23, 2013 at 12:20 AM, Varun Sharma 
 va...@pinterest.com
 wrote:

  Basically
 
  ls /hbase/rs and what do you see for va-p-02-d ?
 
 
  On Wed, May 22, 2013 at 2:19 PM, Varun Sharma 
 va...@pinterest.com
  
 wrote:
 
   Can you do ls /hbase/rs and see what you get for 02-d - instead
  of
  looking
   in /replication/, could you look in /hbase/replication/rs - I
  want
   to
 see
   if the timestamps are matching or not ?
  
   Varun
  
  
   On Wed, May 22, 2013 at 2:17 PM, Varun Sharma 
  va...@pinterest.com
   
  wrote:
  
   I see - so looks okay - there's just a lot of deep nesting in
   there
-
 if
   you look into these you nodes by doing ls - you should see a
  bunch
of
   WAL(s) which still need to be replicated...
  
   Varun
  
  
   On Wed, May 22, 2013 at 2:16 PM, Varun Sharma 
   va...@pinterest.com
  wrote:
  
   2013-05-22 15:31:25,929 WARN
   org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
  Possibly
  transient
   ZooKeeper exception:
   org.apache.zookeeper.KeeperException$SessionExpiredException:
   KeeperErrorCode = Session expired for *
  
 

   
  
 
 /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719
   *
   *
   *
   *01-[01-02-02]-01*
  
   *Looks like a bunch of cascading failures causing this deep
 nesting...
  *
  
  
   On Wed, May 22, 2013 at 2:09 PM, Amit Mor 
   amit.mor.m...@gmail.com
  wrote:
  
   empty return:
  
   [zk: va-p-zookeeper-01-c:2181(CONNECTED) 10] ls
   /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
   []
  
  
  
   On Thu, May 23, 2013 at 12:05 AM, Varun Sharma 
va...@pinterest.com
 
   wrote:
  
Do an ls not a get here and give the output ?
   
ls
  /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
   
   
On Wed, May 22, 2013 at 1:53 PM, amit.mor.m...@gmail.com
amit.mor.m...@gmail.com wrote:
   
 [zk: va-p-zookeeper-01-c:2181(CONNECTED) 3] get

  /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1

 cZxid = 0x60281c1de
 ctime = Wed May 22 15:11:17 EDT 2013
 mZxid = 0x60281c1de
 mtime = Wed May 22 15:11:17 EDT 2013
 pZxid = 0x60281c1de
 cversion =