Re: RS crash upon replication

2013-05-23 Thread Amit Mor
I have pasted most of the RS's logs just prior to their FATAL and including. Would be very thankful if someone can take a look: http://pastebin.com/qFzycXNS . Interestingly, some RS's experience an IOException for not finding an .oldlogs/ file. The rest get KeeperException$NoNodeException w/o the I

Re: RS crash upon replication

2013-05-23 Thread Amit Mor
t by 'online' service requests and copyTable would hit it's resources quite badly. I'll be glad to update.  Thanks again, Amit Original message ---- From: Varun Sharma Date: To: user@hbase.apache.org Subject: Re: RS crash upon replication But wouldn't a

Re: RS crash upon replication

2013-05-23 Thread Varun Sharma
But wouldn't a copy table b/w timestamps bring you back since the mutations are all timestamp based we should okay ? Basically doing a copy table which supersedes the downtime interval ? On Thu, May 23, 2013 at 9:48 AM, Jean-Daniel Cryans wrote: > fwiw stop_replication is a kill switch, not a ge

Re: RS crash upon replication

2013-05-23 Thread Jean-Daniel Cryans
fwiw stop_replication is a kill switch, not a general way to start and stop replicating, and start_replication may put you in an inconsistent state: hbase(main):001:0> help 'stop_replication' Stops all the replication features. The state in which each stream stops in is undetermined. WARNING: star

Re: RS crash upon replication

2013-05-23 Thread Amit Mor
No the server came out fine just because after the crash (RS's - the masters were still running), I immediately pulled the breaks with stop_replication. Then I start the RS's and they came back fine (not replicating). Once I hit 'start_replication' again they had crashed for the second time. Eventu

Re: RS crash upon replication

2013-05-23 Thread Varun Sharma
Actually, it seems like something else was wrong here - the servers came up just fine on trying again - so could not really reproduce the issue. Amit: Did you try patching 8207 ? Varun On Wed, May 22, 2013 at 5:40 PM, Himanshu Vashishtha wrote: > That sounds like a bug for sure. Could you crea

Re: RS crash upon replication

2013-05-22 Thread Himanshu Vashishtha
That sounds like a bug for sure. Could you create a jira with logs/znode dump/steps to reproduce it? Thanks, himanshu On Wed, May 22, 2013 at 5:01 PM, Varun Sharma wrote: > It seems I can reproduce this - I did a few rolling restarts and got > screwed with NoNode exceptions - I am running 0.94

Re: RS crash upon replication

2013-05-22 Thread Varun Sharma
It seems I can reproduce this - I did a few rolling restarts and got screwed with NoNode exceptions - I am running 0.94.7 which has the fix but my nodes don't contain hyphens - nodes are no longer coming back up... Thanks Varun On Wed, May 22, 2013 at 3:02 PM, Himanshu Vashishtha wrote: > I'd s

Re: RS crash upon replication

2013-05-22 Thread Himanshu Vashishtha
I'd suggest to please patch the code with 8207; cdh4.2.1 doesn't have it. With hyphens in the name, ReplicationSource gets confused and tried to set data in a znode which doesn't exist. Thanks, Himanshu On Wed, May 22, 2013 at 2:42 PM, Amit Mor wrote: > yes, indeed - hyphens are part of the

Re: RS crash upon replication

2013-05-22 Thread Amit Mor
Yes, I have checked the source files of the 0.94.2-cdh4.2.1 jar and HBASE-8207 issues are present in the source codes, namely: String[] parts = peerClusterZnode.split("-"); On Thu, May 23, 2013 at 12:42 AM, Amit Mor wrote: > yes, indeed - hyphens are part of the host name (annoying legacy stuf

Re: RS crash upon replication

2013-05-22 Thread Amit Mor
yes, indeed - hyphens are part of the host name (annoying legacy stuff in my company). It's hbase-0.94.2-cdh4.2.1. I have no idea if 0.94.6 was backported by Cloudera into their flavor of 0.94.2, but the mysterious occurrence of the percent sign in zkcli (ls /hbase/replication/rs/va-p-hbase-02-d,60

Re: RS crash upon replication

2013-05-22 Thread Varun Sharma
I believe there were cascading failures which got these deep nodes containing still to be replicated WAL(s) - I suspect there is either some parsing bug or something which is causing the replication source to not work - also which version are you using - does it have https://issues.apache.org/jira/

Re: RS crash upon replication

2013-05-22 Thread Amit Mor
va-p-hbase-02-d,60020,1369249862401 On Thu, May 23, 2013 at 12:20 AM, Varun Sharma wrote: > Basically > > ls /hbase/rs and what do you see for va-p-02-d ? > > > On Wed, May 22, 2013 at 2:19 PM, Varun Sharma wrote: > > > Can you do ls /hbase/rs and see what you get for 02-d - instead of > look

Re: RS crash upon replication

2013-05-22 Thread Varun Sharma
Basically ls /hbase/rs and what do you see for va-p-02-d ? On Wed, May 22, 2013 at 2:19 PM, Varun Sharma wrote: > Can you do ls /hbase/rs and see what you get for 02-d - instead of looking > in /replication/, could you look in /hbase/replication/rs - I want to see > if the timestamps are match

Re: RS crash upon replication

2013-05-22 Thread Varun Sharma
Can you do ls /hbase/rs and see what you get for 02-d - instead of looking in /replication/, could you look in /hbase/replication/rs - I want to see if the timestamps are matching or not ? Varun On Wed, May 22, 2013 at 2:17 PM, Varun Sharma wrote: > I see - so looks okay - there's just a lot o

Re: RS crash upon replication

2013-05-22 Thread Varun Sharma
I see - so looks okay - there's just a lot of deep nesting in there - if you look into these you nodes by doing ls - you should see a bunch of WAL(s) which still need to be replicated... Varun On Wed, May 22, 2013 at 2:16 PM, Varun Sharma wrote: > 2013-05-22 15:31:25,929 WARN > org.apache.hado

Re: RS crash upon replication

2013-05-22 Thread Varun Sharma
2013-05-22 15:31:25,929 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for * /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1-va-p-hbase-0

Re: RS crash upon replication

2013-05-22 Thread Amit Mor
I found this: [zk: va-p-zookeeper-01-c:2181(CONNECTED) 17] ls /hbase/replication/rs/va-p-hbase-02-d,60020,1369249862401 [1-va-p-hbase-02-e,60020,1369042377129-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475, 1-va-p-hbase-01-d,60020,1369042382584-va-p-hbase-02-c,60020,136904

Re: RS crash upon replication

2013-05-22 Thread Amit Mor
empty return: [zk: va-p-zookeeper-01-c:2181(CONNECTED) 10] ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 [] On Thu, May 23, 2013 at 12:05 AM, Varun Sharma wrote: > Do an "ls" not a get here and give the output ? > > ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379

Re: RS crash upon replication

2013-05-22 Thread Varun Sharma
Do an "ls" not a get here and give the output ? ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 On Wed, May 22, 2013 at 1:53 PM, amit.mor.m...@gmail.com < amit.mor.m...@gmail.com> wrote: > [zk: va-p-zookeeper-01-c:2181(CONNECTED) 3] get > /hbase/replication/rs/va-p-hbase-01-c,600

Re: RS crash upon replication

2013-05-22 Thread amit.mor.m...@gmail.com
[zk: va-p-zookeeper-01-c:2181(CONNECTED) 3] get /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 cZxid = 0x60281c1de ctime = Wed May 22 15:11:17 EDT 2013 mZxid = 0x60281c1de mtime = Wed May 22 15:11:17 EDT 2013 pZxid = 0x60281c1de cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwn

Re: RS crash upon replication

2013-05-22 Thread Ted Yu
What does this command show you ? get /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 Cheers On Wed, May 22, 2013 at 1:46 PM, amit.mor.m...@gmail.com < amit.mor.m...@gmail.com> wrote: > ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379 > [1] > [zk: va-p-zookeeper-01-c:218

Re: RS crash upon replication

2013-05-22 Thread amit.mor.m...@gmail.com
ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379 [1] [zk: va-p-zookeeper-01-c:2181(CONNECTED) 2] ls /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1 [] I'm on hbase-0.94.2-cdh4.2.1 Thanks On Wed, May 22, 2013 at 11:40 PM, Varun Sharma wrote: > Also what version of HBase

Re: RS crash upon replication

2013-05-22 Thread Varun Sharma
Also what version of HBase are you running ? On Wed, May 22, 2013 at 1:38 PM, Varun Sharma wrote: > Basically, > > You had va-p-hbase-02 crash - that caused all the replication related data > in zookeeper to be moved to va-p-hbase-01 and have it take over for > replicating 02's logs. Now each r

Re: RS crash upon replication

2013-05-22 Thread Varun Sharma
Basically, You had va-p-hbase-02 crash - that caused all the replication related data in zookeeper to be moved to va-p-hbase-01 and have it take over for replicating 02's logs. Now each region server also maintains an in-memory state of whats in ZK, it seems like when you start up 01, its trying t