Re: RS crash upon replication

Amit Mor Thu, 23 May 2013 10:44:25 -0700

Thanks for the helpful comments. I would certainly dig deeper now that 
everything has stabilized. Regarding J-D's comment - once my slave cluster was 
started, after about 4 hours of downtime (it's for offline stuff), at the very 
moment it came back online, 5 RS of my master-replication cluster crashed. 
Since I had no time figuring out what went wrong with the replication I 
submitted the 'stop_replication' knowing that's a last resort,since I had to 
get those production RS's online asap. I think renaming that cmd to something 
like 'abort_replication' would be more fitting. On the other hand, 
remove_peer("1") at a time of crisis feels like a developer's solution to a 
DBA's problem ;) 
Regarding copyTable, it's all good and well, but one has to consider that I'm 
on ec2 and the cluster is already streched out by 'online' service requests and 
copyTable would hit it's resources quite badly. I'll be glad to update. 
Thanks again,
Amit


-------- Original message --------
From: Varun Sharma <va...@pinterest.com> 
Date:  
To: user@hbase.apache.org 
Subject: Re: RS crash upon replication 
 
But wouldn't a copy table b/w timestamps bring you back since the mutations
are all timestamp based we should okay ? Basically doing a copy table which
supersedes the downtime interval ?


On Thu, May 23, 2013 at 9:48 AM, Jean-Daniel Cryans <jdcry...@apache.org>wrote:

> fwiw stop_replication is a kill switch, not a general way to start and
> stop replicating, and start_replication may put you in an inconsistent
> state:
>
> hbase(main):001:0> help 'stop_replication'
> Stops all the replication features. The state in which each
> stream stops in is undetermined.
> WARNING:
> start/stop replication is only meant to be used in critical load
> situations.
>
> On Thu, May 23, 2013 at 1:17 AM, Amit Mor <amit.mor.m...@gmail.com> wrote:
> > No the server came out fine just because after the crash (RS's - the
> > masters were still running), I immediately pulled the breaks with
> > stop_replication. Then I start the RS's and they came back fine (not
> > replicating). Once I hit 'start_replication' again they had crashed for
> the
> > second time. Eventually I deleted the heavily nested replication znodes
> and
> > the 'start_replication' succeeded. I didn't patch 8207 because I'm on CDH
> > with Cloudera Manager Parcels thing and I'm still trying to figure out
> how
> > to replace their jars with mine in a clean and non intrusive way
> >
> >
> > On Thu, May 23, 2013 at 10:33 AM, Varun Sharma <va...@pinterest.com>
> wrote:
> >
> >> Actually, it seems like something else was wrong here - the servers
> came up
> >> just fine on trying again - so could not really reproduce the issue.
> >>
> >> Amit: Did you try patching 8207 ?
> >>
> >> Varun
> >>
> >>
> >> On Wed, May 22, 2013 at 5:40 PM, Himanshu Vashishtha <
> hv.cs...@gmail.com
> >> >wrote:
> >>
> >> > That sounds like a bug for sure. Could you create a jira with
> logs/znode
> >> > dump/steps to reproduce it?
> >> >
> >> > Thanks,
> >> > himanshu
> >> >
> >> >
> >> > On Wed, May 22, 2013 at 5:01 PM, Varun Sharma <va...@pinterest.com>
> >> wrote:
> >> >
> >> > > It seems I can reproduce this - I did a few rolling restarts and got
> >> > > screwed with NoNode exceptions - I am running 0.94.7 which has the
> fix
> >> > but
> >> > > my nodes don't contain hyphens - nodes are no longer coming back
> up...
> >> > >
> >> > > Thanks
> >> > > Varun
> >> > >
> >> > >
> >> > > On Wed, May 22, 2013 at 3:02 PM, Himanshu Vashishtha <
> >> hv.cs...@gmail.com
> >> > > >wrote:
> >> > >
> >> > > > I'd suggest to please patch the code with 8207;  cdh4.2.1 doesn't
> >> have
> >> > > it.
> >> > > >
> >> > > > With hyphens in the name, ReplicationSource gets confused and
> tried
> >> to
> >> > > set
> >> > > > data in a znode which doesn't exist.
> >> > > >
> >> > > > Thanks,
> >> > > > Himanshu
> >> > > >
> >> > > >
> >> > > > On Wed, May 22, 2013 at 2:42 PM, Amit Mor <
> amit.mor.m...@gmail.com>
> >> > > wrote:
> >> > > >
> >> > > > > yes, indeed - hyphens are part of the host name (annoying legacy
> >> > stuff
> >> > > in
> >> > > > > my company). It's hbase-0.94.2-cdh4.2.1. I have no idea if
> 0.94.6
> >> was
> >> > > > > backported by Cloudera into their flavor of 0.94.2, but
> >> > > > > the mysterious occurrence of the percent sign in zkcli (ls
> >> > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> /hbase/replication/rs/va-p-hbase-02-d,60020,1369249862401/1-va-p-hbase-02-e,60020,1369042377129-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-02-e%2C60020%2C1369042377129.1369227474895)
> >> > > > > might be a sign for such problem. How deep should my rmr in
> zkcli
> >> (an
> >> > > > > example would be most welcomed :) be ? I have no serious problem
> >> > > running
> >> > > > > copyTable with a time period corresponding to the outage and
> then
> >> to
> >> > > > start
> >> > > > > the sync back again. One question though, how did it cause a
> crash
> >> ?
> >> > > > >
> >> > > > >
> >> > > > > On Thu, May 23, 2013 at 12:32 AM, Varun Sharma <
> >> va...@pinterest.com>
> >> > > > > wrote:
> >> > > > >
> >> > > > > > I believe there were cascading failures which got these deep
> >> nodes
> >> > > > > > containing still to be replicated WAL(s) - I suspect there is
> >> > either
> >> > > > some
> >> > > > > > parsing bug or something which is causing the replication
> source
> >> to
> >> > > not
> >> > > > > > work - also which version are you using - does it have
> >> > > > > > https://issues.apache.org/jira/browse/HBASE-8207 - since you
> use
> >> > > > hyphens
> >> > > > > > in
> >> > > > > > our paths. One way to get back up is to delete these nodes but
> >> then
> >> > > you
> >> > > > > > lose data in these WAL(s)...
> >> > > > > >
> >> > > > > >
> >> > > > > > On Wed, May 22, 2013 at 2:22 PM, Amit Mor <
> >> amit.mor.m...@gmail.com
> >> > >
> >> > > > > wrote:
> >> > > > > >
> >> > > > > > >  va-p-hbase-02-d,60020,1369249862401
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > On Thu, May 23, 2013 at 12:20 AM, Varun Sharma <
> >> > > va...@pinterest.com>
> >> > > > > > > wrote:
> >> > > > > > >
> >> > > > > > > > Basically
> >> > > > > > > >
> >> > > > > > > > ls /hbase/rs and what do you see for va-p-02-d ?
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > On Wed, May 22, 2013 at 2:19 PM, Varun Sharma <
> >> > > va...@pinterest.com
> >> > > > >
> >> > > > > > > wrote:
> >> > > > > > > >
> >> > > > > > > > > Can you do ls /hbase/rs and see what you get for 02-d -
> >> > instead
> >> > > > of
> >> > > > > > > > looking
> >> > > > > > > > > in /replication/, could you look in
> /hbase/replication/rs
> >> - I
> >> > > > want
> >> > > > > to
> >> > > > > > > see
> >> > > > > > > > > if the timestamps are matching or not ?
> >> > > > > > > > >
> >> > > > > > > > > Varun
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > On Wed, May 22, 2013 at 2:17 PM, Varun Sharma <
> >> > > > va...@pinterest.com
> >> > > > > >
> >> > > > > > > > wrote:
> >> > > > > > > > >
> >> > > > > > > > >> I see - so looks okay - there's just a lot of deep
> nesting
> >> > in
> >> > > > > there
> >> > > > > > -
> >> > > > > > > if
> >> > > > > > > > >> you look into these you nodes by doing ls - you should
> >> see a
> >> > > > bunch
> >> > > > > > of
> >> > > > > > > > >> WAL(s) which still need to be replicated...
> >> > > > > > > > >>
> >> > > > > > > > >> Varun
> >> > > > > > > > >>
> >> > > > > > > > >>
> >> > > > > > > > >> On Wed, May 22, 2013 at 2:16 PM, Varun Sharma <
> >> > > > > va...@pinterest.com
> >> > > > > > > > >wrote:
> >> > > > > > > > >>
> >> > > > > > > > >>> 2013-05-22 15:31:25,929 WARN
> >> > > > > > > > >>>
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
> >> > > > Possibly
> >> > > > > > > > transient
> >> > > > > > > > >>> ZooKeeper exception:
> >> > > > > > > > >>>
> >> > org.apache.zookeeper.KeeperException$SessionExpiredException:
> >> > > > > > > > >>> KeeperErrorCode = Session expired for *
> >> > > > > > > > >>>
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731-va-p-hbase-02-d,60020,1369233252475/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719
> >> > > > > > > > >>> *
> >> > > > > > > > >>> *
> >> > > > > > > > >>> *
> >> > > > > > > > >>> *01->[01->02->02]->01*
> >> > > > > > > > >>>
> >> > > > > > > > >>> *Looks like a bunch of cascading failures causing this
> >> deep
> >> > > > > > > nesting...
> >> > > > > > > > *
> >> > > > > > > > >>>
> >> > > > > > > > >>>
> >> > > > > > > > >>> On Wed, May 22, 2013 at 2:09 PM, Amit Mor <
> >> > > > > amit.mor.m...@gmail.com
> >> > > > > > > > >wrote:
> >> > > > > > > > >>>
> >> > > > > > > > >>>> empty return:
> >> > > > > > > > >>>>
> >> > > > > > > > >>>> [zk: va-p-zookeeper-01-c:2181(CONNECTED) 10] ls
> >> > > > > > > > >>>>
> >> > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
> >> > > > > > > > >>>> []
> >> > > > > > > > >>>>
> >> > > > > > > > >>>>
> >> > > > > > > > >>>>
> >> > > > > > > > >>>> On Thu, May 23, 2013 at 12:05 AM, Varun Sharma <
> >> > > > > > va...@pinterest.com
> >> > > > > > > >
> >> > > > > > > > >>>> wrote:
> >> > > > > > > > >>>>
> >> > > > > > > > >>>> > Do an "ls" not a get here and give the output ?
> >> > > > > > > > >>>> >
> >> > > > > > > > >>>> > ls
> >> > > > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
> >> > > > > > > > >>>> >
> >> > > > > > > > >>>> >
> >> > > > > > > > >>>> > On Wed, May 22, 2013 at 1:53 PM,
> >> > amit.mor.m...@gmail.com<
> >> > > > > > > > >>>> > amit.mor.m...@gmail.com> wrote:
> >> > > > > > > > >>>> >
> >> > > > > > > > >>>> > > [zk: va-p-zookeeper-01-c:2181(CONNECTED) 3] get
> >> > > > > > > > >>>> > >
> >> > > > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
> >> > > > > > > > >>>> > >
> >> > > > > > > > >>>> > > cZxid = 0x60281c1de
> >> > > > > > > > >>>> > > ctime = Wed May 22 15:11:17 EDT 2013
> >> > > > > > > > >>>> > > mZxid = 0x60281c1de
> >> > > > > > > > >>>> > > mtime = Wed May 22 15:11:17 EDT 2013
> >> > > > > > > > >>>> > > pZxid = 0x60281c1de
> >> > > > > > > > >>>> > > cversion = 0
> >> > > > > > > > >>>> > > dataVersion = 0
> >> > > > > > > > >>>> > > aclVersion = 0
> >> > > > > > > > >>>> > > ephemeralOwner = 0x0
> >> > > > > > > > >>>> > > dataLength = 0
> >> > > > > > > > >>>> > > numChildren = 0
> >> > > > > > > > >>>> > >
> >> > > > > > > > >>>> > >
> >> > > > > > > > >>>> > >
> >> > > > > > > > >>>> > > On Wed, May 22, 2013 at 11:49 PM, Ted Yu <
> >> > > > > yuzhih...@gmail.com
> >> > > > > > >
> >> > > > > > > > >>>> wrote:
> >> > > > > > > > >>>> > >
> >> > > > > > > > >>>> > > > What does this command show you ?
> >> > > > > > > > >>>> > > >
> >> > > > > > > > >>>> > > > get
> >> > > > > > > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
> >> > > > > > > > >>>> > > >
> >> > > > > > > > >>>> > > > Cheers
> >> > > > > > > > >>>> > > >
> >> > > > > > > > >>>> > > > On Wed, May 22, 2013 at 1:46 PM,
> >> > > > amit.mor.m...@gmail.com<
> >> > > > > > > > >>>> > > > amit.mor.m...@gmail.com> wrote:
> >> > > > > > > > >>>> > > >
> >> > > > > > > > >>>> > > > > ls
> >> > > > > > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379
> >> > > > > > > > >>>> > > > > [1]
> >> > > > > > > > >>>> > > > > [zk: va-p-zookeeper-01-c:2181(CONNECTED) 2]
> ls
> >> > > > > > > > >>>> > > > >
> >> > > > > > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379/1
> >> > > > > > > > >>>> > > > > []
> >> > > > > > > > >>>> > > > >
> >> > > > > > > > >>>> > > > > I'm on hbase-0.94.2-cdh4.2.1
> >> > > > > > > > >>>> > > > >
> >> > > > > > > > >>>> > > > > Thanks
> >> > > > > > > > >>>> > > > >
> >> > > > > > > > >>>> > > > >
> >> > > > > > > > >>>> > > > > On Wed, May 22, 2013 at 11:40 PM, Varun
> Sharma <
> >> > > > > > > > >>>> va...@pinterest.com>
> >> > > > > > > > >>>> > > > > wrote:
> >> > > > > > > > >>>> > > > >
> >> > > > > > > > >>>> > > > > > Also what version of HBase are you running
> ?
> >> > > > > > > > >>>> > > > > >
> >> > > > > > > > >>>> > > > > >
> >> > > > > > > > >>>> > > > > > On Wed, May 22, 2013 at 1:38 PM, Varun
> Sharma
> >> <
> >> > > > > > > > >>>> va...@pinterest.com
> >> > > > > > > > >>>> > >
> >> > > > > > > > >>>> > > > > wrote:
> >> > > > > > > > >>>> > > > > >
> >> > > > > > > > >>>> > > > > > > Basically,
> >> > > > > > > > >>>> > > > > > >
> >> > > > > > > > >>>> > > > > > > You had va-p-hbase-02 crash - that caused
> >> all
> >> > > the
> >> > > > > > > > >>>> replication
> >> > > > > > > > >>>> > > related
> >> > > > > > > > >>>> > > > > > data
> >> > > > > > > > >>>> > > > > > > in zookeeper to be moved to va-p-hbase-01
> >> and
> >> > > have
> >> > > > > it
> >> > > > > > > take
> >> > > > > > > > >>>> over
> >> > > > > > > > >>>> > for
> >> > > > > > > > >>>> > > > > > > replicating 02's logs. Now each region
> >> server
> >> > > also
> >> > > > > > > > >>>> maintains an
> >> > > > > > > > >>>> > > > > in-memory
> >> > > > > > > > >>>> > > > > > > state of whats in ZK, it seems like when
> you
> >> > > start
> >> > > > > up
> >> > > > > > > 01,
> >> > > > > > > > >>>> its
> >> > > > > > > > >>>> > > trying
> >> > > > > > > > >>>> > > > to
> >> > > > > > > > >>>> > > > > > > replicate the 02 logs underneath but its
> >> > failing
> >> > > > to
> >> > > > > > > > because
> >> > > > > > > > >>>> that
> >> > > > > > > > >>>> > > data
> >> > > > > > > > >>>> > > > > is
> >> > > > > > > > >>>> > > > > > > not in ZK. This is somewhat weird...
> >> > > > > > > > >>>> > > > > > >
> >> > > > > > > > >>>> > > > > > > Can you open the zookeepeer shell and do
> >> > > > > > > > >>>> > > > > > >
> >> > > > > > > > >>>> > > > > > > ls
> >> > > > > > > > /hbase/replication/rs/va-p-hbase-01-c,60020,1369249873379
> >> > > > > > > > >>>> > > > > > >
> >> > > > > > > > >>>> > > > > > > And give the output ?
> >> > > > > > > > >>>> > > > > > >
> >> > > > > > > > >>>> > > > > > >
> >> > > > > > > > >>>> > > > > > > On Wed, May 22, 2013 at 1:27 PM,
> >> > > > > > > amit.mor.m...@gmail.com<
> >> > > > > > > > >>>> > > > > > > amit.mor.m...@gmail.com> wrote:
> >> > > > > > > > >>>> > > > > > >
> >> > > > > > > > >>>> > > > > > >> Hi,
> >> > > > > > > > >>>> > > > > > >>
> >> > > > > > > > >>>> > > > > > >> This is bad ... and happened twice: I
> had
> >> my
> >> > > > > > > > >>>> replication-slave
> >> > > > > > > > >>>> > > > cluster
> >> > > > > > > > >>>> > > > > > >> offlined. I performed quite a massive
> Merge
> >> > > > > operation
> >> > > > > > > on
> >> > > > > > > > >>>> it and
> >> > > > > > > > >>>> > > > after
> >> > > > > > > > >>>> > > > > a
> >> > > > > > > > >>>> > > > > > >> couple of hours it had finished and I
> >> > returned
> >> > > it
> >> > > > > > back
> >> > > > > > > > >>>> online.
> >> > > > > > > > >>>> > At
> >> > > > > > > > >>>> > > > the
> >> > > > > > > > >>>> > > > > > same
> >> > > > > > > > >>>> > > > > > >> time, the replication-master RS machines
> >> > > crashed
> >> > > > > (see
> >> > > > > > > > first
> >> > > > > > > > >>>> > crash
> >> > > > > > > > >>>> > > > > > >> http://pastebin.com/1msNZ2tH) with the
> >> first
> >> > > > > > exception
> >> > > > > > > > >>>> being:
> >> > > > > > > > >>>> > > > > > >>
> >> > > > > > > > >>>> > > > > > >>
> >> > > > > org.apache.zookeeper.KeeperException$NoNodeException:
> >> > > > > > > > >>>> > > > KeeperErrorCode
> >> > > > > > > > >>>> > > > > =
> >> > > > > > > > >>>> > > > > > >> NoNode for
> >> > > > > > > > >>>> > > > > > >>
> >> > > > > > > > >>>> > > > > > >>
> >> > > > > > > > >>>> > > > > >
> >> > > > > > > > >>>> > > > >
> >> > > > > > > > >>>> > > >
> >> > > > > > > > >>>> > >
> >> > > > > > > > >>>> >
> >> > > > > > > > >>>>
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> /hbase/replication/rs/va-p-hbase-01-c,60020,1369233253404/1-va-p-hbase-01-c,60020,1369042378287-va-p-hbase-02-c,60020,1369042377731/va-p-hbase-01-c%2C60020%2C1369042378287.1369220050719
> >> > > > > > > > >>>> > > > > > >>         at
> >> > > > > > > > >>>> > > > > > >>
> >> > > > > > > > >>>> > > >
> >> > > > > > > > >>>>
> >> > > > > > >
> >> > >
> org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
> >> > > > > > > > >>>> > > > > > >>         at
> >> > > > > > > > >>>> > > > > > >>
> >> > > > > > > > >>>> > >
> >> > > > > > > >
> >> > > >
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> >> > > > > > > > >>>> > > > > > >>         at
> >> > > > > > > > >>>> > > >
> >> > > > > org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266)
> >> > > > > > > > >>>> > > > > > >>         at
> >> > > > > > > > >>>> > > > > > >>
> >> > > > > > > > >>>> > > > > > >>
> >> > > > > > > > >>>> > > > > >
> >> > > > > > > > >>>> > > > >
> >> > > > > > > > >>>> > > >
> >> > > > > > > > >>>> > >
> >> > > > > > > > >>>> >
> >> > > > > > > > >>>>
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:354)
> >> > > > > > > > >>>> > > > > > >>         at
> >> > > > > > > > >>>> > > > > > >>
> >> > > > > > > > >>>> >
> >> > > > > >
> org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:846)
> >> > > > > > > > >>>> > > > > > >>         at
> >> > > > > > > > >>>> > > > > > >>
> >> > > > > > > > >>>> >
> >> > > > > >
> org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:898)
> >> > > > > > > > >>>> > > > > > >>         at
> >> > > > > > > > >>>> > > > > > >>
> >> > > > > > > > >>>> >
> >> > > > > >
> org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:892)
> >> > > > > > > > >>>> > > > > > >>         at
> >> > > > > > > > >>>> > > > > > >>
> >> > > > > > > > >>>> > > > > > >>
> >> > > > > > > > >>>> > > > > >
> >> > > > > > > > >>>> > > > >
> >> > > > > > > > >>>> > > >
> >> > > > > > > > >>>> > >
> >> > > > > > > > >>>> >
> >> > > > > > > > >>>>
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.apache.hadoop.hbase.replication.ReplicationZookeeper.writeReplicationStatus(ReplicationZookeeper.java:558)
> >> > > > > > > > >>>> > > > > > >>         at
> >> > > > > > > > >>>> > > > > > >>
> >> > > > > > > > >>>> > > > > > >>
> >> > > > > > > > >>>> > > > > >
> >> > > > > > > > >>>> > > > >
> >> > > > > > > > >>>> > > >
> >> > > > > > > > >>>> > >
> >> > > > > > > > >>>> >
> >> > > > > > > > >>>>
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:154)
> >> > > > > > > > >>>> > > > > > >>         at
> >> > > > > > > > >>>> > > > > > >>
> >> > > > > > > > >>>> > > > > > >>
> >> > > > > > > > >>>> > > > > >
> >> > > > > > > > >>>> > > > >
> >> > > > > > > > >>>> > > >
> >> > > > > > > > >>>> > >
> >> > > > > > > > >>>> >
> >> > > > > > > > >>>>
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:638)
> >> > > > > > > > >>>> > > > > > >>         at
> >> > > > > > > > >>>> > > > > > >>
> >> > > > > > > > >>>> > > > > > >>
> >> > > > > > > > >>>> > > > > >
> >> > > > > > > > >>>> > > > >
> >> > > > > > > > >>>> > > >
> >> > > > > > > > >>>> > >
> >> > > > > > > > >>>> >
> >> > > > > > > > >>>>
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:387)
> >> > > > > > > > >>>> > > > > > >>
> >> > > > > > > > >>>> > > > > > >> Before restarting the crashed RS's, I
> have
> >> > > > applied
> >> > > > > a
> >> > > > > > > > >>>> > > > > 'stop_replication'
> >> > > > > > > > >>>> > > > > > >> cmd. Then fired up the RS's again.
> They've
> >> > > > started
> >> > > > > > o.k.
> >> > > > > > > > >>>> but once
> >> > > > > > > > >>>> > > > I've
> >> > > > > > > > >>>> > > > > > hit
> >> > > > > > > > >>>> > > > > > >> 'start_replication' they have crashed
> once
> >> > > again.
> >> > > > > The
> >> > > > > > > > >>>> second
> >> > > > > > > > >>>> > crash
> >> > > > > > > > >>>> > > > log
> >> > > > > > > > >>>> > > > > > >> http://pastebin.com/8Nb5epJJ has the
> same
> >> > > > initial
> >> > > > > > > > >>>> exception
> >> > > > > > > > >>>> > > > > > >>
> >> > > > > > (org.apache.zookeeper.KeeperException$NoNodeException:
> >> > > > > > > > >>>> > > > > > >> KeeperErrorCode = NoNode). I've started
> the
> >> > > crash
> >> > > > > > > region
> >> > > > > > > > >>>> servers
> >> > > > > > > > >>>> > > > again
> >> > > > > > > > >>>> > > > > > >> without replication and currently all is
> >> > well,
> >> > > > but
> >> > > > > I
> >> > > > > > > need
> >> > > > > > > > >>>> to
> >> > > > > > > > >>>> > start
> >> > > > > > > > >>>> > > > > > >> replication asap.
> >> > > > > > > > >>>> > > > > > >>
> >> > > > > > > > >>>> > > > > > >> Does anyone have an idea what's going on
> >> and
> >> > > how
> >> > > > > can
> >> > > > > > I
> >> > > > > > > > >>>> solve it
> >> > > > > > > > >>>> > ?
> >> > > > > > > > >>>> > > > > > >>
> >> > > > > > > > >>>> > > > > > >> Thanks,
> >> > > > > > > > >>>> > > > > > >> Amit
> >> > > > > > > > >>>> > > > > > >>
> >> > > > > > > > >>>> > > > > > >
> >> > > > > > > > >>>> > > > > > >
> >> > > > > > > > >>>> > > > > >
> >> > > > > > > > >>>> > > > >
> >> > > > > > > > >>>> > > >
> >> > > > > > > > >>>> > >
> >> > > > > > > > >>>> >
> >> > > > > > > > >>>>
> >> > > > > > > > >>>
> >> > > > > > > > >>>
> >> > > > > > > > >>
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
>

Re: RS crash upon replication

Reply via email to