Re: Replication hosed after simple cluster restart

Jean-Marc Spaggiari Thu, 14 Mar 2013 05:40:49 -0700

I agree that we should  try to add a test for this. A chance Lars you
found it, but will be even better if the test suite can detect such
things.


Does any-one have opened the JIRA for the test part?

JM

2013/3/14 lars hofhansl <la...@apache.org>:
> I have proposed some minor changes (including adding the jitter) on 
> HBASE-8099.
> Turns out there already is a wait-time to give the cluster a chance to 
> shutdown. It defaults to 2s, which was not enough in our case.
>
> Let's do a test (if we think that can be done) in a different jira.
>
>
> -- Lars
> ________________________________
> From: Himanshu Vashishtha <hvash...@cs.ualberta.ca>
> To: dev@hbase.apache.org; lars hofhansl <la...@apache.org>
> Sent: Wednesday, March 13, 2013 8:59 PM
> Subject: Re: Replication hosed after simple cluster restart
>
> On Wed, Mar 13, 2013 at 8:48 PM, lars hofhansl <la...@apache.org> wrote:
>> Yeah, lemme sink the RC... We do have a fix.
>>
>>
>> Consider it sunk.
>>
>> In the end there are some more issues to discuss anyway.
>> - Can we avoid RSs taking over queues during a clean shutdown/restart? 
>> Without multi we can actually loose data to replicate this way (one RS is 
>> shut down, another takes over and is itself shut down) - unless I 
>> misunderstand.
>
> I agree. because even if they do move, they are not using locality as
> the regionserver which eventually takes it over will remotely read the
> log files. One way I can think of is do a scan on the available
> regionservers in the /hbase/rs znodes and then decide whether it
> should start the failover processing.
>
>>
>> - Should we stagger the attempts to move the queues for example with a 
>> random wait between 0 and 10s, so that not all RSs try at the same time?
>> - A test for this scenario? (That's probably tricky)
>
> How about adding a jitter (random sleep (0-10 sec]) in the run method
> of the NodeFailoverWorker before it actually starts the failover
> processing? I will try to come up with a test case.
>
>>
>>
>> -- Lars
>>
>>
>>
>> ________________________________
>>  From: Andrew Purtell <apurt...@apache.org>
>> To: "dev@hbase.apache.org" <dev@hbase.apache.org>
>> Sent: Wednesday, March 13, 2013 8:22 PM
>> Subject: Re: Replication hosed after simple cluster restart
>>
>> If Himanshu (?) can fix it quickly we should try to get it in here IMHO.
>>
>> On Wednesday, March 13, 2013, Ted Yu wrote:
>>
>>> This was the JIRA that introduced copyQueuesFromRSUsingMulti():
>>> HBASE-2611 Handle RS that fails while processing the failure of another one
>>> (Himanshu Vashishtha)
>>>
>>> It went into 0.94.5
>>> And the feature is off by default:
>>>
>>>     <name>hbase.zookeeper.useMulti</name>
>>>     <value>false</value>
>>>
>>> The fact that Lars first reported the following problem meant that no other
>>> user tried this feature.
>>>
>>> Hence I think 0.94.6 RC1 doesn't need to be sunk.
>>>
>>> Cheers
>>>
>>> On Wed, Mar 13, 2013 at 6:45 PM, lars hofhansl 
>>> <la...@apache.org<javascript:;>>
>>> wrote:
>>>
>>> > Hey no problem. It's cool that we found it in a test env. It's probably
>>> > quite hard to reproduce.
>>> > This is in 0.94.5 but this feature is off by default.
>>> >
>>> > What's the general thought here, should I kill the current 0.94.6 rc for
>>> > this?
>>> > My gut says: Yes.
>>> >
>>> >
>>> > I'm also a bit worried about these:
>>> > 2013-03-14 01:42:42,271 DEBUG
>>> > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
>>> Opening
>>> > log for replication shared-dnds1-12-sfm.ops.sfdc.net
>>> %2C60020%2C1363220608780.1363220609572
>>> > at 0
>>> > 2013-03-14 01:42:42,358 WARN
>>> > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: 1
>>> Got:
>>> > java.io.EOFException
>>> >         at java.io.DataInputStream.readFully(DataInputStream.java:180)
>>> >         at java.io.DataInputStream.readFully(DataInputStream.java:152)
>>> >         at
>>> > org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1800)
>>> >         at
>>> >
>>> org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1765)
>>> >         at
>>> > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1714)
>>> >         at
>>> > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1728)
>>> >         at
>>> >
>>> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55)
>>> >         at
>>> >
>>> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:177)
>>> >         at
>>> > org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:728)
>>> >         at
>>> >
>>> org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:67)
>>> >         at
>>> >
>>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:507)
>>> >         at
>>> >
>>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:313)
>>> > 2013-03-14 01:42:42,358 WARN
>>> > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
>>> Waited
>>> > too long for this file, considering dumping
>>> > 2013-03-14 01:42:42,358 DEBUG
>>> > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
>>> Unable
>>> > to open a reader, sleeping 1000 times 10
>>> >
>>> > This happens after bouncing the cluster a 2nd time and these messages
>>> > repeat every 10s (for hours now). This is a separate problem I think.
>>> >
>>> > -- Lars
>>> >
>>> >   ------------------------------
>>> > *From:* Himanshu Vashishtha <hvash...@cs.ualberta.ca <javascript:;>>
>>> >
>>> > *To:* dev@hbase.apache.org <javascript:;>; lars hofhansl <
>>> la...@apache.org <javascript:;>>
>>> > *Cc:* Ted Yu <yuzhih...@gmail.com <javascript:;>>
>>> > *Sent:* Wednesday, March 13, 2013 6:38 PM
>>> >
>>> > *Subject:* Re: Replication hosed after simple cluster restart
>>> >
>>> > This is bad. Yes, copyQueuesFromRSUsingMulti returns a list which it
>>> > might not be able to move later on, resulting in bogus znodes.
>>> > I'll fix this asap. Weird it didn't happen in my testing earlier.
>>> > Sorry about this.
>>> >
>>> >
>>> > On Wed, Mar 13, 2013 at 6:27 PM, lars hofhansl 
>>> > <la...@apache.org<javascript:;>>
>>> wrote:
>>> > > Sorry 0.94.6RC1
>>> > > (I complain about folks not reporting the version all the time, and
>>> then
>>> > I do it too)
>>> > >
>>> > >
>>> > >
>>> > > ________________________________
>>> > >  From: Ted Yu <yuzhih...@gmail.com <javascript:;>>
>>> > > To: dev@hbase.apache.org <javascript:;>; lars hofhansl <
>>> la...@apache.org <javascript:;>>
>>> > > Sent: Wednesday, March 13, 2013 6:17 PM
>>> > > Subject: Re: Replication hosed after simple cluster restart
>>> > >
>>> > >
>>> > > Did this happen on 0.94.5 ?
>>> > >
>>> > > Thanks
>>> > >
>>> > >
>>> > > On Wed, Mar 13, 2013 at 6:12 PM, lars hofhansl 
>>> > > <la...@apache.org<javascript:;>>
>>> wrote:
>>> > >
>>> > > We just ran into an interesting scenario. We restarted a cluster that
>>> > was setup as a replication source.
>>> > >>The stop went cleanly.
>>> > >>
>>> > >>Upon restart *all* regionservers aborted within a few seconds with
>>> > variations of these errors:
>>> > >>http://pastebin.com/3iQVuBqS
>>> > >>
>>> > >>This is scary!
>>> > >>
>>> > >>-- Lars
>>> >
>>> >
>>> >
>>>
>>
>>
>> --
>> Best regards,
>>
>>    - Andy
>>
>> Problems worthy of attack prove their worth by hitting back. - Piet Hein
>> (via Tom White)

Re: Replication hosed after simple cluster restart

Reply via email to