I agree that we should try to add a test for this. A chance Lars you found it, but will be even better if the test suite can detect such things.
Does any-one have opened the JIRA for the test part? JM 2013/3/14 lars hofhansl <la...@apache.org>: > I have proposed some minor changes (including adding the jitter) on > HBASE-8099. > Turns out there already is a wait-time to give the cluster a chance to > shutdown. It defaults to 2s, which was not enough in our case. > > Let's do a test (if we think that can be done) in a different jira. > > > -- Lars > ________________________________ > From: Himanshu Vashishtha <hvash...@cs.ualberta.ca> > To: dev@hbase.apache.org; lars hofhansl <la...@apache.org> > Sent: Wednesday, March 13, 2013 8:59 PM > Subject: Re: Replication hosed after simple cluster restart > > On Wed, Mar 13, 2013 at 8:48 PM, lars hofhansl <la...@apache.org> wrote: >> Yeah, lemme sink the RC... We do have a fix. >> >> >> Consider it sunk. >> >> In the end there are some more issues to discuss anyway. >> - Can we avoid RSs taking over queues during a clean shutdown/restart? >> Without multi we can actually loose data to replicate this way (one RS is >> shut down, another takes over and is itself shut down) - unless I >> misunderstand. > > I agree. because even if they do move, they are not using locality as > the regionserver which eventually takes it over will remotely read the > log files. One way I can think of is do a scan on the available > regionservers in the /hbase/rs znodes and then decide whether it > should start the failover processing. > >> >> - Should we stagger the attempts to move the queues for example with a >> random wait between 0 and 10s, so that not all RSs try at the same time? >> - A test for this scenario? (That's probably tricky) > > How about adding a jitter (random sleep (0-10 sec]) in the run method > of the NodeFailoverWorker before it actually starts the failover > processing? I will try to come up with a test case. > >> >> >> -- Lars >> >> >> >> ________________________________ >> From: Andrew Purtell <apurt...@apache.org> >> To: "dev@hbase.apache.org" <dev@hbase.apache.org> >> Sent: Wednesday, March 13, 2013 8:22 PM >> Subject: Re: Replication hosed after simple cluster restart >> >> If Himanshu (?) can fix it quickly we should try to get it in here IMHO. >> >> On Wednesday, March 13, 2013, Ted Yu wrote: >> >>> This was the JIRA that introduced copyQueuesFromRSUsingMulti(): >>> HBASE-2611 Handle RS that fails while processing the failure of another one >>> (Himanshu Vashishtha) >>> >>> It went into 0.94.5 >>> And the feature is off by default: >>> >>> <name>hbase.zookeeper.useMulti</name> >>> <value>false</value> >>> >>> The fact that Lars first reported the following problem meant that no other >>> user tried this feature. >>> >>> Hence I think 0.94.6 RC1 doesn't need to be sunk. >>> >>> Cheers >>> >>> On Wed, Mar 13, 2013 at 6:45 PM, lars hofhansl >>> <la...@apache.org<javascript:;>> >>> wrote: >>> >>> > Hey no problem. It's cool that we found it in a test env. It's probably >>> > quite hard to reproduce. >>> > This is in 0.94.5 but this feature is off by default. >>> > >>> > What's the general thought here, should I kill the current 0.94.6 rc for >>> > this? >>> > My gut says: Yes. >>> > >>> > >>> > I'm also a bit worried about these: >>> > 2013-03-14 01:42:42,271 DEBUG >>> > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: >>> Opening >>> > log for replication shared-dnds1-12-sfm.ops.sfdc.net >>> %2C60020%2C1363220608780.1363220609572 >>> > at 0 >>> > 2013-03-14 01:42:42,358 WARN >>> > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: 1 >>> Got: >>> > java.io.EOFException >>> > at java.io.DataInputStream.readFully(DataInputStream.java:180) >>> > at java.io.DataInputStream.readFully(DataInputStream.java:152) >>> > at >>> > org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1800) >>> > at >>> > >>> org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1765) >>> > at >>> > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1714) >>> > at >>> > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1728) >>> > at >>> > >>> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55) >>> > at >>> > >>> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:177) >>> > at >>> > org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:728) >>> > at >>> > >>> org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:67) >>> > at >>> > >>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:507) >>> > at >>> > >>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:313) >>> > 2013-03-14 01:42:42,358 WARN >>> > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: >>> Waited >>> > too long for this file, considering dumping >>> > 2013-03-14 01:42:42,358 DEBUG >>> > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: >>> Unable >>> > to open a reader, sleeping 1000 times 10 >>> > >>> > This happens after bouncing the cluster a 2nd time and these messages >>> > repeat every 10s (for hours now). This is a separate problem I think. >>> > >>> > -- Lars >>> > >>> > ------------------------------ >>> > *From:* Himanshu Vashishtha <hvash...@cs.ualberta.ca <javascript:;>> >>> > >>> > *To:* dev@hbase.apache.org <javascript:;>; lars hofhansl < >>> la...@apache.org <javascript:;>> >>> > *Cc:* Ted Yu <yuzhih...@gmail.com <javascript:;>> >>> > *Sent:* Wednesday, March 13, 2013 6:38 PM >>> > >>> > *Subject:* Re: Replication hosed after simple cluster restart >>> > >>> > This is bad. Yes, copyQueuesFromRSUsingMulti returns a list which it >>> > might not be able to move later on, resulting in bogus znodes. >>> > I'll fix this asap. Weird it didn't happen in my testing earlier. >>> > Sorry about this. >>> > >>> > >>> > On Wed, Mar 13, 2013 at 6:27 PM, lars hofhansl >>> > <la...@apache.org<javascript:;>> >>> wrote: >>> > > Sorry 0.94.6RC1 >>> > > (I complain about folks not reporting the version all the time, and >>> then >>> > I do it too) >>> > > >>> > > >>> > > >>> > > ________________________________ >>> > > From: Ted Yu <yuzhih...@gmail.com <javascript:;>> >>> > > To: dev@hbase.apache.org <javascript:;>; lars hofhansl < >>> la...@apache.org <javascript:;>> >>> > > Sent: Wednesday, March 13, 2013 6:17 PM >>> > > Subject: Re: Replication hosed after simple cluster restart >>> > > >>> > > >>> > > Did this happen on 0.94.5 ? >>> > > >>> > > Thanks >>> > > >>> > > >>> > > On Wed, Mar 13, 2013 at 6:12 PM, lars hofhansl >>> > > <la...@apache.org<javascript:;>> >>> wrote: >>> > > >>> > > We just ran into an interesting scenario. We restarted a cluster that >>> > was setup as a replication source. >>> > >>The stop went cleanly. >>> > >> >>> > >>Upon restart *all* regionservers aborted within a few seconds with >>> > variations of these errors: >>> > >>http://pastebin.com/3iQVuBqS >>> > >> >>> > >>This is scary! >>> > >> >>> > >>-- Lars >>> > >>> > >>> > >>> >> >> >> -- >> Best regards, >> >> - Andy >> >> Problems worthy of attack prove their worth by hitting back. - Piet Hein >> (via Tom White)