On Wed, Mar 13, 2013 at 8:48 PM, lars hofhansl <la...@apache.org> wrote: > Yeah, lemme sink the RC... We do have a fix. > > > Consider it sunk. > > In the end there are some more issues to discuss anyway. > - Can we avoid RSs taking over queues during a clean shutdown/restart? > Without multi we can actually loose data to replicate this way (one RS is > shut down, another takes over and is itself shut down) - unless I > misunderstand.
I agree. because even if they do move, they are not using locality as the regionserver which eventually takes it over will remotely read the log files. One way I can think of is do a scan on the available regionservers in the /hbase/rs znodes and then decide whether it should start the failover processing. > > - Should we stagger the attempts to move the queues for example with a random > wait between 0 and 10s, so that not all RSs try at the same time? > - A test for this scenario? (That's probably tricky) How about adding a jitter (random sleep (0-10 sec]) in the run method of the NodeFailoverWorker before it actually starts the failover processing? I will try to come up with a test case. > > > -- Lars > > > > ________________________________ > From: Andrew Purtell <apurt...@apache.org> > To: "dev@hbase.apache.org" <dev@hbase.apache.org> > Sent: Wednesday, March 13, 2013 8:22 PM > Subject: Re: Replication hosed after simple cluster restart > > If Himanshu (?) can fix it quickly we should try to get it in here IMHO. > > On Wednesday, March 13, 2013, Ted Yu wrote: > >> This was the JIRA that introduced copyQueuesFromRSUsingMulti(): >> HBASE-2611 Handle RS that fails while processing the failure of another one >> (Himanshu Vashishtha) >> >> It went into 0.94.5 >> And the feature is off by default: >> >> <name>hbase.zookeeper.useMulti</name> >> <value>false</value> >> >> The fact that Lars first reported the following problem meant that no other >> user tried this feature. >> >> Hence I think 0.94.6 RC1 doesn't need to be sunk. >> >> Cheers >> >> On Wed, Mar 13, 2013 at 6:45 PM, lars hofhansl >> <la...@apache.org<javascript:;>> >> wrote: >> >> > Hey no problem. It's cool that we found it in a test env. It's probably >> > quite hard to reproduce. >> > This is in 0.94.5 but this feature is off by default. >> > >> > What's the general thought here, should I kill the current 0.94.6 rc for >> > this? >> > My gut says: Yes. >> > >> > >> > I'm also a bit worried about these: >> > 2013-03-14 01:42:42,271 DEBUG >> > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: >> Opening >> > log for replication shared-dnds1-12-sfm.ops.sfdc.net >> %2C60020%2C1363220608780.1363220609572 >> > at 0 >> > 2013-03-14 01:42:42,358 WARN >> > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: 1 >> Got: >> > java.io.EOFException >> > at java.io.DataInputStream.readFully(DataInputStream.java:180) >> > at java.io.DataInputStream.readFully(DataInputStream.java:152) >> > at >> > org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1800) >> > at >> > >> org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1765) >> > at >> > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1714) >> > at >> > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1728) >> > at >> > >> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55) >> > at >> > >> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:177) >> > at >> > org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:728) >> > at >> > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:67) >> > at >> > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:507) >> > at >> > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:313) >> > 2013-03-14 01:42:42,358 WARN >> > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: >> Waited >> > too long for this file, considering dumping >> > 2013-03-14 01:42:42,358 DEBUG >> > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: >> Unable >> > to open a reader, sleeping 1000 times 10 >> > >> > This happens after bouncing the cluster a 2nd time and these messages >> > repeat every 10s (for hours now). This is a separate problem I think. >> > >> > -- Lars >> > >> > ------------------------------ >> > *From:* Himanshu Vashishtha <hvash...@cs.ualberta.ca <javascript:;>> >> > >> > *To:* dev@hbase.apache.org <javascript:;>; lars hofhansl < >> la...@apache.org <javascript:;>> >> > *Cc:* Ted Yu <yuzhih...@gmail.com <javascript:;>> >> > *Sent:* Wednesday, March 13, 2013 6:38 PM >> > >> > *Subject:* Re: Replication hosed after simple cluster restart >> > >> > This is bad. Yes, copyQueuesFromRSUsingMulti returns a list which it >> > might not be able to move later on, resulting in bogus znodes. >> > I'll fix this asap. Weird it didn't happen in my testing earlier. >> > Sorry about this. >> > >> > >> > On Wed, Mar 13, 2013 at 6:27 PM, lars hofhansl >> > <la...@apache.org<javascript:;>> >> wrote: >> > > Sorry 0.94.6RC1 >> > > (I complain about folks not reporting the version all the time, and >> then >> > I do it too) >> > > >> > > >> > > >> > > ________________________________ >> > > From: Ted Yu <yuzhih...@gmail.com <javascript:;>> >> > > To: dev@hbase.apache.org <javascript:;>; lars hofhansl < >> la...@apache.org <javascript:;>> >> > > Sent: Wednesday, March 13, 2013 6:17 PM >> > > Subject: Re: Replication hosed after simple cluster restart >> > > >> > > >> > > Did this happen on 0.94.5 ? >> > > >> > > Thanks >> > > >> > > >> > > On Wed, Mar 13, 2013 at 6:12 PM, lars hofhansl >> > > <la...@apache.org<javascript:;>> >> wrote: >> > > >> > > We just ran into an interesting scenario. We restarted a cluster that >> > was setup as a replication source. >> > >>The stop went cleanly. >> > >> >> > >>Upon restart *all* regionservers aborted within a few seconds with >> > variations of these errors: >> > >>http://pastebin.com/3iQVuBqS >> > >> >> > >>This is scary! >> > >> >> > >>-- Lars >> > >> > >> > >> > > > -- > Best regards, > > - Andy > > Problems worthy of attack prove their worth by hitting back. - Piet Hein > (via Tom White)