If Himanshu (?) can fix it quickly we should try to get it in here IMHO. On Wednesday, March 13, 2013, Ted Yu wrote:
> This was the JIRA that introduced copyQueuesFromRSUsingMulti(): > HBASE-2611 Handle RS that fails while processing the failure of another one > (Himanshu Vashishtha) > > It went into 0.94.5 > And the feature is off by default: > > <name>hbase.zookeeper.useMulti</name> > <value>false</value> > > The fact that Lars first reported the following problem meant that no other > user tried this feature. > > Hence I think 0.94.6 RC1 doesn't need to be sunk. > > Cheers > > On Wed, Mar 13, 2013 at 6:45 PM, lars hofhansl > <la...@apache.org<javascript:;>> > wrote: > > > Hey no problem. It's cool that we found it in a test env. It's probably > > quite hard to reproduce. > > This is in 0.94.5 but this feature is off by default. > > > > What's the general thought here, should I kill the current 0.94.6 rc for > > this? > > My gut says: Yes. > > > > > > I'm also a bit worried about these: > > 2013-03-14 01:42:42,271 DEBUG > > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: > Opening > > log for replication shared-dnds1-12-sfm.ops.sfdc.net > %2C60020%2C1363220608780.1363220609572 > > at 0 > > 2013-03-14 01:42:42,358 WARN > > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: 1 > Got: > > java.io.EOFException > > at java.io.DataInputStream.readFully(DataInputStream.java:180) > > at java.io.DataInputStream.readFully(DataInputStream.java:152) > > at > > org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1800) > > at > > > org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1765) > > at > > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1714) > > at > > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1728) > > at > > > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55) > > at > > > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:177) > > at > > org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:728) > > at > > > org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:67) > > at > > > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:507) > > at > > > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:313) > > 2013-03-14 01:42:42,358 WARN > > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: > Waited > > too long for this file, considering dumping > > 2013-03-14 01:42:42,358 DEBUG > > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: > Unable > > to open a reader, sleeping 1000 times 10 > > > > This happens after bouncing the cluster a 2nd time and these messages > > repeat every 10s (for hours now). This is a separate problem I think. > > > > -- Lars > > > > ------------------------------ > > *From:* Himanshu Vashishtha <hvash...@cs.ualberta.ca <javascript:;>> > > > > *To:* dev@hbase.apache.org <javascript:;>; lars hofhansl < > la...@apache.org <javascript:;>> > > *Cc:* Ted Yu <yuzhih...@gmail.com <javascript:;>> > > *Sent:* Wednesday, March 13, 2013 6:38 PM > > > > *Subject:* Re: Replication hosed after simple cluster restart > > > > This is bad. Yes, copyQueuesFromRSUsingMulti returns a list which it > > might not be able to move later on, resulting in bogus znodes. > > I'll fix this asap. Weird it didn't happen in my testing earlier. > > Sorry about this. > > > > > > On Wed, Mar 13, 2013 at 6:27 PM, lars hofhansl > > <la...@apache.org<javascript:;>> > wrote: > > > Sorry 0.94.6RC1 > > > (I complain about folks not reporting the version all the time, and > then > > I do it too) > > > > > > > > > > > > ________________________________ > > > From: Ted Yu <yuzhih...@gmail.com <javascript:;>> > > > To: dev@hbase.apache.org <javascript:;>; lars hofhansl < > la...@apache.org <javascript:;>> > > > Sent: Wednesday, March 13, 2013 6:17 PM > > > Subject: Re: Replication hosed after simple cluster restart > > > > > > > > > Did this happen on 0.94.5 ? > > > > > > Thanks > > > > > > > > > On Wed, Mar 13, 2013 at 6:12 PM, lars hofhansl > > > <la...@apache.org<javascript:;>> > wrote: > > > > > > We just ran into an interesting scenario. We restarted a cluster that > > was setup as a replication source. > > >>The stop went cleanly. > > >> > > >>Upon restart *all* regionservers aborted within a few seconds with > > variations of these errors: > > >>http://pastebin.com/3iQVuBqS > > >> > > >>This is scary! > > >> > > >>-- Lars > > > > > > > -- Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)