Re: Replication hosed after simple cluster restart

Ted Yu Wed, 13 Mar 2013 20:06:48 -0700

This was the JIRA that introduced copyQueuesFromRSUsingMulti():
HBASE-2611 Handle RS that fails while processing the failure of another one
(Himanshu Vashishtha)


It went into 0.94.5
And the feature is off by default:

    <name>hbase.zookeeper.useMulti</name>
    <value>false</value>

The fact that Lars first reported the following problem meant that no other
user tried this feature.

Hence I think 0.94.6 RC1 doesn't need to be sunk.

Cheers

On Wed, Mar 13, 2013 at 6:45 PM, lars hofhansl <[email protected]> wrote:

> Hey no problem. It's cool that we found it in a test env. It's probably
> quite hard to reproduce.
> This is in 0.94.5 but this feature is off by default.
>
> What's the general thought here, should I kill the current 0.94.6 rc for
> this?
> My gut says: Yes.
>
>
> I'm also a bit worried about these:
> 2013-03-14 01:42:42,271 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening
> log for replication 
> shared-dnds1-12-sfm.ops.sfdc.net%2C60020%2C1363220608780.1363220609572
> at 0
> 2013-03-14 01:42:42,358 WARN
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: 1 Got:
> java.io.EOFException
>         at java.io.DataInputStream.readFully(DataInputStream.java:180)
>         at java.io.DataInputStream.readFully(DataInputStream.java:152)
>         at
> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1800)
>         at
> org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1765)
>         at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1714)
>         at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1728)
>         at
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55)
>         at
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:177)
>         at
> org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:728)
>         at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:67)
>         at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:507)
>         at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:313)
> 2013-03-14 01:42:42,358 WARN
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Waited
> too long for this file, considering dumping
> 2013-03-14 01:42:42,358 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Unable
> to open a reader, sleeping 1000 times 10
>
> This happens after bouncing the cluster a 2nd time and these messages
> repeat every 10s (for hours now). This is a separate problem I think.
>
> -- Lars
>
>   ------------------------------
> *From:* Himanshu Vashishtha <[email protected]>
>
> *To:* [email protected]; lars hofhansl <[email protected]>
> *Cc:* Ted Yu <[email protected]>
> *Sent:* Wednesday, March 13, 2013 6:38 PM
>
> *Subject:* Re: Replication hosed after simple cluster restart
>
> This is bad. Yes, copyQueuesFromRSUsingMulti returns a list which it
> might not be able to move later on, resulting in bogus znodes.
> I'll fix this asap. Weird it didn't happen in my testing earlier.
> Sorry about this.
>
>
> On Wed, Mar 13, 2013 at 6:27 PM, lars hofhansl <[email protected]> wrote:
> > Sorry 0.94.6RC1
> > (I complain about folks not reporting the version all the time, and then
> I do it too)
> >
> >
> >
> > ________________________________
> >  From: Ted Yu <[email protected]>
> > To: [email protected]; lars hofhansl <[email protected]>
> > Sent: Wednesday, March 13, 2013 6:17 PM
> > Subject: Re: Replication hosed after simple cluster restart
> >
> >
> > Did this happen on 0.94.5 ?
> >
> > Thanks
> >
> >
> > On Wed, Mar 13, 2013 at 6:12 PM, lars hofhansl <[email protected]> wrote:
> >
> > We just ran into an interesting scenario. We restarted a cluster that
> was setup as a replication source.
> >>The stop went cleanly.
> >>
> >>Upon restart *all* regionservers aborted within a few seconds with
> variations of these errors:
> >>http://pastebin.com/3iQVuBqS
> >>
> >>This is scary!
> >>
> >>-- Lars
>
>
>

Re: Replication hosed after simple cluster restart

Reply via email to