Re: Replication hosed after simple cluster restart

lars hofhansl Wed, 13 Mar 2013 18:45:31 -0700

Hey no problem. It's cool that we found it in a test env. It's probably quite 
hard to reproduce.
This is in 0.94.5 but this feature is off by default.


What's the general thought here, should I kill the current 0.94.6 rc for this?
My gut says: Yes.



I'm also a bit worried about these:
2013-03-14 01:42:42,271 DEBUG 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening log 
for replication 
shared-dnds1-12-sfm.ops.sfdc.net%2C60020%2C1363220608780.1363220609572 at 0
2013-03-14 01:42:42,358 WARN 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: 1 Got: 
java.io.EOFException
        at java.io.DataInputStream.readFully(DataInputStream.java:180)
        at java.io.DataInputStream.readFully(DataInputStream.java:152)
        at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1800)
        at 
org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1765)
        at 
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1714)
        at 
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1728)
        at 
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55)
        at 
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:177)
        at 
org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:728)
        at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:67)
        at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:507)
        at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:313)
2013-03-14 01:42:42,358 WARN 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Waited too 
long for this file, considering dumping
2013-03-14 01:42:42,358 DEBUG 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Unable to 
open a reader, sleeping 1000 times 10


This happens after bouncing the cluster a 2nd time and these messages repeat 
every 10s (for hours now). This is a separate problem I think.


-- Lars



________________________________
 From: Himanshu Vashishtha <hvash...@cs.ualberta.ca>
To: dev@hbase.apache.org; lars hofhansl <la...@apache.org> 
Cc: Ted Yu <yuzhih...@gmail.com> 
Sent: Wednesday, March 13, 2013 6:38 PM
Subject: Re: Replication hosed after simple cluster restart
 
This is bad. Yes, copyQueuesFromRSUsingMulti returns a list which it
might not be able to move later on, resulting in bogus znodes.
I'll fix this asap. Weird it didn't happen in my testing earlier.
Sorry about this.

On Wed, Mar 13, 2013 at 6:27 PM, lars hofhansl <la...@apache.org> wrote:
> Sorry 0.94.6RC1
> (I complain about folks not reporting the version all the time, and then I do 
> it too)
>
>
>
> ________________________________
>  From: Ted Yu <yuzhih...@gmail.com>
> To: dev@hbase.apache.org; lars hofhansl <la...@apache.org>
> Sent: Wednesday, March 13, 2013 6:17 PM
> Subject: Re: Replication hosed after simple cluster restart
>
>
> Did this happen on 0.94.5 ?
>
> Thanks
>
>
> On Wed, Mar 13, 2013 at 6:12 PM, lars hofhansl <la...@apache.org> wrote:
>
> We just ran into an interesting scenario. We restarted a cluster that was 
> setup as a replication source.
>>The stop went cleanly.
>>
>>Upon restart *all* regionservers aborted within a few seconds with variations 
>>of these errors:
>>http://pastebin.com/3iQVuBqS
>>
>>This is scary!
>>
>>-- Lars

Re: Replication hosed after simple cluster restart

Reply via email to