Re: Replication hosed after simple cluster restart

Himanshu Vashishtha Wed, 13 Mar 2013 19:43:56 -0700

Err.. The jira is https://issues.apache.org/jira/browse/HBASE-7122


On Wed, Mar 13, 2013 at 6:51 PM, Himanshu Vashishtha
<hvash...@cs.ualberta.ca> wrote:
> The log message you are seeing have been there for a long time I
> remember (it is a debug level message).
> I had a patch long back
> https://issues.apache.org/jira/browse/HBASE-7937, which became stale.
>
> Stack: it is not the fault of the multi command, it is the way the
> code is using it is wrong. There is a race b/w reading and moving the
> znodes. Basically, what should be done is in case a regionserver fails
> to move the znodes, it should return an empty list, and NOT what it
> has read earlier. This is because other regionsever might have moved
> the znodes.
>
> On Wed, Mar 13, 2013 at 6:45 PM, lars hofhansl <la...@apache.org> wrote:
>> Hey no problem. It's cool that we found it in a test env. It's probably
>> quite hard to reproduce.
>> This is in 0.94.5 but this feature is off by default.
>>
>> What's the general thought here, should I kill the current 0.94.6 rc for
>> this?
>> My gut says: Yes.
>>
>>
>> I'm also a bit worried about these:
>> 2013-03-14 01:42:42,271 DEBUG
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening
>> log for replication
>> shared-dnds1-12-sfm.ops.sfdc.net%2C60020%2C1363220608780.1363220609572 at 0
>> 2013-03-14 01:42:42,358 WARN
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: 1 Got:
>> java.io.EOFException
>>         at java.io.DataInputStream.readFully(DataInputStream.java:180)
>>         at java.io.DataInputStream.readFully(DataInputStream.java:152)
>>         at
>> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1800)
>>         at
>> org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1765)
>>         at
>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1714)
>>         at
>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1728)
>>         at
>> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55)
>>         at
>> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:177)
>>         at
>> org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:728)
>>         at
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:67)
>>         at
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:507)
>>         at
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:313)
>> 2013-03-14 01:42:42,358 WARN
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Waited
>> too long for this file, considering dumping
>> 2013-03-14 01:42:42,358 DEBUG
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Unable
>> to open a reader, sleeping 1000 times 10
>>
>> This happens after bouncing the cluster a 2nd time and these messages repeat
>> every 10s (for hours now). This is a separate problem I think.
>>
>> -- Lars
>>
>> ________________________________
>> From: Himanshu Vashishtha <hvash...@cs.ualberta.ca>
>>
>> To: dev@hbase.apache.org; lars hofhansl <la...@apache.org>
>> Cc: Ted Yu <yuzhih...@gmail.com>
>> Sent: Wednesday, March 13, 2013 6:38 PM
>>
>> Subject: Re: Replication hosed after simple cluster restart
>>
>> This is bad. Yes, copyQueuesFromRSUsingMulti returns a list which it
>> might not be able to move later on, resulting in bogus znodes.
>> I'll fix this asap. Weird it didn't happen in my testing earlier.
>> Sorry about this.
>>
>> On Wed, Mar 13, 2013 at 6:27 PM, lars hofhansl <la...@apache.org> wrote:
>>> Sorry 0.94.6RC1
>>> (I complain about folks not reporting the version all the time, and then I
>>> do it too)
>>>
>>>
>>>
>>> ________________________________
>>>  From: Ted Yu <yuzhih...@gmail.com>
>>> To: dev@hbase.apache.org; lars hofhansl <la...@apache.org>
>>> Sent: Wednesday, March 13, 2013 6:17 PM
>>> Subject: Re: Replication hosed after simple cluster restart
>>>
>>>
>>> Did this happen on 0.94.5 ?
>>>
>>> Thanks
>>>
>>>
>>> On Wed, Mar 13, 2013 at 6:12 PM, lars hofhansl <la...@apache.org> wrote:
>>>
>>> We just ran into an interesting scenario. We restarted a cluster that was
>>> setup as a replication source.
>>>>The stop went cleanly.
>>>>
>>>>Upon restart *all* regionservers aborted within a few seconds with
>>>> variations of these errors:
>>>>http://pastebin.com/3iQVuBqS
>>>>
>>>>This is scary!
>>>>
>>>>-- Lars
>>
>>

Re: Replication hosed after simple cluster restart

Reply via email to