Err.. The jira is https://issues.apache.org/jira/browse/HBASE-7122
On Wed, Mar 13, 2013 at 6:51 PM, Himanshu Vashishtha <hvash...@cs.ualberta.ca> wrote: > The log message you are seeing have been there for a long time I > remember (it is a debug level message). > I had a patch long back > https://issues.apache.org/jira/browse/HBASE-7937, which became stale. > > Stack: it is not the fault of the multi command, it is the way the > code is using it is wrong. There is a race b/w reading and moving the > znodes. Basically, what should be done is in case a regionserver fails > to move the znodes, it should return an empty list, and NOT what it > has read earlier. This is because other regionsever might have moved > the znodes. > > On Wed, Mar 13, 2013 at 6:45 PM, lars hofhansl <la...@apache.org> wrote: >> Hey no problem. It's cool that we found it in a test env. It's probably >> quite hard to reproduce. >> This is in 0.94.5 but this feature is off by default. >> >> What's the general thought here, should I kill the current 0.94.6 rc for >> this? >> My gut says: Yes. >> >> >> I'm also a bit worried about these: >> 2013-03-14 01:42:42,271 DEBUG >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening >> log for replication >> shared-dnds1-12-sfm.ops.sfdc.net%2C60020%2C1363220608780.1363220609572 at 0 >> 2013-03-14 01:42:42,358 WARN >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: 1 Got: >> java.io.EOFException >> at java.io.DataInputStream.readFully(DataInputStream.java:180) >> at java.io.DataInputStream.readFully(DataInputStream.java:152) >> at >> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1800) >> at >> org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1765) >> at >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1714) >> at >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1728) >> at >> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55) >> at >> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:177) >> at >> org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:728) >> at >> org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:67) >> at >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:507) >> at >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:313) >> 2013-03-14 01:42:42,358 WARN >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Waited >> too long for this file, considering dumping >> 2013-03-14 01:42:42,358 DEBUG >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Unable >> to open a reader, sleeping 1000 times 10 >> >> This happens after bouncing the cluster a 2nd time and these messages repeat >> every 10s (for hours now). This is a separate problem I think. >> >> -- Lars >> >> ________________________________ >> From: Himanshu Vashishtha <hvash...@cs.ualberta.ca> >> >> To: dev@hbase.apache.org; lars hofhansl <la...@apache.org> >> Cc: Ted Yu <yuzhih...@gmail.com> >> Sent: Wednesday, March 13, 2013 6:38 PM >> >> Subject: Re: Replication hosed after simple cluster restart >> >> This is bad. Yes, copyQueuesFromRSUsingMulti returns a list which it >> might not be able to move later on, resulting in bogus znodes. >> I'll fix this asap. Weird it didn't happen in my testing earlier. >> Sorry about this. >> >> On Wed, Mar 13, 2013 at 6:27 PM, lars hofhansl <la...@apache.org> wrote: >>> Sorry 0.94.6RC1 >>> (I complain about folks not reporting the version all the time, and then I >>> do it too) >>> >>> >>> >>> ________________________________ >>> From: Ted Yu <yuzhih...@gmail.com> >>> To: dev@hbase.apache.org; lars hofhansl <la...@apache.org> >>> Sent: Wednesday, March 13, 2013 6:17 PM >>> Subject: Re: Replication hosed after simple cluster restart >>> >>> >>> Did this happen on 0.94.5 ? >>> >>> Thanks >>> >>> >>> On Wed, Mar 13, 2013 at 6:12 PM, lars hofhansl <la...@apache.org> wrote: >>> >>> We just ran into an interesting scenario. We restarted a cluster that was >>> setup as a replication source. >>>>The stop went cleanly. >>>> >>>>Upon restart *all* regionservers aborted within a few seconds with >>>> variations of these errors: >>>>http://pastebin.com/3iQVuBqS >>>> >>>>This is scary! >>>> >>>>-- Lars >> >>