[jira] [Commented] (HBASE-10829) Flush is skipped after log replay if the last recovered edits file is skipped
[ https://issues.apache.org/jira/browse/HBASE-10829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13965680#comment-13965680 ] Cosmin Lehene commented on HBASE-10829: --- I can't find this issue in the 0.98.1 release notes. Perhaps fix version should be 0.98.2? Flush is skipped after log replay if the last recovered edits file is skipped - Key: HBASE-10829 URL: https://issues.apache.org/jira/browse/HBASE-10829 Project: HBase Issue Type: Bug Reporter: Enis Soztutar Assignee: Enis Soztutar Priority: Critical Fix For: 0.98.1, 0.99.0, 0.96.3 Attachments: hbase-10829_v1.patch, hbase-10829_v2.patch, hbase-10829_v3.patch We caught this in an extended test run where IntegrationTestBigLinkedList failed with some missing keys. The problem is that HRegion.replayRecoveredEdits() would return -1 if all the edits in the log file is skipped, which is true for example if the log file only contains a single compaction record (HBASE-2231) or somehow the edits cannot be applied (column family deleted, etc). The callee, HRegion.replayRecoveredEditsIfAny() only looks for the last returned seqId to decide whether a flush is necessary or not before opening the region, and discarding replayed recovered edits files. Therefore, if the last recovered edits file is skipped but some edits from earlier recovered edits files are applied, the mandatory flush before opening the region is skipped. If the region server dies after this point before a flush, the edits are lost. This is important to fix, though the sequence of events are super rare for a production cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-10829) Flush is skipped after log replay if the last recovered edits file is skipped
[ https://issues.apache.org/jira/browse/HBASE-10829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13949262#comment-13949262 ] Hudson commented on HBASE-10829: SUCCESS: Integrated in hbase-0.96 #369 (See [https://builds.apache.org/job/hbase-0.96/369/]) HBASE-10829 Flush is skipped after log replay if the last recovered edits file is skipped (enis: rev 1581957) * /hbase/branches/0.96/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java * /hbase/branches/0.96/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java Flush is skipped after log replay if the last recovered edits file is skipped - Key: HBASE-10829 URL: https://issues.apache.org/jira/browse/HBASE-10829 Project: HBase Issue Type: Bug Reporter: Enis Soztutar Assignee: Enis Soztutar Priority: Critical Fix For: 0.98.1, 0.99.0, 0.96.3 Attachments: hbase-10829_v1.patch, hbase-10829_v2.patch, hbase-10829_v3.patch We caught this in an extended test run where IntegrationTestBigLinkedList failed with some missing keys. The problem is that HRegion.replayRecoveredEdits() would return -1 if all the edits in the log file is skipped, which is true for example if the log file only contains a single compaction record (HBASE-2231) or somehow the edits cannot be applied (column family deleted, etc). The callee, HRegion.replayRecoveredEditsIfAny() only looks for the last returned seqId to decide whether a flush is necessary or not before opening the region, and discarding replayed recovered edits files. Therefore, if the last recovered edits file is skipped but some edits from earlier recovered edits files are applied, the mandatory flush before opening the region is skipped. If the region server dies after this point before a flush, the edits are lost. This is important to fix, though the sequence of events are super rare for a production cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-10829) Flush is skipped after log replay if the last recovered edits file is skipped
[ https://issues.apache.org/jira/browse/HBASE-10829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13949704#comment-13949704 ] Hudson commented on HBASE-10829: SUCCESS: Integrated in HBase-0.98 #253 (See [https://builds.apache.org/job/HBase-0.98/253/]) HBASE-10829 Flush is skipped after log replay if the last recovered edits file is skipped (enis: rev 1581954) * /hbase/branches/0.98/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java * /hbase/branches/0.98/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java Flush is skipped after log replay if the last recovered edits file is skipped - Key: HBASE-10829 URL: https://issues.apache.org/jira/browse/HBASE-10829 Project: HBase Issue Type: Bug Reporter: Enis Soztutar Assignee: Enis Soztutar Priority: Critical Fix For: 0.98.1, 0.99.0, 0.96.3 Attachments: hbase-10829_v1.patch, hbase-10829_v2.patch, hbase-10829_v3.patch We caught this in an extended test run where IntegrationTestBigLinkedList failed with some missing keys. The problem is that HRegion.replayRecoveredEdits() would return -1 if all the edits in the log file is skipped, which is true for example if the log file only contains a single compaction record (HBASE-2231) or somehow the edits cannot be applied (column family deleted, etc). The callee, HRegion.replayRecoveredEditsIfAny() only looks for the last returned seqId to decide whether a flush is necessary or not before opening the region, and discarding replayed recovered edits files. Therefore, if the last recovered edits file is skipped but some edits from earlier recovered edits files are applied, the mandatory flush before opening the region is skipped. If the region server dies after this point before a flush, the edits are lost. This is important to fix, though the sequence of events are super rare for a production cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-10829) Flush is skipped after log replay if the last recovered edits file is skipped
[ https://issues.apache.org/jira/browse/HBASE-10829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13948378#comment-13948378 ] Hudson commented on HBASE-10829: SUCCESS: Integrated in HBase-TRUNK #5042 (See [https://builds.apache.org/job/HBase-TRUNK/5042/]) HBASE-10829 Flush is skipped after log replay if the last recovered edits file is skipped (enis: rev 1581947) * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java Flush is skipped after log replay if the last recovered edits file is skipped - Key: HBASE-10829 URL: https://issues.apache.org/jira/browse/HBASE-10829 Project: HBase Issue Type: Bug Reporter: Enis Soztutar Assignee: Enis Soztutar Priority: Critical Fix For: 0.98.1, 0.99.0, 0.96.3 Attachments: hbase-10829_v1.patch, hbase-10829_v2.patch, hbase-10829_v3.patch We caught this in an extended test run where IntegrationTestBigLinkedList failed with some missing keys. The problem is that HRegion.replayRecoveredEdits() would return -1 if all the edits in the log file is skipped, which is true for example if the log file only contains a single compaction record (HBASE-2231) or somehow the edits cannot be applied (column family deleted, etc). The callee, HRegion.replayRecoveredEditsIfAny() only looks for the last returned seqId to decide whether a flush is necessary or not before opening the region, and discarding replayed recovered edits files. Therefore, if the last recovered edits file is skipped but some edits from earlier recovered edits files are applied, the mandatory flush before opening the region is skipped. If the region server dies after this point before a flush, the edits are lost. This is important to fix, though the sequence of events are super rare for a production cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-10829) Flush is skipped after log replay if the last recovered edits file is skipped
[ https://issues.apache.org/jira/browse/HBASE-10829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13948447#comment-13948447 ] Hudson commented on HBASE-10829: FAILURE: Integrated in HBase-0.98-on-Hadoop-1.1 #236 (See [https://builds.apache.org/job/HBase-0.98-on-Hadoop-1.1/236/]) HBASE-10829 Flush is skipped after log replay if the last recovered edits file is skipped (enis: rev 1581954) * /hbase/branches/0.98/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java * /hbase/branches/0.98/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java Flush is skipped after log replay if the last recovered edits file is skipped - Key: HBASE-10829 URL: https://issues.apache.org/jira/browse/HBASE-10829 Project: HBase Issue Type: Bug Reporter: Enis Soztutar Assignee: Enis Soztutar Priority: Critical Fix For: 0.98.1, 0.99.0, 0.96.3 Attachments: hbase-10829_v1.patch, hbase-10829_v2.patch, hbase-10829_v3.patch We caught this in an extended test run where IntegrationTestBigLinkedList failed with some missing keys. The problem is that HRegion.replayRecoveredEdits() would return -1 if all the edits in the log file is skipped, which is true for example if the log file only contains a single compaction record (HBASE-2231) or somehow the edits cannot be applied (column family deleted, etc). The callee, HRegion.replayRecoveredEditsIfAny() only looks for the last returned seqId to decide whether a flush is necessary or not before opening the region, and discarding replayed recovered edits files. Therefore, if the last recovered edits file is skipped but some edits from earlier recovered edits files are applied, the mandatory flush before opening the region is skipped. If the region server dies after this point before a flush, the edits are lost. This is important to fix, though the sequence of events are super rare for a production cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-10829) Flush is skipped after log replay if the last recovered edits file is skipped
[ https://issues.apache.org/jira/browse/HBASE-10829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13948789#comment-13948789 ] Hudson commented on HBASE-10829: FAILURE: Integrated in hbase-0.96-hadoop2 #253 (See [https://builds.apache.org/job/hbase-0.96-hadoop2/253/]) HBASE-10829 Flush is skipped after log replay if the last recovered edits file is skipped (enis: rev 1581957) * /hbase/branches/0.96/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java * /hbase/branches/0.96/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java Flush is skipped after log replay if the last recovered edits file is skipped - Key: HBASE-10829 URL: https://issues.apache.org/jira/browse/HBASE-10829 Project: HBase Issue Type: Bug Reporter: Enis Soztutar Assignee: Enis Soztutar Priority: Critical Fix For: 0.98.1, 0.99.0, 0.96.3 Attachments: hbase-10829_v1.patch, hbase-10829_v2.patch, hbase-10829_v3.patch We caught this in an extended test run where IntegrationTestBigLinkedList failed with some missing keys. The problem is that HRegion.replayRecoveredEdits() would return -1 if all the edits in the log file is skipped, which is true for example if the log file only contains a single compaction record (HBASE-2231) or somehow the edits cannot be applied (column family deleted, etc). The callee, HRegion.replayRecoveredEditsIfAny() only looks for the last returned seqId to decide whether a flush is necessary or not before opening the region, and discarding replayed recovered edits files. Therefore, if the last recovered edits file is skipped but some edits from earlier recovered edits files are applied, the mandatory flush before opening the region is skipped. If the region server dies after this point before a flush, the edits are lost. This is important to fix, though the sequence of events are super rare for a production cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-10829) Flush is skipped after log replay if the last recovered edits file is skipped
[ https://issues.apache.org/jira/browse/HBASE-10829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13946911#comment-13946911 ] Enis Soztutar commented on HBASE-10829: --- Here is a log of events in case you are interested. The region (9514935e6a659bd90faa21bf458a842e) was happily hosted by some region server. After the writes have settled down, the region had some un-flushed data. The last flush happened, and after some time, the write tasks finished, so no more data was coming in for some time: {code} 2014-03-24 20:54:30,924 INFO [Thread-22] regionserver.HRegion: Finished memstore flush of ~128.2 M/134443296, currentsize=12.7 M/13270608 for region IntegrationTestBigLinkedList,\x07\xFE\xDA\x1Chv\xF9\x7F\x18s\xEE\x0C\x85X\xFCU,1395690539958.9514935e6a659bd90faa21bf458a842e. in 7324ms, sequenceid=119978, compaction requested=true {code} After some more time, the region decided to do a compaction. At this point no writes were coming. {code} compaction 2014-03-24 20:55:52,764 INFO [regionserver60020-smallCompactions-1395694311085] regionserver.HStore: Starting compaction of 5 file(s) in meta of IntegrationTestBigLinkedList,\x07\xFE\xDA\x1Chv\xF9\x7F\x18s\xEE\x0C\x85X\xFCU,1395690539958.9514935e6a659bd90faa21bf458a842e. into tmpdir=hdfs://hor9n01.gq1.ygridcore.net:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/9514935e6a659bd90faa21bf458a842e/.tmp, totalSize=212.2 M {code} After this compaction, but before any more flush, the region server got killed around: {code} 2014-03-24 20:56:44,466 DEBUG [regionserver60020-EventThread] regionserver.SplitLogWorker: tasks arrived or departed {code} Because of the region server got killed, the cluster performed a log split, which completed without any issues (logs are not necessary). This resulted in 7 log files to be split, resulting in 7 files in recovered.edits under region dir. Then, some other region server opens the region and applies the recovered edits in memory: {code} Open region: 2014-03-24 20:57:28,196 DEBUG [StoreOpener-9514935e6a659bd90faa21bf458a842e-1] regionserver.HStore: loaded hdfs://hor9n01.gq1.ygridcore.net:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/9514935e6a659bd90faa21bf458a842e/meta/02f7152afee34b07b40fa31e0de5a3de, isReference=false, isBulkLoadResult=false, seqid=119978, majorCompaction=false 2014-03-24 20:57:28,240 DEBUG [StoreOpener-9514935e6a659bd90faa21bf458a842e-1] regionserver.HStore: loaded hdfs://hor9n01.gq1.ygridcore.net:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/9514935e6a659bd90faa21bf458a842e/meta/69260a6d4ffc45a1806dd501204b73ce, isReference=false, isBulkLoadResult=false, seqid=88532, majorCompaction=true 2014-03-24 20:57:28,264 INFO [RS_OPEN_REGION-hor9n08:60020-2] regionserver.HRegion: Replaying edits from hdfs://hor9n01.gq1.ygridcore.net:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/9514935e6a659bd90faa21bf458a842e/recovered.edits/0118699 2014-03-24 20:57:28,457 DEBUG [RS_OPEN_REGION-hor9n08:60020-2] regionserver.HRegion: Applied 0, skipped 187830, firstSequenceidInLog=118084, maxSequenceidInLog=-1, path=hdfs://hor9n01.gq1.ygridcore.net:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/9514935e6a659bd90faa21bf458a842e/recovered.edits/0118699 2014-03-24 20:57:28,460 INFO [RS_OPEN_REGION-hor9n08:60020-2] regionserver.HRegion: Replaying edits from hdfs://hor9n01.gq1.ygridcore.net:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/9514935e6a659bd90faa21bf458a842e/recovered.edits/0119351 2014-03-24 20:57:28,630 DEBUG [RS_OPEN_REGION-hor9n08:60020-2] regionserver.HRegion: Applied 0, skipped 199401, firstSequenceidInLog=118700, maxSequenceidInLog=-1, path=hdfs://hor9n01.gq1.ygridcore.net:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/9514935e6a659bd90faa21bf458a842e/recovered.edits/0119351 2014-03-24 20:57:28,632 INFO [RS_OPEN_REGION-hor9n08:60020-2] regionserver.HRegion: Replaying edits from hdfs://hor9n01.gq1.ygridcore.net:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/9514935e6a659bd90faa21bf458a842e/recovered.edits/0120086 2014-03-24 20:57:28,873 DEBUG [RS_OPEN_REGION-hor9n08:60020-2] regionserver.HRegion: Applied 37938, skipped 148185, firstSequenceidInLog=119352, maxSequenceidInLog=120086, path=hdfs://hor9n01.gq1.ygridcore.net:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/9514935e6a659bd90faa21bf458a842e/recovered.edits/0120086 2014-03-24 20:57:28,876 INFO [RS_OPEN_REGION-hor9n08:60020-2] regionserver.HRegion: Replaying edits from hdfs://hor9n01.gq1.ygridcore.net:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/9514935e6a659bd90faa21bf458a842e/recovered.edits/0120806 2014-03-24 20:57:30,130 DEBUG [RS_OPEN_REGION-hor9n08:60020-2] regionserver.HRegion:
[jira] [Commented] (HBASE-10829) Flush is skipped after log replay if the last recovered edits file is skipped
[ https://issues.apache.org/jira/browse/HBASE-10829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13947004#comment-13947004 ] Ted Yu commented on HBASE-10829: lgtm {code} + public void testSkipRecoveredEditsReplayTheLastFileIgnored() throws Exception { +String method = testSkipRecoveredEditsReplaySomeIgnored; {code} nit: method name should match test name. Flush is skipped after log replay if the last recovered edits file is skipped - Key: HBASE-10829 URL: https://issues.apache.org/jira/browse/HBASE-10829 Project: HBase Issue Type: Bug Reporter: Enis Soztutar Assignee: Enis Soztutar Priority: Critical Fix For: 0.99.0, 0.98.2, 0.96.3 Attachments: hbase-10829_v1.patch We caught this in an extended test run where IntegrationTestBigLinkedList failed with some missing keys. The problem is that HRegion.replayRecoveredEdits() would return -1 if all the edits in the log file is skipped, which is true for example if the log file only contains a single compaction record (HBASE-2231) or somehow the edits cannot be applied (column family deleted, etc). The callee, HRegion.replayRecoveredEditsIfAny() only looks for the last returned seqId to decide whether a flush is necessary or not before opening the region, and discarding replayed recovered edits files. Therefore, if the last recovered edits file is skipped but some edits from earlier recovered edits files are applied, the mandatory flush before opening the region is skipped. If the region server dies after this point before a flush, the edits are lost. This is important to fix, though the sequence of events are super rare for a production cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-10829) Flush is skipped after log replay if the last recovered edits file is skipped
[ https://issues.apache.org/jira/browse/HBASE-10829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13947036#comment-13947036 ] Ted Yu commented on HBASE-10829: +1 Flush is skipped after log replay if the last recovered edits file is skipped - Key: HBASE-10829 URL: https://issues.apache.org/jira/browse/HBASE-10829 Project: HBase Issue Type: Bug Reporter: Enis Soztutar Assignee: Enis Soztutar Priority: Critical Fix For: 0.99.0, 0.98.2, 0.96.3 Attachments: hbase-10829_v1.patch, hbase-10829_v2.patch We caught this in an extended test run where IntegrationTestBigLinkedList failed with some missing keys. The problem is that HRegion.replayRecoveredEdits() would return -1 if all the edits in the log file is skipped, which is true for example if the log file only contains a single compaction record (HBASE-2231) or somehow the edits cannot be applied (column family deleted, etc). The callee, HRegion.replayRecoveredEditsIfAny() only looks for the last returned seqId to decide whether a flush is necessary or not before opening the region, and discarding replayed recovered edits files. Therefore, if the last recovered edits file is skipped but some edits from earlier recovered edits files are applied, the mandatory flush before opening the region is skipped. If the region server dies after this point before a flush, the edits are lost. This is important to fix, though the sequence of events are super rare for a production cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-10829) Flush is skipped after log replay if the last recovered edits file is skipped
[ https://issues.apache.org/jira/browse/HBASE-10829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13947053#comment-13947053 ] Ted Yu commented on HBASE-10829: Spoke too soon :-) {code} [ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:2.5.1:testCompile (default-testCompile) on project hbase-server: Compilation failure: Compilation failure: [ERROR] /homes/hortonzy/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java:[583,49] error: cannot find symbol [ERROR] symbol: variable conf [ERROR] location: class TestHRegion [ERROR] /homes/hortonzy/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java:[600,88] error: cannot find symbol {code} Flush is skipped after log replay if the last recovered edits file is skipped - Key: HBASE-10829 URL: https://issues.apache.org/jira/browse/HBASE-10829 Project: HBase Issue Type: Bug Reporter: Enis Soztutar Assignee: Enis Soztutar Priority: Critical Fix For: 0.99.0, 0.98.2, 0.96.3 Attachments: hbase-10829_v1.patch, hbase-10829_v2.patch We caught this in an extended test run where IntegrationTestBigLinkedList failed with some missing keys. The problem is that HRegion.replayRecoveredEdits() would return -1 if all the edits in the log file is skipped, which is true for example if the log file only contains a single compaction record (HBASE-2231) or somehow the edits cannot be applied (column family deleted, etc). The callee, HRegion.replayRecoveredEditsIfAny() only looks for the last returned seqId to decide whether a flush is necessary or not before opening the region, and discarding replayed recovered edits files. Therefore, if the last recovered edits file is skipped but some edits from earlier recovered edits files are applied, the mandatory flush before opening the region is skipped. If the region server dies after this point before a flush, the edits are lost. This is important to fix, though the sequence of events are super rare for a production cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-10829) Flush is skipped after log replay if the last recovered edits file is skipped
[ https://issues.apache.org/jira/browse/HBASE-10829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13947217#comment-13947217 ] Hadoop QA commented on HBASE-10829: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12636761/hbase-10829_v2.patch against trunk revision . ATTACHMENT ID: 12636761 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified tests. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:red}-1 javac{color}. The patch appears to cause mvn compile goal to fail. {color:red}-1 findbugs{color}. The patch appears to cause Findbugs (version 1.3.9) to fail. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:red}-1 core tests{color}. The patch failed these unit tests: Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/9093//testReport/ Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/9093//console This message is automatically generated. Flush is skipped after log replay if the last recovered edits file is skipped - Key: HBASE-10829 URL: https://issues.apache.org/jira/browse/HBASE-10829 Project: HBase Issue Type: Bug Reporter: Enis Soztutar Assignee: Enis Soztutar Priority: Critical Fix For: 0.99.0, 0.98.2, 0.96.3 Attachments: hbase-10829_v1.patch, hbase-10829_v2.patch We caught this in an extended test run where IntegrationTestBigLinkedList failed with some missing keys. The problem is that HRegion.replayRecoveredEdits() would return -1 if all the edits in the log file is skipped, which is true for example if the log file only contains a single compaction record (HBASE-2231) or somehow the edits cannot be applied (column family deleted, etc). The callee, HRegion.replayRecoveredEditsIfAny() only looks for the last returned seqId to decide whether a flush is necessary or not before opening the region, and discarding replayed recovered edits files. Therefore, if the last recovered edits file is skipped but some edits from earlier recovered edits files are applied, the mandatory flush before opening the region is skipped. If the region server dies after this point before a flush, the edits are lost. This is important to fix, though the sequence of events are super rare for a production cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-10829) Flush is skipped after log replay if the last recovered edits file is skipped
[ https://issues.apache.org/jira/browse/HBASE-10829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13947222#comment-13947222 ] stack commented on HBASE-10829: --- Nice debugging lads. Patch lgtm Flush is skipped after log replay if the last recovered edits file is skipped - Key: HBASE-10829 URL: https://issues.apache.org/jira/browse/HBASE-10829 Project: HBase Issue Type: Bug Reporter: Enis Soztutar Assignee: Enis Soztutar Priority: Critical Fix For: 0.99.0, 0.98.2, 0.96.3 Attachments: hbase-10829_v1.patch, hbase-10829_v2.patch We caught this in an extended test run where IntegrationTestBigLinkedList failed with some missing keys. The problem is that HRegion.replayRecoveredEdits() would return -1 if all the edits in the log file is skipped, which is true for example if the log file only contains a single compaction record (HBASE-2231) or somehow the edits cannot be applied (column family deleted, etc). The callee, HRegion.replayRecoveredEditsIfAny() only looks for the last returned seqId to decide whether a flush is necessary or not before opening the region, and discarding replayed recovered edits files. Therefore, if the last recovered edits file is skipped but some edits from earlier recovered edits files are applied, the mandatory flush before opening the region is skipped. If the region server dies after this point before a flush, the edits are lost. This is important to fix, though the sequence of events are super rare for a production cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-10829) Flush is skipped after log replay if the last recovered edits file is skipped
[ https://issues.apache.org/jira/browse/HBASE-10829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13947407#comment-13947407 ] Hadoop QA commented on HBASE-10829: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12636804/hbase-10829_v3.patch against trunk revision . ATTACHMENT ID: 12636804 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified tests. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 6 warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:red}-1 core tests{color}. The patch failed these unit tests: {color:red}-1 core zombie tests{color}. There are 1 zombie test(s): at org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:368) Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/9094//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9094//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9094//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9094//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9094//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9094//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9094//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9094//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9094//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/9094//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/9094//console This message is automatically generated. Flush is skipped after log replay if the last recovered edits file is skipped - Key: HBASE-10829 URL: https://issues.apache.org/jira/browse/HBASE-10829 Project: HBase Issue Type: Bug Reporter: Enis Soztutar Assignee: Enis Soztutar Priority: Critical Fix For: 0.99.0, 0.98.2, 0.96.3 Attachments: hbase-10829_v1.patch, hbase-10829_v2.patch, hbase-10829_v3.patch We caught this in an extended test run where IntegrationTestBigLinkedList failed with some missing keys. The problem is that HRegion.replayRecoveredEdits() would return -1 if all the edits in the log file is skipped, which is true for example if the log file only contains a single compaction record (HBASE-2231) or somehow the edits cannot be applied (column family deleted, etc). The callee, HRegion.replayRecoveredEditsIfAny() only looks for the last returned seqId to decide whether a flush is necessary or not before opening the region, and discarding replayed recovered edits files. Therefore, if the last recovered edits file is skipped but some edits from earlier recovered edits files are applied, the mandatory flush before opening the region is skipped. If the region server dies after this point before a flush, the edits are lost. This is important to fix, though the sequence of events are super rare for a production cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)