[jira] [Commented] (HBASE-11902) RegionServer was blocked while aborting
[ https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696562#comment-14696562 ] Hiroshi Ikeda commented on HBASE-11902: --- This HBASE-11902 seems duplicate with HBASE-13592, which has been resolved. I have created a new issue HBASE-14222 about DrainBarrier. RegionServer was blocked while aborting --- Key: HBASE-11902 URL: https://issues.apache.org/jira/browse/HBASE-11902 Project: HBase Issue Type: Bug Components: regionserver, wal Affects Versions: 0.98.4 Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7 Reporter: Victor Xu Assignee: Qiang Tian Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, hbase11902-master.patch, hbase11902-master_v2.patch, hbase11902-master_v3.patch, jstack_hadoop461.cm6.log Generally, regionserver automatically aborts when isHealth() returns false. But it sometimes got blocked while aborting. I saved the jstack and logs, and found out that it was caused by datanodes failures. The regionserver60020 thread was blocked while closing WAL. This issue doesn't happen so frequently, but if it happens, it always leads to huge amount of requests failure. The only way to do is KILL -9. I think it's a bug, but I haven't found a decent solution. Does anyone have the same problem? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-11902) RegionServer was blocked while aborting
[ https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14695194#comment-14695194 ] Hiroshi Ikeda commented on HBASE-11902: --- {code} if (getValue(oldValAndFlags) == 1) return; // There were no operations outstanding. synchronized (this) { this.wait(); } {code} If DrainBarrier#endOp calls notifyAll just before the synchronized block, this may wait forever. (BTW, some of tests for DrainBarrier are also required to fix because they catch AssertionError thrown by JUnit.) RegionServer was blocked while aborting --- Key: HBASE-11902 URL: https://issues.apache.org/jira/browse/HBASE-11902 Project: HBase Issue Type: Bug Components: regionserver, wal Affects Versions: 0.98.4 Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7 Reporter: Victor Xu Assignee: Qiang Tian Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, hbase11902-master.patch, hbase11902-master_v2.patch, hbase11902-master_v3.patch, jstack_hadoop461.cm6.log Generally, regionserver automatically aborts when isHealth() returns false. But it sometimes got blocked while aborting. I saved the jstack and logs, and found out that it was caused by datanodes failures. The regionserver60020 thread was blocked while closing WAL. This issue doesn't happen so frequently, but if it happens, it always leads to huge amount of requests failure. The only way to do is KILL -9. I think it's a bug, but I haven't found a decent solution. Does anyone have the same problem? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-11902) RegionServer was blocked while aborting
[ https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227542#comment-14227542 ] Hadoop QA commented on HBASE-11902: --- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12684027/hbase11902-master_v3.patch against master branch at commit aa0bd50fd40d8090c5a98cbde063621eadd988f8. ATTACHMENT ID: 12684027 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified tests. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 checkstyle{color}. The applied patch does not increase the total number of checkstyle errors {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/11852//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11852//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11852//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11852//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11852//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11852//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11852//artifact/patchprocess/newPatchFindbugsWarningshbase-rest.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11852//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11852//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11852//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11852//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11852//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html Checkstyle Errors: https://builds.apache.org/job/PreCommit-HBASE-Build/11852//artifact/patchprocess/checkstyle-aggregate.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/11852//console This message is automatically generated. RegionServer was blocked while aborting --- Key: HBASE-11902 URL: https://issues.apache.org/jira/browse/HBASE-11902 Project: HBase Issue Type: Bug Components: regionserver, wal Affects Versions: 0.98.4 Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7 Reporter: Victor Xu Assignee: Qiang Tian Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, hbase11902-master.patch, hbase11902-master_v2.patch, hbase11902-master_v3.patch, jstack_hadoop461.cm6.log Generally, regionserver automatically aborts when isHealth() returns false. But it sometimes got blocked while aborting. I saved the jstack and logs, and found out that it was caused by datanodes failures. The regionserver60020 thread was blocked while closing WAL. This issue doesn't happen so frequently, but if it happens, it always leads to huge amount of requests failure. The only way to do is KILL -9. I think it's a bug, but I haven't found a decent solution. Does anyone have the same problem? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-11902) RegionServer was blocked while aborting
[ https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227976#comment-14227976 ] Victor Xu commented on HBASE-11902: --- Thanks, Qiang Tian. I guess you're right. I'll use this patch in my cluster and see if this problem happens again. RegionServer was blocked while aborting --- Key: HBASE-11902 URL: https://issues.apache.org/jira/browse/HBASE-11902 Project: HBase Issue Type: Bug Components: regionserver, wal Affects Versions: 0.98.4 Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7 Reporter: Victor Xu Assignee: Qiang Tian Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, hbase11902-master.patch, hbase11902-master_v2.patch, hbase11902-master_v3.patch, jstack_hadoop461.cm6.log Generally, regionserver automatically aborts when isHealth() returns false. But it sometimes got blocked while aborting. I saved the jstack and logs, and found out that it was caused by datanodes failures. The regionserver60020 thread was blocked while closing WAL. This issue doesn't happen so frequently, but if it happens, it always leads to huge amount of requests failure. The only way to do is KILL -9. I think it's a bug, but I haven't found a decent solution. Does anyone have the same problem? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-11902) RegionServer was blocked while aborting
[ https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14225962#comment-14225962 ] Qiang Tian commented on HBASE-11902: the TestLogRolling creates a similar error scenario with this case. the testcase failure is because of below code: {code} // verify the written rows are there assertTrue(loggedRows.contains(row1002)); assertTrue(loggedRows.contains(row1003)); assertTrue(loggedRows.contains(row1004)); assertTrue(loggedRows.contains(row1005)); // flush all regions ListHRegion regions = new ArrayListHRegion(server.getOnlineRegionsLocalContext()); for (HRegion r: regions) { r.flushcache();// ===the re-throwed exception will end the testcase } {code} adding a try/catch for flushcache call make it pass. RegionServer was blocked while aborting --- Key: HBASE-11902 URL: https://issues.apache.org/jira/browse/HBASE-11902 Project: HBase Issue Type: Bug Components: regionserver, wal Affects Versions: 0.98.4 Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7 Reporter: Victor Xu Assignee: Qiang Tian Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, hbase11902-master.patch, jstack_hadoop461.cm6.log Generally, regionserver automatically aborts when isHealth() returns false. But it sometimes got blocked while aborting. I saved the jstack and logs, and found out that it was caused by datanodes failures. The regionserver60020 thread was blocked while closing WAL. This issue doesn't happen so frequently, but if it happens, it always leads to huge amount of requests failure. The only way to do is KILL -9. I think it's a bug, but I haven't found a decent solution. Does anyone have the same problem? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-11902) RegionServer was blocked while aborting
[ https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227272#comment-14227272 ] Hadoop QA commented on HBASE-11902: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12683991/hbase11902-master_v2.patch against master branch at commit f0d95e7f11403d67b4fc3f1fd4ef048047b6842a. ATTACHMENT ID: 12683991 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 checkstyle{color}. The applied patch does not increase the total number of checkstyle errors {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.regionserver.wal.TestLogRolling Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/11851//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11851//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11851//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11851//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11851//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11851//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11851//artifact/patchprocess/newPatchFindbugsWarningshbase-rest.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11851//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11851//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11851//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11851//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11851//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html Checkstyle Errors: https://builds.apache.org/job/PreCommit-HBASE-Build/11851//artifact/patchprocess/checkstyle-aggregate.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/11851//console This message is automatically generated. RegionServer was blocked while aborting --- Key: HBASE-11902 URL: https://issues.apache.org/jira/browse/HBASE-11902 Project: HBase Issue Type: Bug Components: regionserver, wal Affects Versions: 0.98.4 Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7 Reporter: Victor Xu Assignee: Qiang Tian Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, hbase11902-master.patch, hbase11902-master_v2.patch, jstack_hadoop461.cm6.log Generally, regionserver automatically aborts when isHealth() returns false. But it sometimes got blocked while aborting. I saved the jstack and logs, and found out that it was caused by datanodes failures. The regionserver60020 thread was blocked while closing WAL. This issue doesn't happen so frequently, but if it happens, it always leads to huge amount of requests failure. The only way to do is KILL -9. I think it's a bug, but I haven't found a decent solution. Does anyone have the same problem? -- This message was sent by Atlassian
[jira] [Commented] (HBASE-11902) RegionServer was blocked while aborting
[ https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227322#comment-14227322 ] Qiang Tian commented on HBASE-11902: ok. the latest failure is because, in the testcase, only WAL write fails, if we hide the exception( just decrement the counter) and continues, the data flush will succeed, so completeCacheFlush call decrement it again!. to preserve the counter semantics, simple is the best --- return right away.(the original patch) RegionServer was blocked while aborting --- Key: HBASE-11902 URL: https://issues.apache.org/jira/browse/HBASE-11902 Project: HBase Issue Type: Bug Components: regionserver, wal Affects Versions: 0.98.4 Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7 Reporter: Victor Xu Assignee: Qiang Tian Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, hbase11902-master.patch, hbase11902-master_v2.patch, jstack_hadoop461.cm6.log Generally, regionserver automatically aborts when isHealth() returns false. But it sometimes got blocked while aborting. I saved the jstack and logs, and found out that it was caused by datanodes failures. The regionserver60020 thread was blocked while closing WAL. This issue doesn't happen so frequently, but if it happens, it always leads to huge amount of requests failure. The only way to do is KILL -9. I think it's a bug, but I haven't found a decent solution. Does anyone have the same problem? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-11902) RegionServer was blocked while aborting
[ https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223976#comment-14223976 ] Qiang Tian commented on HBASE-11902: from the stacktrace, DrainBarrier.stopAndDrainOps is waiting, so DrainBarrier#endOp does not notify it. looking at class DrainBarrier, it is expected that beginOp and endOp are called in pair. the initial value of {{valueAndFlags}} is 2, incremented by 2 in beginOp; decremented by 2 in endOp. in stopAndDrainOps, if getValue(oldValAndFlags) == 1, means oldValAndFlags=2, all ops are completed in pair, otherwise, it needs to wait the last endOp to notify it: {code} if (getValue(oldValAndFlags) == 1) return; // There were no operations outstanding. synchronized (this) { this.wait(); } {code} so the problem could be the beginOp/endOp is not called in pair, the hole looks to be here: HRegion#internalFlushcache {code} // sync unflushed WAL changes when deferred log sync is enabled // see HBASE-8208 for details if (wal != null !shouldSyncLog()) { wal.sync(); } {code} at that point, wal.startCacheFlush-closeBarrier.beginOp is called, but completeCacheFlush-closeBarrier.endOp() is not protected by a try block..so if WAL/HDFS layer throws exception, the endOp will not be called. related info in log: {quote} 2014-09-03 13:38:03,789 ERROR org.apache.hadoop.hbase.regionserver.wal.FSHLog: Error while AsyncWriter write, request close of hlog java.io.IOException: All datanodes 10.246.2.103:50010 are bad. Aborting... at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1127) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:924) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:486) 2014-09-03 13:38:03,789 FATAL org.apache.hadoop.hbase.regionserver.wal.FSHLog: Error while AsyncSyncer sync, request close of hlog java.io.IOException: All datanodes 10.246.2.103:50010 are bad. Aborting... at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1127) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:924) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:486) 2014-09-03 13:38:03,799 ERROR org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Cache flush failed for region page_content_queue,00166,1408946731655.8671b8a0f82565f88eb2ab8a5b53e84c. //==MemStoreFlusher#flushRegion java.io.IOException: All datanodes 10.246.2.103:50010 are bad. Aborting... {quote} the exception thrown to caller should be here: {code} FSHLog#syncer: if (txid = this.failedTxid.get()) { assert asyncIOE != null : current txid is among(under) failed txids, but asyncIOE is null!; throw asyncIOE; } {code} the master branch can catch the hdfs exception, but it just ignore it, which looks incorrect: {code} if (wal != null) { try { wal.sync(); // ensure that flush marker is sync'ed } catch (IOException ioe) { LOG.warn(Unexpected exception while wal.sync(), ignoring. Exception: + StringUtils.stringifyException(ioe)); } } {code} Personally the exeception should not be ignored since it is severe hdfs error. RegionServer was blocked while aborting --- Key: HBASE-11902 URL: https://issues.apache.org/jira/browse/HBASE-11902 Project: HBase Issue Type: Bug Components: regionserver, wal Affects Versions: 0.98.4 Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7 Reporter: Victor Xu Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, jstack_hadoop461.cm6.log Generally, regionserver automatically aborts when isHealth() returns false. But it sometimes got blocked while aborting. I saved the jstack and logs, and found out that it was caused by datanodes failures. The regionserver60020 thread was blocked while closing WAL. This issue doesn't happen so frequently, but if it happens, it always leads to huge amount of requests failure. The only way to do is KILL -9. I think it's a bug, but I haven't found a decent solution. Does anyone have the same problem? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-11902) RegionServer was blocked while aborting
[ https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223989#comment-14223989 ] Qiang Tian commented on HBASE-11902: proposed fix for 0.98: {code} --- hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java +++ hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java @@ -1760,7 +1760,13 @@ public class HRegion implements HeapSize { // , Writable{ // sync unflushed WAL changes when deferred log sync is enabled // see HBASE-8208 for details if (wal != null !shouldSyncLog()) { - wal.sync(); + try { +wal.sync(); + } catch (IOException e) { + wal.abortCacheFlush(this.getRegionInfo().getEncodedNameAsBytes()); + LOG.warn(Unexpected exception while wal.sync(), re-throw); + throw e; + } } {code} the master branch code writes ABORT_FLUSH log before we call wal.abortCacheFlush. so it is also needed if wal.sync aborts? also I am thinking about if we could make error injection test for such kind of failure which could mostly happen in real env but would not happen in UT? RegionServer was blocked while aborting --- Key: HBASE-11902 URL: https://issues.apache.org/jira/browse/HBASE-11902 Project: HBase Issue Type: Bug Components: regionserver, wal Affects Versions: 0.98.4 Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7 Reporter: Victor Xu Assignee: Qiang Tian Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, jstack_hadoop461.cm6.log Generally, regionserver automatically aborts when isHealth() returns false. But it sometimes got blocked while aborting. I saved the jstack and logs, and found out that it was caused by datanodes failures. The regionserver60020 thread was blocked while closing WAL. This issue doesn't happen so frequently, but if it happens, it always leads to huge amount of requests failure. The only way to do is KILL -9. I think it's a bug, but I haven't found a decent solution. Does anyone have the same problem? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-11902) RegionServer was blocked while aborting
[ https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224070#comment-14224070 ] Hadoop QA commented on HBASE-11902: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12683469/hbase11902-master.patch against master branch at commit e83082a88816684714d8a563967046e582f9b8c7. ATTACHMENT ID: 12683469 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 checkstyle{color}. The applied patch does not increase the total number of checkstyle errors {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.regionserver.wal.TestLogRolling Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/11821//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11821//artifact/patchprocess/newPatchFindbugsWarningshbase-rest.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11821//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11821//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11821//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11821//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11821//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11821//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11821//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11821//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11821//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11821//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html Checkstyle Errors: https://builds.apache.org/job/PreCommit-HBASE-Build/11821//artifact/patchprocess/checkstyle-aggregate.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/11821//console This message is automatically generated. RegionServer was blocked while aborting --- Key: HBASE-11902 URL: https://issues.apache.org/jira/browse/HBASE-11902 Project: HBase Issue Type: Bug Components: regionserver, wal Affects Versions: 0.98.4 Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7 Reporter: Victor Xu Assignee: Qiang Tian Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, hbase11902-master.patch, jstack_hadoop461.cm6.log Generally, regionserver automatically aborts when isHealth() returns false. But it sometimes got blocked while aborting. I saved the jstack and logs, and found out that it was caused by datanodes failures. The regionserver60020 thread was blocked while closing WAL. This issue doesn't happen so frequently, but if it happens, it always leads to huge amount of requests failure. The only way to do is KILL -9. I think it's a bug, but I haven't found a decent solution. Does anyone have the same problem? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-11902) RegionServer was blocked while aborting
[ https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122682#comment-14122682 ] Victor Xu commented on HBASE-11902: --- Yes, stack. The rs main thread is waiting at org.apache.hadoop.hbase.util.DrainBarrier.stopAndDrainOps, but the main cause of the aborting is DataNodes. You can find the details in the log: 2014-09-03 13:38:03,789 FATAL org.apache.hadoop.hbase.regionserver.wal.FSHLog: Error while AsyncSyncer sync, request close of hlog java.io.IOException: All datanodes 10.246.2.103:50010 are bad. Aborting... at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1127) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:924) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:486) 2014-09-03 13:38:03,799 ERROR org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Cache flush failed for region page_content_queue,00166,1408946731655.8671b8a0f82565f88eb2ab8a5b53e84c. java.io.IOException: All datanodes 10.246.2.103:50010 are bad. Aborting... at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1127) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:924) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:486) 2014-09-03 13:38:03,801 ERROR org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter: Got IOException while writing trailer java.io.IOException: All datanodes 10.246.2.103:50010 are bad. Aborting... at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1127) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:924) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:486) 2014-09-03 13:38:03,802 ERROR org.apache.hadoop.hbase.regionserver.wal.FSHLog: Failed close of HLog writer java.io.IOException: All datanodes 10.246.2.103:50010 are bad. Aborting... at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1127) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:924) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:486) 2014-09-03 13:38:03,802 WARN org.apache.hadoop.hbase.regionserver.wal.FSHLog: Riding over HLog close failure! error count=1 2014-09-03 13:38:03,804 INFO org.apache.hadoop.hbase.regionserver.wal.FSHLog: Rolled WAL /hbase/WALs/hadoop461.cm6.tbsite.net,60020,1409003284950/hadoop461.cm6.tbsite.net%2C60020%2C1409003284950.1409722420708 with entries=32565, filesize=118.6 M; new WAL /hbase/WALs/hadoop461.cm6.tbsite.net,60020,1409003284950/hadoop461.cm6.tbsite.net%2C60020%2C1409003284950.1409722683780 2014-09-03 13:38:03,804 DEBUG org.apache.hadoop.hbase.regionserver.wal.FSHLog: log file is ready for archiving hdfs://hadoopnnvip.cm6:9000/hbase/WALs/hadoop461.cm6.tbsite.net,60020,1409003284950/hadoop461.cm6.tbsite.net%2C60020%2C1409003284950.1409707475254 2014-09-03 13:38:03,804 DEBUG org.apache.hadoop.hbase.regionserver.wal.FSHLog: log file is ready for archiving hdfs://hadoopnnvip.cm6:9000/hbase/WALs/hadoop461.cm6.tbsite.net,60020,1409003284950/hadoop461.cm6.tbsite.net%2C60020%2C1409003284950.1409707722202 2014-09-03 13:38:03,804 DEBUG org.apache.hadoop.hbase.regionserver.wal.FSHLog: log file is ready for archiving hdfs://hadoopnnvip.cm6:9000/hbase/WALs/hadoop461.cm6.tbsite.net,60020,1409003284950/hadoop461.cm6.tbsite.net%2C60020%2C1409003284950.1409707946159 2014-09-03 13:38:03,804 DEBUG org.apache.hadoop.hbase.regionserver.wal.FSHLog: log file is ready for archiving hdfs://hadoopnnvip.cm6:9000/hbase/WALs/hadoop461.cm6.tbsite.net,60020,1409003284950/hadoop461.cm6.tbsite.net%2C60020%2C1409003284950.1409708155788 2014-09-03 13:38:03,839 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Flush requested on page_content_queue,00166,1408946731655.8671b8a0f82565f88eb2ab8a5b53e84c. 2014-09-03 13:38:03,839 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Started memstore flush for page_content_queue,00166,1408946731655.8671b8a0f82565f88eb2ab8a5b53e84c., current region memstore size 218.5 M 2014-09-03 13:38:03,887 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Flush requested on page_content_queue,00166,1408946731655.8671b8a0f82565f88eb2ab8a5b53e84c. 2014-09-03 13:38:03,887 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Started memstore flush for page_content_queue,00166,1408946731655.8671b8a0f82565f88eb2ab8a5b53e84c., current region memstore size 218.5 M 2014-09-03 13:38:03,897 DEBUG
[jira] [Commented] (HBASE-11902) RegionServer was blocked while aborting
[ https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122467#comment-14122467 ] stack commented on HBASE-11902: --- You mean here: {code} regionserver60020 prio=10 tid=0x7f85011ca800 nid=0x74d0 in Object.wait() [0x4405f000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:503) at org.apache.hadoop.hbase.util.DrainBarrier.stopAndDrainOps(DrainBarrier.java:115) - locked 0x0002bb325248 (a org.apache.hadoop.hbase.util.DrainBarrier) at org.apache.hadoop.hbase.util.DrainBarrier.stopAndDrainOps(DrainBarrier.java:85) at org.apache.hadoop.hbase.regionserver.wal.FSHLog.close(FSHLog.java:923) at org.apache.hadoop.hbase.regionserver.HRegionServer.closeWAL(HRegionServer.java:1208) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1001) at java.lang.Thread.run(Thread.java:744) {code} Doesn't seem to be an HDFS issue, just waiting on flushes to complete. You see issues flushing Victor (I've not looked at log). RegionServer was blocked while aborting --- Key: HBASE-11902 URL: https://issues.apache.org/jira/browse/HBASE-11902 Project: HBase Issue Type: Bug Components: regionserver, wal Affects Versions: 0.98.4 Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7 Reporter: Victor Xu Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, jstack_hadoop461.cm6.log Generally, regionserver automatically aborts when isHealth() returns false. But it sometimes got blocked while aborting. I saved the jstack and logs, and found out that it was caused by datanodes failures. The regionserver60020 thread was blocked while closing WAL. This issue doesn't happen so frequently, but if it happens, it always leads to huge amount of requests failure. The only way to do is KILL -9. I think it's a bug, but I haven't found a decent solution. Does anyone have the same problem? -- This message was sent by Atlassian JIRA (v6.3.4#6332)