[jira] [Commented] (HBASE-11902) RegionServer was blocked while aborting

2015-08-14 Thread Hiroshi Ikeda (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696562#comment-14696562
 ] 

Hiroshi Ikeda commented on HBASE-11902:
---

This HBASE-11902 seems duplicate with HBASE-13592, which has been resolved. 
I have created a new issue HBASE-14222 about DrainBarrier.


 RegionServer was blocked while aborting
 ---

 Key: HBASE-11902
 URL: https://issues.apache.org/jira/browse/HBASE-11902
 Project: HBase
  Issue Type: Bug
  Components: regionserver, wal
Affects Versions: 0.98.4
 Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7
Reporter: Victor Xu
Assignee: Qiang Tian
 Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, 
 hbase11902-master.patch, hbase11902-master_v2.patch, 
 hbase11902-master_v3.patch, jstack_hadoop461.cm6.log


 Generally, regionserver automatically aborts when isHealth() returns false. 
 But it sometimes got blocked while aborting. I saved the jstack and logs, and 
 found out that it was caused by datanodes failures. The regionserver60020 
 thread was blocked while closing WAL. 
 This issue doesn't happen so frequently, but if it happens, it always leads 
 to huge amount of requests failure. The only way to do is KILL -9.
 I think it's a bug, but I haven't found a decent solution. Does anyone have 
 the same problem?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-11902) RegionServer was blocked while aborting

2015-08-13 Thread Hiroshi Ikeda (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14695194#comment-14695194
 ] 

Hiroshi Ikeda commented on HBASE-11902:
---

{code}
if (getValue(oldValAndFlags) == 1) return; // There were no operations 
outstanding.
synchronized (this) { this.wait(); }
{code}

If DrainBarrier#endOp calls notifyAll just before the synchronized block, this 
may wait forever.
(BTW, some of tests for DrainBarrier are also required to fix because they 
catch AssertionError thrown by JUnit.)

 RegionServer was blocked while aborting
 ---

 Key: HBASE-11902
 URL: https://issues.apache.org/jira/browse/HBASE-11902
 Project: HBase
  Issue Type: Bug
  Components: regionserver, wal
Affects Versions: 0.98.4
 Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7
Reporter: Victor Xu
Assignee: Qiang Tian
 Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, 
 hbase11902-master.patch, hbase11902-master_v2.patch, 
 hbase11902-master_v3.patch, jstack_hadoop461.cm6.log


 Generally, regionserver automatically aborts when isHealth() returns false. 
 But it sometimes got blocked while aborting. I saved the jstack and logs, and 
 found out that it was caused by datanodes failures. The regionserver60020 
 thread was blocked while closing WAL. 
 This issue doesn't happen so frequently, but if it happens, it always leads 
 to huge amount of requests failure. The only way to do is KILL -9.
 I think it's a bug, but I haven't found a decent solution. Does anyone have 
 the same problem?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-11902) RegionServer was blocked while aborting

2014-11-27 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227542#comment-14227542
 ] 

Hadoop QA commented on HBASE-11902:
---

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12684027/hbase11902-master_v3.patch
  against master branch at commit aa0bd50fd40d8090c5a98cbde063621eadd988f8.
  ATTACHMENT ID: 12684027

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified tests.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 checkstyle{color}.  The applied patch does not increase the 
total number of checkstyle errors

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

  {color:green}+1 site{color}.  The mvn site goal succeeds with this patch.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11852//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11852//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11852//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11852//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11852//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11852//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11852//artifact/patchprocess/newPatchFindbugsWarningshbase-rest.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11852//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11852//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11852//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11852//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11852//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11852//artifact/patchprocess/checkstyle-aggregate.html

  Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11852//console

This message is automatically generated.

 RegionServer was blocked while aborting
 ---

 Key: HBASE-11902
 URL: https://issues.apache.org/jira/browse/HBASE-11902
 Project: HBase
  Issue Type: Bug
  Components: regionserver, wal
Affects Versions: 0.98.4
 Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7
Reporter: Victor Xu
Assignee: Qiang Tian
 Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, 
 hbase11902-master.patch, hbase11902-master_v2.patch, 
 hbase11902-master_v3.patch, jstack_hadoop461.cm6.log


 Generally, regionserver automatically aborts when isHealth() returns false. 
 But it sometimes got blocked while aborting. I saved the jstack and logs, and 
 found out that it was caused by datanodes failures. The regionserver60020 
 thread was blocked while closing WAL. 
 This issue doesn't happen so frequently, but if it happens, it always leads 
 to huge amount of requests failure. The only way to do is KILL -9.
 I think it's a bug, but I haven't found a decent solution. Does anyone have 
 the same problem?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-11902) RegionServer was blocked while aborting

2014-11-27 Thread Victor Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227976#comment-14227976
 ] 

Victor Xu commented on HBASE-11902:
---

Thanks, Qiang Tian. I guess you're right. I'll use this patch in my cluster and 
see if this problem happens again.

 RegionServer was blocked while aborting
 ---

 Key: HBASE-11902
 URL: https://issues.apache.org/jira/browse/HBASE-11902
 Project: HBase
  Issue Type: Bug
  Components: regionserver, wal
Affects Versions: 0.98.4
 Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7
Reporter: Victor Xu
Assignee: Qiang Tian
 Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, 
 hbase11902-master.patch, hbase11902-master_v2.patch, 
 hbase11902-master_v3.patch, jstack_hadoop461.cm6.log


 Generally, regionserver automatically aborts when isHealth() returns false. 
 But it sometimes got blocked while aborting. I saved the jstack and logs, and 
 found out that it was caused by datanodes failures. The regionserver60020 
 thread was blocked while closing WAL. 
 This issue doesn't happen so frequently, but if it happens, it always leads 
 to huge amount of requests failure. The only way to do is KILL -9.
 I think it's a bug, but I haven't found a decent solution. Does anyone have 
 the same problem?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-11902) RegionServer was blocked while aborting

2014-11-26 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14225962#comment-14225962
 ] 

Qiang Tian commented on HBASE-11902:


the TestLogRolling creates a similar error scenario with this case. 

the testcase failure is because of below code:
{code}
  // verify the written rows are there
  assertTrue(loggedRows.contains(row1002));
  assertTrue(loggedRows.contains(row1003));
  assertTrue(loggedRows.contains(row1004));
  assertTrue(loggedRows.contains(row1005));
  // flush all regions
  ListHRegion regions = new 
ArrayListHRegion(server.getOnlineRegionsLocalContext());
  for (HRegion r: regions) {
r.flushcache();// ===the re-throwed exception 
will end the testcase
  }
{code}

adding a try/catch for flushcache call make it pass.


 RegionServer was blocked while aborting
 ---

 Key: HBASE-11902
 URL: https://issues.apache.org/jira/browse/HBASE-11902
 Project: HBase
  Issue Type: Bug
  Components: regionserver, wal
Affects Versions: 0.98.4
 Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7
Reporter: Victor Xu
Assignee: Qiang Tian
 Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, 
 hbase11902-master.patch, jstack_hadoop461.cm6.log


 Generally, regionserver automatically aborts when isHealth() returns false. 
 But it sometimes got blocked while aborting. I saved the jstack and logs, and 
 found out that it was caused by datanodes failures. The regionserver60020 
 thread was blocked while closing WAL. 
 This issue doesn't happen so frequently, but if it happens, it always leads 
 to huge amount of requests failure. The only way to do is KILL -9.
 I think it's a bug, but I haven't found a decent solution. Does anyone have 
 the same problem?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-11902) RegionServer was blocked while aborting

2014-11-26 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227272#comment-14227272
 ] 

Hadoop QA commented on HBASE-11902:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12683991/hbase11902-master_v2.patch
  against master branch at commit f0d95e7f11403d67b4fc3f1fd4ef048047b6842a.
  ATTACHMENT ID: 12683991

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 checkstyle{color}.  The applied patch does not increase the 
total number of checkstyle errors

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

  {color:green}+1 site{color}.  The mvn site goal succeeds with this patch.

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
   org.apache.hadoop.hbase.regionserver.wal.TestLogRolling

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11851//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11851//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11851//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11851//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11851//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11851//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11851//artifact/patchprocess/newPatchFindbugsWarningshbase-rest.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11851//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11851//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11851//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11851//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11851//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11851//artifact/patchprocess/checkstyle-aggregate.html

  Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11851//console

This message is automatically generated.

 RegionServer was blocked while aborting
 ---

 Key: HBASE-11902
 URL: https://issues.apache.org/jira/browse/HBASE-11902
 Project: HBase
  Issue Type: Bug
  Components: regionserver, wal
Affects Versions: 0.98.4
 Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7
Reporter: Victor Xu
Assignee: Qiang Tian
 Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, 
 hbase11902-master.patch, hbase11902-master_v2.patch, jstack_hadoop461.cm6.log


 Generally, regionserver automatically aborts when isHealth() returns false. 
 But it sometimes got blocked while aborting. I saved the jstack and logs, and 
 found out that it was caused by datanodes failures. The regionserver60020 
 thread was blocked while closing WAL. 
 This issue doesn't happen so frequently, but if it happens, it always leads 
 to huge amount of requests failure. The only way to do is KILL -9.
 I think it's a bug, but I haven't found a decent solution. Does anyone have 
 the same problem?



--
This message was sent by Atlassian 

[jira] [Commented] (HBASE-11902) RegionServer was blocked while aborting

2014-11-26 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227322#comment-14227322
 ] 

Qiang Tian commented on HBASE-11902:


ok. the latest failure is because, in the testcase, only WAL write fails, if we 
hide the exception( just decrement the counter) and continues, the data flush 
will succeed, so completeCacheFlush call decrement it again!.
to preserve the counter semantics, simple is the best --- return right 
away.(the original patch)

 RegionServer was blocked while aborting
 ---

 Key: HBASE-11902
 URL: https://issues.apache.org/jira/browse/HBASE-11902
 Project: HBase
  Issue Type: Bug
  Components: regionserver, wal
Affects Versions: 0.98.4
 Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7
Reporter: Victor Xu
Assignee: Qiang Tian
 Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, 
 hbase11902-master.patch, hbase11902-master_v2.patch, jstack_hadoop461.cm6.log


 Generally, regionserver automatically aborts when isHealth() returns false. 
 But it sometimes got blocked while aborting. I saved the jstack and logs, and 
 found out that it was caused by datanodes failures. The regionserver60020 
 thread was blocked while closing WAL. 
 This issue doesn't happen so frequently, but if it happens, it always leads 
 to huge amount of requests failure. The only way to do is KILL -9.
 I think it's a bug, but I haven't found a decent solution. Does anyone have 
 the same problem?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-11902) RegionServer was blocked while aborting

2014-11-24 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223976#comment-14223976
 ] 

Qiang Tian commented on HBASE-11902:


from the stacktrace, DrainBarrier.stopAndDrainOps is waiting, so 
DrainBarrier#endOp does not notify it.

looking at class DrainBarrier, it is expected that beginOp and endOp are called 
in pair. the initial value of {{valueAndFlags}} is 2, incremented by 2 in 
beginOp; decremented by 2 in endOp.

in stopAndDrainOps, if getValue(oldValAndFlags) == 1, means oldValAndFlags=2, 
all ops are completed in pair, otherwise, it needs to wait the last endOp to 
notify it:

{code}
if (getValue(oldValAndFlags) == 1) return; // There were no operations 
outstanding.
synchronized (this) { this.wait(); }
{code}

so the problem could be the beginOp/endOp is not called in pair, the hole looks 
to be here:

HRegion#internalFlushcache
{code}
// sync unflushed WAL changes when deferred log sync is enabled
// see HBASE-8208 for details
if (wal != null  !shouldSyncLog()) {
  wal.sync();
}
{code} 

at that point, wal.startCacheFlush-closeBarrier.beginOp is called, but 
completeCacheFlush-closeBarrier.endOp() is not protected by a try block..so if 
WAL/HDFS layer throws exception, the endOp will not be called.

related info in log:

  
{quote}
2014-09-03 13:38:03,789 ERROR org.apache.hadoop.hbase.regionserver.wal.FSHLog: 
Error while AsyncWriter write, request close of hlog
java.io.IOException: All datanodes 10.246.2.103:50010 are bad. Aborting...
  at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1127)
  at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:924)
  at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:486)
2014-09-03 13:38:03,789 FATAL org.apache.hadoop.hbase.regionserver.wal.FSHLog: 
Error while AsyncSyncer sync, request close of hlog
java.io.IOException: All datanodes 10.246.2.103:50010 are bad. Aborting...
  at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1127)
  at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:924)
  at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:486)
2014-09-03 13:38:03,799 ERROR 
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Cache flush failed for 
region page_content_queue,00166,1408946731655.8671b8a0f82565f88eb2ab8a5b53e84c. 
 //==MemStoreFlusher#flushRegion
java.io.IOException: All datanodes 10.246.2.103:50010 are bad. Aborting...
{quote}


the exception thrown to caller should be here:
{code}
FSHLog#syncer:

if (txid = this.failedTxid.get()) {
assert asyncIOE != null :
  current txid is among(under) failed txids, but asyncIOE is null!;
throw asyncIOE;
}

{code}

the master branch can catch the hdfs exception, but it just ignore it, which 
looks incorrect:
{code}
  if (wal != null) {
try {
  wal.sync(); // ensure that flush marker is sync'ed
} catch (IOException ioe) {
  LOG.warn(Unexpected exception while wal.sync(), ignoring. Exception: 

  + StringUtils.stringifyException(ioe));
}
  }
{code}

Personally the exeception should not be ignored since it is severe hdfs error.


 RegionServer was blocked while aborting
 ---

 Key: HBASE-11902
 URL: https://issues.apache.org/jira/browse/HBASE-11902
 Project: HBase
  Issue Type: Bug
  Components: regionserver, wal
Affects Versions: 0.98.4
 Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7
Reporter: Victor Xu
 Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, 
 jstack_hadoop461.cm6.log


 Generally, regionserver automatically aborts when isHealth() returns false. 
 But it sometimes got blocked while aborting. I saved the jstack and logs, and 
 found out that it was caused by datanodes failures. The regionserver60020 
 thread was blocked while closing WAL. 
 This issue doesn't happen so frequently, but if it happens, it always leads 
 to huge amount of requests failure. The only way to do is KILL -9.
 I think it's a bug, but I haven't found a decent solution. Does anyone have 
 the same problem?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-11902) RegionServer was blocked while aborting

2014-11-24 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223989#comment-14223989
 ] 

Qiang Tian commented on HBASE-11902:


proposed fix for 0.98:

{code}
--- hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
+++ hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
@@ -1760,7 +1760,13 @@ public class HRegion implements HeapSize { // , Writable{
 // sync unflushed WAL changes when deferred log sync is enabled
 // see HBASE-8208 for details
 if (wal != null  !shouldSyncLog()) {
-  wal.sync();
+  try {
+wal.sync();
+  } catch (IOException e) {
+ wal.abortCacheFlush(this.getRegionInfo().getEncodedNameAsBytes());
+ LOG.warn(Unexpected exception while wal.sync(), re-throw);
+ throw e;
+  }
 }
{code}

the master branch code writes ABORT_FLUSH log before we call 
wal.abortCacheFlush. so it is also needed if wal.sync aborts?

also I am thinking about if we could make error injection test for such kind of 
failure which could mostly happen in real env but would not happen in UT?


 RegionServer was blocked while aborting
 ---

 Key: HBASE-11902
 URL: https://issues.apache.org/jira/browse/HBASE-11902
 Project: HBase
  Issue Type: Bug
  Components: regionserver, wal
Affects Versions: 0.98.4
 Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7
Reporter: Victor Xu
Assignee: Qiang Tian
 Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, 
 jstack_hadoop461.cm6.log


 Generally, regionserver automatically aborts when isHealth() returns false. 
 But it sometimes got blocked while aborting. I saved the jstack and logs, and 
 found out that it was caused by datanodes failures. The regionserver60020 
 thread was blocked while closing WAL. 
 This issue doesn't happen so frequently, but if it happens, it always leads 
 to huge amount of requests failure. The only way to do is KILL -9.
 I think it's a bug, but I haven't found a decent solution. Does anyone have 
 the same problem?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-11902) RegionServer was blocked while aborting

2014-11-24 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224070#comment-14224070
 ] 

Hadoop QA commented on HBASE-11902:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12683469/hbase11902-master.patch
  against master branch at commit e83082a88816684714d8a563967046e582f9b8c7.
  ATTACHMENT ID: 12683469

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 checkstyle{color}.  The applied patch does not increase the 
total number of checkstyle errors

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

  {color:green}+1 site{color}.  The mvn site goal succeeds with this patch.

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
   org.apache.hadoop.hbase.regionserver.wal.TestLogRolling

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11821//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11821//artifact/patchprocess/newPatchFindbugsWarningshbase-rest.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11821//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11821//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11821//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11821//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11821//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11821//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11821//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11821//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11821//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11821//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11821//artifact/patchprocess/checkstyle-aggregate.html

  Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11821//console

This message is automatically generated.

 RegionServer was blocked while aborting
 ---

 Key: HBASE-11902
 URL: https://issues.apache.org/jira/browse/HBASE-11902
 Project: HBase
  Issue Type: Bug
  Components: regionserver, wal
Affects Versions: 0.98.4
 Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7
Reporter: Victor Xu
Assignee: Qiang Tian
 Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, 
 hbase11902-master.patch, jstack_hadoop461.cm6.log


 Generally, regionserver automatically aborts when isHealth() returns false. 
 But it sometimes got blocked while aborting. I saved the jstack and logs, and 
 found out that it was caused by datanodes failures. The regionserver60020 
 thread was blocked while closing WAL. 
 This issue doesn't happen so frequently, but if it happens, it always leads 
 to huge amount of requests failure. The only way to do is KILL -9.
 I think it's a bug, but I haven't found a decent solution. Does anyone have 
 the same problem?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-11902) RegionServer was blocked while aborting

2014-09-05 Thread Victor Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122682#comment-14122682
 ] 

Victor Xu commented on HBASE-11902:
---

Yes, stack. The rs main thread is waiting at 
org.apache.hadoop.hbase.util.DrainBarrier.stopAndDrainOps, but the main cause 
of the aborting is DataNodes. You can find the details in the log:
2014-09-03 13:38:03,789 FATAL org.apache.hadoop.hbase.regionserver.wal.FSHLog: 
Error while AsyncSyncer sync, request close of hlog 
java.io.IOException: All datanodes 10.246.2.103:50010 are bad. Aborting...
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1127)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:924)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:486)
2014-09-03 13:38:03,799 ERROR 
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Cache flush failed for 
region page_content_queue,00166,1408946731655.8671b8a0f82565f88eb2ab8a5b53e84c.
java.io.IOException: All datanodes 10.246.2.103:50010 are bad. Aborting...
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1127)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:924)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:486)
2014-09-03 13:38:03,801 ERROR 
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter: Got IOException 
while writing trailer
java.io.IOException: All datanodes 10.246.2.103:50010 are bad. Aborting...
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1127)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:924)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:486)
2014-09-03 13:38:03,802 ERROR org.apache.hadoop.hbase.regionserver.wal.FSHLog: 
Failed close of HLog writer
java.io.IOException: All datanodes 10.246.2.103:50010 are bad. Aborting...
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1127)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:924)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:486)
2014-09-03 13:38:03,802 WARN org.apache.hadoop.hbase.regionserver.wal.FSHLog: 
Riding over HLog close failure! error count=1
2014-09-03 13:38:03,804 INFO org.apache.hadoop.hbase.regionserver.wal.FSHLog: 
Rolled WAL 
/hbase/WALs/hadoop461.cm6.tbsite.net,60020,1409003284950/hadoop461.cm6.tbsite.net%2C60020%2C1409003284950.1409722420708
 with entries=32565, filesize=118.6 M; new WAL 
/hbase/WALs/hadoop461.cm6.tbsite.net,60020,1409003284950/hadoop461.cm6.tbsite.net%2C60020%2C1409003284950.1409722683780
2014-09-03 13:38:03,804 DEBUG org.apache.hadoop.hbase.regionserver.wal.FSHLog: 
log file is ready for archiving 
hdfs://hadoopnnvip.cm6:9000/hbase/WALs/hadoop461.cm6.tbsite.net,60020,1409003284950/hadoop461.cm6.tbsite.net%2C60020%2C1409003284950.1409707475254
2014-09-03 13:38:03,804 DEBUG org.apache.hadoop.hbase.regionserver.wal.FSHLog: 
log file is ready for archiving 
hdfs://hadoopnnvip.cm6:9000/hbase/WALs/hadoop461.cm6.tbsite.net,60020,1409003284950/hadoop461.cm6.tbsite.net%2C60020%2C1409003284950.1409707722202
2014-09-03 13:38:03,804 DEBUG org.apache.hadoop.hbase.regionserver.wal.FSHLog: 
log file is ready for archiving 
hdfs://hadoopnnvip.cm6:9000/hbase/WALs/hadoop461.cm6.tbsite.net,60020,1409003284950/hadoop461.cm6.tbsite.net%2C60020%2C1409003284950.1409707946159
2014-09-03 13:38:03,804 DEBUG org.apache.hadoop.hbase.regionserver.wal.FSHLog: 
log file is ready for archiving 
hdfs://hadoopnnvip.cm6:9000/hbase/WALs/hadoop461.cm6.tbsite.net,60020,1409003284950/hadoop461.cm6.tbsite.net%2C60020%2C1409003284950.1409708155788
2014-09-03 13:38:03,839 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
Flush requested on 
page_content_queue,00166,1408946731655.8671b8a0f82565f88eb2ab8a5b53e84c.
2014-09-03 13:38:03,839 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
Started memstore flush for 
page_content_queue,00166,1408946731655.8671b8a0f82565f88eb2ab8a5b53e84c., 
current region memstore size 218.5 M
2014-09-03 13:38:03,887 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
Flush requested on 
page_content_queue,00166,1408946731655.8671b8a0f82565f88eb2ab8a5b53e84c.
2014-09-03 13:38:03,887 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
Started memstore flush for 
page_content_queue,00166,1408946731655.8671b8a0f82565f88eb2ab8a5b53e84c., 
current region memstore size 218.5 M
2014-09-03 13:38:03,897 DEBUG 

[jira] [Commented] (HBASE-11902) RegionServer was blocked while aborting

2014-09-04 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122467#comment-14122467
 ] 

stack commented on HBASE-11902:
---

You mean here:

{code}
regionserver60020 prio=10 tid=0x7f85011ca800 nid=0x74d0 in Object.wait() 
[0x4405f000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:503)
at 
org.apache.hadoop.hbase.util.DrainBarrier.stopAndDrainOps(DrainBarrier.java:115)
- locked 0x0002bb325248 (a 
org.apache.hadoop.hbase.util.DrainBarrier)
at 
org.apache.hadoop.hbase.util.DrainBarrier.stopAndDrainOps(DrainBarrier.java:85)
at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog.close(FSHLog.java:923)
at 
org.apache.hadoop.hbase.regionserver.HRegionServer.closeWAL(HRegionServer.java:1208)
at 
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1001)
at java.lang.Thread.run(Thread.java:744)
{code}

Doesn't seem to be an HDFS issue, just waiting on flushes to complete.  You see 
issues flushing Victor (I've not looked at log).



 RegionServer was blocked while aborting
 ---

 Key: HBASE-11902
 URL: https://issues.apache.org/jira/browse/HBASE-11902
 Project: HBase
  Issue Type: Bug
  Components: regionserver, wal
Affects Versions: 0.98.4
 Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7
Reporter: Victor Xu
 Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, 
 jstack_hadoop461.cm6.log


 Generally, regionserver automatically aborts when isHealth() returns false. 
 But it sometimes got blocked while aborting. I saved the jstack and logs, and 
 found out that it was caused by datanodes failures. The regionserver60020 
 thread was blocked while closing WAL. 
 This issue doesn't happen so frequently, but if it happens, it always leads 
 to huge amount of requests failure. The only way to do is KILL -9.
 I think it's a bug, but I haven't found a decent solution. Does anyone have 
 the same problem?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)