[ 
https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223989#comment-14223989
 ] 

Qiang Tian commented on HBASE-11902:
------------------------------------

proposed fix for 0.98:

{code}
--- hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
+++ hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
@@ -1760,7 +1760,13 @@ public class HRegion implements HeapSize { // , Writable{
     // sync unflushed WAL changes when deferred log sync is enabled
     // see HBASE-8208 for details
     if (wal != null && !shouldSyncLog()) {
-      wal.sync();
+      try {
+        wal.sync();
+      } catch (IOException e) {
+         wal.abortCacheFlush(this.getRegionInfo().getEncodedNameAsBytes());
+         LOG.warn("Unexpected exception while wal.sync(), re-throw");
+         throw e;
+      }
     }
{code}

the master branch code writes ABORT_FLUSH log before we call 
wal.abortCacheFlush. so it is also needed if wal.sync aborts?

also I am thinking about if we could make error injection test for such kind of 
failure which could mostly happen in real env but would not happen in UT?


> RegionServer was blocked while aborting
> ---------------------------------------
>
>                 Key: HBASE-11902
>                 URL: https://issues.apache.org/jira/browse/HBASE-11902
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver, wal
>    Affects Versions: 0.98.4
>         Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7
>            Reporter: Victor Xu
>            Assignee: Qiang Tian
>         Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, 
> jstack_hadoop461.cm6.log
>
>
> Generally, regionserver automatically aborts when isHealth() returns false. 
> But it sometimes got blocked while aborting. I saved the jstack and logs, and 
> found out that it was caused by datanodes failures. The "regionserver60020" 
> thread was blocked while closing WAL. 
> This issue doesn't happen so frequently, but if it happens, it always leads 
> to huge amount of requests failure. The only way to do is KILL -9.
> I think it's a bug, but I haven't found a decent solution. Does anyone have 
> the same problem?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to