[ https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223989#comment-14223989 ]
Qiang Tian commented on HBASE-11902: ------------------------------------ proposed fix for 0.98: {code} --- hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java +++ hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java @@ -1760,7 +1760,13 @@ public class HRegion implements HeapSize { // , Writable{ // sync unflushed WAL changes when deferred log sync is enabled // see HBASE-8208 for details if (wal != null && !shouldSyncLog()) { - wal.sync(); + try { + wal.sync(); + } catch (IOException e) { + wal.abortCacheFlush(this.getRegionInfo().getEncodedNameAsBytes()); + LOG.warn("Unexpected exception while wal.sync(), re-throw"); + throw e; + } } {code} the master branch code writes ABORT_FLUSH log before we call wal.abortCacheFlush. so it is also needed if wal.sync aborts? also I am thinking about if we could make error injection test for such kind of failure which could mostly happen in real env but would not happen in UT? > RegionServer was blocked while aborting > --------------------------------------- > > Key: HBASE-11902 > URL: https://issues.apache.org/jira/browse/HBASE-11902 > Project: HBase > Issue Type: Bug > Components: regionserver, wal > Affects Versions: 0.98.4 > Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7 > Reporter: Victor Xu > Assignee: Qiang Tian > Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, > jstack_hadoop461.cm6.log > > > Generally, regionserver automatically aborts when isHealth() returns false. > But it sometimes got blocked while aborting. I saved the jstack and logs, and > found out that it was caused by datanodes failures. The "regionserver60020" > thread was blocked while closing WAL. > This issue doesn't happen so frequently, but if it happens, it always leads > to huge amount of requests failure. The only way to do is KILL -9. > I think it's a bug, but I haven't found a decent solution. Does anyone have > the same problem? -- This message was sent by Atlassian JIRA (v6.3.4#6332)