[ 
https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223976#comment-14223976
 ] 

Qiang Tian commented on HBASE-11902:
------------------------------------

from the stacktrace, DrainBarrier.stopAndDrainOps is waiting, so 
DrainBarrier#endOp does not notify it.

looking at class DrainBarrier, it is expected that beginOp and endOp are called 
in pair. the initial value of {{valueAndFlags}} is 2, incremented by 2 in 
beginOp; decremented by 2 in endOp.

in stopAndDrainOps, if getValue(oldValAndFlags) == 1, means oldValAndFlags=2, 
all ops are completed in pair, otherwise, it needs to wait the last endOp to 
notify it:

{code}
    if (getValue(oldValAndFlags) == 1) return; // There were no operations 
outstanding.
    synchronized (this) { this.wait(); }
{code}

so the problem could be the beginOp/endOp is not called in pair, the hole looks 
to be here:

HRegion#internalFlushcache
{code}
    // sync unflushed WAL changes when deferred log sync is enabled
    // see HBASE-8208 for details
    if (wal != null && !shouldSyncLog()) {
      wal.sync();
    }
{code} 

at that point, wal.startCacheFlush->closeBarrier.beginOp is called, but 
completeCacheFlush->closeBarrier.endOp() is not protected by a try block..so if 
WAL/HDFS layer throws exception, the endOp will not be called.

related info in log:

  
{quote}    
2014-09-03 13:38:03,789 ERROR org.apache.hadoop.hbase.regionserver.wal.FSHLog: 
Error while AsyncWriter write, request close of hlog<
java.io.IOException: All datanodes 10.246.2.103:50010 are bad. Aborting...
  at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1127)
  at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:924)
  at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:486)
2014-09-03 13:38:03,789 FATAL org.apache.hadoop.hbase.regionserver.wal.FSHLog: 
Error while AsyncSyncer sync, request close of hlog<
java.io.IOException: All datanodes 10.246.2.103:50010 are bad. Aborting...
  at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1127)
  at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:924)
  at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:486)
2014-09-03 13:38:03,799 ERROR 
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Cache flush failed for 
region page_content_queue,00166,1408946731655.8671b8a0f82565f88eb2ab8a5b53e84c. 
 //<==========MemStoreFlusher#flushRegion
java.io.IOException: All datanodes 10.246.2.103:50010 are bad. Aborting...
{quote}


the exception thrown to caller should be here:
{code}
FSHLog#syncer:

    if (txid <= this.failedTxid.get()) {
        assert asyncIOE != null :
          "current txid is among(under) failed txids, but asyncIOE is null!";
        throw asyncIOE;
    }

{code}

the master branch can catch the hdfs exception, but it just ignore it, which 
looks incorrect:
{code}
      if (wal != null) {
        try {
          wal.sync(); // ensure that flush marker is sync'ed
        } catch (IOException ioe) {
          LOG.warn("Unexpected exception while wal.sync(), ignoring. Exception: 
"
              + StringUtils.stringifyException(ioe));
        }
      }
{code}

Personally the exeception should not be ignored since it is severe hdfs error.


> RegionServer was blocked while aborting
> ---------------------------------------
>
>                 Key: HBASE-11902
>                 URL: https://issues.apache.org/jira/browse/HBASE-11902
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver, wal
>    Affects Versions: 0.98.4
>         Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7
>            Reporter: Victor Xu
>         Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, 
> jstack_hadoop461.cm6.log
>
>
> Generally, regionserver automatically aborts when isHealth() returns false. 
> But it sometimes got blocked while aborting. I saved the jstack and logs, and 
> found out that it was caused by datanodes failures. The "regionserver60020" 
> thread was blocked while closing WAL. 
> This issue doesn't happen so frequently, but if it happens, it always leads 
> to huge amount of requests failure. The only way to do is KILL -9.
> I think it's a bug, but I haven't found a decent solution. Does anyone have 
> the same problem?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to