[
https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223976#comment-14223976
]
Qiang Tian commented on HBASE-11902:
------------------------------------
from the stacktrace, DrainBarrier.stopAndDrainOps is waiting, so
DrainBarrier#endOp does not notify it.
looking at class DrainBarrier, it is expected that beginOp and endOp are called
in pair. the initial value of {{valueAndFlags}} is 2, incremented by 2 in
beginOp; decremented by 2 in endOp.
in stopAndDrainOps, if getValue(oldValAndFlags) == 1, means oldValAndFlags=2,
all ops are completed in pair, otherwise, it needs to wait the last endOp to
notify it:
{code}
if (getValue(oldValAndFlags) == 1) return; // There were no operations
outstanding.
synchronized (this) { this.wait(); }
{code}
so the problem could be the beginOp/endOp is not called in pair, the hole looks
to be here:
HRegion#internalFlushcache
{code}
// sync unflushed WAL changes when deferred log sync is enabled
// see HBASE-8208 for details
if (wal != null && !shouldSyncLog()) {
wal.sync();
}
{code}
at that point, wal.startCacheFlush->closeBarrier.beginOp is called, but
completeCacheFlush->closeBarrier.endOp() is not protected by a try block..so if
WAL/HDFS layer throws exception, the endOp will not be called.
related info in log:
{quote}
2014-09-03 13:38:03,789 ERROR org.apache.hadoop.hbase.regionserver.wal.FSHLog:
Error while AsyncWriter write, request close of hlog<
java.io.IOException: All datanodes 10.246.2.103:50010 are bad. Aborting...
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1127)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:924)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:486)
2014-09-03 13:38:03,789 FATAL org.apache.hadoop.hbase.regionserver.wal.FSHLog:
Error while AsyncSyncer sync, request close of hlog<
java.io.IOException: All datanodes 10.246.2.103:50010 are bad. Aborting...
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1127)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:924)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:486)
2014-09-03 13:38:03,799 ERROR
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Cache flush failed for
region page_content_queue,00166,1408946731655.8671b8a0f82565f88eb2ab8a5b53e84c.
//<==========MemStoreFlusher#flushRegion
java.io.IOException: All datanodes 10.246.2.103:50010 are bad. Aborting...
{quote}
the exception thrown to caller should be here:
{code}
FSHLog#syncer:
if (txid <= this.failedTxid.get()) {
assert asyncIOE != null :
"current txid is among(under) failed txids, but asyncIOE is null!";
throw asyncIOE;
}
{code}
the master branch can catch the hdfs exception, but it just ignore it, which
looks incorrect:
{code}
if (wal != null) {
try {
wal.sync(); // ensure that flush marker is sync'ed
} catch (IOException ioe) {
LOG.warn("Unexpected exception while wal.sync(), ignoring. Exception:
"
+ StringUtils.stringifyException(ioe));
}
}
{code}
Personally the exeception should not be ignored since it is severe hdfs error.
> RegionServer was blocked while aborting
> ---------------------------------------
>
> Key: HBASE-11902
> URL: https://issues.apache.org/jira/browse/HBASE-11902
> Project: HBase
> Issue Type: Bug
> Components: regionserver, wal
> Affects Versions: 0.98.4
> Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7
> Reporter: Victor Xu
> Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log,
> jstack_hadoop461.cm6.log
>
>
> Generally, regionserver automatically aborts when isHealth() returns false.
> But it sometimes got blocked while aborting. I saved the jstack and logs, and
> found out that it was caused by datanodes failures. The "regionserver60020"
> thread was blocked while closing WAL.
> This issue doesn't happen so frequently, but if it happens, it always leads
> to huge amount of requests failure. The only way to do is KILL -9.
> I think it's a bug, but I haven't found a decent solution. Does anyone have
> the same problem?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)