Michael Stack created HBASE-23181:
-------------------------------------
Summary: Blocked WAL archive: "LogRoller: Failed to schedule flush
of 8ee433ad59526778c53cc85ed3762d0b, because it is not online on us"
Key: HBASE-23181
URL: https://issues.apache.org/jira/browse/HBASE-23181
Project: HBase
Issue Type: Bug
Reporter: Michael Stack
On a heavily loaded cluster, WAL count keeps rising and we can get into a state
where we are not rolling the logs off fast enough. In particular, there is this
interesting state at the extreme where we pick a region to flush because 'Too
many WALs' but the region is actually not online. As the WAL count rises, we
keep picking a region-to-flush that is no longer on the server. This condition
blocks our being able to clear WALs; eventually WALs climb into the hundreds
and the RS goes zombie with a full Call queue that starts throwing
CallQueueTooLargeExceptions (bad if this servers is the one carrying
hbase:meta).
Here is how it looks in the log:
{code}
# Here is region closing....
2019-10-16 23:10:55,897 INFO
org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler: Closed
8ee433ad59526778c53cc85ed3762d0b
....
# Then soon after ...
2019-10-16 23:11:44,041 WARN org.apache.hadoop.hbase.regionserver.LogRoller:
Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is not
online on us
2019-10-16 23:11:45,006 INFO
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs;
count=45, max=32; forcing flush of 1 regions(s):
8ee433ad59526778c53cc85ed3762d0b
...
# Later...
2019-10-16 23:20:25,427 INFO
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs;
count=542, max=32; forcing flush of 1 regions(s):
8ee433ad59526778c53cc85ed3762d0b
2019-10-16 23:20:25,427 WARN org.apache.hadoop.hbase.regionserver.LogRoller:
Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is not
online on us
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)