[ https://issues.apache.org/jira/browse/HBASE-23181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16960662#comment-16960662 ]
Michael Stack commented on HBASE-23181: --------------------------------------- Ran cluster test with heavy load overrunning hbase against branch-2.2 (i.e. w/ this patch). This problem did not show (Plenty of others did but that is another story). Will run more but looking good. > Blocked WAL archive: "LogRoller: Failed to schedule flush of XXXX, because it > is not online on us" > -------------------------------------------------------------------------------------------------- > > Key: HBASE-23181 > URL: https://issues.apache.org/jira/browse/HBASE-23181 > Project: HBase > Issue Type: Bug > Components: regionserver, wal > Affects Versions: 2.2.1 > Reporter: Michael Stack > Assignee: Duo Zhang > Priority: Major > Fix For: 3.0.0, 2.3.0, 2.1.8, 2.2.3 > > > On a heavily loaded cluster, WAL count keeps rising and we can get into a > state where we are not rolling the logs off fast enough. In particular, there > is this interesting state at the extreme where we pick a region to flush > because 'Too many WALs' but the region is actually not online. As the WAL > count rises, we keep picking a region-to-flush that is no longer on the > server. This condition blocks our being able to clear WALs; eventually WALs > climb into the hundreds and the RS goes zombie with a full Call queue that > starts throwing CallQueueTooLargeExceptions (bad if this servers is the one > carrying hbase:meta): i.e. clients fail to access the RegionServer. > One symptom is a fast spike in WAL count for the RS. A restart of the RS will > break the bind. > Here is how it looks in the log: > {code} > # Here is region closing.... > 2019-10-16 23:10:55,897 INFO > org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler: Closed > 8ee433ad59526778c53cc85ed3762d0b > .... > # Then soon after ... > 2019-10-16 23:11:44,041 WARN org.apache.hadoop.hbase.regionserver.LogRoller: > Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is > not online on us > 2019-10-16 23:11:45,006 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; > count=45, max=32; forcing flush of 1 regions(s): > 8ee433ad59526778c53cc85ed3762d0b > ... > # Later... > 2019-10-16 23:20:25,427 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; > count=542, max=32; forcing flush of 1 regions(s): > 8ee433ad59526778c53cc85ed3762d0b > 2019-10-16 23:20:25,427 WARN org.apache.hadoop.hbase.regionserver.LogRoller: > Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is > not online on us > {code} > I've seen this runaway WALs 2.2.1. I've seen runaway WALs in a 1.2.x version > regularly that had HBASE-16721 fix in it, but can't say yet if it was for > same reason as above. -- This message was sent by Atlassian Jira (v8.3.4#803005)