[ https://issues.apache.org/jira/browse/HBASE-25612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17338824#comment-17338824 ]
Reid Chan commented on HBASE-25612: ----------------------------------- could you update the branch-1 pull request again, I think the pre-commit works now. > HMaster should abort if ReplicationLogCleaner is not able to delete oldWALs. > ---------------------------------------------------------------------------- > > Key: HBASE-25612 > URL: https://issues.apache.org/jira/browse/HBASE-25612 > Project: HBase > Issue Type: Improvement > Affects Versions: 1.6.0 > Reporter: Rushabh Shah > Assignee: Rushabh Shah > Priority: Major > Fix For: 1.8.0 > > > In our production cluster, we encountered an issue where the number of files > within /hbase/oldWALs directory were growing exponentially from about 4000 > baseline to 150000 and growing at the rate of 333 files per minute. > On further investigation we found that ReplicatonLogCleaner thread was > getting aborted since it was not able to talk to zookeeper. Stack trace below > {noformat} > 2021-02-25 23:05:01,149 WARN [an-pool3-thread-1729] zookeeper.ZKUtil - > replicationLogCleaner-0x3000002e05e0d8f, > quorum=zookeeper-0:2181,zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181,zookeeper-4:2181, > baseZNode=/hbase Unable to get data of znode /hbase/replication/rs > org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode > = Session expired for /hbase/replication/rs > at org.apache.zookeeper.KeeperException.create(KeeperException.java:130) > at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) > at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1229) > at > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:374) > at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataNoWatch(ZKUtil.java:713) > at > org.apache.hadoop.hbase.replication.ReplicationQueuesClientZKImpl.getQueuesZNodeCversion(ReplicationQueuesClientZKImpl.java:87) > at > org.apache.hadoop.hbase.replication.master.ReplicationLogCleaner.loadWALsFromQueues(ReplicationLogCleaner.java:99) > at > org.apache.hadoop.hbase.replication.master.ReplicationLogCleaner.getDeletableFiles(ReplicationLogCleaner.java:70) > at > org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteFiles(CleanerChore.java:262) > at > org.apache.hadoop.hbase.master.cleaner.CleanerChore.access$200(CleanerChore.java:52) > at > org.apache.hadoop.hbase.master.cleaner.CleanerChore$3.act(CleanerChore.java:413) > at > org.apache.hadoop.hbase.master.cleaner.CleanerChore$3.act(CleanerChore.java:410) > at > org.apache.hadoop.hbase.master.cleaner.CleanerChore.deleteAction(CleanerChore.java:481) > at > org.apache.hadoop.hbase.master.cleaner.CleanerChore.traverseAndDelete(CleanerChore.java:410) > at > org.apache.hadoop.hbase.master.cleaner.CleanerChore.access$100(CleanerChore.java:52) > at > org.apache.hadoop.hbase.master.cleaner.CleanerChore$1.run(CleanerChore.java:220) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2021-02-25 23:05:01,149 WARN [an-pool3-thread-1729] > master.ReplicationLogCleaner - ReplicationLogCleaner received abort, > ignoring. Reason: Failed to get stat of replication rs node > 2021-02-25 23:05:01,149 DEBUG [an-pool3-thread-1729] > master.ReplicationLogCleaner - > org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode > = Session expired for /hbase/replication/rs > 2021-02-25 23:05:01,150 WARN [an-pool3-thread-1729] > master.ReplicationLogCleaner - Failed to read zookeeper, skipping checking > deletable files > {noformat} > > {quote} 2021-02-25 23:05:01,149 WARN [an-pool3-thread-1729] > master.ReplicationLogCleaner - ReplicationLogCleaner received abort, > ignoring. Reason: Failed to get stat of replication rs node > {quote} > > This line is more scary where HMaster invoked Abortable but just ignored and > HMaster was doing it business as usual. > We have max files per directory configuration in namenode which is set to 1M > in our clusters. If this directory reached that limit then that would have > brought down the whole cluster. > We shouldn't ignore Abortable and should crash the Hmaster if Abortable is > invoked. > -- This message was sent by Atlassian Jira (v8.3.4#803005)