[ 
https://issues.apache.org/jira/browse/HBASE-25612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17294715#comment-17294715
 ] 

Rushabh Shah commented on HBASE-25612:
--------------------------------------

In branch-1, when ReplicationLogCleaner encounters any zk related exception 
(KeeperException to be specific), it invokes 
[WarnOnlyAbortable|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/master/ReplicationLogCleaner.java#L181-L195]
 which is as below.

{noformat}
public static class WarnOnlyAbortable implements Abortable {
    @Override
    public void abort(String why, Throwable e) {
      LOG.warn("ReplicationLogCleaner received abort, ignoring.  Reason: " + 
why);
      if (LOG.isDebugEnabled()) {
        LOG.debug(e);
      }
    }

    @Override
    public boolean isAborted() {
      return false;
    }
  }
{noformat}

 This change was introduced as a part of HBASE-15234 which tried to fix the 
same issue that we are encountering. For *transient* zk related issues, it was 
not able to clean oldWALs directory and pre HBASE-15234, it used to abort 
ReplicationLogCleaner thread silently and oldWALs directory will grow unbounded 
 even after connection to zk is back to normal.
So in HBASE-15234, we swallowed KeeperException and kept the 
ReplicationLogCleaner thread running.

But there was a big assumption that zk related issues were transient. In case 
of non-transient issue, we will swallow the exception and the oldWALs directory 
will keep on growing.

{quote}
 For the log cleaner chores, do we ever want a KeeperException here to be 
fatal? Either the log cleaner should warn but continue on these exceptions 
(which seems to be the original intent), or else it should delegate back up on 
the abort and fully abort the running master. The current situation of stopping 
the cleaner chore just leaves the system in a bad state.
{quote}

Also in [one of the 
comments|https://issues.apache.org/jira/browse/HBASE-15234?focusedCommentId=15137632&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15137632]
 in HBASE-15234,  it was discussed whether to abort hmaster or keep on 
continuing when cleaner encounters KeeperException, we decided to go with 
latter since we thought they were transient error.

We have 2 options here. 
1. Abort hmaster in case of any KeeperException.
2. Classify KeeperException into fatal and non-fatal. Fatal Exception can be 
SessionExpiredException and more.
In case of fatal exception, abort hmaster and keep on continuing in case of non 
fatal exceptions.
[~apurtell] Any suggestions ? Thank you !

 

> HMaster should abort if ReplicationLogCleaner is not able to delete oldWALs.
> ----------------------------------------------------------------------------
>
>                 Key: HBASE-25612
>                 URL: https://issues.apache.org/jira/browse/HBASE-25612
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 1.6.0
>            Reporter: Rushabh Shah
>            Assignee: Rushabh Shah
>            Priority: Major
>
> In our production cluster, we encountered an issue where the number of files 
> within /hbase/oldWALs directory were growing exponentially from about 4000 
> baseline to 150000 and growing at the rate of 333 files per minute.
> On further investigation we found that ReplicatonLogCleaner thread was 
> getting aborted since it was not able to talk to zookeeper. Stack trace below
> {noformat}
> 2021-02-25 23:05:01,149 WARN [an-pool3-thread-1729] zookeeper.ZKUtil - 
> replicationLogCleaner-0x3000002e05e0d8f, 
> quorum=zookeeper-0:2181,zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181,zookeeper-4:2181,
>  baseZNode=/hbase Unable to get data of znode /hbase/replication/rs
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode 
> = Session expired for /hbase/replication/rs
>  at org.apache.zookeeper.KeeperException.create(KeeperException.java:130)
>  at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
>  at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1229)
>  at 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:374)
>  at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataNoWatch(ZKUtil.java:713)
>  at 
> org.apache.hadoop.hbase.replication.ReplicationQueuesClientZKImpl.getQueuesZNodeCversion(ReplicationQueuesClientZKImpl.java:87)
>  at 
> org.apache.hadoop.hbase.replication.master.ReplicationLogCleaner.loadWALsFromQueues(ReplicationLogCleaner.java:99)
>  at 
> org.apache.hadoop.hbase.replication.master.ReplicationLogCleaner.getDeletableFiles(ReplicationLogCleaner.java:70)
>  at 
> org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteFiles(CleanerChore.java:262)
>  at 
> org.apache.hadoop.hbase.master.cleaner.CleanerChore.access$200(CleanerChore.java:52)
>  at 
> org.apache.hadoop.hbase.master.cleaner.CleanerChore$3.act(CleanerChore.java:413)
>  at 
> org.apache.hadoop.hbase.master.cleaner.CleanerChore$3.act(CleanerChore.java:410)
>  at 
> org.apache.hadoop.hbase.master.cleaner.CleanerChore.deleteAction(CleanerChore.java:481)
>  at 
> org.apache.hadoop.hbase.master.cleaner.CleanerChore.traverseAndDelete(CleanerChore.java:410)
>  at 
> org.apache.hadoop.hbase.master.cleaner.CleanerChore.access$100(CleanerChore.java:52)
>  at 
> org.apache.hadoop.hbase.master.cleaner.CleanerChore$1.run(CleanerChore.java:220)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> 2021-02-25 23:05:01,149 WARN  [an-pool3-thread-1729] 
> master.ReplicationLogCleaner - ReplicationLogCleaner received abort, 
> ignoring.  Reason: Failed to get stat of replication rs node
> 2021-02-25 23:05:01,149 DEBUG [an-pool3-thread-1729] 
> master.ReplicationLogCleaner - 
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode 
> = Session expired for /hbase/replication/rs
> 2021-02-25 23:05:01,150 WARN  [an-pool3-thread-1729] 
> master.ReplicationLogCleaner - Failed to read zookeeper, skipping checking 
> deletable files
>  {noformat}
>  
> {quote} 2021-02-25 23:05:01,149 WARN [an-pool3-thread-1729] 
> master.ReplicationLogCleaner - ReplicationLogCleaner received abort, 
> ignoring. Reason: Failed to get stat of replication rs node
> {quote}
>  
> This line is more scary where HMaster invoked Abortable but just ignored and 
> HMaster was doing it business as usual.
> We have max files per directory configuration in namenode which is set to 1M 
> in our clusters. If this directory reached that limit then that would have 
> brought down the whole cluster.
> We shouldn't ignore Abortable and should crash the Hmaster if Abortable is 
> invoked.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to