[
https://issues.apache.org/jira/browse/HBASE-29376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
guluo resolved HBASE-29376.
---------------------------
Fix Version/s: 3.0.0-beta-2
Resolution: Fixed
> ReplicationLogCleaner.preClean/getDeletableFiles should return early when
> asyncClusterConnection closes during HMaster stopping
> -------------------------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-29376
> URL: https://issues.apache.org/jira/browse/HBASE-29376
> Project: HBase
> Issue Type: Improvement
> Components: master, Replication
> Environment: HBase master
> Reporter: guluo
> Assignee: guluo
> Priority: Major
> Labels: pull-request-available
> Fix For: 3.0.0-beta-2
>
>
> When HMaster is stopping, I found that hbase printed a lot of exception logs
> (hbase.master.cleaner.interval = 10000(ms) or you can configure a smaller
> time interval ), as follow.
> 2025-06-04T20:49:37,614 ERROR [master/hbase001:16000.Chore.2]
> master.ReplicationLogCleaner: Error occurred while executing
> queueStorage.hasData()
> org.apache.hadoop.hbase.replication.ReplicationException: failed to get
> replication queue table
> at
> org.apache.hadoop.hbase.replication.TableReplicationQueueStorage.hasData(TableReplicationQueueStorage.java:538)
> ~[hbase-replication-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
> at
> org.apache.hadoop.hbase.replication.master.ReplicationLogCleaner.preClean(ReplicationLogCleaner.java:86)
> ~[hbase-server-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
> at java.util.ArrayList.forEach(ArrayList.java:1511) ~[?:?]
> at
> org.apache.hadoop.hbase.master.cleaner.CleanerChore.preRunCleaner(CleanerChore.java:282)
> ~[hbase-server-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
> at
> org.apache.hadoop.hbase.master.cleaner.CleanerChore.chore(CleanerChore.java:257)
> ~[hbase-server-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
> at
> org.apache.hadoop.hbase.ScheduledChore.run(ScheduledChore.java:161)
> ~[hbase-common-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
> ~[?:?]
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
> ~[?:?]
> at
> org.apache.hadoop.hbase.JitterScheduledThreadPoolExecutorImpl$JitteredRunnableScheduledFuture.run(JitterScheduledThreadPoolExecutorImpl.java:107)
> ~[hbase-common-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
> ~[?:?]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
> ~[?:?]
> at java.lang.Thread.run(Thread.java:833) ~[?:?]
> Caused by: org.apache.hadoop.hbase.ipc.StoppedRpcClientException: Call to
> address=hbase001:16020 failed on local exception:
> org.apache.hadoop.hbase.ipc.StoppedRpcClientException
> at java.lang.Thread.getStackTrace(Thread.java:1610) ~[?:?]
> at
> org.apache.hadoop.hbase.util.FutureUtils.setStackTrace(FutureUtils.java:144)
> ~[hbase-common-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
> at
> org.apache.hadoop.hbase.util.FutureUtils.rethrow(FutureUtils.java:163)
> ~[hbase-common-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
> at org.apache.hadoop.hbase.util.FutureUtils.get(FutureUtils.java:186)
> ~[hbase-common-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
> at
> org.apache.hadoop.hbase.client.AdminOverAsyncAdmin.tableExists(AdminOverAsyncAdmin.java:130)
> ~[hbase-client-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
> at
> org.apache.hadoop.hbase.replication.TableReplicationQueueStorage.hasData(TableReplicationQueueStorage.java:536)
> ~[hbase-replication-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
> at
> org.apache.hadoop.hbase.replication.master.ReplicationLogCleaner.preClean(ReplicationLogCleaner.java:86)
> ~[hbase-server-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
> at java.util.ArrayList.forEach(ArrayList.java:1511) ~[?:?]
>
> The reason.
> When the HMaster service enters its stopping phase, the ReplicationLogCleaner
> task continues to execute periodically. During these executions, it invokes
> the rpm.getQueueStorage().hasData() method to check for the existence of
> pending data in the replication queue.
> However, once the HMaster service closes its asyncClusterConnection, we can
> no longer properly retrieve replication queue data because the underlying RPC
> client has been shut down at that point.
> So I think we should check if HMaster.asyncClusterConnection is closed in
> ReplicationLogCleaner to ensure a graceful shutdown of hmaster
--
This message was sent by Atlassian Jira
(v8.20.10#820010)