guluo created HBASE-29376:
-----------------------------
Summary: ReplicationLogCleaner.preClean/getDeletableFiles should
return early when asyncClusterConnection closes during HMaster stopping
Key: HBASE-29376
URL: https://issues.apache.org/jira/browse/HBASE-29376
Project: HBase
Issue Type: Improvement
Components: master, Replication
Environment: HBase master
Reporter: guluo
When HMaster is stopping, I found that hbase printed a lot of exception logs
(hbase.master.cleaner.interval = 10000(ms) or you can configure a smaller time
interval ), as follow.
2025-06-04T20:49:37,614 ERROR [master/hbase001:16000.Chore.2]
master.ReplicationLogCleaner: Error occurred while executing
queueStorage.hasData()
org.apache.hadoop.hbase.replication.ReplicationException: failed to get
replication queue table
at
org.apache.hadoop.hbase.replication.TableReplicationQueueStorage.hasData(TableReplicationQueueStorage.java:538)
~[hbase-replication-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
at
org.apache.hadoop.hbase.replication.master.ReplicationLogCleaner.preClean(ReplicationLogCleaner.java:86)
~[hbase-server-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
at java.util.ArrayList.forEach(ArrayList.java:1511) ~[?:?]
at
org.apache.hadoop.hbase.master.cleaner.CleanerChore.preRunCleaner(CleanerChore.java:282)
~[hbase-server-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
at
org.apache.hadoop.hbase.master.cleaner.CleanerChore.chore(CleanerChore.java:257)
~[hbase-server-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
at org.apache.hadoop.hbase.ScheduledChore.run(ScheduledChore.java:161)
~[hbase-common-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
~[?:?]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
~[?:?]
at
org.apache.hadoop.hbase.JitterScheduledThreadPoolExecutorImpl$JitteredRunnableScheduledFuture.run(JitterScheduledThreadPoolExecutorImpl.java:107)
~[hbase-common-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
~[?:?]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
~[?:?]
at java.lang.Thread.run(Thread.java:833) ~[?:?]
Caused by: org.apache.hadoop.hbase.ipc.StoppedRpcClientException: Call to
address=hbase001:16020 failed on local exception:
org.apache.hadoop.hbase.ipc.StoppedRpcClientException
at java.lang.Thread.getStackTrace(Thread.java:1610) ~[?:?]
at
org.apache.hadoop.hbase.util.FutureUtils.setStackTrace(FutureUtils.java:144)
~[hbase-common-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
at
org.apache.hadoop.hbase.util.FutureUtils.rethrow(FutureUtils.java:163)
~[hbase-common-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
at org.apache.hadoop.hbase.util.FutureUtils.get(FutureUtils.java:186)
~[hbase-common-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
at
org.apache.hadoop.hbase.client.AdminOverAsyncAdmin.tableExists(AdminOverAsyncAdmin.java:130)
~[hbase-client-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
at
org.apache.hadoop.hbase.replication.TableReplicationQueueStorage.hasData(TableReplicationQueueStorage.java:536)
~[hbase-replication-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
at
org.apache.hadoop.hbase.replication.master.ReplicationLogCleaner.preClean(ReplicationLogCleaner.java:86)
~[hbase-server-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
at java.util.ArrayList.forEach(ArrayList.java:1511) ~[?:?]
The reason.
When the HMaster service enters its stopping phase, the ReplicationLogCleaner
task continues to execute periodically. During these executions, it invokes the
rpm.getQueueStorage().hasData() method to check for the existence of pending
data in the replication queue.
However, once the HMaster service closes its asyncClusterConnection, we can no
longer properly retrieve replication queue data because the underlying RPC
client has been shut down at that point.
So I think we should check if HMaster.asyncClusterConnection is closed in
ReplicationLogCleaner to ensure a graceful shutdown of hmaster
--
This message was sent by Atlassian Jira
(v8.20.10#820010)