Wei-Chiu Chuang created HDDS-15327:
--------------------------------------
Summary: SCM does not proactively clear failed replication tasks
Key: HDDS-15327
URL: https://issues.apache.org/jira/browse/HDDS-15327
Project: Apache Ozone
Issue Type: Bug
Components: SCM
Reporter: Wei-Chiu Chuang
Let's say SCM is configured to allow up to 200 replications
(hdds.datanode.replication.outofservice.limit.factor = 2,
hdds.scm.replication.datanode.replication.limit = 100), but Datanode only sees
at most 12 replication tasks (inflight+ queued)
The reason SCM thinks 200 commands are active while the Datanode only sees 12
is that SCM does not proactively clear failed replication commands.
* Accounting Logic: When SCM sends a replication command, it increments its
"in-flight" count. It only decrements this count if it receives a successful
Container Report from the Datanode OR if the command times out.
* The Leak: If a command fails on the Datanode (e.g., due to a temporary
network blip, or the DN being busy), the Datanode sends a failure report.
However, SCM's CommandStatusReportHandler currently ignores replication
failure reports.
* The Timeout: Those failed commands stay in SCM's "in-flight" quota for 12
minutes (the default for hdds.scm.replication.event.timeout).
* Decommissioning Impact: Because decommissioning triggers a burst of
thousands of commands, any small percentage of failures quickly "leaks" and
fills
up the 200-command quota with stale entries that won't disappear for 12
minutes, blocking new progress.
The workaround might be to reduce hdds.scm.replication.event.timeout
aggressively down to e.g. 1m.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]