[
https://issues.apache.org/jira/browse/HDDS-15327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wei-Chiu Chuang updated HDDS-15327:
-----------------------------------
Description:
Let's say SCM is configured to allow up to 200 replications
(hdds.datanode.replication.outofservice.limit.factor = 2,
hdds.scm.replication.datanode.replication.limit = 100), but Datanode only sees
at most 12 replication tasks (inflight+ queued)
The result? SCM does not push replication as hard as what is configured, and
decommission becomes slow.
The reason SCM thinks 200 commands are active while the Datanode only sees 12
is that SCM does not proactively clear failed replication commands.
* Accounting Logic: When SCM sends a replication command, it increments its
"in-flight" count. It only decrements this count if it receives a successful
Container Report from the Datanode OR if the command times out.
* The Leak: If a command fails on the Datanode (e.g., due to a temporary
network blip, or the DN being busy), the Datanode sends a failure report.
However, SCM's CommandStatusReportHandler currently ignores replication
failure reports.
* The Timeout: Those failed commands stay in SCM's "in-flight" quota for 12
minutes (the default for hdds.scm.replication.event.timeout).
* Decommissioning Impact: Because decommissioning triggers a burst of
thousands of commands, any small percentage of failures quickly "leaks" and
fills
up the 200-command quota with stale entries that won't disappear for 12
minutes, blocking new progress.
The workaround might be to reduce hdds.scm.replication.event.timeout
aggressively down to e.g. 1m.
was:
Let's say SCM is configured to allow up to 200 replications
(hdds.datanode.replication.outofservice.limit.factor = 2,
hdds.scm.replication.datanode.replication.limit = 100), but Datanode only sees
at most 12 replication tasks (inflight+ queued)
The reason SCM thinks 200 commands are active while the Datanode only sees 12
is that SCM does not proactively clear failed replication commands.
* Accounting Logic: When SCM sends a replication command, it increments its
"in-flight" count. It only decrements this count if it receives a successful
Container Report from the Datanode OR if the command times out.
* The Leak: If a command fails on the Datanode (e.g., due to a temporary
network blip, or the DN being busy), the Datanode sends a failure report.
However, SCM's CommandStatusReportHandler currently ignores replication
failure reports.
* The Timeout: Those failed commands stay in SCM's "in-flight" quota for 12
minutes (the default for hdds.scm.replication.event.timeout).
* Decommissioning Impact: Because decommissioning triggers a burst of
thousands of commands, any small percentage of failures quickly "leaks" and
fills
up the 200-command quota with stale entries that won't disappear for 12
minutes, blocking new progress.
The workaround might be to reduce hdds.scm.replication.event.timeout
aggressively down to e.g. 1m.
> SCM does not proactively clear failed replication tasks
> -------------------------------------------------------
>
> Key: HDDS-15327
> URL: https://issues.apache.org/jira/browse/HDDS-15327
> Project: Apache Ozone
> Issue Type: Bug
> Components: SCM
> Reporter: Wei-Chiu Chuang
> Priority: Major
>
> Let's say SCM is configured to allow up to 200 replications
> (hdds.datanode.replication.outofservice.limit.factor = 2,
> hdds.scm.replication.datanode.replication.limit = 100), but Datanode only
> sees at most 12 replication tasks (inflight+ queued)
>
> The result? SCM does not push replication as hard as what is configured, and
> decommission becomes slow.
>
> The reason SCM thinks 200 commands are active while the Datanode only sees
> 12 is that SCM does not proactively clear failed replication commands.
> * Accounting Logic: When SCM sends a replication command, it increments
> its "in-flight" count. It only decrements this count if it receives a
> successful
> Container Report from the Datanode OR if the command times out.
> * The Leak: If a command fails on the Datanode (e.g., due to a temporary
> network blip, or the DN being busy), the Datanode sends a failure report.
> However, SCM's CommandStatusReportHandler currently ignores replication
> failure reports.
> * The Timeout: Those failed commands stay in SCM's "in-flight" quota for
> 12 minutes (the default for hdds.scm.replication.event.timeout).
> * Decommissioning Impact: Because decommissioning triggers a burst of
> thousands of commands, any small percentage of failures quickly "leaks" and
> fills
> up the 200-command quota with stale entries that won't disappear for 12
> minutes, blocking new progress.
>
> The workaround might be to reduce hdds.scm.replication.event.timeout
> aggressively down to e.g. 1m.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]