[jira] [Updated] (HDDS-15327) SCM does not proactively clear failed replication tasks

Wei-Chiu Chuang (Jira) Tue, 19 May 2026 16:37:15 -0700


     [ 
https://issues.apache.org/jira/browse/HDDS-15327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Wei-Chiu Chuang updated HDDS-15327:
-----------------------------------
    Description: 
Let's say SCM is configured to allow up to 200 replications 
(hdds.datanode.replication.outofservice.limit.factor = 2, 
hdds.scm.replication.datanode.replication.limit = 100), but Datanode only sees 
at most 12 replication tasks (inflight+ queued)

 

The result? SCM does not push replication as hard as what is configured, and 
decommission becomes slow.

 

  The reason SCM thinks 200 commands are active while the Datanode only sees 12 
is that SCM does not proactively clear failed replication commands.
   * Accounting Logic: When SCM sends a replication command, it increments its 
"in-flight" count. It only decrements this count if it receives a successful
     Container Report from the Datanode OR if the command times out.
   * The Leak: If a command fails on the Datanode (e.g., due to a temporary 
network blip, or the DN being busy), the Datanode sends a failure report.
     However, SCM's CommandStatusReportHandler currently ignores replication 
failure reports.
   * The Timeout: Those failed commands stay in SCM's "in-flight" quota for 12 
minutes (the default for hdds.scm.replication.event.timeout).
   * Decommissioning Impact: Because decommissioning triggers a burst of 
thousands of commands, any small percentage of failures quickly "leaks" and 
fills
     up the 200-command quota with stale entries that won't disappear for 12 
minutes, blocking new progress.

 

The workaround might be to reduce hdds.scm.replication.event.timeout 
aggressively down to e.g. 1m.

  was:
Let's say SCM is configured to allow up to 200 replications 
(hdds.datanode.replication.outofservice.limit.factor = 2, 
hdds.scm.replication.datanode.replication.limit = 100), but Datanode only sees 
at most 12 replication tasks (inflight+ queued)

 

  The reason SCM thinks 200 commands are active while the Datanode only sees 12 
is that SCM does not proactively clear failed replication commands.
   * Accounting Logic: When SCM sends a replication command, it increments its 
"in-flight" count. It only decrements this count if it receives a successful
     Container Report from the Datanode OR if the command times out.
   * The Leak: If a command fails on the Datanode (e.g., due to a temporary 
network blip, or the DN being busy), the Datanode sends a failure report.
     However, SCM's CommandStatusReportHandler currently ignores replication 
failure reports.
   * The Timeout: Those failed commands stay in SCM's "in-flight" quota for 12 
minutes (the default for hdds.scm.replication.event.timeout).
   * Decommissioning Impact: Because decommissioning triggers a burst of 
thousands of commands, any small percentage of failures quickly "leaks" and 
fills
     up the 200-command quota with stale entries that won't disappear for 12 
minutes, blocking new progress.

 

The workaround might be to reduce hdds.scm.replication.event.timeout 
aggressively down to e.g. 1m.


> SCM does not proactively clear failed replication tasks
> -------------------------------------------------------
>
>                 Key: HDDS-15327
>                 URL: https://issues.apache.org/jira/browse/HDDS-15327
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: SCM
>            Reporter: Wei-Chiu Chuang
>            Priority: Major
>
> Let's say SCM is configured to allow up to 200 replications 
> (hdds.datanode.replication.outofservice.limit.factor = 2, 
> hdds.scm.replication.datanode.replication.limit = 100), but Datanode only 
> sees at most 12 replication tasks (inflight+ queued)
>  
> The result? SCM does not push replication as hard as what is configured, and 
> decommission becomes slow.
>  
>   The reason SCM thinks 200 commands are active while the Datanode only sees 
> 12 is that SCM does not proactively clear failed replication commands.
>    * Accounting Logic: When SCM sends a replication command, it increments 
> its "in-flight" count. It only decrements this count if it receives a 
> successful
>      Container Report from the Datanode OR if the command times out.
>    * The Leak: If a command fails on the Datanode (e.g., due to a temporary 
> network blip, or the DN being busy), the Datanode sends a failure report.
>      However, SCM's CommandStatusReportHandler currently ignores replication 
> failure reports.
>    * The Timeout: Those failed commands stay in SCM's "in-flight" quota for 
> 12 minutes (the default for hdds.scm.replication.event.timeout).
>    * Decommissioning Impact: Because decommissioning triggers a burst of 
> thousands of commands, any small percentage of failures quickly "leaks" and 
> fills
>      up the 200-command quota with stale entries that won't disappear for 12 
> minutes, blocking new progress.
>  
> The workaround might be to reduce hdds.scm.replication.event.timeout 
> aggressively down to e.g. 1m.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDDS-15327) SCM does not proactively clear failed replication tasks

Reply via email to