[
https://issues.apache.org/jira/browse/HDDS-15330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ivan Andika updated HDDS-15330:
-------------------------------
Attachment: scm_Top_Components_oom.zip
scm_System_Overview_oom.zip
scm_Leak_Suspects_oom.zip
> Implement SCM FCR rate limit
> ----------------------------
>
> Key: HDDS-15330
> URL: https://issues.apache.org/jira/browse/HDDS-15330
> Project: Apache Ozone
> Issue Type: Sub-task
> Reporter: Ivan Andika
> Assignee: Ivan Andika
> Priority: Major
> Attachments: scm_Leak_Suspects_oom.zip, scm_System_Overview_oom.zip,
> scm_Top_Components_oom.zip
>
>
> We have previous instances where a new bootstrapped SCM becomes OOM (FYI the
> OOM has 96GB heap size). We suspect that it's due to the concurrent FCR
> reports processed in SCM.
> HDFS implements a full block reports rate limit in HDFS-7923 to reduce the
> concurrent block reports residing in SCM using BlockReportLeaseManager. Ozone
> should also implement similar mechanism to prevent FCR storms.
> A possible design is that we register DN first, but don't include the full
> FCR immediately. SCM grants only N datanodes permission to send FCRs at once,
> similar to HDFS implementation.
> Another possibility to reduce the single FCR size to to split the FCR to one
> FCR per volume (can be considered in the future).
> One tradeoff of the rate-limiting is that new SCM might delay the SafeMode
> exit. However, this is better than SCM OOM. Another tradeoff is that FCR
> might be delayed for large cluster (we need to think about this).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]