[ 
https://issues.apache.org/jira/browse/HDDS-15330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Andika updated HDDS-15330:
-------------------------------
    Description: 
We have previous instances where a new bootstrapped SCM becomes OOM (FYI the 
OOM has 96GB heap size). We suspect that it's due to the concurrent FCR reports 
processed in SCM. 

HDFS implements a full block reports rate limit in HDFS-7923 to reduce the 
concurrent block reports residing in SCM using BlockReportLeaseManager. Ozone 
should also implement similar mechanism to prevent FCR storms.

A possible design is that we register DN first, but don't include the full FCR 
immediately. SCM grants only N datanodes permission to send FCRs at once, 
similar to HDFS implementation.

Another possibility to reduce the single FCR size to to split the FCR to one 
FCR per volume (can be considered in the future). 

One tradeoff of the rate-limiting is that new SCM might delay the SafeMode 
exit. However, this is better than SCM OOM.

  was:
We have previous instances where a new bootstrapped SCM becomes OOM (FYI the 
OOM has 96GB heap size). We suspect that it's due to the concurrent FCR reports 
processed in SCM. 

HDFS implements a full block reports rate limit in HDFS-7923 to reduce the 
concurrent block reports residing in SCM using BlockReportLeaseManager. Ozone 
should also implement similar mechanism to prevent FCR storms.

A possible design is that we register DN first, but don't include the full FCR 
immediately. SCM grants only N datanodes permission to send FCRs at once, 
similar to HDFS implementation.

One tradeoff of the rate-limiting is that new SCM might delay the SafeMode 
exit. However, this is better than SCM OOM.


> Implement SCM FCR rate-limit
> ----------------------------
>
>                 Key: HDDS-15330
>                 URL: https://issues.apache.org/jira/browse/HDDS-15330
>             Project: Apache Ozone
>          Issue Type: Sub-task
>            Reporter: Ivan Andika
>            Assignee: Ivan Andika
>            Priority: Major
>
> We have previous instances where a new bootstrapped SCM becomes OOM (FYI the 
> OOM has 96GB heap size). We suspect that it's due to the concurrent FCR 
> reports processed in SCM. 
> HDFS implements a full block reports rate limit in HDFS-7923 to reduce the 
> concurrent block reports residing in SCM using BlockReportLeaseManager. Ozone 
> should also implement similar mechanism to prevent FCR storms.
> A possible design is that we register DN first, but don't include the full 
> FCR immediately. SCM grants only N datanodes permission to send FCRs at once, 
> similar to HDFS implementation.
> Another possibility to reduce the single FCR size to to split the FCR to one 
> FCR per volume (can be considered in the future). 
> One tradeoff of the rate-limiting is that new SCM might delay the SafeMode 
> exit. However, this is better than SCM OOM.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to