[jira] [Updated] (HDDS-4754) A restarted SCM quickly OOM due to ContainerReport Storm from DN cluster.

Glen Geng (Jira) Wed, 27 Jan 2021 00:27:18 -0800


     [ 
https://issues.apache.org/jira/browse/HDDS-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Glen Geng updated HDDS-4754:
----------------------------
    Description: 
During tencent monthly upgrade, we restart all DNs first, then stop the SCM, 
wait for a while, start it. SCM go OOM in a short time.

 

Current retry policy of DN is retry sending with a 1s interval. Given at some 
time-point, all the DNs lost connection with the SCM at the same time, due to 
restart of SCM, all DNs will send container report to SCM nearly at the same 
time, which is a ContainerReport Storm.

 

We propose to change datanode retry policy to connect SCM.
{code:java}
public void addSCMServer(InetSocketAddress address) throws IOException {
  writeLock();
  try {
    if (scmMachines.containsKey(address)) {
      LOG.warn("Trying to add an existing SCM Machine to Machines group. " +
          "Ignoring the request.");
      return;
    }

    Configuration hadoopConfig =
        LegacyHadoopConfigurationSource.asHadoopConfiguration(this.conf);
    RPC.setProtocolEngine(
        hadoopConfig,
        StorageContainerDatanodeProtocolPB.class,
        ProtobufRpcEngine.class);
    long version =
        RPC.getProtocolVersion(StorageContainerDatanodeProtocolPB.class);

    RetryPolicy retryPolicy =
        RetryPolicies.retryUpToMaximumCountWithFixedSleep(
            getScmRpcRetryCount(conf),
            1000, TimeUnit.MILLISECONDS);
{code}

  was:
 

During our upgrade, we restart all DNs first, then stop the SCM, wait for a 
while, start it.

Current retry policy is retry sending with a 1s interval. 

Given at some time-point, all the DNs lost connection with the SCM at the same 
time, due to restart of SCM, all DNs will send container report to SCM nearly 
at the same time.

 

We propose to change datanode retry policy to connect SCM.
{code:java}
public void addSCMServer(InetSocketAddress address) throws IOException {
  writeLock();
  try {
    if (scmMachines.containsKey(address)) {
      LOG.warn("Trying to add an existing SCM Machine to Machines group. " +
          "Ignoring the request.");
      return;
    }

    Configuration hadoopConfig =
        LegacyHadoopConfigurationSource.asHadoopConfiguration(this.conf);
    RPC.setProtocolEngine(
        hadoopConfig,
        StorageContainerDatanodeProtocolPB.class,
        ProtobufRpcEngine.class);
    long version =
        RPC.getProtocolVersion(StorageContainerDatanodeProtocolPB.class);

    RetryPolicy retryPolicy =
        RetryPolicies.retryUpToMaximumCountWithFixedSleep(
            getScmRpcRetryCount(conf),
            1000, TimeUnit.MILLISECONDS);
{code}


> A restarted SCM quickly OOM due to ContainerReport Storm from DN cluster.
> -------------------------------------------------------------------------
>
>                 Key: HDDS-4754
>                 URL: https://issues.apache.org/jira/browse/HDDS-4754
>             Project: Hadoop Distributed Data Store
>          Issue Type: Improvement
>            Reporter: runzhiwang
>            Priority: Major
>         Attachments: 企业微信截图_1611734015772.png
>
>
> During tencent monthly upgrade, we restart all DNs first, then stop the SCM, 
> wait for a while, start it. SCM go OOM in a short time.
>  
> Current retry policy of DN is retry sending with a 1s interval. Given at some 
> time-point, all the DNs lost connection with the SCM at the same time, due to 
> restart of SCM, all DNs will send container report to SCM nearly at the same 
> time, which is a ContainerReport Storm.
>  
> We propose to change datanode retry policy to connect SCM.
> {code:java}
> public void addSCMServer(InetSocketAddress address) throws IOException {
>   writeLock();
>   try {
>     if (scmMachines.containsKey(address)) {
>       LOG.warn("Trying to add an existing SCM Machine to Machines group. " +
>           "Ignoring the request.");
>       return;
>     }
>     Configuration hadoopConfig =
>         LegacyHadoopConfigurationSource.asHadoopConfiguration(this.conf);
>     RPC.setProtocolEngine(
>         hadoopConfig,
>         StorageContainerDatanodeProtocolPB.class,
>         ProtobufRpcEngine.class);
>     long version =
>         RPC.getProtocolVersion(StorageContainerDatanodeProtocolPB.class);
>     RetryPolicy retryPolicy =
>         RetryPolicies.retryUpToMaximumCountWithFixedSleep(
>             getScmRpcRetryCount(conf),
>             1000, TimeUnit.MILLISECONDS);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org
For additional commands, e-mail: issues-h...@ozone.apache.org

[jira] [Updated] (HDDS-4754) A restarted SCM quickly OOM due to ContainerReport Storm from DN cluster.

Reply via email to