[ https://issues.apache.org/jira/browse/HDDS-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Glen Geng updated HDDS-4754: ---------------------------- Description: During tencent monthly upgrade, we restart all DNs first, then stop the SCM, wait for a while, start it. SCM go OOM in a short time. Current retry policy of DN is retry sending with a 1s interval. Given at some time-point, all the DNs lost connection with the SCM at the same time, due to restart of SCM, all DNs will send container report to SCM nearly at the same time, which is a ContainerReport Storm. We propose to change datanode retry policy to connect SCM. {code:java} public void addSCMServer(InetSocketAddress address) throws IOException { writeLock(); try { if (scmMachines.containsKey(address)) { LOG.warn("Trying to add an existing SCM Machine to Machines group. " + "Ignoring the request."); return; } Configuration hadoopConfig = LegacyHadoopConfigurationSource.asHadoopConfiguration(this.conf); RPC.setProtocolEngine( hadoopConfig, StorageContainerDatanodeProtocolPB.class, ProtobufRpcEngine.class); long version = RPC.getProtocolVersion(StorageContainerDatanodeProtocolPB.class); RetryPolicy retryPolicy = RetryPolicies.retryUpToMaximumCountWithFixedSleep( getScmRpcRetryCount(conf), 1000, TimeUnit.MILLISECONDS); {code} was: During our upgrade, we restart all DNs first, then stop the SCM, wait for a while, start it. Current retry policy is retry sending with a 1s interval. Given at some time-point, all the DNs lost connection with the SCM at the same time, due to restart of SCM, all DNs will send container report to SCM nearly at the same time. We propose to change datanode retry policy to connect SCM. {code:java} public void addSCMServer(InetSocketAddress address) throws IOException { writeLock(); try { if (scmMachines.containsKey(address)) { LOG.warn("Trying to add an existing SCM Machine to Machines group. " + "Ignoring the request."); return; } Configuration hadoopConfig = LegacyHadoopConfigurationSource.asHadoopConfiguration(this.conf); RPC.setProtocolEngine( hadoopConfig, StorageContainerDatanodeProtocolPB.class, ProtobufRpcEngine.class); long version = RPC.getProtocolVersion(StorageContainerDatanodeProtocolPB.class); RetryPolicy retryPolicy = RetryPolicies.retryUpToMaximumCountWithFixedSleep( getScmRpcRetryCount(conf), 1000, TimeUnit.MILLISECONDS); {code} > A restarted SCM quickly OOM due to ContainerReport Storm from DN cluster. > ------------------------------------------------------------------------- > > Key: HDDS-4754 > URL: https://issues.apache.org/jira/browse/HDDS-4754 > Project: Hadoop Distributed Data Store > Issue Type: Improvement > Reporter: runzhiwang > Priority: Major > Attachments: 企业微信截图_1611734015772.png > > > During tencent monthly upgrade, we restart all DNs first, then stop the SCM, > wait for a while, start it. SCM go OOM in a short time. > > Current retry policy of DN is retry sending with a 1s interval. Given at some > time-point, all the DNs lost connection with the SCM at the same time, due to > restart of SCM, all DNs will send container report to SCM nearly at the same > time, which is a ContainerReport Storm. > > We propose to change datanode retry policy to connect SCM. > {code:java} > public void addSCMServer(InetSocketAddress address) throws IOException { > writeLock(); > try { > if (scmMachines.containsKey(address)) { > LOG.warn("Trying to add an existing SCM Machine to Machines group. " + > "Ignoring the request."); > return; > } > Configuration hadoopConfig = > LegacyHadoopConfigurationSource.asHadoopConfiguration(this.conf); > RPC.setProtocolEngine( > hadoopConfig, > StorageContainerDatanodeProtocolPB.class, > ProtobufRpcEngine.class); > long version = > RPC.getProtocolVersion(StorageContainerDatanodeProtocolPB.class); > RetryPolicy retryPolicy = > RetryPolicies.retryUpToMaximumCountWithFixedSleep( > getScmRpcRetryCount(conf), > 1000, TimeUnit.MILLISECONDS); > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org For additional commands, e-mail: issues-h...@ozone.apache.org