[ 
https://issues.apache.org/jira/browse/HDDS-2972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Lin updated HDDS-2972:
----------------------------
    Description: 
I found there any container replication error thrown in ReplicationManager can 
terminates SCM service. It's a very expensive behavior to terminate the SCM 
service just because of one container replication error.

It's not worth to shutdown the SCM. We can be friendly to deal with this, catch 
the exception and print the warn message with thrown exception.

The shutdown info:
{noformat}
2020-01-30 08:16:04,705 ERROR 
org.apache.hadoop.hdds.scm.container.ReplicationManager: Exception in 
Replication Monitor Thread.
java.lang.IllegalArgumentException: Affinity node 
/dc1/rack1/b9343ca0-a4bc-4436-9671-bc1de6c8bd89 is not a member of topology
        at 
org.apache.hadoop.hdds.scm.net.NetworkTopologyImpl.checkAffinityNode(NetworkTopologyImpl.java:789)
        at 
org.apache.hadoop.hdds.scm.net.NetworkTopologyImpl.chooseRandom(NetworkTopologyImpl.java:399)
        at 
org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware.chooseNode(SCMContainerPlacementRackAware.java:249)
        at 
org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware.chooseDatanodes(SCMContainerPlacementRackAware.java:173)
        at 
org.apache.hadoop.hdds.scm.container.ReplicationManager.handleUnderReplicatedContainer(ReplicationManager.java:515)
        at 
org.apache.hadoop.hdds.scm.container.ReplicationManager.processContainer(ReplicationManager.java:311)
        at 
java.util.concurrent.ConcurrentHashMap$KeySetView.forEach(ConcurrentHashMap.java:4649)
        at 
java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1080)
        at 
org.apache.hadoop.hdds.scm.container.ReplicationManager.run(ReplicationManager.java:223)
        at java.lang.Thread.run(Thread.java:745)
2020-01-30 08:16:04,730 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
status 1: java.lang.IllegalArgumentException: Affinity node 
/dc1/rack1/b9343ca0-a4bc-4436-9671-bc1de6c8bd89 is not a member of topology
2020-01-30 08:16:04,734 INFO 
org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter: SHUTDOWN_MSG:
{noformat}

  was:
I found there any container replication error running in ReplicationManager can 
terminates SCM service. It's a very expensive behavior to terminate the SCM 
service just because of one container replication error.

It's not worth to shutdown the SCM. We can be friendly to deal with this, catch 
the exception and print the warn message with thrown exception.

The shutdown info:
{noformat}
2020-01-30 08:16:04,705 ERROR 
org.apache.hadoop.hdds.scm.container.ReplicationManager: Exception in 
Replication Monitor Thread.
java.lang.IllegalArgumentException: Affinity node 
/dc1/rack1/b9343ca0-a4bc-4436-9671-bc1de6c8bd89 is not a member of topology
        at 
org.apache.hadoop.hdds.scm.net.NetworkTopologyImpl.checkAffinityNode(NetworkTopologyImpl.java:789)
        at 
org.apache.hadoop.hdds.scm.net.NetworkTopologyImpl.chooseRandom(NetworkTopologyImpl.java:399)
        at 
org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware.chooseNode(SCMContainerPlacementRackAware.java:249)
        at 
org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware.chooseDatanodes(SCMContainerPlacementRackAware.java:173)
        at 
org.apache.hadoop.hdds.scm.container.ReplicationManager.handleUnderReplicatedContainer(ReplicationManager.java:515)
        at 
org.apache.hadoop.hdds.scm.container.ReplicationManager.processContainer(ReplicationManager.java:311)
        at 
java.util.concurrent.ConcurrentHashMap$KeySetView.forEach(ConcurrentHashMap.java:4649)
        at 
java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1080)
        at 
org.apache.hadoop.hdds.scm.container.ReplicationManager.run(ReplicationManager.java:223)
        at java.lang.Thread.run(Thread.java:745)
2020-01-30 08:16:04,730 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
status 1: java.lang.IllegalArgumentException: Affinity node 
/dc1/rack1/b9343ca0-a4bc-4436-9671-bc1de6c8bd89 is not a member of topology
2020-01-30 08:16:04,734 INFO 
org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter: SHUTDOWN_MSG:
{noformat}


> Any container replication error can terminate SCM service
> ---------------------------------------------------------
>
>                 Key: HDDS-2972
>                 URL: https://issues.apache.org/jira/browse/HDDS-2972
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>          Components: SCM
>    Affects Versions: 0.4.1
>            Reporter: Yiqun Lin
>            Assignee: Yiqun Lin
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> I found there any container replication error thrown in ReplicationManager 
> can terminates SCM service. It's a very expensive behavior to terminate the 
> SCM service just because of one container replication error.
> It's not worth to shutdown the SCM. We can be friendly to deal with this, 
> catch the exception and print the warn message with thrown exception.
> The shutdown info:
> {noformat}
> 2020-01-30 08:16:04,705 ERROR 
> org.apache.hadoop.hdds.scm.container.ReplicationManager: Exception in 
> Replication Monitor Thread.
> java.lang.IllegalArgumentException: Affinity node 
> /dc1/rack1/b9343ca0-a4bc-4436-9671-bc1de6c8bd89 is not a member of topology
>         at 
> org.apache.hadoop.hdds.scm.net.NetworkTopologyImpl.checkAffinityNode(NetworkTopologyImpl.java:789)
>         at 
> org.apache.hadoop.hdds.scm.net.NetworkTopologyImpl.chooseRandom(NetworkTopologyImpl.java:399)
>         at 
> org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware.chooseNode(SCMContainerPlacementRackAware.java:249)
>         at 
> org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware.chooseDatanodes(SCMContainerPlacementRackAware.java:173)
>         at 
> org.apache.hadoop.hdds.scm.container.ReplicationManager.handleUnderReplicatedContainer(ReplicationManager.java:515)
>         at 
> org.apache.hadoop.hdds.scm.container.ReplicationManager.processContainer(ReplicationManager.java:311)
>         at 
> java.util.concurrent.ConcurrentHashMap$KeySetView.forEach(ConcurrentHashMap.java:4649)
>         at 
> java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1080)
>         at 
> org.apache.hadoop.hdds.scm.container.ReplicationManager.run(ReplicationManager.java:223)
>         at java.lang.Thread.run(Thread.java:745)
> 2020-01-30 08:16:04,730 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1: java.lang.IllegalArgumentException: Affinity node 
> /dc1/rack1/b9343ca0-a4bc-4436-9671-bc1de6c8bd89 is not a member of topology
> 2020-01-30 08:16:04,734 INFO 
> org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter: 
> SHUTDOWN_MSG:
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

Reply via email to