[ 
https://issues.apache.org/jira/browse/HDFS-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Chen updated HDFS-10320:
-----------------------------
    Description: 
If there're rack failures which end up leaving only 1 rack available, 
{{BlockPlacementPolicyDefault#chooseRandom}} may get 
{{InvalidTopologyException}} when calling {{NetworkTopology#chooseRandom}}, 
which then throws all the way out to {{BlockManager}}'s {{ReplicationMonitor}} 
thread and terminate the NN.

Log:
{noformat}
2016-02-24 09:22:01,514  WARN 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to 
place enough replicas, still in need of 1 to reach 3 (unavailableStorages=[], 
storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) For more 
information, please enable DEBUG log level on 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy

2016-02-24 09:22:01,958  ERROR 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: ReplicationMonitor 
thread received Runtime exception. 
org.apache.hadoop.net.NetworkTopology$InvalidTopologyException: Failed to find 
datanode (scope="" excludedScope="/rack_a5").
        at 
org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:729)
        at 
org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:694)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:635)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRemoteRack(BlockPlacementPolicyDefault.java:580)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:348)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:214)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:111)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.chooseTargets(BlockManager.java:3746)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.access$200(BlockManager.java:3711)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1400)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1306)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3682)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3634)
        at java.lang.Thread.run(Thread.java:745)
{noformat}

> Rack failures may result in NN terminate
> ----------------------------------------
>
>                 Key: HDFS-10320
>                 URL: https://issues.apache.org/jira/browse/HDFS-10320
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Xiao Chen
>            Assignee: Xiao Chen
>
> If there're rack failures which end up leaving only 1 rack available, 
> {{BlockPlacementPolicyDefault#chooseRandom}} may get 
> {{InvalidTopologyException}} when calling {{NetworkTopology#chooseRandom}}, 
> which then throws all the way out to {{BlockManager}}'s 
> {{ReplicationMonitor}} thread and terminate the NN.
> Log:
> {noformat}
> 2016-02-24 09:22:01,514  WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to 
> place enough replicas, still in need of 1 to reach 3 (unavailableStorages=[], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) For 
> more information, please enable DEBUG log level on 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
> 2016-02-24 09:22:01,958  ERROR 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception. 
> org.apache.hadoop.net.NetworkTopology$InvalidTopologyException: Failed to 
> find datanode (scope="" excludedScope="/rack_a5").
>       at 
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:729)
>       at 
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:694)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:635)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRemoteRack(BlockPlacementPolicyDefault.java:580)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:348)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:214)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:111)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.chooseTargets(BlockManager.java:3746)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.access$200(BlockManager.java:3711)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1400)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1306)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3682)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3634)
>       at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to