[ https://issues.apache.org/jira/browse/HDFS-14527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855822#comment-16855822 ]
He Xiaoqiao commented on HDFS-14527: ------------------------------------ Thanks [~elgoiri] for your detailed reviews. upload [^HDFS-14527.003.patch] to fix following comments. Pending jenkins. Another more reviews. Thanks again. > Stop all DataNodes may result in NN terminate > --------------------------------------------- > > Key: HDFS-14527 > URL: https://issues.apache.org/jira/browse/HDFS-14527 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Reporter: He Xiaoqiao > Assignee: He Xiaoqiao > Priority: Major > Attachments: HDFS-14527.001.patch, HDFS-14527.002.patch, > HDFS-14527.003.patch > > > If we stop all datanodes of cluster, BlockPlacementPolicyDefault#chooseTarget > may get ArithmeticException when calling #getMaxNodesPerRack, which throws > the runtime exception out to BlockManager's ReplicationMonitor thread and > then terminate the NN. > The root cause is that BlockPlacementPolicyDefault#chooseTarget not hold the > global lock, and if all DataNodes are dead between > {{clusterMap.getNumberOfLeaves()}} and {{getMaxNodesPerRack}} then it meet > {{ArithmeticException}} while invoke {{getMaxNodesPerRack}}. > {code:java} > private DatanodeStorageInfo[] chooseTarget(int numOfReplicas, > Node writer, > List<DatanodeStorageInfo> chosenStorage, > boolean returnChosenNodes, > Set<Node> excludedNodes, > long blocksize, > final BlockStoragePolicy storagePolicy, > EnumSet<AddBlockFlag> addBlockFlags, > EnumMap<StorageType, Integer> sTypes) { > if (numOfReplicas == 0 || clusterMap.getNumOfLeaves()==0) { > return DatanodeStorageInfo.EMPTY_ARRAY; > } > ...... > int[] result = getMaxNodesPerRack(chosenStorage.size(), numOfReplicas); > ...... > } > {code} > Some detailed log show as following. > {code:java} > 2019-05-31 12:29:21,803 ERROR > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: > ReplicationMonitor thread received Runtime exception. > java.lang.ArithmeticException: / by zero > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.getMaxNodesPerRack(BlockPlacementPolicyDefault.java:282) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:228) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:132) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.chooseTargets(BlockManager.java:4533) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.access$1800(BlockManager.java:4493) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1954) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1830) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4453) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4388) > at java.lang.Thread.run(Thread.java:745) > 2019-05-31 12:29:21,805 INFO org.apache.hadoop.util.ExitUtil: Exiting with > status 1 > {code} > To be honest, this is not serious bug and not reprod easily, since if we stop > all Datanodes and only keep NameNode lives, HDFS could be not offer service > normally and we could only retrieve directory. It may be one corner case. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org