[ https://issues.apache.org/jira/browse/HDFS-16775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated HDFS-16775: ---------------------------------- Labels: pull-request-available (was: ) > Improve BlockPlacementPolicyRackFaultTolerant's chooseOnce > ---------------------------------------------------------- > > Key: HDFS-16775 > URL: https://issues.apache.org/jira/browse/HDFS-16775 > Project: Hadoop HDFS > Issue Type: Improvement > Reporter: Haiyang Hu > Assignee: Haiyang Hu > Priority: Major > Labels: pull-request-available > > In our online cluster ,for the existence of EC blocks, the decommissioning > datanode speed is relatively slow, > there are many info logs 'Not enough replicas was chosen. Reason: > {TOO_MANY_NODES_ON_RACK=13,NO_REQUIRED_STORAGE_TYPE=1,NOT_IN_SERVICE=5}' in > the NameNode log , > as follow: > {code:java} > 2022-08-17 14:22:53,133 DEBUG blockmanagement.BlockPlacementPolicy > (BlockPlacementPolicyDefault.java:chooseRandom(904)) - [ > Node /rack1/ip1:50010 [ > Datanode ip1:50010 is not chosen since the rack has too many chosen nodes. > Node /rack1/ip2:50010 [ > Datanode ip2:50010 is not chosen since the rack has too many chosen nodes. > Node /rack1/ip3:50010 [ > Datanode ip3:50010 is not chosen since the rack has too many chosen nodes. > Node /rack1/ip5:50010 [ > Datanode ip5:50010 is not chosen since the node is not in service. > Node /rack1/ip6:50010 [ > Datanode ip6:50010 is not chosen since the rack has too many chosen nodes. > Datanode None is not chosen since required storage types are unavailable for > storage type DISK. > 2022-08-17 14:22:53,133 INFO blockmanagement.BlockPlacementPolicy > (BlockPlacementPolicyDefault.java:chooseRandom(912)) > - Not enough replicas was chosen. Reason: {TOO_MANY_NODES_ON_RACK=4, > NO_REQUIRED_STORAGE_TYPE=1, NOT_IN_SERVICE=1} > 2022-08-17 14:22:53,133 DEBUG blockmanagement.BlockPlacementPolicy > (BlockPlacementPolicyDefault.java:chooseLocalRack(718)) > - Failed to choose from local rack (location = /rack1), retry with the rack > of the next replica (location = /rack2) > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy$NotEnoughReplicasException: > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:914) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:800) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:710) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalStorage(BlockPlacementPolicyDefault.java:670) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyRackFaultTolerant.chooseOnce(BlockPlacementPolicyRackFaultTolerant.java:220) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyRackFaultTolerant.chooseTargetInOrder(BlockPlacementPolicyRackFaultTolerant.java:96) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:478) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:350) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:170) > at > org.apache.hadoop.hdfs.server.blockmanagement.ErasureCodingWork.chooseTargets(ErasureCodingWork.java:63) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:2089) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:2027) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:5137) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:5003) > at java.lang.Thread.run(Thread.java:748) > > 2022-08-17 14:22:53,133 DEBUG blockmanagement.BlockPlacementPolicy > (BlockPlacementPolicyDefault.java:chooseRandom(904)) - [ > Node /rack2/ip6:50010 [ > Datanode ip6:50010 is not chosen since the node is not in service. > Node /rack2/ip7:50010 [ > Datanode ip7:50010 is not chosen since the rack has too many chosen nodes. > Node /rack2/ip8:50010 [ > Datanode ip8:50010 is not chosen since the rack has too many chosen nodes. > Node /rack2/ip9:50010 [ > Datanode ip9:50010 is not chosen since the rack has too many chosen nodes. > Node /rack2/ip10:50010 [ > Datanode ip10:50010 is not chosen since the rack has too many chosen nodes. > Node /rack2/ip11:50010 [ > Datanode ip11:50010 is not chosen since the rack has too many chosen nodes. > Node /rack2/ip12:50010 [ > Datanode ip12:50010 is not chosen since the rack has too many chosen nodes. > ... > Datanode None is not chosen since required storage types are unavailable > for storage type DISK. > 2022-08-17 14:22:53,133 INFO blockmanagement.BlockPlacementPolicy > (BlockPlacementPolicyDefault.java:chooseRandom(912)) > - Not enough replicas was chosen. Reason: {TOO_MANY_NODES_ON_RACK=16, > NO_REQUIRED_STORAGE_TYPE=1,NOT_IN_SERVICE=1} > 2022-08-17 14:22:53,133 DEBUG blockmanagement.BlockPlacementPolicy > (BlockPlacementPolicyDefault.java:chooseFromNextRack(748)) > - Failed to choose from the next rack (location = /rack2), retry choosing > randomly > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy$NotEnoughReplicasException: > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:914) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:800) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseFromNextRack(BlockPlacementPolicyDefault.java:745) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:722) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalStorage(BlockPlacementPolicyDefault.java:670) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyRackFaultTolerant.chooseOnce(BlockPlacementPolicyRackFaultTolerant.java:220) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyRackFaultTolerant.chooseTargetInOrder(BlockPlacementPolicyRackFaultTolerant.java:96) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:478) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:350) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:170) > at > org.apache.hadoop.hdfs.server.blockmanagement.ErasureCodingWork.chooseTargets(ErasureCodingWork.java:63) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:2089) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:2027) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:5137) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:5003) > at java.lang.Thread.run(Thread.java:748) > {code} > this seriously affects the datanode decommissioning speed > The process of choose target dn for the current EC : > chooseLocalStorage->chooseLocalRack->chooseFromNextRack->chooseRandom > 1.chooseLocalStorage choose localMachine as the target ,and localMachine is > srcNodes[0], it maybe not decommissioning and was in the excluded list, so is > not available > 2.chooseLocalRack choose one node from the rack that localMachine is on, > maybe the rack maybe has already exists one node ,so is not available > 3.chooseFromNextRack choose next node on the srcNodes retry with its rack, > maybe the rack has already exists one node, so is not available > 4.last retry choose randomly > So, We can optimize this logic. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org