[ https://issues.apache.org/jira/browse/HDFS-9361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wei-Chiu Chuang resolved HDFS-9361. ----------------------------------- Resolution: Not A Problem I spent some time discussing the issue with [~walter.k.su] and I also agree this is not a problem. The test can be configured to ignore load factor. > Default block placement policy causes TestReplaceDataNodeOnFailure to fail > intermittently > ----------------------------------------------------------------------------------------- > > Key: HDFS-9361 > URL: https://issues.apache.org/jira/browse/HDFS-9361 > Project: Hadoop HDFS > Issue Type: Improvement > Components: HDFS > Reporter: Wei-Chiu Chuang > > TestReplaceDatanodeOnFailure sometimes fail (See HDFS-6101). > (For background information, the test case set up a cluster with three data > nodes, add two more data nodes, remove one data nodes, and verify that > clients can correctly recover from the failure and set up three replicas) > I traced down and found that some times a client only set up a pipeline with > only two data nodes, which is one less than configured in the test case, even > though the test case configures to always replace failed nodes. > Digging into the log, I saw: > {noformat} > 2015-11-02 12:07:38,634 [IPC Server handler 8 on 50673] WARN > blockmanagement.BlockPlacementPolicy > (BlockPlacementPolicyDefault.java:chooseTarget(355)) - Failed to place enough > replicas, still in nee > d of 1 to reach 3 (unavailableStorages=[], > storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], > creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy$NotEnoughReplicasException: > [ > Node /rack0/127.0.0.1:32931 [ > Datanode 127.0.0.1:32931 is not chosen since the rack has too many chosen > nodes . > ] > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:723) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRemoteRack(BlockPlacementPolicyDefault.java:624) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:429) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:342) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:220) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:105) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:120) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1727) > at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:299) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2457) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:796) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:500) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:637) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2305) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2301) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2299) > {noformat} > So from the log, it seems the policy causes the pipeline selection to give up > on the data node. > I wonder whether this is appropriate or not. If the load factor exceeds > certain threshold, but the file is insufficient of replicas, should it accept > it as is, or should it attempt to acquire more replicas? > I am filing this JIRA for discussion. I am very unfamiliar with block > placement, so I may be wrong about my hypothesis. > (Edit: I turned on DEBUG option for Log4j, and changed the logging message a > bit to make it show the stack trace) -- This message was sent by Atlassian JIRA (v6.3.4#6332)