[ https://issues.apache.org/jira/browse/HDFS-13833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16584630#comment-16584630 ]
He Xiaoqiao commented on HDFS-13833: ------------------------------------ IIUC I think inconsistent {{stats}} is the main reason about this corner case: a. {{chooseTarget}} at {{BlockPlacementPolicyDefault}} is not under global lock, and it can been invoke with {{sendHeartbeat}} at same time. b. when {{chooseTarget}} has choose the target node, it's stat (of course include {{load}}) is immutable, because {{DatanodeStorageInfo}} instances is created by {{DatanodeDescriptor#getStorageInfos}}. c. On the other hand, when compare load about target node to average load (* factor) of the whole cluster, it get the real time value. d. if update heartbeat between step b and step c, and the latest load of the only node is zero unfortunately, chooseTarget will fail, and exception stack show as depict above. e. the good news is the average load will update when next heartbeat at the latest, and chooseTarget will work as expect. {code:java} // check the communication traffic of the target machine if (considerLoad) { final double maxLoad = 2.0 * stats.getInServiceXceiverAverage(); final int nodeLoad = node.getXceiverCount(); if (nodeLoad > maxLoad) { logNodeIsNotChosen(storage, "the node is too busy (load: " + nodeLoad + " > " + maxLoad + ") "); return false; } } {code} my suggestion is try to add more datanode (for instance: 3) in cluster. [~jojochuang],[~xiaochen] branch trunk also has this problem. > Failed to choose from local rack (location = /default); the second replica is > not found, retry choosing ramdomly > ---------------------------------------------------------------------------------------------------------------- > > Key: HDFS-13833 > URL: https://issues.apache.org/jira/browse/HDFS-13833 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: Henrique Barros > Priority: Critical > > I'm having a random problem with blocks replication with Hadoop > 2.6.0-cdh5.15.0 > With Cloudera CDH-5.15.0-1.cdh5.15.0.p0.21 > > In my case we are getting this error very randomly (after some hours) and > with only one Datanode (for now, we are trying this cloudera cluster for a > POC) > Here is the Log. > {code:java} > Choosing random from 1 available nodes on node /default, scope=/default, > excludedScope=null, excludeNodes=[] > 2:38:20.527 PM DEBUG NetworkTopology > Choosing random from 0 available nodes on node /default, scope=/default, > excludedScope=null, excludeNodes=[192.168.220.53:50010] > 2:38:20.527 PM DEBUG NetworkTopology > chooseRandom returning null > 2:38:20.527 PM DEBUG BlockPlacementPolicy > [ > Node /default/192.168.220.53:50010 [ > Datanode 192.168.220.53:50010 is not chosen since the node is too busy > (load: 8 > 0.0). > 2:38:20.527 PM DEBUG NetworkTopology > chooseRandom returning 192.168.220.53:50010 > 2:38:20.527 PM INFO BlockPlacementPolicy > Not enough replicas was chosen. Reason:{NODE_TOO_BUSY=1} > 2:38:20.527 PM DEBUG StateChange > closeFile: > /mobi.me/development/apps/flink/checkpoints/a5a6806866c1640660924ea1453cbe34/chk-2118/eef8bff6-75a9-43c1-ae93-4b1a9ca31ad9 > with 1 blocks is persisted to the file system > 2:38:20.527 PM DEBUG StateChange > *BLOCK* NameNode.addBlock: file > /mobi.me/development/apps/flink/checkpoints/a5a6806866c1640660924ea1453cbe34/chk-2118/1cfe900d-6f45-4b55-baaa-73c02ace2660 > fileId=129628869 for DFSClient_NONMAPREDUCE_467616914_65 > 2:38:20.527 PM DEBUG BlockPlacementPolicy > Failed to choose from local rack (location = /default); the second replica is > not found, retry choosing ramdomly > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy$NotEnoughReplicasException: > > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:784) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:694) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:601) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalStorage(BlockPlacementPolicyDefault.java:561) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:464) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:395) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:270) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:142) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:158) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1715) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3505) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:694) > at > org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:219) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:507) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2281) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2277) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2275) > {code} > This part makes no sense at all: > {code:java} > load: 8 > 0.0{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org