[jira] [Commented] (HDFS-13833) Failed to choose from local rack (location = /default); the second replica is not found, retry choosing ramdomly

He Xiaoqiao (JIRA) Fri, 17 Aug 2018 21:41:47 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-13833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16584630#comment-16584630
 ]


He Xiaoqiao commented on HDFS-13833:
------------------------------------

IIUC I think inconsistent {{stats}} is the main reason about this corner case:
a. {{chooseTarget}} at {{BlockPlacementPolicyDefault}} is not under global 
lock, and it can been invoke with {{sendHeartbeat}} at same time.
b. when {{chooseTarget}} has choose the target node, it's stat (of course 
include {{load}}) is immutable, because {{DatanodeStorageInfo}} instances is 
created by {{DatanodeDescriptor#getStorageInfos}}.
c. On the other hand, when compare load about target node to average load (* 
factor) of the whole cluster, it get the real time value.
d. if update heartbeat between step b and step c, and the latest load of the 
only node is zero unfortunately, chooseTarget will fail, and exception stack 
show as depict above. 
e. the good news is the average load will update when next heartbeat at the 
latest, and chooseTarget will work as expect. 

{code:java}
    // check the communication traffic of the target machine
    if (considerLoad) {
      final double maxLoad = 2.0 * stats.getInServiceXceiverAverage();
      final int nodeLoad = node.getXceiverCount();
      if (nodeLoad > maxLoad) {
        logNodeIsNotChosen(storage, "the node is too busy (load: " + nodeLoad
            + " > " + maxLoad + ") ");
        return false;
      }
    }
{code}
my suggestion is try to add more datanode (for instance: 3) in cluster.
[~jojochuang],[~xiaochen] branch trunk also has this problem.

> Failed to choose from local rack (location = /default); the second replica is 
> not found, retry choosing ramdomly
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-13833
>                 URL: https://issues.apache.org/jira/browse/HDFS-13833
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Henrique Barros
>            Priority: Critical
>
> I'm having a random problem with blocks replication with Hadoop 
> 2.6.0-cdh5.15.0
> With Cloudera CDH-5.15.0-1.cdh5.15.0.p0.21
>  
> In my case we are getting this error very randomly (after some hours) and 
> with only one Datanode (for now, we are trying this cloudera cluster for a 
> POC)
> Here is the Log.
> {code:java}
> Choosing random from 1 available nodes on node /default, scope=/default, 
> excludedScope=null, excludeNodes=[]
> 2:38:20.527 PM        DEBUG   NetworkTopology 
> Choosing random from 0 available nodes on node /default, scope=/default, 
> excludedScope=null, excludeNodes=[192.168.220.53:50010]
> 2:38:20.527 PM        DEBUG   NetworkTopology 
> chooseRandom returning null
> 2:38:20.527 PM        DEBUG   BlockPlacementPolicy    
> [
> Node /default/192.168.220.53:50010 [
>   Datanode 192.168.220.53:50010 is not chosen since the node is too busy 
> (load: 8 > 0.0).
> 2:38:20.527 PM        DEBUG   NetworkTopology 
> chooseRandom returning 192.168.220.53:50010
> 2:38:20.527 PM        INFO    BlockPlacementPolicy    
> Not enough replicas was chosen. Reason:{NODE_TOO_BUSY=1}
> 2:38:20.527 PM        DEBUG   StateChange     
> closeFile: 
> /mobi.me/development/apps/flink/checkpoints/a5a6806866c1640660924ea1453cbe34/chk-2118/eef8bff6-75a9-43c1-ae93-4b1a9ca31ad9
>  with 1 blocks is persisted to the file system
> 2:38:20.527 PM        DEBUG   StateChange     
> *BLOCK* NameNode.addBlock: file 
> /mobi.me/development/apps/flink/checkpoints/a5a6806866c1640660924ea1453cbe34/chk-2118/1cfe900d-6f45-4b55-baaa-73c02ace2660
>  fileId=129628869 for DFSClient_NONMAPREDUCE_467616914_65
> 2:38:20.527 PM        DEBUG   BlockPlacementPolicy    
> Failed to choose from local rack (location = /default); the second replica is 
> not found, retry choosing ramdomly
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy$NotEnoughReplicasException:
>  
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:784)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:694)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:601)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalStorage(BlockPlacementPolicyDefault.java:561)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:464)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:395)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:270)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:142)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:158)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1715)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3505)
>       at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:694)
>       at 
> org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:219)
>       at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:507)
>       at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2281)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2277)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:422)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2275)
> {code}
> This part makes no sense at all:
> {code:java}
> load: 8 > 0.0{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-13833) Failed to choose from local rack (location = /default); the second replica is not found, retry choosing ramdomly

Reply via email to