[jira] [Comment Edited] (HDFS-13833) Failed to choose from local rack (location = /default); the second replica is not found, retry choosing ramdomly
[ https://issues.apache.org/jira/browse/HDFS-13833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16585717#comment-16585717 ] Henrique Barros edited comment on HDFS-13833 at 8/20/18 9:54 AM: - Yes, [~hexiaoqiao] and [~xiaochen] it appears your explained logic is happening. The only thing that conflicts with it is the 'message of the load' being out of order with the message that informs about the excluded the node. It appears it exclude it first and then print the load message. However, you could reproduce it and your analysis makes total sense, it could only be inconsistent stats - chooseTarget and {{sendHeartbeat}} being invoking at same time. But if it is, the stats should be saved someway till the next heartbeat, I think. Your explanation still can explain why it happens so randomly. I will maintain 'considerLoad' deactivated for now and will check if it happens with more than one dataNode. Thank you very much for your fastest response and help. I will come back with some conclusion to this issue soon to see if we can close it or recheck. was (Author: rikeppb100): Yes, [~hexiaoqiao] and [~xiaochen] it appears your explained logic is happening. The only thing that conflicts with it is the 'message of the load' being out of order with the message that informs about the excluded the node. It appears it exclude it first and then print the load message. However, you could reproduce it and your analysis makes total sense, it could only be inconsistent stats - chooseTarget and {{sendHeartbeat}} being invoking at same time. But if it is, the stats should be saved someway till the next heartbeat, I think. I will maintain 'considerLoad' deactivated for now and will check if it happens with more than one dataNode. Thank you very much for your fastest response and help. I will come back with some conclusion to this issue soon to see if we can close it or recheck. > Failed to choose from local rack (location = /default); the second replica is > not found, retry choosing ramdomly > > > Key: HDFS-13833 > URL: https://issues.apache.org/jira/browse/HDFS-13833 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Henrique Barros >Priority: Critical > > I'm having a random problem with blocks replication with Hadoop > 2.6.0-cdh5.15.0 > With Cloudera CDH-5.15.0-1.cdh5.15.0.p0.21 > > In my case we are getting this error very randomly (after some hours) and > with only one Datanode (for now, we are trying this cloudera cluster for a > POC) > Here is the Log. > {code:java} > Choosing random from 1 available nodes on node /default, scope=/default, > excludedScope=null, excludeNodes=[] > 2:38:20.527 PMDEBUG NetworkTopology > Choosing random from 0 available nodes on node /default, scope=/default, > excludedScope=null, excludeNodes=[192.168.220.53:50010] > 2:38:20.527 PMDEBUG NetworkTopology > chooseRandom returning null > 2:38:20.527 PMDEBUG BlockPlacementPolicy > [ > Node /default/192.168.220.53:50010 [ > Datanode 192.168.220.53:50010 is not chosen since the node is too busy > (load: 8 > 0.0). > 2:38:20.527 PMDEBUG NetworkTopology > chooseRandom returning 192.168.220.53:50010 > 2:38:20.527 PMINFOBlockPlacementPolicy > Not enough replicas was chosen. Reason:{NODE_TOO_BUSY=1} > 2:38:20.527 PMDEBUG StateChange > closeFile: > /mobi.me/development/apps/flink/checkpoints/a5a6806866c1640660924ea1453cbe34/chk-2118/eef8bff6-75a9-43c1-ae93-4b1a9ca31ad9 > with 1 blocks is persisted to the file system > 2:38:20.527 PMDEBUG StateChange > *BLOCK* NameNode.addBlock: file > /mobi.me/development/apps/flink/checkpoints/a5a6806866c1640660924ea1453cbe34/chk-2118/1cfe900d-6f45-4b55-baaa-73c02ace2660 > fileId=129628869 for DFSClient_NONMAPREDUCE_467616914_65 > 2:38:20.527 PMDEBUG BlockPlacementPolicy > Failed to choose from local rack (location = /default); the second replica is > not found, retry choosing ramdomly > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy$NotEnoughReplicasException: > > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:784) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:694) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:601) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalStorage(BlockPlacementPolicyDefault.java:561) > at >
[jira] [Commented] (HDFS-13833) Failed to choose from local rack (location = /default); the second replica is not found, retry choosing ramdomly
[ https://issues.apache.org/jira/browse/HDFS-13833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16585717#comment-16585717 ] Henrique Barros commented on HDFS-13833: Yes, [~hexiaoqiao] and [~xiaochen] it appears your explained logic is happening. The only thing that conflicts with it is the 'message of the load' being out of order with the message that informs about the excluded the node. It appears it exclude it first and then print the load message. However, you could reproduce it and your analysis makes total sense, it could only be inconsistent stats - chooseTarget and {{sendHeartbeat}} being invoking at same time. But if it is, the stats should be saved someway till the next heartbeat, I think. I will maintain 'considerLoad' deactivated for now and will check if it happens with more than one dataNode. Thank you very much for your fastest response and help. I will come back with some conclusion to this issue soon to see if we can close it or recheck. > Failed to choose from local rack (location = /default); the second replica is > not found, retry choosing ramdomly > > > Key: HDFS-13833 > URL: https://issues.apache.org/jira/browse/HDFS-13833 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Henrique Barros >Priority: Critical > > I'm having a random problem with blocks replication with Hadoop > 2.6.0-cdh5.15.0 > With Cloudera CDH-5.15.0-1.cdh5.15.0.p0.21 > > In my case we are getting this error very randomly (after some hours) and > with only one Datanode (for now, we are trying this cloudera cluster for a > POC) > Here is the Log. > {code:java} > Choosing random from 1 available nodes on node /default, scope=/default, > excludedScope=null, excludeNodes=[] > 2:38:20.527 PMDEBUG NetworkTopology > Choosing random from 0 available nodes on node /default, scope=/default, > excludedScope=null, excludeNodes=[192.168.220.53:50010] > 2:38:20.527 PMDEBUG NetworkTopology > chooseRandom returning null > 2:38:20.527 PMDEBUG BlockPlacementPolicy > [ > Node /default/192.168.220.53:50010 [ > Datanode 192.168.220.53:50010 is not chosen since the node is too busy > (load: 8 > 0.0). > 2:38:20.527 PMDEBUG NetworkTopology > chooseRandom returning 192.168.220.53:50010 > 2:38:20.527 PMINFOBlockPlacementPolicy > Not enough replicas was chosen. Reason:{NODE_TOO_BUSY=1} > 2:38:20.527 PMDEBUG StateChange > closeFile: > /mobi.me/development/apps/flink/checkpoints/a5a6806866c1640660924ea1453cbe34/chk-2118/eef8bff6-75a9-43c1-ae93-4b1a9ca31ad9 > with 1 blocks is persisted to the file system > 2:38:20.527 PMDEBUG StateChange > *BLOCK* NameNode.addBlock: file > /mobi.me/development/apps/flink/checkpoints/a5a6806866c1640660924ea1453cbe34/chk-2118/1cfe900d-6f45-4b55-baaa-73c02ace2660 > fileId=129628869 for DFSClient_NONMAPREDUCE_467616914_65 > 2:38:20.527 PMDEBUG BlockPlacementPolicy > Failed to choose from local rack (location = /default); the second replica is > not found, retry choosing ramdomly > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy$NotEnoughReplicasException: > > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:784) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:694) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:601) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalStorage(BlockPlacementPolicyDefault.java:561) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:464) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:395) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:270) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:142) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:158) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1715) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3505) > at >
[jira] [Commented] (HDFS-13833) Failed to choose from local rack (location = /default); the second replica is not found, retry choosing ramdomly
[ https://issues.apache.org/jira/browse/HDFS-13833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584285#comment-16584285 ] Henrique Barros commented on HDFS-13833: *Context:* We are migrating from HortonWorks Hadoop v2.3 to this one. The POC is clucial since we decomissioned the HW nodes for installing this Cloudera's POC. With this Random error we cannot accept the solution. At least without knowing the real cause. We already tried turning that off (dfs.namenode.replication.considerLoad) and it works, but it is only hiding the problem. It is not because of the load. Our load is really really low across all the cluster - 2 NN and one DN. Disks, CPU, Memory are all sleeping, we do not have network issues, nor disk issues; we are getting around 1 GBits per second between all the 3 machines. It seems to me that the node is being excluded by some reason that we cannot find in the logs and then the total load becomes equal to 0 and the message: {code:java} load: 8 > 0.0{code} Shows off. Sometimes that load is 2 other times is 10, but the total load (number on the right) is always zero which seems like a consequence of the only DN being excluded. Do you know some other crucial classes I can activate DEBUG logs on, in order to find more about this? Any Help is appreciated, we already tried so many configurations, including raising the Cloudera CDH version (it is now the one in description box), even tried raising our Flink version from 1.3.2 to 1.6.0, and the same happens. Flink is our client, and this exception only happens with the Flink Checkpoints pointing to HDFS. Best Regards, Barros > Failed to choose from local rack (location = /default); the second replica is > not found, retry choosing ramdomly > > > Key: HDFS-13833 > URL: https://issues.apache.org/jira/browse/HDFS-13833 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Henrique Barros >Priority: Critical > > I'm having a random problem with blocks replication with Hadoop > 2.6.0-cdh5.15.0 > With Cloudera CDH-5.15.0-1.cdh5.15.0.p0.21 > > In my case we are getting this error very randomly (after some hours) and > with only one Datanode (for now, we are trying this cloudera cluster for a > POC) > Here is the Log. > {code:java} > Choosing random from 1 available nodes on node /default, scope=/default, > excludedScope=null, excludeNodes=[] > 2:38:20.527 PMDEBUG NetworkTopology > Choosing random from 0 available nodes on node /default, scope=/default, > excludedScope=null, excludeNodes=[192.168.220.53:50010] > 2:38:20.527 PMDEBUG NetworkTopology > chooseRandom returning null > 2:38:20.527 PMDEBUG BlockPlacementPolicy > [ > Node /default/192.168.220.53:50010 [ > Datanode 192.168.220.53:50010 is not chosen since the node is too busy > (load: 8 > 0.0). > 2:38:20.527 PMDEBUG NetworkTopology > chooseRandom returning 192.168.220.53:50010 > 2:38:20.527 PMINFOBlockPlacementPolicy > Not enough replicas was chosen. Reason:{NODE_TOO_BUSY=1} > 2:38:20.527 PMDEBUG StateChange > closeFile: > /mobi.me/development/apps/flink/checkpoints/a5a6806866c1640660924ea1453cbe34/chk-2118/eef8bff6-75a9-43c1-ae93-4b1a9ca31ad9 > with 1 blocks is persisted to the file system > 2:38:20.527 PMDEBUG StateChange > *BLOCK* NameNode.addBlock: file > /mobi.me/development/apps/flink/checkpoints/a5a6806866c1640660924ea1453cbe34/chk-2118/1cfe900d-6f45-4b55-baaa-73c02ace2660 > fileId=129628869 for DFSClient_NONMAPREDUCE_467616914_65 > 2:38:20.527 PMDEBUG BlockPlacementPolicy > Failed to choose from local rack (location = /default); the second replica is > not found, retry choosing ramdomly > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy$NotEnoughReplicasException: > > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:784) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:694) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:601) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalStorage(BlockPlacementPolicyDefault.java:561) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:464) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:395) > at >
[jira] [Commented] (HDFS-5970) callers of NetworkTopology's chooseRandom method to expect null return value
[ https://issues.apache.org/jira/browse/HDFS-5970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584198#comment-16584198 ] Henrique Barros commented on HDFS-5970: --- I just reproduced it returning null. See the issue I created please: https://issues.apache.org/jira/browse/HDFS-13833 > callers of NetworkTopology's chooseRandom method to expect null return value > > > Key: HDFS-5970 > URL: https://issues.apache.org/jira/browse/HDFS-5970 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.0.0-alpha1 >Reporter: Yongjun Zhang >Priority: Minor > > Class NetworkTopology's method >public Node chooseRandom(String scope) > calls >private Node chooseRandom(String scope, String excludedScope) > which may return null value. > Callers of this method such as BlockPlacementPolicyDefault etc need to be > aware that. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-10453) ReplicationMonitor thread could stuck for long time due to the race between replication and delete of same file in a large cluster.
[ https://issues.apache.org/jira/browse/HDFS-10453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584189#comment-16584189 ] Henrique Barros edited comment on HDFS-10453 at 8/17/18 5:26 PM: - I have the same problem with Hadoop 2.6.0-cdh5.15.0 With Cloudera CDH-5.15.0-1.cdh5.15.0.p0.21 In my case we are getting this error very randomly and with only one Datanode (for now). Here is the Log. {code:java} Choosing random from 1 available nodes on node /default, scope=/default, excludedScope=null, excludeNodes=[] 2:38:20.527 PM DEBUG NetworkTopology Choosing random from 0 available nodes on node /default, scope=/default, excludedScope=null, excludeNodes=[192.168.220.53:50010] 2:38:20.527 PM DEBUG NetworkTopology chooseRandom returning null 2:38:20.527 PM DEBUG BlockPlacementPolicy [ Node /default/192.168.220.53:50010 [ Datanode 192.168.220.53:50010 is not chosen since the node is too busy (load: 8 > 0.0). 2:38:20.527 PM DEBUG NetworkTopology chooseRandom returning 192.168.220.53:50010 2:38:20.527 PM INFOBlockPlacementPolicy Not enough replicas was chosen. Reason:{NODE_TOO_BUSY=1} 2:38:20.527 PM DEBUG StateChange closeFile: /mobi.me/development/apps/flink/checkpoints/a5a6806866c1640660924ea1453cbe34/chk-2118/eef8bff6-75a9-43c1-ae93-4b1a9ca31ad9 with 1 blocks is persisted to the file system 2:38:20.527 PM DEBUG StateChange *BLOCK* NameNode.addBlock: file /mobi.me/development/apps/flink/checkpoints/a5a6806866c1640660924ea1453cbe34/chk-2118/1cfe900d-6f45-4b55-baaa-73c02ace2660 fileId=129628869 for DFSClient_NONMAPREDUCE_467616914_65 2:38:20.527 PM DEBUG BlockPlacementPolicy Failed to choose from local rack (location = /default); the second replica is not found, retry choosing ramdomly org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy$NotEnoughReplicasException: at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:784) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:694) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:601) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalStorage(BlockPlacementPolicyDefault.java:561) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:464) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:395) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:270) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:142) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:158) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1715) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3505) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:694) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:219) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:507) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2281) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2277) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2275) {code} This part makes no sense at all: {code:java} load: 8 > 0.0{code} I created a dedicated Bug for this case since it could not have anything to do with this one: https://issues.apache.org/jira/browse/HDFS-13833 was (Author: rikeppb100): I have the same problem with Hadoop 2.6.0-cdh5.15.0 With Cloudera CDH-5.15.0-1.cdh5.15.0.p0.21 In my case we are
[jira] [Commented] (HDFS-10453) ReplicationMonitor thread could stuck for long time due to the race between replication and delete of same file in a large cluster.
[ https://issues.apache.org/jira/browse/HDFS-10453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584189#comment-16584189 ] Henrique Barros commented on HDFS-10453: I have the same problem with Hadoop 2.6.0-cdh5.15.0 With Cloudera CDH-5.15.0-1.cdh5.15.0.p0.21 In my case we are getting this error very randomly and with only one Datanode (for now). Here is the Log. {code:java} Choosing random from 1 available nodes on node /default, scope=/default, excludedScope=null, excludeNodes=[] 2:38:20.527 PM DEBUG NetworkTopology Choosing random from 0 available nodes on node /default, scope=/default, excludedScope=null, excludeNodes=[192.168.220.53:50010] 2:38:20.527 PM DEBUG NetworkTopology chooseRandom returning null 2:38:20.527 PM DEBUG BlockPlacementPolicy [ Node /default/192.168.220.53:50010 [ Datanode 192.168.220.53:50010 is not chosen since the node is too busy (load: 8 > 0.0). 2:38:20.527 PM DEBUG NetworkTopology chooseRandom returning 192.168.220.53:50010 2:38:20.527 PM INFOBlockPlacementPolicy Not enough replicas was chosen. Reason:{NODE_TOO_BUSY=1} 2:38:20.527 PM DEBUG StateChange closeFile: /mobi.me/development/apps/flink/checkpoints/a5a6806866c1640660924ea1453cbe34/chk-2118/eef8bff6-75a9-43c1-ae93-4b1a9ca31ad9 with 1 blocks is persisted to the file system 2:38:20.527 PM DEBUG StateChange *BLOCK* NameNode.addBlock: file /mobi.me/development/apps/flink/checkpoints/a5a6806866c1640660924ea1453cbe34/chk-2118/1cfe900d-6f45-4b55-baaa-73c02ace2660 fileId=129628869 for DFSClient_NONMAPREDUCE_467616914_65 2:38:20.527 PM DEBUG BlockPlacementPolicy Failed to choose from local rack (location = /default); the second replica is not found, retry choosing ramdomly org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy$NotEnoughReplicasException: at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:784) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:694) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:601) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalStorage(BlockPlacementPolicyDefault.java:561) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:464) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:395) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:270) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:142) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:158) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1715) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3505) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:694) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:219) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:507) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2281) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2277) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2275) {code} This part makes no sense at all: {code:java} load: 8 > 0.0{code} > ReplicationMonitor thread could stuck for long time due to the race between > replication and delete of same file in a large cluster. > --- > > Key: HDFS-10453 > URL:
[jira] [Created] (HDFS-13833) Failed to choose from local rack (location = /default); the second replica is not found, retry choosing ramdomly
Henrique Barros created HDFS-13833: -- Summary: Failed to choose from local rack (location = /default); the second replica is not found, retry choosing ramdomly Key: HDFS-13833 URL: https://issues.apache.org/jira/browse/HDFS-13833 Project: Hadoop HDFS Issue Type: Bug Reporter: Henrique Barros I'm having a random problem with blocks replication with Hadoop 2.6.0-cdh5.15.0 With Cloudera CDH-5.15.0-1.cdh5.15.0.p0.21 In my case we are getting this error very randomly (after some hours) and with only one Datanode (for now, we are trying this cloudera cluster for a POC) Here is the Log. {code:java} Choosing random from 1 available nodes on node /default, scope=/default, excludedScope=null, excludeNodes=[] 2:38:20.527 PM DEBUG NetworkTopology Choosing random from 0 available nodes on node /default, scope=/default, excludedScope=null, excludeNodes=[192.168.220.53:50010] 2:38:20.527 PM DEBUG NetworkTopology chooseRandom returning null 2:38:20.527 PM DEBUG BlockPlacementPolicy [ Node /default/192.168.220.53:50010 [ Datanode 192.168.220.53:50010 is not chosen since the node is too busy (load: 8 > 0.0). 2:38:20.527 PM DEBUG NetworkTopology chooseRandom returning 192.168.220.53:50010 2:38:20.527 PM INFOBlockPlacementPolicy Not enough replicas was chosen. Reason:{NODE_TOO_BUSY=1} 2:38:20.527 PM DEBUG StateChange closeFile: /mobi.me/development/apps/flink/checkpoints/a5a6806866c1640660924ea1453cbe34/chk-2118/eef8bff6-75a9-43c1-ae93-4b1a9ca31ad9 with 1 blocks is persisted to the file system 2:38:20.527 PM DEBUG StateChange *BLOCK* NameNode.addBlock: file /mobi.me/development/apps/flink/checkpoints/a5a6806866c1640660924ea1453cbe34/chk-2118/1cfe900d-6f45-4b55-baaa-73c02ace2660 fileId=129628869 for DFSClient_NONMAPREDUCE_467616914_65 2:38:20.527 PM DEBUG BlockPlacementPolicy Failed to choose from local rack (location = /default); the second replica is not found, retry choosing ramdomly org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy$NotEnoughReplicasException: at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:784) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:694) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:601) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalStorage(BlockPlacementPolicyDefault.java:561) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:464) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:395) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:270) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:142) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:158) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1715) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3505) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:694) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:219) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:507) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2281) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2277) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2275) {code} This part makes no sense at all: {code:java} load: 8 > 0.0{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)