[ 
https://issues.apache.org/jira/browse/HDFS-9619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-9619:
----------------------------------
    Description: 
We sometimes see {{TestBalancerWithMultipleNameNodes.testBalancer}} failed to 
replicate a file, because a data node is excluded.

{noformat}
File /tmp.txt could only be replicated to 0 nodes instead of minReplication 
(=1).  There are 1 datanode(s) running and 1 node(s) are excluded in this 
operation.
 at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1745)
 at 
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:299)
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2390)
 at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:797)
 at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:500)
 at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:637)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2305)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2301)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1705)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2299)
{noformat}

Relevent logs suggest root cause is due to block pool not found.  
{noformat}
2016-01-03 22:11:43,174 [DataXceiver for client 
DFSClient_NONMAPREDUCE_849671738_1 at /127.0.0.1:47318 [Receiving block 
BP-1927700312-172.26.2.1-1451887902222:blk_1073741825_1001]] ERROR 
datanode.DataNode (DataXceiver.java:run(280)) - host0.foo.com:49997:DataXceiver 
error processing WRITE_BLOCK operation src: /127.0.0.1:47318 dst: 
/127.0.0.1:49997
java.io.IOException: Non existent blockpool 
BP-1927700312-172.26.2.1-1451887902222
at 
org.apache.hadoop.hdfs.server.datanode.SimulatedFSDataset.getMap(SimulatedFSDataset.java:583)
at 
org.apache.hadoop.hdfs.server.datanode.SimulatedFSDataset.createTemporary(SimulatedFSDataset.java:955)
at 
org.apache.hadoop.hdfs.server.datanode.SimulatedFSDataset.createRbw(SimulatedFSDataset.java:941)
at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:203)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.getBlockReceiver(DataXceiver.java:1235)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:678)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:166)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:103)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:253)
at java.lang.Thread.run(Thread.java:745)
{noformat}

For a bit more context, this test starts a cluster with two name nodes and one 
data node. The block pools are added, but one of them is not found after added. 
The root cause is due to an undetected concurrent access in a hash map in 
SimulatedFSDataset. The solution would be to use a thread safe class instead, 
like ConcurrentHashMap.

  was:
We sometimes see TestBalancerWithMultipleNameNodes.testBalancer failed to 
replicate a file, because a data node is excluded.

{noformat}
File /tmp.txt could only be replicated to 0 nodes instead of minReplication 
(=1).  There are 1 datanode(s) running and 1 node(s) are excluded in this 
operation.
 at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1745)
 at 
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:299)
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2390)
 at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:797)
 at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:500)
 at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:637)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2305)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2301)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1705)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2299)
{noformat}

Relevent logs suggest root cause is due to block pool not found.  
{noformat}
2016-01-03 22:11:43,174 [DataXceiver for client 
DFSClient_NONMAPREDUCE_849671738_1 at /127.0.0.1:47318 [Receiving block 
BP-1927700312-172.26.2.1-1451887902222:blk_1073741825_1001]] ERROR 
datanode.DataNode (DataXceiver.java:run(280)) - host0.foo.com:49997:DataXceiver 
error processing WRITE_BLOCK operation src: /127.0.0.1:47318 dst: 
/127.0.0.1:49997
java.io.IOException: Non existent blockpool 
BP-1927700312-172.26.2.1-1451887902222
at 
org.apache.hadoop.hdfs.server.datanode.SimulatedFSDataset.getMap(SimulatedFSDataset.java:583)
at 
org.apache.hadoop.hdfs.server.datanode.SimulatedFSDataset.createTemporary(SimulatedFSDataset.java:955)
at 
org.apache.hadoop.hdfs.server.datanode.SimulatedFSDataset.createRbw(SimulatedFSDataset.java:941)
at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:203)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.getBlockReceiver(DataXceiver.java:1235)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:678)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:166)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:103)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:253)
at java.lang.Thread.run(Thread.java:745)
{noformat}

For a bit more context, this test starts a cluster with two name nodes and one 
data node. The block pools are added, but one of them is not found after added. 
The root cause is due to an undetected concurrent access in a hash map in 
SimulatedFSDataset. The solution would be to use a thread safe class instead, 
like ConcurrentHashMap.


> DataNode sometimes can not find blockpool for the correct namenode
> ------------------------------------------------------------------
>
>                 Key: HDFS-9619
>                 URL: https://issues.apache.org/jira/browse/HDFS-9619
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 3.0.0
>         Environment: Jenkins
>            Reporter: Wei-Chiu Chuang
>            Assignee: Wei-Chiu Chuang
>
> We sometimes see {{TestBalancerWithMultipleNameNodes.testBalancer}} failed to 
> replicate a file, because a data node is excluded.
> {noformat}
> File /tmp.txt could only be replicated to 0 nodes instead of minReplication 
> (=1).  There are 1 datanode(s) running and 1 node(s) are excluded in this 
> operation.
>  at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1745)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:299)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2390)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:797)
>  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:500)
>  at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:637)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
>  at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2305)
>  at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2301)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:415)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1705)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2299)
> {noformat}
> Relevent logs suggest root cause is due to block pool not found.  
> {noformat}
> 2016-01-03 22:11:43,174 [DataXceiver for client 
> DFSClient_NONMAPREDUCE_849671738_1 at /127.0.0.1:47318 [Receiving block 
> BP-1927700312-172.26.2.1-1451887902222:blk_1073741825_1001]] ERROR 
> datanode.DataNode (DataXceiver.java:run(280)) - 
> host0.foo.com:49997:DataXceiver error processing WRITE_BLOCK operation src: 
> /127.0.0.1:47318 dst: /127.0.0.1:49997
> java.io.IOException: Non existent blockpool 
> BP-1927700312-172.26.2.1-1451887902222
> at 
> org.apache.hadoop.hdfs.server.datanode.SimulatedFSDataset.getMap(SimulatedFSDataset.java:583)
> at 
> org.apache.hadoop.hdfs.server.datanode.SimulatedFSDataset.createTemporary(SimulatedFSDataset.java:955)
> at 
> org.apache.hadoop.hdfs.server.datanode.SimulatedFSDataset.createRbw(SimulatedFSDataset.java:941)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:203)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.getBlockReceiver(DataXceiver.java:1235)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:678)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:166)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:103)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:253)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> For a bit more context, this test starts a cluster with two name nodes and 
> one data node. The block pools are added, but one of them is not found after 
> added. The root cause is due to an undetected concurrent access in a hash map 
> in SimulatedFSDataset. The solution would be to use a thread safe class 
> instead, like ConcurrentHashMap.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to