[ 
https://issues.apache.org/jira/browse/HDFS-9619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinayakumar B updated HDFS-9619:
--------------------------------
    Summary: SimulatedFSDataset sometimes can not find blockpool for the 
correct namenode  (was: DataNode sometimes can not find blockpool for the 
correct namenode)

> SimulatedFSDataset sometimes can not find blockpool for the correct namenode
> ----------------------------------------------------------------------------
>
>                 Key: HDFS-9619
>                 URL: https://issues.apache.org/jira/browse/HDFS-9619
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, test
>    Affects Versions: 3.0.0
>         Environment: Jenkins
>            Reporter: Wei-Chiu Chuang
>            Assignee: Wei-Chiu Chuang
>              Labels: test
>         Attachments: HDFS-9619.001.patch, HDFS-9619.002.patch
>
>
> We sometimes see {{TestBalancerWithMultipleNameNodes.testBalancer}} failed to 
> replicate a file, because a data node is excluded.
> {noformat}
> File /tmp.txt could only be replicated to 0 nodes instead of minReplication 
> (=1).  There are 1 datanode(s) running and 1 node(s) are excluded in this 
> operation.
>  at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1745)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:299)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2390)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:797)
>  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:500)
>  at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:637)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
>  at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2305)
>  at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2301)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:415)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1705)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2299)
> {noformat}
> Relevent logs suggest root cause is due to block pool not found.  
> {noformat}
> 2016-01-03 22:11:43,174 [DataXceiver for client 
> DFSClient_NONMAPREDUCE_849671738_1 at /127.0.0.1:47318 [Receiving block 
> BP-1927700312-172.26.2.1-1451887902222:blk_1073741825_1001]] ERROR 
> datanode.DataNode (DataXceiver.java:run(280)) - 
> host0.foo.com:49997:DataXceiver error processing WRITE_BLOCK operation src: 
> /127.0.0.1:47318 dst: /127.0.0.1:49997
> java.io.IOException: Non existent blockpool 
> BP-1927700312-172.26.2.1-1451887902222
> at 
> org.apache.hadoop.hdfs.server.datanode.SimulatedFSDataset.getMap(SimulatedFSDataset.java:583)
> at 
> org.apache.hadoop.hdfs.server.datanode.SimulatedFSDataset.createTemporary(SimulatedFSDataset.java:955)
> at 
> org.apache.hadoop.hdfs.server.datanode.SimulatedFSDataset.createRbw(SimulatedFSDataset.java:941)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:203)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.getBlockReceiver(DataXceiver.java:1235)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:678)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:166)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:103)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:253)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> For a bit more context, this test starts a cluster with two name nodes and 
> one data node. The block pools are added, but one of them is not found after 
> added. The root cause is due to an undetected concurrent access in a hash map 
> in SimulatedFSDataset (two block pools are added simultaneously). I added 
> some logs to print blockMap, and saw a few ConcurrentModificationExceptions. 
> The solution would be to use a thread safe class instead, like 
> ConcurrentHashMap.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to