[ https://issues.apache.org/jira/browse/HDFS-9619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vinayakumar B updated HDFS-9619: -------------------------------- Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 2.8.0 Status: Resolved (was: Patch Available) Committed to trunk, branch-2 and branch-2.8. Thanks [~jojochuang]. > SimulatedFSDataset sometimes can not find blockpool for the correct namenode > ---------------------------------------------------------------------------- > > Key: HDFS-9619 > URL: https://issues.apache.org/jira/browse/HDFS-9619 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, test > Affects Versions: 3.0.0 > Environment: Jenkins > Reporter: Wei-Chiu Chuang > Assignee: Wei-Chiu Chuang > Labels: test > Fix For: 2.8.0 > > Attachments: HDFS-9619.001.patch, HDFS-9619.002.patch > > > We sometimes see {{TestBalancerWithMultipleNameNodes.testBalancer}} failed to > replicate a file, because a data node is excluded. > {noformat} > File /tmp.txt could only be replicated to 0 nodes instead of minReplication > (=1). There are 1 datanode(s) running and 1 node(s) are excluded in this > operation. > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1745) > at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:299) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2390) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:797) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:500) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:637) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2305) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2301) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1705) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2299) > {noformat} > Relevent logs suggest root cause is due to block pool not found. > {noformat} > 2016-01-03 22:11:43,174 [DataXceiver for client > DFSClient_NONMAPREDUCE_849671738_1 at /127.0.0.1:47318 [Receiving block > BP-1927700312-172.26.2.1-1451887902222:blk_1073741825_1001]] ERROR > datanode.DataNode (DataXceiver.java:run(280)) - > host0.foo.com:49997:DataXceiver error processing WRITE_BLOCK operation src: > /127.0.0.1:47318 dst: /127.0.0.1:49997 > java.io.IOException: Non existent blockpool > BP-1927700312-172.26.2.1-1451887902222 > at > org.apache.hadoop.hdfs.server.datanode.SimulatedFSDataset.getMap(SimulatedFSDataset.java:583) > at > org.apache.hadoop.hdfs.server.datanode.SimulatedFSDataset.createTemporary(SimulatedFSDataset.java:955) > at > org.apache.hadoop.hdfs.server.datanode.SimulatedFSDataset.createRbw(SimulatedFSDataset.java:941) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:203) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.getBlockReceiver(DataXceiver.java:1235) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:678) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:166) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:103) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:253) > at java.lang.Thread.run(Thread.java:745) > {noformat} > For a bit more context, this test starts a cluster with two name nodes and > one data node. The block pools are added, but one of them is not found after > added. The root cause is due to an undetected concurrent access in a hash map > in SimulatedFSDataset (two block pools are added simultaneously). I added > some logs to print blockMap, and saw a few ConcurrentModificationExceptions. > The solution would be to use a thread safe class instead, like > ConcurrentHashMap. -- This message was sent by Atlassian JIRA (v6.3.4#6332)