[ https://issues.apache.org/jira/browse/HDFS-4315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13532857#comment-13532857 ]
Aaron T. Myers commented on HDFS-4315: -------------------------------------- In all of the failing test runs that I saw, the client would end up failing with an error like the following: {noformat} 2012-12-14 16:30:36,818 WARN hdfs.DFSClient (DFSOutputStream.java:run(562)) - DataStreamer Exception java.io.IOException: Failed to add a datanode. User may turn off this feature by setting dfs.client.block.write.replace-datanode-on-failure.policy in configuration, where the current policy is DEFAULT. (Nodes: current=[127.0.0.1:52552, 127.0.0.1:43557], original=[127.0.0.1:43557, 127.0.0.1:52552]) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:792) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:852) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:958) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:469) {noformat} This suggests that either an entire DN or one of the BPOfferServices of one of the DNs was not starting correctly, or had not started by the time the client was trying to access it. Unfortunately, TestWebHdfsWithMultipleNameNodes disables the DN logger, so it wasn't obvious what was causing that problem. Upon changing the test to not disable the logger and looping the test, I would occasionally see an error like the following: {noformat} java.lang.NullPointerException at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:850) at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:819) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:308) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:218) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:660) at java.lang.Thread.run(Thread.java:662) {noformat} This error would cause one of the BPOfferServices in one of the DNs to not come up. The reason for this is that concurrent, unsynchronized puts to the HashMap DataStorage#bpStorageMap results in undefined behavior, including previously-included entries no longer appearing to be in the map. > DNs with multiple BPs can have BPOfferServices fail to start due to > unsynchronized map access > --------------------------------------------------------------------------------------------- > > Key: HDFS-4315 > URL: https://issues.apache.org/jira/browse/HDFS-4315 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode > Affects Versions: 2.0.2-alpha > Reporter: Aaron T. Myers > Assignee: Aaron T. Myers > > In some nightly test runs we've seen pretty frequent failures of > TestWebHdfsWithMultipleNameNodes. I've traced the root cause to an > unsynchronized map access in the DataStorage class. > More details in the first comment. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira