[ 
https://issues.apache.org/jira/browse/HDFS-4315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13532857#comment-13532857
 ] 

Aaron T. Myers commented on HDFS-4315:
--------------------------------------

In all of the failing test runs that I saw, the client would end up failing 
with an error like the following:

{noformat}
2012-12-14 16:30:36,818 WARN  hdfs.DFSClient (DFSOutputStream.java:run(562)) - 
DataStreamer Exception
java.io.IOException: Failed to add a datanode.  User may turn off this feature 
by setting dfs.client.block.write.replace-datanode-on-failure.policy in 
configuration, where the current policy is DEFAULT.  (Nodes: 
current=[127.0.0.1:52552, 127.0.0.1:43557], original=[127.0.0.1:43557, 
127.0.0.1:52552])
  at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:792)
  at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:852)
  at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:958)
 
  at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:469)
{noformat}

This suggests that either an entire DN or one of the BPOfferServices of one of 
the DNs was not starting correctly, or had not started by the time the client 
was trying to access it. Unfortunately, TestWebHdfsWithMultipleNameNodes 
disables the DN logger, so it wasn't obvious what was causing that problem. 
Upon changing the test to not disable the logger and looping the test, I would 
occasionally see an error like the following:

{noformat}
java.lang.NullPointerException
  at 
org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:850)
  at 
org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:819)
  at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:308)
  at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:218)
  at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:660)
  at java.lang.Thread.run(Thread.java:662)
{noformat}

This error would cause one of the BPOfferServices in one of the DNs to not come 
up. The reason for this is that concurrent, unsynchronized puts to the HashMap 
DataStorage#bpStorageMap results in undefined behavior, including 
previously-included entries no longer appearing to be in the map.
                
> DNs with multiple BPs can have BPOfferServices fail to start due to 
> unsynchronized map access
> ---------------------------------------------------------------------------------------------
>
>                 Key: HDFS-4315
>                 URL: https://issues.apache.org/jira/browse/HDFS-4315
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.0.2-alpha
>            Reporter: Aaron T. Myers
>            Assignee: Aaron T. Myers
>
> In some nightly test runs we've seen pretty frequent failures of 
> TestWebHdfsWithMultipleNameNodes. I've traced the root cause to an 
> unsynchronized map access in the DataStorage class.
> More details in the first comment.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to