[ 
https://issues.apache.org/jira/browse/HDFS-889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12866322#action_12866322
 ] 

Todd Lipcon commented on HDFS-889:
----------------------------------

Is this just a test bug? i.e is the contract of BlockManager that its methods 
require the FSN lock to be held, and the test is at fault for not doing so? Or 
do we have other cases in the NN where we access these iterators w/o 
synchronization

> Possible race condition in BlocksMap.NodeIterator.
> --------------------------------------------------
>
>                 Key: HDFS-889
>                 URL: https://issues.apache.org/jira/browse/HDFS-889
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.22.0
>            Reporter: Steve Loughran
>
> Hudson's test run for HDFS-165 is showing an NPE in 
> {{org.apache.hadoop.hdfs.server.namenode.TestNodeCount.testNodeCount()}}
> One problem could be in {{BlocksMap.NodeIterator}}. It's {{hasNext()}} method 
> checks the next entry isn't null. But what if between the {{hasNext() call 
> and the next() operation, the map changes and an entry goes away? In that 
> situation, the node returned from next() will be null. 
> This is potentially serious as a quick look through the code shows that the 
> iterator gets retrieved a lot and everywhere hadoop does so, it assumes the 
> value is not null. It's also one of those problems that doesn't have a simple 
> "make it go away" fix.
> Options
> # Ignore it, hope it doesn't happen very often and the test failing was a one 
> off that will never happen in a production datacentre. This is the default. 
> The iterator is only used in the namenode, so while it does depend on the # 
> of datanodes, it isn't running in 4000 machines in a big cluster.
> # Leave the iterator as is, have all the in-Hadoop code check for a 
> null-value and break the loop
> # Patch the {{NodeIterator}} to be consistent with the {{Iterator}} 
> specification and throw a {{NoSuchElementException}} if the next value is 
> null. This does not make the problem go away, but now it is handled by having 
> every use in-Hadoop catching the exception at the right point and exiting the 
> loop. 
> Testing. This should be possible.
> # Create a block map
> # iterate over a block
> # while the iterator is in progress remove the next block in the list. Expect 
> the next call to next() to fail in whatever way you choose. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to