[ 
https://issues.apache.org/jira/browse/HDFS-1878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Foley updated HDFS-1878:
-----------------------------

    Description: 
In 20.204, TestHDFSServerPorts was observed to intermittently throw a 
NullPointerException.  This only happens when FSNamesystem.close() is called, 
which means system termination for the Namenode, so this is not a serious bug 
for .204.  TestHDFSServerPorts is more likely than normal execution to 
stimulate the race, because it runs two Namenodes in the same JVM, causing more 
interleaving and more potential to see a race condition.

The race is in FSNamesystem.close(), line 566, we have:
      if (replthread != null) replthread.interrupt();
      if (replmon != null) replmon = null;

Since the interrupted replthread is not waited on, there is a potential race 
condition with replmon being nulled before replthread is dead, but replthread 
references replmon in computeDatanodeWork() where the NullPointerException 
occurs.

The solution is either to wait on replthread or just don't null replmon.  The 
latter is preferred, since none of the sibling Namenode processing threads are 
waited on in close().

I'll attach a patch for .205.


  was:
TestHDFSServerPorts was observed to intermittently throw a 
NullPointerException.  This only happens when FSNamesystem.close() is called, 
which means system termination for the Namenode, so this is not a serious bug 
for .204.  TestHDFSServerPorts is more likely than normal execution to 
stimulate the race, because it runs two Namenodes in the same JVM, causing more 
interleaving and more potential to see a race condition.

The race is in FSNamesystem.close(), line 566, we have:
      if (replthread != null) replthread.interrupt();
      if (replmon != null) replmon = null;

Since the interrupted replthread is not waited on, there is a potential race 
condition with replmon being nulled before replthread is dead, but replthread 
references replmon in computeDatanodeWork() where the NullPointerException 
occurs.

The solution is either to wait on replthread or just don't null replmon.  The 
latter is preferred, since none of the sibling Namenode processing threads are 
waited on in close().

I'll attach a patch for .205.



> race condition in FSNamesystem.close() causes NullPointerException without 
> serious consequence - TestHDFSServerPorts unit test failure
> --------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-1878
>                 URL: https://issues.apache.org/jira/browse/HDFS-1878
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.20.204.0
>            Reporter: Matt Foley
>            Assignee: Matt Foley
>            Priority: Minor
>             Fix For: 0.20.205.0
>
>
> In 20.204, TestHDFSServerPorts was observed to intermittently throw a 
> NullPointerException.  This only happens when FSNamesystem.close() is called, 
> which means system termination for the Namenode, so this is not a serious bug 
> for .204.  TestHDFSServerPorts is more likely than normal execution to 
> stimulate the race, because it runs two Namenodes in the same JVM, causing 
> more interleaving and more potential to see a race condition.
> The race is in FSNamesystem.close(), line 566, we have:
>       if (replthread != null) replthread.interrupt();
>       if (replmon != null) replmon = null;
> Since the interrupted replthread is not waited on, there is a potential race 
> condition with replmon being nulled before replthread is dead, but replthread 
> references replmon in computeDatanodeWork() where the NullPointerException 
> occurs.
> The solution is either to wait on replthread or just don't null replmon.  The 
> latter is preferred, since none of the sibling Namenode processing threads 
> are waited on in close().
> I'll attach a patch for .205.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to