[jira] [Commented] (YARN-112) Race in localization can cause containers to fail

Robert Joseph Evans (JIRA) Tue, 26 Mar 2013 11:33:20 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13614402#comment-13614402
 ]


Robert Joseph Evans commented on YARN-112:
------------------------------------------

I am not really sure that we fixed the underlying issue.  

{code}files.rename(dst_work, destDirPath, Rename.OVERWRITE);{code}

threw an exception because there was something else in that directory already, 
but files.mkdir(destDirPath, cachePerms, false) is supposed to throw a 
FileAlreadyExistsException if the directory already exists.  

http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileContext.html#mkdir%28org.apache.hadoop.fs.Path,%20org.apache.hadoop.fs.permission.FsPermission,%20boolean%29

files.rename should never get into this situation if files.rename threw the 
exception when it was supposed to.

I tested this and 
{code}
FileContext lfc = FileContext.getLocalFSFileContext(new Configuration());
Path p = new Path("/tmp/bobby.12345");
FsPermission cachePerms = new FsPermission((short) 0755);
lfc.mkdir(p, cachePerms, false);
lfc.mkdir(p, cachePerms, false);
{code}

never throws an exception.  We first need to address the bug in FileContext, 
and then we can look at how we can make FSDownload deal with mkdir throwing an 
exception, or whatever the fix ends up being.

I filed HADOOP-9438 for this.

If the fix ends up being that we do not support throwing the exception in 
FileContext, then your current solution looks OK.

I also have a hard time believing that we are getting random collisions on a 
long value that should be fairly uniformly distributed.  We need to guard 
against it either way and I suppose it is possible, but if I remember correctly 
we were seeing a significant number of these errors and my gut tells me that 
there is either something very wrong with Random, or there is something else 
also going on here.
                
> Race in localization can cause containers to fail
> -------------------------------------------------
>
>                 Key: YARN-112
>                 URL: https://issues.apache.org/jira/browse/YARN-112
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 0.23.3
>            Reporter: Jason Lowe
>            Assignee: omkar vinit joshi
>         Attachments: yarn-112-20130325.1.patch, yarn-112-20130325.patch, 
> yarn-112.20131503.patch
>
>
> On one of our 0.23 clusters, I saw a case of two containers, corresponding to 
> two map tasks of a MR job, that were launched almost simultaneously on the 
> same node.  It appears they both tried to localize job.jar and job.xml at the 
> same time.  One of the containers failed when it couldn't rename the 
> temporary job.jar directory to its final name because the target directory 
> wasn't empty.  Shortly afterwards the second container failed because job.xml 
> could not be found, presumably because the first container removed it when it 
> cleaned up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-112) Race in localization can cause containers to fail

Reply via email to