[ 
https://issues.apache.org/jira/browse/YARN-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13608333#comment-13608333
 ] 

omkar vinit joshi commented on YARN-112:
----------------------------------------

This problem is occurring mainly because createDir call on FileContext is not 
throwing exception in case the file system is RawLocalFileSystem. So if the 
directory is already present then new createDir will silently return instead of 
throwing exception. This is causing the race condition to occur in case two 
containers try to localize at the same time and get same random number. However 
rename call is an atomic call and to avoid the race condition we should use it.

Earlier implementation
1) generate random num (r1)
2) check if the r1 is present.. if present go to 1 else 2
3) create directories r1 and r1_tmp
4) copy the files into r1_tmp
5) rename r1_tmp to r1 ( This is an atomic call and only one thread will 
succeed. Rest of them will fail. Error listed is just one of the errors which 
might be logged).


Suggested Fix
1) generate random num (r1)
2) check if r1 is present if present go to 1) else 3)
3) create dir r1
4) rename r1 to r1_tmp (only one will succeed .. rest of the threads will get 
an exception and will continue to 1)
5) check if there exists file inside r1_tmp if present rename it back to r1 and 
go to 1) else go to 6 ( This check is added because if we get threads with same 
random number and passes check 2.. then one thread completely finishes download 
in which case it will rename r1_tmp back to r1... so for the other thread which 
now comes into picture rename call ( r1 to r1_tmp ) will succeed. However this 
should be avoided. This we can avoid by checking the contents of r1_tmp).
6) create r1
7) continue with actual file download.
8) rename r1_tmp to r1.


                
> Race in localization can cause containers to fail
> -------------------------------------------------
>
>                 Key: YARN-112
>                 URL: https://issues.apache.org/jira/browse/YARN-112
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 0.23.3
>            Reporter: Jason Lowe
>            Assignee: omkar vinit joshi
>
> On one of our 0.23 clusters, I saw a case of two containers, corresponding to 
> two map tasks of a MR job, that were launched almost simultaneously on the 
> same node.  It appears they both tried to localize job.jar and job.xml at the 
> same time.  One of the containers failed when it couldn't rename the 
> temporary job.jar directory to its final name because the target directory 
> wasn't empty.  Shortly afterwards the second container failed because job.xml 
> could not be found, presumably because the first container removed it when it 
> cleaned up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to