[ 
https://issues.apache.org/jira/browse/YARN-5867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15656671#comment-15656671
 ] 

Bibin A Chundatt commented on YARN-5867:
----------------------------------------

Thank you [~jlowe] for looking into issue 

Sorry missed to add about bad disk scenario.The following sequence of steps 
could happen in actual cluster also.
# Bad disk was shown in RM UI due to hardware fault.(1 of the disk)
# Formatted and mounted again or new disk added
# After 2 min interval in RM UI node was healthy.(Admin also will think server 
is healthy)
# But containers will start failing randomly.

Will implement patch based on solution 1 and upload soon. Additional logging 
mentioning {{nmlocal}} folder is created in {{DirectoryCollection#testDirs}} 
will be included .


> DirectoryCollection#checkDirs can cause incorrect permission of nmlocal dir
> ---------------------------------------------------------------------------
>
>                 Key: YARN-5867
>                 URL: https://issues.apache.org/jira/browse/YARN-5867
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Bibin A Chundatt
>            Assignee: Bibin A Chundatt
>
> Steps to reproduce
> ===============
> # Set umask to 077 for user
> # Start nodemanager with nmlocal dir configured
> nmlocal dir permission is *755* 
> {{LocalDirsHandlerService#serviceInit}}
> {code} 
>     FsPermission perm = new FsPermission((short)0755);
>     boolean createSucceeded = localDirs.createNonExistentDirs(localFs, perm);
>     createSucceeded &= logDirs.createNonExistentDirs(localFs, perm);
> {code}
> # After  startup delete the nmlocal dir and wait for {{MonitoringTimerTask}} 
> to run (simulation using delete)
> # Now check the permission of {{nmlocal dir}} will be *700*
> *Root Cause*
> {{DirectoryCollection#testDirs}} checks as following
> {code}
>         // create a random dir to make sure fs isn't in read-only mode
>         verifyDirUsingMkdir(testDir);
> {code}
> which cause a new Random directory to be create in {{localdir}} using
> {{DiskChecker.checkDir(dir)}} -> {{!mkdirsWithExistsCheck(dir)}} causing the 
> nmlocal dir to be created with wrong permission. *700*
> Few application fail to container launch due to permission denied.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to