[ 
https://issues.apache.org/jira/browse/YARN-5867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15654405#comment-15654405
 ] 

Jason Lowe commented on YARN-5867:
----------------------------------

I'm curious how the top-level local directory was deleted in the first place.  
It sounds like an incorrect setup, like tmpwatch or something was coming along 
and blowing away NM directories.  Arbitrary removal of NM directories while it 
is running is going to cause container failures at a minimum.

I'm somewhat torn on this.  Part of me thinks it would be best to treat this 
case like a bad disk, since something _clearly_ is wrong when top-level 
directories go missing out of the blue.  Either admins setup something wrong on 
the cluster or the filesystem is having difficulty persisting data.  Both are 
bad.  Someone should really look into it, otherwise if we keep silently trying 
to fix it up after the fact then we just move the issue to debugging 
mysteriously failing containers.  However I can see the benefits of not forcing 
an admin to intervene, as it can hobble along automatically (with degraded 
performance due to reruns of mysteriously crashing containers).

If we do go with solution 1, we need to log an error when we detect it.

> DirectoryCollection#checkDirs can cause incorrect permission of nmlocal dir
> ---------------------------------------------------------------------------
>
>                 Key: YARN-5867
>                 URL: https://issues.apache.org/jira/browse/YARN-5867
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Bibin A Chundatt
>            Assignee: Bibin A Chundatt
>
> Steps to reproduce
> ===============
> # Set umask to 077 for user
> # Start nodemanager with nmlocal dir configured
> nmlocal dir permission is *755* 
> {{LocalDirsHandlerService#serviceInit}}
> {code} 
>     FsPermission perm = new FsPermission((short)0755);
>     boolean createSucceeded = localDirs.createNonExistentDirs(localFs, perm);
>     createSucceeded &= logDirs.createNonExistentDirs(localFs, perm);
> {code}
> # After  startup delete the nmlocal dir and wait for {{MonitoringTimerTask}} 
> to run (simulation using delete)
> # Now check the permission of {{nmlocal dir}} will be *700*
> *Root Cause*
> {{DirectoryCollection#testDirs}} checks as following
> {code}
>         // create a random dir to make sure fs isn't in read-only mode
>         verifyDirUsingMkdir(testDir);
> {code}
> which cause a new Random directory to be create in {{localdir}} using
> {{DiskChecker.checkDir(dir)}} -> {{!mkdirsWithExistsCheck(dir)}} causing the 
> nmlocal dir to be created with wrong permission. *700*
> Few application fail to container launch due to permission denied.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to