[jira] [Commented] (YARN-5867) DirectoryCollection#checkDirs can cause incorrect permission of nmlocal dir

2016-11-11 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15657228#comment-15657228
 ] 

Bibin A Chundatt commented on YARN-5867:


[~jlowe]
{quote}
If the disk was wiped and re-introduced then this may be more complicated than 
just fixing the one directory. The NM creates quite a few directories for each 
disk on startup with various permissions, 
{quote}
Totally agree with you lot of cases need to be handled. Admin have to handle 
carefully.

IMO i think we could handle for the possible scenarios which can be solved 
(nmlocal and nmlog).If not i am open to close this issue




> DirectoryCollection#checkDirs can cause incorrect permission of nmlocal dir
> ---
>
> Key: YARN-5867
> URL: https://issues.apache.org/jira/browse/YARN-5867
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>
> Steps to reproduce
> ===
> # Set umask to 077 for user
> # Start nodemanager with nmlocal dir configured
> nmlocal dir permission is *755* 
> {{LocalDirsHandlerService#serviceInit}}
> {code} 
> FsPermission perm = new FsPermission((short)0755);
> boolean createSucceeded = localDirs.createNonExistentDirs(localFs, perm);
> createSucceeded &= logDirs.createNonExistentDirs(localFs, perm);
> {code}
> # After  startup delete the nmlocal dir and wait for {{MonitoringTimerTask}} 
> to run (simulation using delete)
> # Now check the permission of {{nmlocal dir}} will be *700*
> *Root Cause*
> {{DirectoryCollection#testDirs}} checks as following
> {code}
> // create a random dir to make sure fs isn't in read-only mode
> verifyDirUsingMkdir(testDir);
> {code}
> which cause a new Random directory to be create in {{localdir}} using
> {{DiskChecker.checkDir(dir)}} -> {{!mkdirsWithExistsCheck(dir)}} causing the 
> nmlocal dir to be created with wrong permission. *700*
> Few application fail to container launch due to permission denied.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5867) DirectoryCollection#checkDirs can cause incorrect permission of nmlocal dir

2016-11-11 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15657124#comment-15657124
 ] 

Jason Lowe commented on YARN-5867:
--

If the disk was wiped and re-introduced then this may be more complicated than 
just fixing the one directory.  The NM creates quite a few directories for each 
disk on startup with various permissions, and we'd need to ensure that all of 
them get properly recreated when the top-level directory is detected as missing.

> DirectoryCollection#checkDirs can cause incorrect permission of nmlocal dir
> ---
>
> Key: YARN-5867
> URL: https://issues.apache.org/jira/browse/YARN-5867
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>
> Steps to reproduce
> ===
> # Set umask to 077 for user
> # Start nodemanager with nmlocal dir configured
> nmlocal dir permission is *755* 
> {{LocalDirsHandlerService#serviceInit}}
> {code} 
> FsPermission perm = new FsPermission((short)0755);
> boolean createSucceeded = localDirs.createNonExistentDirs(localFs, perm);
> createSucceeded &= logDirs.createNonExistentDirs(localFs, perm);
> {code}
> # After  startup delete the nmlocal dir and wait for {{MonitoringTimerTask}} 
> to run (simulation using delete)
> # Now check the permission of {{nmlocal dir}} will be *700*
> *Root Cause*
> {{DirectoryCollection#testDirs}} checks as following
> {code}
> // create a random dir to make sure fs isn't in read-only mode
> verifyDirUsingMkdir(testDir);
> {code}
> which cause a new Random directory to be create in {{localdir}} using
> {{DiskChecker.checkDir(dir)}} -> {{!mkdirsWithExistsCheck(dir)}} causing the 
> nmlocal dir to be created with wrong permission. *700*
> Few application fail to container launch due to permission denied.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5867) DirectoryCollection#checkDirs can cause incorrect permission of nmlocal dir

2016-11-11 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15656671#comment-15656671
 ] 

Bibin A Chundatt commented on YARN-5867:


Thank you [~jlowe] for looking into issue 

Sorry missed to add about bad disk scenario.The following sequence of steps 
could happen in actual cluster also.
# Bad disk was shown in RM UI due to hardware fault.(1 of the disk)
# Formatted and mounted again or new disk added
# After 2 min interval in RM UI node was healthy.(Admin also will think server 
is healthy)
# But containers will start failing randomly.

Will implement patch based on solution 1 and upload soon. Additional logging 
mentioning {{nmlocal}} folder is created in {{DirectoryCollection#testDirs}} 
will be included .


> DirectoryCollection#checkDirs can cause incorrect permission of nmlocal dir
> ---
>
> Key: YARN-5867
> URL: https://issues.apache.org/jira/browse/YARN-5867
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>
> Steps to reproduce
> ===
> # Set umask to 077 for user
> # Start nodemanager with nmlocal dir configured
> nmlocal dir permission is *755* 
> {{LocalDirsHandlerService#serviceInit}}
> {code} 
> FsPermission perm = new FsPermission((short)0755);
> boolean createSucceeded = localDirs.createNonExistentDirs(localFs, perm);
> createSucceeded &= logDirs.createNonExistentDirs(localFs, perm);
> {code}
> # After  startup delete the nmlocal dir and wait for {{MonitoringTimerTask}} 
> to run (simulation using delete)
> # Now check the permission of {{nmlocal dir}} will be *700*
> *Root Cause*
> {{DirectoryCollection#testDirs}} checks as following
> {code}
> // create a random dir to make sure fs isn't in read-only mode
> verifyDirUsingMkdir(testDir);
> {code}
> which cause a new Random directory to be create in {{localdir}} using
> {{DiskChecker.checkDir(dir)}} -> {{!mkdirsWithExistsCheck(dir)}} causing the 
> nmlocal dir to be created with wrong permission. *700*
> Few application fail to container launch due to permission denied.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5867) DirectoryCollection#checkDirs can cause incorrect permission of nmlocal dir

2016-11-10 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15654405#comment-15654405
 ] 

Jason Lowe commented on YARN-5867:
--

I'm curious how the top-level local directory was deleted in the first place.  
It sounds like an incorrect setup, like tmpwatch or something was coming along 
and blowing away NM directories.  Arbitrary removal of NM directories while it 
is running is going to cause container failures at a minimum.

I'm somewhat torn on this.  Part of me thinks it would be best to treat this 
case like a bad disk, since something _clearly_ is wrong when top-level 
directories go missing out of the blue.  Either admins setup something wrong on 
the cluster or the filesystem is having difficulty persisting data.  Both are 
bad.  Someone should really look into it, otherwise if we keep silently trying 
to fix it up after the fact then we just move the issue to debugging 
mysteriously failing containers.  However I can see the benefits of not forcing 
an admin to intervene, as it can hobble along automatically (with degraded 
performance due to reruns of mysteriously crashing containers).

If we do go with solution 1, we need to log an error when we detect it.

> DirectoryCollection#checkDirs can cause incorrect permission of nmlocal dir
> ---
>
> Key: YARN-5867
> URL: https://issues.apache.org/jira/browse/YARN-5867
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>
> Steps to reproduce
> ===
> # Set umask to 077 for user
> # Start nodemanager with nmlocal dir configured
> nmlocal dir permission is *755* 
> {{LocalDirsHandlerService#serviceInit}}
> {code} 
> FsPermission perm = new FsPermission((short)0755);
> boolean createSucceeded = localDirs.createNonExistentDirs(localFs, perm);
> createSucceeded &= logDirs.createNonExistentDirs(localFs, perm);
> {code}
> # After  startup delete the nmlocal dir and wait for {{MonitoringTimerTask}} 
> to run (simulation using delete)
> # Now check the permission of {{nmlocal dir}} will be *700*
> *Root Cause*
> {{DirectoryCollection#testDirs}} checks as following
> {code}
> // create a random dir to make sure fs isn't in read-only mode
> verifyDirUsingMkdir(testDir);
> {code}
> which cause a new Random directory to be create in {{localdir}} using
> {{DiskChecker.checkDir(dir)}} -> {{!mkdirsWithExistsCheck(dir)}} causing the 
> nmlocal dir to be created with wrong permission. *700*
> Few application fail to container launch due to permission denied.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5867) DirectoryCollection#checkDirs can cause incorrect permission of nmlocal dir

2016-11-10 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15654312#comment-15654312
 ] 

Bibin A Chundatt commented on YARN-5867:


cc/ [~jlowe] and [~vvasudev] . Could you please share your thoughts too?

> DirectoryCollection#checkDirs can cause incorrect permission of nmlocal dir
> ---
>
> Key: YARN-5867
> URL: https://issues.apache.org/jira/browse/YARN-5867
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>
> Steps to reproduce
> ===
> # Set umask to 077 for user
> # Start nodemanager with nmlocal dir configured
> nmlocal dir permission is *755* 
> {{LocalDirsHandlerService#serviceInit}}
> {code} 
> FsPermission perm = new FsPermission((short)0755);
> boolean createSucceeded = localDirs.createNonExistentDirs(localFs, perm);
> createSucceeded &= logDirs.createNonExistentDirs(localFs, perm);
> {code}
> # After  startup delete the nmlocal dir and wait for {{MonitoringTimerTask}} 
> to run (simulation using delete)
> # Now check the permission of {{nmlocal dir}} will be *700*
> *Root Cause*
> {{DirectoryCollection#testDirs}} checks as following
> {code}
> // create a random dir to make sure fs isn't in read-only mode
> verifyDirUsingMkdir(testDir);
> {code}
> which cause a new Random directory to be create in {{localdir}} using
> {{DiskChecker.checkDir(dir)}} -> {{!mkdirsWithExistsCheck(dir)}} causing the 
> nmlocal dir to be created with wrong permission. *700*
> Few application fail to container launch due to permission denied.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5867) DirectoryCollection#checkDirs can cause incorrect permission of nmlocal dir

2016-11-10 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15653928#comment-15653928
 ] 

Bibin A Chundatt commented on YARN-5867:


[~naganarasimha...@apache.org]
Not related to appcache.. This i for root directory. configured nmlocaldir.
{quote}
User with which NM is run ?
{quote}
NM started user umask is 077

> DirectoryCollection#checkDirs can cause incorrect permission of nmlocal dir
> ---
>
> Key: YARN-5867
> URL: https://issues.apache.org/jira/browse/YARN-5867
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>
> Steps to reproduce
> ===
> # Set umask to 077 for user
> # Start nodemanager with nmlocal dir configured
> nmlocal dir permission is *755* 
> {{LocalDirsHandlerService#serviceInit}}
> {code} 
> FsPermission perm = new FsPermission((short)0755);
> boolean createSucceeded = localDirs.createNonExistentDirs(localFs, perm);
> createSucceeded &= logDirs.createNonExistentDirs(localFs, perm);
> {code}
> # After  startup delete the nmlocal dir and wait for {{MonitoringTimerTask}} 
> to run (simulation using delete)
> # Now check the permission of {{nmlocal dir}} will be *700*
> *Root Cause*
> {{DirectoryCollection#testDirs}} checks as following
> {code}
> // create a random dir to make sure fs isn't in read-only mode
> verifyDirUsingMkdir(testDir);
> {code}
> which cause a new Random directory to be create in {{localdir}} using
> {{DiskChecker.checkDir(dir)}} -> {{!mkdirsWithExistsCheck(dir)}} causing the 
> nmlocal dir to be created with wrong permission. *700*
> Few application fail to container launch due to permission denied.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5867) DirectoryCollection#checkDirs can cause incorrect permission of nmlocal dir

2016-11-10 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15653904#comment-15653904
 ] 

Naganarasimha G R commented on YARN-5867:
-

I think its kind of related to YARN-5765 and YARN-5287 with restricted rights 
on the user. Not sure which user you are trying to refer here user with which 
NM is run ? 

> DirectoryCollection#checkDirs can cause incorrect permission of nmlocal dir
> ---
>
> Key: YARN-5867
> URL: https://issues.apache.org/jira/browse/YARN-5867
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>
> Steps to reproduce
> ===
> # Set umask to 077 for user
> # Start nodemanager with nmlocal dir configured
> nmlocal dir permission is *755* 
> {{LocalDirsHandlerService#serviceInit}}
> {code} 
> FsPermission perm = new FsPermission((short)0755);
> boolean createSucceeded = localDirs.createNonExistentDirs(localFs, perm);
> createSucceeded &= logDirs.createNonExistentDirs(localFs, perm);
> {code}
> # After  startup delete the nmlocal dir and wait for {{MonitoringTimerTask}} 
> to run (simulation using delete)
> # Now check the permission of {{nmlocal dir}} will be *750*
> *Root Cause*
> {{DirectoryCollection#testDirs}} checks as following
> {code}
> // create a random dir to make sure fs isn't in read-only mode
> verifyDirUsingMkdir(testDir);
> {code}
> which cause a new Random directory to be create in {{localdir}} using
> {{DiskChecker.checkDir(dir)}} -> {{!mkdirsWithExistsCheck(dir)}} causing the 
> nmlocal dir to be created with wrong permission. *750*
> Few application fail to container launch due to permission denied.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5867) DirectoryCollection#checkDirs can cause incorrect permission of nmlocal dir

2016-11-10 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15653868#comment-15653868
 ] 

Bibin A Chundatt commented on YARN-5867:


*Solution*
# We can check and try creation of localdir before testdir() all dir with 
*0755* permission
# Should create Random localdir only if the localdir exits , So that local dir 
will be considered as bad.

In my opinion should use *Solution 1* makes NM auto recoverable.Thoughts?


> DirectoryCollection#checkDirs can cause incorrect permission of nmlocal dir
> ---
>
> Key: YARN-5867
> URL: https://issues.apache.org/jira/browse/YARN-5867
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>
> Steps to reproduce
> ===
> # Set umask to 027 for user
> # Start nodemanager with nmlocal dir configured
> nmlocal dir permission is *755* 
> {{LocalDirsHandlerService#serviceInit}}
> {code} 
> FsPermission perm = new FsPermission((short)0755);
> boolean createSucceeded = localDirs.createNonExistentDirs(localFs, perm);
> createSucceeded &= logDirs.createNonExistentDirs(localFs, perm);
> {code}
> # After  startup delete the nmlocal dir and wait for {{MonitoringTimerTask}} 
> to run (simulation using delete)
> # Now check the permission of {{nmlocal dir}} will be *750*
> *Root Cause*
> {{DirectoryCollection#testDirs}} checks as following
> {code}
> // create a random dir to make sure fs isn't in read-only mode
> verifyDirUsingMkdir(testDir);
> {code}
> which cause a new Random directory to be create in {{localdir}} using
> {{DiskChecker.checkDir(dir)}} -> {{!mkdirsWithExistsCheck(dir)}} causing the 
> nmlocal dir to be created with wrong permission. *750*
> Few application fail to container launch due to permission denied.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org