[ https://issues.apache.org/jira/browse/YARN-5867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15656671#comment-15656671 ]
Bibin A Chundatt commented on YARN-5867: ---------------------------------------- Thank you [~jlowe] for looking into issue Sorry missed to add about bad disk scenario.The following sequence of steps could happen in actual cluster also. # Bad disk was shown in RM UI due to hardware fault.(1 of the disk) # Formatted and mounted again or new disk added # After 2 min interval in RM UI node was healthy.(Admin also will think server is healthy) # But containers will start failing randomly. Will implement patch based on solution 1 and upload soon. Additional logging mentioning {{nmlocal}} folder is created in {{DirectoryCollection#testDirs}} will be included . > DirectoryCollection#checkDirs can cause incorrect permission of nmlocal dir > --------------------------------------------------------------------------- > > Key: YARN-5867 > URL: https://issues.apache.org/jira/browse/YARN-5867 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Bibin A Chundatt > Assignee: Bibin A Chundatt > > Steps to reproduce > =============== > # Set umask to 077 for user > # Start nodemanager with nmlocal dir configured > nmlocal dir permission is *755* > {{LocalDirsHandlerService#serviceInit}} > {code} > FsPermission perm = new FsPermission((short)0755); > boolean createSucceeded = localDirs.createNonExistentDirs(localFs, perm); > createSucceeded &= logDirs.createNonExistentDirs(localFs, perm); > {code} > # After startup delete the nmlocal dir and wait for {{MonitoringTimerTask}} > to run (simulation using delete) > # Now check the permission of {{nmlocal dir}} will be *700* > *Root Cause* > {{DirectoryCollection#testDirs}} checks as following > {code} > // create a random dir to make sure fs isn't in read-only mode > verifyDirUsingMkdir(testDir); > {code} > which cause a new Random directory to be create in {{localdir}} using > {{DiskChecker.checkDir(dir)}} -> {{!mkdirsWithExistsCheck(dir)}} causing the > nmlocal dir to be created with wrong permission. *700* > Few application fail to container launch due to permission denied. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org