[ https://issues.apache.org/jira/browse/YARN-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555819#comment-14555819 ]
Lavkesh Lahngir commented on YARN-3591: --------------------------------------- Hm.. Got you point. Is DirectoryCollection class a good place to add newErrorDirs and newRepairedDirs ? So finally this is my understanding: please correct me if I am wrong. Def: newErrorDirs -> Dirs which turned bad from localdirs or fulldirs. newRepairedDirs -> Dirs which turned good from errorDirs. After calling checkLocalizedResources() with localdirs and fulldirs, we can call {code}cleanUpLocalDir(lfs, del, localDir);{code} on newRepairedDirs. We will put newErrorDirs to statestore so that when nm restarts it can do a cleanup. Also We need to remove them from statestore if they become repaired. > Resource Localisation on a bad disk causes subsequent containers failure > ------------------------------------------------------------------------- > > Key: YARN-3591 > URL: https://issues.apache.org/jira/browse/YARN-3591 > Project: Hadoop YARN > Issue Type: Bug > Affects Versions: 2.7.0 > Reporter: Lavkesh Lahngir > Assignee: Lavkesh Lahngir > Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch, > YARN-3591.2.patch, YARN-3591.3.patch, YARN-3591.4.patch > > > It happens when a resource is localised on the disk, after localising that > disk has gone bad. NM keeps paths for localised resources in memory. At the > time of resource request isResourcePresent(rsrc) will be called which calls > file.exists() on the localised path. > In some cases when disk has gone bad, inodes are stilled cached and > file.exists() returns true. But at the time of reading, file will not open. > Note: file.exists() actually calls stat64 natively which returns true because > it was able to find inode information from the OS. > A proposal is to call file.list() on the parent path of the resource, which > will call open() natively. If the disk is good it should return an array of > paths with length at-least 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)