[ https://issues.apache.org/jira/browse/YARN-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14570652#comment-14570652 ]
Lavkesh Lahngir commented on YARN-3591: --------------------------------------- Thanks [~sunilg] and [~zxu] for comments and review. I did slightly differently. I added newRepairedDirs and newErrorDirs into DirectoryCollection. In this version checkLocalizedResources(dirsTocheck) takes the list of good dirs. {code:title=DirectoryCollection.java|borderStyle=solid} + private List<String> newErrorDirs; + private List<String> newRepariedDirs; private int numFailures; @@ -159,6 +161,8 @@ public DirectoryCollection(String[] dirs, localDirs = new CopyOnWriteArrayList<String>(dirs); errorDirs = new CopyOnWriteArrayList<String>(); fullDirs = new CopyOnWriteArrayList<String>(); + newErrorDirs = new CopyOnWriteArrayList<String>(); + newRepariedDirs = new CopyOnWriteArrayList<String>(); @@ -213,6 +217,20 @@ synchronized int getNumFailures() { } /** + * @return Recently discovered error dirs + */ + synchronized List<String> getNewErrorDirs() { + return newErrorDirs; + } + + /** + * @return Recently discovered repaired dirs + */ + synchronized List<String> getNewRepairedDirs() { + return newRepariedDirs; + } + @@ -259,6 +277,8 @@ synchronized boolean checkDirs() { localDirs.clear(); errorDirs.clear(); fullDirs.clear(); + newRepariedDirs.clear(); + newErrorDirs.clear(); for (Map.Entry<String, DiskErrorInformation> entry : dirsFailedCheck .entrySet()) { @@ -292,6 +312,11 @@ synchronized boolean checkDirs() { } Set<String> postCheckFullDirs = new HashSet<String>(fullDirs); Set<String> postCheckOtherDirs = new HashSet<String>(errorDirs); + for (String dir : preCheckGoodDirs) { + if (postCheckOtherDirs.contains(dir)) { + newErrorDirs.add(dir); + } + } for (String dir : preCheckFullDirs) { if (postCheckOtherDirs.contains(dir)) { LOG.warn("Directory " + dir + " error " @@ -304,6 +329,9 @@ synchronized boolean checkDirs() { LOG.warn("Directory " + dir + " error " + dirsFailedCheck.get(dir).message); } + if (localDirs.contains(dir) || postCheckFullDirs.contains(dir)) { + newRepariedDirs.add(dir); + } } {code} {code:title=LocalDirsHandlerService.java|borderStyle=solid} + * @return Recently added error dirs + */ + public List<String> getDiskNewErrorDirs() { + return localDirs.getNewErrorDirs(); + } + + /** + * @return Recently added repaired dirs + */ + public List<String> getDiskNewRepairedDirs() { + return localDirs.getNewRepairedDirs(); + } {code} {code:title=ResourceLocalizationService.java|borderStyle=solid} @Override public void onDirsChanged() { checkAndInitializeLocalDirs(); + List<String> dirsTocheck = + new ArrayList<String>(dirsHandler.getLocalDirs()); + dirsTocheck.addAll(dirsHandler.getDiskFullLocalDirs()); + // checks if resources are present in the dirsTocheck + publicRsrc.checkLocalizedResources(dirsTocheck); for (LocalResourcesTracker tracker : privateRsrc.values()) { + tracker.checkLocalizedResources(dirsTocheck); + } + List<String> newRepairedDirs = dirsHandler.getDiskNewRepairedDirs(); + // Delete any resources found in the newly repaired Dirs. + for (String dir : newRepairedDirs) { + cleanUpLocalDir(lfs, delService, dir); } + // Add code here to add errordirs to statestore. } }; {code} {code:title=DirectoryCollection.java|borderStyle=solid} synchronized List<String> getErrorDirs() { return Collections.unmodifiableList(errorDirs); } {code} We can use getErroeDirs() and keep it in the NMstate as suggested and upon start we can do a cleanUpLocalDir on the errordirs. > Resource Localisation on a bad disk causes subsequent containers failure > ------------------------------------------------------------------------- > > Key: YARN-3591 > URL: https://issues.apache.org/jira/browse/YARN-3591 > Project: Hadoop YARN > Issue Type: Bug > Affects Versions: 2.7.0 > Reporter: Lavkesh Lahngir > Assignee: Lavkesh Lahngir > Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch, > YARN-3591.2.patch, YARN-3591.3.patch, YARN-3591.4.patch > > > It happens when a resource is localised on the disk, after localising that > disk has gone bad. NM keeps paths for localised resources in memory. At the > time of resource request isResourcePresent(rsrc) will be called which calls > file.exists() on the localised path. > In some cases when disk has gone bad, inodes are stilled cached and > file.exists() returns true. But at the time of reading, file will not open. > Note: file.exists() actually calls stat64 natively which returns true because > it was able to find inode information from the OS. > A proposal is to call file.list() on the parent path of the resource, which > will call open() natively. If the disk is good it should return an array of > paths with length at-least 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)