[jira] [Commented] (YARN-3591) Resource Localisation on a bad disk causes subsequent containers failure

Lavkesh Lahngir (JIRA) Wed, 03 Jun 2015 03:56:17 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14570652#comment-14570652
 ]


Lavkesh Lahngir commented on YARN-3591:
---------------------------------------

Thanks [~sunilg] and [~zxu] for comments and review. I did slightly 
differently. I added newRepairedDirs and newErrorDirs into DirectoryCollection. 
 
In this version checkLocalizedResources(dirsTocheck) takes the list of good 
dirs.

{code:title=DirectoryCollection.java|borderStyle=solid}
+  private List<String> newErrorDirs;
+  private List<String> newRepariedDirs;
 
   private int numFailures;
   
@@ -159,6 +161,8 @@ public DirectoryCollection(String[] dirs,
     localDirs = new CopyOnWriteArrayList<String>(dirs);
     errorDirs = new CopyOnWriteArrayList<String>();
     fullDirs = new CopyOnWriteArrayList<String>();
+    newErrorDirs = new CopyOnWriteArrayList<String>();
+    newRepariedDirs = new CopyOnWriteArrayList<String>();
 
     
@@ -213,6 +217,20 @@ synchronized int getNumFailures() {
   }
 
   /**
+   * @return Recently discovered error dirs
+   */
+  synchronized List<String> getNewErrorDirs() {
+    return newErrorDirs;
+  }
+
+  /**
+   * @return Recently discovered repaired dirs
+   */
+  synchronized List<String> getNewRepairedDirs() {
+    return newRepariedDirs;
+  }
+

@@ -259,6 +277,8 @@ synchronized boolean checkDirs() {
     localDirs.clear();
     errorDirs.clear();
     fullDirs.clear();
+    newRepariedDirs.clear();
+    newErrorDirs.clear();
 
     for (Map.Entry<String, DiskErrorInformation> entry : dirsFailedCheck
       .entrySet()) {
@@ -292,6 +312,11 @@ synchronized boolean checkDirs() {
     }
     Set<String> postCheckFullDirs = new HashSet<String>(fullDirs);
     Set<String> postCheckOtherDirs = new HashSet<String>(errorDirs);
+    for (String dir : preCheckGoodDirs) {
+      if (postCheckOtherDirs.contains(dir)) {
+        newErrorDirs.add(dir);
+      }
+    }
     for (String dir : preCheckFullDirs) {
       if (postCheckOtherDirs.contains(dir)) {
         LOG.warn("Directory " + dir + " error "
@@ -304,6 +329,9 @@ synchronized boolean checkDirs() {
         LOG.warn("Directory " + dir + " error "
             + dirsFailedCheck.get(dir).message);
       }
+      if (localDirs.contains(dir) || postCheckFullDirs.contains(dir)) {
+        newRepariedDirs.add(dir);
+      }
     }
{code}

{code:title=LocalDirsHandlerService.java|borderStyle=solid}
+   * @return Recently added error dirs
+   */
+  public List<String> getDiskNewErrorDirs() {
+    return localDirs.getNewErrorDirs();
+  }
+
+  /**
+   * @return Recently added repaired dirs
+   */
+  public List<String> getDiskNewRepairedDirs() {
+    return localDirs.getNewRepairedDirs();
+  }
{code}

{code:title=ResourceLocalizationService.java|borderStyle=solid}
       @Override
       public void onDirsChanged() {
         checkAndInitializeLocalDirs();
+        List<String> dirsTocheck =
+            new ArrayList<String>(dirsHandler.getLocalDirs());
+        dirsTocheck.addAll(dirsHandler.getDiskFullLocalDirs());
+        // checks if resources are present in the dirsTocheck
+        publicRsrc.checkLocalizedResources(dirsTocheck);
         for (LocalResourcesTracker tracker : privateRsrc.values()) {
+          tracker.checkLocalizedResources(dirsTocheck);
+        }
+        List<String> newRepairedDirs = dirsHandler.getDiskNewRepairedDirs();
+        // Delete any resources found in the newly repaired Dirs.
+        for (String dir : newRepairedDirs) {
+          cleanUpLocalDir(lfs, delService, dir);
         }
+        // Add code here to add errordirs to statestore.
       }
     };
{code}

{code:title=DirectoryCollection.java|borderStyle=solid}
  synchronized List<String> getErrorDirs() {
    return Collections.unmodifiableList(errorDirs);
  }
{code}
We can use getErroeDirs() and keep it in the NMstate as suggested and upon 
start we can do a cleanUpLocalDir on the errordirs.
 

> Resource Localisation on a bad disk causes subsequent containers failure 
> -------------------------------------------------------------------------
>
>                 Key: YARN-3591
>                 URL: https://issues.apache.org/jira/browse/YARN-3591
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.7.0
>            Reporter: Lavkesh Lahngir
>            Assignee: Lavkesh Lahngir
>         Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch, 
> YARN-3591.2.patch, YARN-3591.3.patch, YARN-3591.4.patch
>
>
> It happens when a resource is localised on the disk, after localising that 
> disk has gone bad. NM keeps paths for localised resources in memory.  At the 
> time of resource request isResourcePresent(rsrc) will be called which calls 
> file.exists() on the localised path.
> In some cases when disk has gone bad, inodes are stilled cached and 
> file.exists() returns true. But at the time of reading, file will not open.
> Note: file.exists() actually calls stat64 natively which returns true because 
> it was able to find inode information from the OS.
> A proposal is to call file.list() on the parent path of the resource, which 
> will call open() natively. If the disk is good it should return an array of 
> paths with length at-least 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3591) Resource Localisation on a bad disk causes subsequent containers failure

Reply via email to