[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143974#comment-14143974
 ] 

Jason Lowe commented on YARN-90:
--------------------------------

Thanks, Varun!  Comments on the latest patch:

It's a bit odd to have a hash map to map disk error types to lists of 
directories, fill them all in, but we only in practice actually look at one 
type in the map and that's DISK_FULL.  It'd be simpler (and faster and less 
space since there's no hashmap involved) to just track full disks as a separate 
collection like we already do for localDirs and failedDirs.

Nit: DISK_ERROR_CAUSE should be DiskErrorCause (if we keep the enum) to match 
the style of other enum types in the code.

In verifyDirUsingMkdir, if an error occurs during the finally clause then that 
exception will mask the original exception

isDiskUsageUnderPercentageLimit is named backwards.  Disk usage being under the 
configured limit shouldn't be a full disk error, and the error message is 
inconsistent with the method name (method talks about being under but error 
message says its above).
{code}
        if (isDiskUsageUnderPercentageLimit(testDir)) {
          msg =
              "used space above threshold of "
                  + diskUtilizationPercentageCutoff
                  + "%, removing from the list of valid directories.";
{code}

We should only call getDisksHealthReport() once in the following code:
{code}
+    String report = getDisksHealthReport();
+    if (!report.isEmpty()) {
+      LOG.info("Disk(s) failed. " + getDisksHealthReport());
{code}

Should updateDirsAfterTest always say "Disk(s) failed" if the report isn't 
empty?  Thinking of the case where two disks go bad, then one later is 
restored.  The health report will still have something, but that last update is 
a disk turning good not failing.  Before this code was only called when a new 
disk failed, and now that's not always the case.  Maybe it should just be 
something like "Disk health update: " instead?

Is it really necessary to stat a directory before we try to delete it?  Seems 
like we can just try to delete it.

The idiom of getting the directories and adding the full directories seems 
pretty common.  Might be good to have dirhandler methods that already do this, 
like getLocalDirsForCleanup or getLogDirsForCleanup.

I'm a bit worried that getInitializedLocalDirs could potentially try to delete 
an entire directory tree for a disk.  If this fails in some sector-specific way 
but other containers are currently using their files from other sectors just 
fine on the same disk, removing these files from underneath active containers 
could be very problematic and difficult to debug.

> NodeManager should identify failed disks becoming good back again
> -----------------------------------------------------------------
>
>                 Key: YARN-90
>                 URL: https://issues.apache.org/jira/browse/YARN-90
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>            Reporter: Ravi Gummadi
>            Assignee: Varun Vasudev
>         Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
> YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
> apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to