[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again

2013-10-31 Thread Hou Song (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13810961#comment-13810961
 ] 

Hou Song commented on YARN-90:
--

Hi guys, I have been using my patch for this issue for a very long time. It 
enables NM to reuse failed diskes after they come back, and tt also adds a new 
metric of the number of failed directories so people have clearer view from 
outside.
For unit tests, I add a test to TestLocalDirsHandlerService, and mimic disk 
failure by "chmod 000 failed_dir", and mimic disk repairment by "chmod 000 
failed_dir". 
If anyone interested, I can post this patch here.

> NodeManager should identify failed disks becoming good back again
> -
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
> Attachments: YARN-90.1.patch, YARN-90.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again

2013-10-31 Thread Hou Song (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13810976#comment-13810976
 ] 

Hou Song commented on YARN-90:
--

Sorry for the last comment, I meant: 
For unit tests, I add a test to TestLocalDirsHandlerService, and mimic disk 
failure by "chmod 000 failed_dir", and mimic disk repairing by "chmod 755 
failed_dir". 

> NodeManager should identify failed disks becoming good back again
> -
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
> Attachments: YARN-90.1.patch, YARN-90.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again

2013-11-01 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13811612#comment-13811612
 ] 

Ravi Prakash commented on YARN-90:
--

Hi Song! Thanks a lot for your offer to contribute. It would be great if you 
could please also share your patch. Could you please also clarify what "tt" you 
are referring to in "tt also adds a new metric of the number"?
I will go ahead and test the pre-existing patch anyway. 

> NodeManager should identify failed disks becoming good back again
> -
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
> Attachments: YARN-90.1.patch, YARN-90.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again

2013-11-01 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13811724#comment-13811724
 ] 

Ravi Prakash commented on YARN-90:
--

Apart from DirectoryCollection changes, I think we should also update 
LocalDirAllocation.AllocatorPerContext. Maybe we should handle that in a 
separate JIRA.

Anyway. I noticed that after this patch, although DirectoryCollection recovered 
the repaired directories, they were not actually used. I wonder if its 
something wrong with my test procedure or we need more changes.

> NodeManager should identify failed disks becoming good back again
> -
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
> Attachments: YARN-90.1.patch, YARN-90.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again

2013-11-04 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13813183#comment-13813183
 ] 

Vinod Kumar Vavilapalli commented on YARN-90:
-

Thanks for the patch, Song! Some quick comments:
 - Because you are changing the semantics of checkDirs(), there are more 
changes that are needed.
  -- updateDirsAfterFailure() -> updateConfAfterDirListChange?
  -- The log message in updateDirsAfterFailure: "Disk(s) failed. " should be 
changed to something like "Disk-health report changed: " or something like that.
 - Web UI and Web-services are fine for now I think, nothing to do there.
 - Drop the extraneous "System.out.println" lines in all of the patch.
 - Let's drop the metrics changes. We need to expose this end-to-end and not 
just metrics - client side reports, jmx and metrics. Worth tracking that effort 
separately.
 - Test:
-- testAutoDir() -> testDisksGoingOnAndOff ?
-- Can you also validate the health-report both when disks go off and when 
they come back again?
-- Also just throw unwanted exceptions instead of catching them and 
printing stack-trace.

> NodeManager should identify failed disks becoming good back again
> -
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
> Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again

2013-11-05 Thread Hou Song (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13813811#comment-13813811
 ] 

Hou Song commented on YARN-90:
--

Thanks for the suggestions. I'm trying to modify my patch, and will upload it 
soon.
However, I don't quite understand your saying "expose this end-to-end and not 
just metrics". We have been using failed-disk metric in our prodution cluster 
for a year, and it's good enough for our rapid disk repairment. Enlight me if 
you have a better way. 

> NodeManager should identify failed disks becoming good back again
> -
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
> Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again

2013-11-05 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814090#comment-13814090
 ] 

Vinod Kumar Vavilapalli commented on YARN-90:
-

bq.  However, I don't quite understand your saying "expose this end-to-end and 
not just metrics". We have been using failed-disk metric in our prodution 
cluster for a year, and it's good enough for our rapid disk repairment. Enlight 
me if you have a better way. 
I meant that it should be part of client side RPC report, JMX as well as the 
metrics. Doing only one of those is incomplete and so I was suggesting that we 
do all of that in a separate JIRA.

> NodeManager should identify failed disks becoming good back again
> -
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
> Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again

2013-11-07 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13816232#comment-13816232
 ] 

Ravi Prakash commented on YARN-90:
--

Thanks for updating the patch Song! With almost the same changes as nigel, I 
was able to get the originally invalid directories to be used again. So the 
src/main code looks good to me. The one nit I had was that
{code} } catch (IOException e2) {
  Assert.fail("should not throw an exception");
  Shell.execCommand(Shell.getSetPermissionCommand("755", false,
  testDir.getAbsolutePath()));
  throw e2;
} {code},  {code} catch (InterruptedException e1) {
} {code} , {code}  } catch (IOException e2) {
  Assert.fail("should not throw an exception");
  throw e2;
} {code} and {code} } catch (IOException e) {
  Assert.fail("Service should have thrown an exception while closing");
  throw e;
} {code} can simply be removed.

Other than that, the patch looks good to me. +1. Thanks a lot Nigel and Song!

> NodeManager should identify failed disks becoming good back again
> -
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
> Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
> YARN-90.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again

2014-03-03 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13918405#comment-13918405
 ] 

Varun Vasudev commented on YARN-90:
---

Ravi, are you still working on this ticket? Do you mind if I take over?

> NodeManager should identify failed disks becoming good back again
> -
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
> Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
> YARN-90.patch, YARN-90.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again

2014-03-03 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13918439#comment-13918439
 ] 

Ravi Prakash commented on YARN-90:
--

I'm not working on it. Please feel free to take it over. Thanks Varun

> NodeManager should identify failed disks becoming good back again
> -
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
> Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
> YARN-90.patch, YARN-90.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again

2014-03-10 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13925905#comment-13925905
 ] 

Hadoop QA commented on YARN-90:
---

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12633727/apache-yarn-90.0.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 5 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3310//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3310//console

This message is automatically generated.

> NodeManager should identify failed disks becoming good back again
> -
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
>Assignee: Varun Vasudev
> Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
> YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again

2014-03-10 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13926199#comment-13926199
 ] 

Xuan Gong commented on YARN-90:
---

+1 LGTM

> NodeManager should identify failed disks becoming good back again
> -
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
>Assignee: Varun Vasudev
> Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
> YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again

2014-03-12 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932539#comment-13932539
 ] 

Hadoop QA commented on YARN-90:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12634255/apache-yarn-90.1.patch
  against trunk revision .

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3337//console

This message is automatically generated.

> NodeManager should identify failed disks becoming good back again
> -
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
>Assignee: Varun Vasudev
> Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
> YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again

2014-03-13 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932976#comment-13932976
 ] 

Hadoop QA commented on YARN-90:
---

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12634358/apache-yarn-90.2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 5 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3344//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3344//console

This message is automatically generated.

> NodeManager should identify failed disks becoming good back again
> -
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
>Assignee: Varun Vasudev
> Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
> YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
> apache-yarn-90.2.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again

2014-09-19 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140596#comment-14140596
 ] 

Hadoop QA commented on YARN-90:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12669994/apache-yarn-90.3.patch
  against trunk revision 6fe5c6b.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 5 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5044//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5044//artifact/PreCommit-HADOOP-Build-patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5044//console

This message is automatically generated.

> NodeManager should identify failed disks becoming good back again
> -
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
>Assignee: Varun Vasudev
> Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
> YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
> apache-yarn-90.2.patch, apache-yarn-90.3.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again

2014-09-19 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140663#comment-14140663
 ] 

Hadoop QA commented on YARN-90:
---

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12669998/apache-yarn-90.4.patch
  against trunk revision bf27b9c.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 5 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5045//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5045//console

This message is automatically generated.

> NodeManager should identify failed disks becoming good back again
> -
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
>Assignee: Varun Vasudev
> Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
> YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
> apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again

2014-09-22 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143974#comment-14143974
 ] 

Jason Lowe commented on YARN-90:


Thanks, Varun!  Comments on the latest patch:

It's a bit odd to have a hash map to map disk error types to lists of 
directories, fill them all in, but we only in practice actually look at one 
type in the map and that's DISK_FULL.  It'd be simpler (and faster and less 
space since there's no hashmap involved) to just track full disks as a separate 
collection like we already do for localDirs and failedDirs.

Nit: DISK_ERROR_CAUSE should be DiskErrorCause (if we keep the enum) to match 
the style of other enum types in the code.

In verifyDirUsingMkdir, if an error occurs during the finally clause then that 
exception will mask the original exception

isDiskUsageUnderPercentageLimit is named backwards.  Disk usage being under the 
configured limit shouldn't be a full disk error, and the error message is 
inconsistent with the method name (method talks about being under but error 
message says its above).
{code}
if (isDiskUsageUnderPercentageLimit(testDir)) {
  msg =
  "used space above threshold of "
  + diskUtilizationPercentageCutoff
  + "%, removing from the list of valid directories.";
{code}

We should only call getDisksHealthReport() once in the following code:
{code}
+String report = getDisksHealthReport();
+if (!report.isEmpty()) {
+  LOG.info("Disk(s) failed. " + getDisksHealthReport());
{code}

Should updateDirsAfterTest always say "Disk(s) failed" if the report isn't 
empty?  Thinking of the case where two disks go bad, then one later is 
restored.  The health report will still have something, but that last update is 
a disk turning good not failing.  Before this code was only called when a new 
disk failed, and now that's not always the case.  Maybe it should just be 
something like "Disk health update: " instead?

Is it really necessary to stat a directory before we try to delete it?  Seems 
like we can just try to delete it.

The idiom of getting the directories and adding the full directories seems 
pretty common.  Might be good to have dirhandler methods that already do this, 
like getLocalDirsForCleanup or getLogDirsForCleanup.

I'm a bit worried that getInitializedLocalDirs could potentially try to delete 
an entire directory tree for a disk.  If this fails in some sector-specific way 
but other containers are currently using their files from other sectors just 
fine on the same disk, removing these files from underneath active containers 
could be very problematic and difficult to debug.

> NodeManager should identify failed disks becoming good back again
> -
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
>Assignee: Varun Vasudev
> Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
> YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
> apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again

2014-09-24 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146937#comment-14146937
 ] 

Hadoop QA commented on YARN-90:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12671047/apache-yarn-90.5.patch
  against trunk revision 9fa5a89.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 5 new 
or modified test files.

  {color:red}-1 javac{color}.  The applied patch generated 1270 javac 
compiler warnings (more than the trunk's current 1265 warnings).

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager:

  
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService
  
org.apache.hadoop.yarn.server.nodemanager.containermanager.loghandler.TestNonAggregatingLogHandler
  
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5107//testReport/
Javac warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5107//artifact/PreCommit-HADOOP-Build-patchprocess/diffJavacWarnings.txt
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5107//console

This message is automatically generated.

> NodeManager should identify failed disks becoming good back again
> -
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
>Assignee: Varun Vasudev
> Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
> YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
> apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, 
> apache-yarn-90.5.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again

2014-09-24 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14147090#comment-14147090
 ] 

Hadoop QA commented on YARN-90:
---

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12671081/apache-yarn-90.6.patch
  against trunk revision 3cde37c.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 5 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5109//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5109//console

This message is automatically generated.

> NodeManager should identify failed disks becoming good back again
> -
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
>Assignee: Varun Vasudev
> Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
> YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
> apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, 
> apache-yarn-90.5.patch, apache-yarn-90.6.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again

2014-09-29 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152307#comment-14152307
 ] 

Jason Lowe commented on YARN-90:


Thanks for updating the patch, Varun.

bq. I've changed it to "Disk(s) health report: ". My only concern with this is 
that there might be scripts looking for the "Disk(s) failed" log line for 
monitoring. What do you think?

If that's true then the code should bother to do a diff between the old disk 
list and the new one, logging which disks turned bad using the "Disk(s) failed" 
line and which disks became healthy with some other log message.

bq. Directories are only cleaned up during startup. The code tests for 
existence of the directories and the correct permissions. This does mean that 
container directories left behind for any reason won't get cleaned up unit the 
NodeManager is restarted. Is that ok?

This could still be problematic for the NM work-preserving restart case, as we 
could try to delete an entire disk tree with active containers on it due to a 
hiccup when the NM restarts.  I think a better approach is a periodic cleanup 
scan that looks for directories under yarn-local and yarn-logs that shouldn't 
be there.  This could be part of the health check scan or done separately.  
That way we don't have to wait for a disk to turn good or bad to catch leaked 
entities on the disk due to some hiccup.  Sorta like an fsck for the NM state 
on disk.  That is best done as a separate JIRA, as I think this functionality 
is still an incremental improvement without it.

Other comments:

checkDirs unnecessarily calls union(errorDirs, fullDirs) twice.

isDiskFreeSpaceOverLimt is now named backwards, as the code returns true if the 
free space is under the limit.

getLocalDirsForCleanup and getLogDirsForCleanup should have javadoc comments 
like the other methods.

Nit: The union utility function doesn't technically perform a union but rather 
a concatenation, and it'd be a little clearer if the name reflected that.  Also 
the function should leverage the fact that it knows how big the ArrayList will 
be after the operations and give it the appropriate hint to its constructor to 
avoid reallocations.


> NodeManager should identify failed disks becoming good back again
> -
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
>Assignee: Varun Vasudev
> Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
> YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
> apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, 
> apache-yarn-90.5.patch, apache-yarn-90.6.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again

2014-09-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14153750#comment-14153750
 ] 

Hadoop QA commented on YARN-90:
---

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12672125/apache-yarn-90.7.patch
  against trunk revision 9582a50.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 6 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5187//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5187//console

This message is automatically generated.

> NodeManager should identify failed disks becoming good back again
> -
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
>Assignee: Varun Vasudev
> Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
> YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
> apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, 
> apache-yarn-90.5.patch, apache-yarn-90.6.patch, apache-yarn-90.7.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again

2014-09-30 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14154211#comment-14154211
 ] 

Ming Ma commented on YARN-90:
-

Thanks, Varun, Jason. Couple comments:

1. What if a dir is transitioned from DISK_FULL state to OTHER state? 
DirectoryCollection.checkDirs doesn't seem to update errorDirs and fullDirs 
properly. We can use some state machine for each dir and make sure each 
transition is covered.

2. DISK_FULL state is counted toward the error disk threshold by 
LocalDirsHandlerService.areDisksHealthy; later RM could mark NM NODE_UNUSABLE. 
If we believe DISK_FULL is mostly temporary issue, should we consider disks are 
healthy if disks only stay in DISK_FULL for some short period of time?

3. In AppLogAggregatorImpl.java, "(Path[]) localAppLogDirs.toArray(new 
Path[localAppLogDirs.size()]).". It seems the (Path[]) cast isn't necessary.

4. What is the intention of numFailures? Method getNumFailures isn't used.

5. Nit: It is better to expand "import java.util.*;" in 
DirectoryCollection.java and LocalDirsHandlerService.java.

> NodeManager should identify failed disks becoming good back again
> -
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
>Assignee: Varun Vasudev
> Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
> YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
> apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, 
> apache-yarn-90.5.patch, apache-yarn-90.6.patch, apache-yarn-90.7.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again

2014-10-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14155745#comment-14155745
 ] 

Hadoop QA commented on YARN-90:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12672436/apache-yarn-90.8.patch
  against trunk revision dd1b8f2.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 6 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:red}-1 release audit{color}.  The applied patch generated 1 
release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5213//testReport/
Release audit warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5213//artifact/patchprocess/patchReleaseAuditProblems.txt
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5213//console

This message is automatically generated.

> NodeManager should identify failed disks becoming good back again
> -
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
>Assignee: Varun Vasudev
> Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
> YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
> apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, 
> apache-yarn-90.5.patch, apache-yarn-90.6.patch, apache-yarn-90.7.patch, 
> apache-yarn-90.8.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again

2014-10-01 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14155747#comment-14155747
 ] 

Varun Vasudev commented on YARN-90:
---

The release audit warning is unrelated to the patch.

> NodeManager should identify failed disks becoming good back again
> -
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
>Assignee: Varun Vasudev
> Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
> YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
> apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, 
> apache-yarn-90.5.patch, apache-yarn-90.6.patch, apache-yarn-90.7.patch, 
> apache-yarn-90.8.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again

2014-10-03 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158643#comment-14158643
 ] 

Ming Ma commented on YARN-90:
-

Thanks, Varun.

The main question about UNHEALTHY state is whether this patch might make it 
more likely for a node to become unhealthy given "full disk" has been added as 
one of the conditions. Given [~jira.shegalov]'s YARN-1996 and [~sjlee0]'s 
MAPREDUCE-5817 have suggestions to mitigate the impact of UNHEALTHY nodes on 
existing containers and MR task scheduling, this might not be an issue.

Nit: For "Set postCheckFullDirs = new HashSet(fullDirs);". It 
doesn't have to create postCheckFullDirs. It can directly refer to fullDirs 
later.

> NodeManager should identify failed disks becoming good back again
> -
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
>Assignee: Varun Vasudev
> Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
> YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
> apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, 
> apache-yarn-90.5.patch, apache-yarn-90.6.patch, apache-yarn-90.7.patch, 
> apache-yarn-90.8.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again

2014-10-08 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14164214#comment-14164214
 ] 

zhihai xu commented on YARN-90:
---

I looked at the patch:
some nits I found:
1. can change
 if (!postCheckFullDirs.contains(dir) && postCheckOtherDirs.contains(dir)) {
to 
 if (postCheckOtherDirs.contains(dir)) {
because postCheckFullDirs and postCheckOtherDirs are mutually exclusive set.

2. same to item 1
change
 if (!postCheckOtherDirs.contains(dir) && postCheckFullDirs.contains(dir)) {
to 
if (postCheckFullDirs.contains(dir)) {

3. in verifyDirUsingMkdir:
Can we add int variable to file name to avoid loop forever(although it is a 
very small chance) like the following? 
long i = 0L;
while (target.exists()) {
  randomDirName = RandomStringUtils.randomAlphanumeric(5) + i++;
  target = new File(dir, randomDirName);   
 }

4. in disksTurnedBad:
Can we add break in the loop when disksFailed is true so we exit the loop 
earlier?
 if (!preCheckDirs.contains(dir)) {
disksFailed = true;
break;
   }

5. in disksTurnedGood same as item 4:
Can we add break in the loop when disksTurnedGood is true?

thanks

> NodeManager should identify failed disks becoming good back again
> -
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
>Assignee: Varun Vasudev
> Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
> YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
> apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, 
> apache-yarn-90.5.patch, apache-yarn-90.6.patch, apache-yarn-90.7.patch, 
> apache-yarn-90.8.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again

2014-10-10 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14167222#comment-14167222
 ] 

zhihai xu commented on YARN-90:
---

In function verifyDirUsingMkdir, target.exists(), target.mkdir() and 
FileUtils.deleteQuietly(target) is not atomic,
What happen if another thread try to create the same directory(target)?



> NodeManager should identify failed disks becoming good back again
> -
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
>Assignee: Varun Vasudev
> Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
> YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
> apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, 
> apache-yarn-90.5.patch, apache-yarn-90.6.patch, apache-yarn-90.7.patch, 
> apache-yarn-90.8.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again

2014-10-14 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14170711#comment-14170711
 ] 

Hadoop QA commented on YARN-90:
---

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12674722/apache-yarn-90.9.patch
  against trunk revision 5faaba0.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 6 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5386//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5386//console

This message is automatically generated.

> NodeManager should identify failed disks becoming good back again
> -
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
>Assignee: Varun Vasudev
> Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
> YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
> apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, 
> apache-yarn-90.5.patch, apache-yarn-90.6.patch, apache-yarn-90.7.patch, 
> apache-yarn-90.8.patch, apache-yarn-90.9.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again

2014-10-14 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14172050#comment-14172050
 ] 

Ming Ma commented on YARN-90:
-

Thanks Varun.

You and Jason discussed about disk clean up scenario. It will be useful to 
clarify if the following scenario will be resolved by this jira or a separate 
jira is necessary.

1. A disk became ready only. So DiskChecker will mark it as 
DiskErrorCause.OTHER.
2. Later the disk was repaired and became good. There are still data left on 
the disk.
3. Given these data are from old containers which have finished, who will clean 
up these data?

Nit: disksTurnedBad's parameter name preCheckDirs, it is better to name it 
preFailedDirs.

In the getDisksHealthReport, people can't tell if the disk fails due to full 
disk or failed disk, might be useful to distinguish the two cases.

verifyDirUsingMkdir, is it necessary given DiskChecker.checkDir will check it?
 

> NodeManager should identify failed disks becoming good back again
> -
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
>Assignee: Varun Vasudev
> Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
> YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
> apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, 
> apache-yarn-90.5.patch, apache-yarn-90.6.patch, apache-yarn-90.7.patch, 
> apache-yarn-90.8.patch, apache-yarn-90.9.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again

2014-10-17 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175175#comment-14175175
 ] 

Ming Ma commented on YARN-90:
-

Thanks Varun.The latest patch LGTM.

> NodeManager should identify failed disks becoming good back again
> -
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
>Assignee: Varun Vasudev
> Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
> YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
> apache-yarn-90.10.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, 
> apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, 
> apache-yarn-90.7.patch, apache-yarn-90.8.patch, apache-yarn-90.9.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again

2014-10-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175191#comment-14175191
 ] 

Hadoop QA commented on YARN-90:
---

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12675491/apache-yarn-90.10.patch
  against trunk revision 3687431.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 6 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5434//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5434//console

This message is automatically generated.

> NodeManager should identify failed disks becoming good back again
> -
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
>Assignee: Varun Vasudev
> Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
> YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
> apache-yarn-90.10.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, 
> apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, 
> apache-yarn-90.7.patch, apache-yarn-90.8.patch, apache-yarn-90.9.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again

2014-09-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14118193#comment-14118193
 ] 

Hadoop QA commented on YARN-90:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12634358/apache-yarn-90.2.patch
  against trunk revision 258c7d0.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4799//console

This message is automatically generated.

> NodeManager should identify failed disks becoming good back again
> -
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
>Assignee: Varun Vasudev
> Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
> YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
> apache-yarn-90.2.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again

2014-09-02 Thread Xu Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14119194#comment-14119194
 ] 

Xu Yang commented on YARN-90:
-

Hi, [~vvasudev]. Thanks a lot for your patch. Is it finished? I think this 
feature is very useful. If it isn't commited, maybe I need merge the patch 
manually.

> NodeManager should identify failed disks becoming good back again
> -
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
>Assignee: Varun Vasudev
> Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
> YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
> apache-yarn-90.2.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again

2014-09-03 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14119795#comment-14119795
 ] 

Varun Vasudev commented on YARN-90:
---

[~yxls123123] the patch needs to be rebased. It's currently causing a merge 
conflict. Give me a couple of days and I should be able to sort it out.

> NodeManager should identify failed disks becoming good back again
> -
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
>Assignee: Varun Vasudev
> Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
> YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
> apache-yarn-90.2.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again

2013-08-06 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13730850#comment-13730850
 ] 

Ravi Prakash commented on YARN-90:
--

Do we know what we need to do for this JIRA? I can see in DirectoryCollection, 
we need to be able to remove from failedDirs, and be able to recognize this 
fact in LocalDirsHandler service. Would anything else need to be done?

> NodeManager should identify failed disks becoming good back again
> -
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again

2013-09-24 Thread nijel (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13776113#comment-13776113
 ] 

nijel commented on YARN-90:
---

To handle this we can check the failed dirs first in 
DirectoryCollection.checkDirs() and add back to localDirs if the directories 
are recovered from error.


> NodeManager should identify failed disks becoming good back again
> -
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again

2013-09-24 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13777082#comment-13777082
 ] 

Ravi Prakash commented on YARN-90:
--

Hi nijel!

Welcome to the community and thanks for your contribution. A few comments:
1. Nit: Some lines are over 80 characters long.
2. numFailures is never incremented any more when the directory fails. Thus 
getNumFailures() would return the wrong result.

Could you please also tell us how you tested the patch? There seem to be a lot 
of unit tests which use LocalDirsHandlerService. Did you run them all and 
ensure that they still all pass?

Thanks again

> NodeManager should identify failed disks becoming good back again
> -
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
> Attachments: YARN-90.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again

2013-09-25 Thread nijel (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13777530#comment-13777530
 ] 

nijel commented on YARN-90:
---

Hi Ravi,
Thanks for the comments

Patch is updated with the comments

About the test part, Executed the test cases in node manager project locally. 
Except 2 all are passing. Failures are not related to directory service.

> NodeManager should identify failed disks becoming good back again
> -
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
> Attachments: YARN-90.1.patch, YARN-90.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again

2013-09-30 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13781905#comment-13781905
 ] 

Ravi Prakash commented on YARN-90:
--

Hi nijel!

For testing I would like to configure a USB drive to be one of the local + log 
dirs. We can then simulate failure by unplugging the USB drive. When we plug it 
back in, the NM should start using the "recovered" disk. Did you experience 
this behaviour yourself? I'll also try to test this soon as I get some cycles.

 

> NodeManager should identify failed disks becoming good back again
> -
>
> Key: YARN-90
> URL: https://issues.apache.org/jira/browse/YARN-90
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Ravi Gummadi
> Attachments: YARN-90.1.patch, YARN-90.patch
>
>
> MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
> down, it is marked as failed forever. To reuse that disk (after it becomes 
> good), NodeManager needs restart. This JIRA is to improve NodeManager to 
> reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.1#6144)