[jira] [Updated] (MAPREDUCE-3121) NodeManager should handle disk-failures

2011-11-29 Thread Mahadev konar (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-3121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar updated MAPREDUCE-3121:
-

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Just committed this. Thanks Ravi! 

> NodeManager should handle disk-failures
> ---
>
> Key: MAPREDUCE-3121
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3121
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mrv2, nodemanager
>Affects Versions: 0.23.0
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Ravi Gummadi
>Priority: Blocker
> Fix For: 0.23.1
>
> Attachments: 3121.patch, 3121.v1.1.patch, 3121.v1.patch, 
> 3121.v2.patch, 3121.v3.patch
>
>
> This is akin to MAPREDUCE-2413 but for YARN's NodeManager. We want to 
> minimize the impact of transient/permanent disk failures on containers. With 
> larger number of disks per node, the ability to continue to run containers on 
> other disks is crucial.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (MAPREDUCE-3121) NodeManager should handle disk-failures

2011-11-21 Thread Ravi Gummadi (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-3121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Gummadi updated MAPREDUCE-3121:


Status: Patch Available  (was: Open)

> NodeManager should handle disk-failures
> ---
>
> Key: MAPREDUCE-3121
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3121
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mrv2, nodemanager
>Affects Versions: 0.23.0
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Ravi Gummadi
>Priority: Blocker
> Fix For: 0.23.1
>
> Attachments: 3121.patch, 3121.v1.1.patch, 3121.v1.patch, 
> 3121.v2.patch, 3121.v3.patch
>
>
> This is akin to MAPREDUCE-2413 but for YARN's NodeManager. We want to 
> minimize the impact of transient/permanent disk failures on containers. With 
> larger number of disks per node, the ability to continue to run containers on 
> other disks is crucial.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (MAPREDUCE-3121) NodeManager should handle disk-failures

2011-11-21 Thread Ravi Gummadi (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-3121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Gummadi updated MAPREDUCE-3121:


Attachment: 3121.v3.patch

Attaching new patch. Addressed all the review comments except the integration 
tests for other components. Will raise a separate JIRA for them as the current 
patch is already having tests that check the basic functionality of the patch.

> NodeManager should handle disk-failures
> ---
>
> Key: MAPREDUCE-3121
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3121
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mrv2, nodemanager
>Affects Versions: 0.23.0
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Ravi Gummadi
>Priority: Blocker
> Fix For: 0.23.1
>
> Attachments: 3121.patch, 3121.v1.1.patch, 3121.v1.patch, 
> 3121.v2.patch, 3121.v3.patch
>
>
> This is akin to MAPREDUCE-2413 but for YARN's NodeManager. We want to 
> minimize the impact of transient/permanent disk failures on containers. With 
> larger number of disks per node, the ability to continue to run containers on 
> other disks is crucial.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (MAPREDUCE-3121) NodeManager should handle disk-failures

2011-11-16 Thread Ravi Gummadi (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-3121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Gummadi updated MAPREDUCE-3121:


Attachment: 3121.v2.patch

Attaching new patch.
Refactored and cleaned up code, added more testcases.

NodeHealthCheckerService is a composite service containing 2 services 
NodeHealthScriptRunner and DiskHandlerService.

> NodeManager should handle disk-failures
> ---
>
> Key: MAPREDUCE-3121
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3121
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mrv2, nodemanager
>Affects Versions: 0.23.0
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Ravi Gummadi
>Priority: Blocker
> Fix For: 0.23.1
>
> Attachments: 3121.patch, 3121.v1.1.patch, 3121.v1.patch, 3121.v2.patch
>
>
> This is akin to MAPREDUCE-2413 but for YARN's NodeManager. We want to 
> minimize the impact of transient/permanent disk failures on containers. With 
> larger number of disks per node, the ability to continue to run containers on 
> other disks is crucial.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (MAPREDUCE-3121) NodeManager should handle disk-failures

2011-11-15 Thread Vinod Kumar Vavilapalli (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-3121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated MAPREDUCE-3121:
---

Status: Open  (was: Patch Available)

Cancelling patch for progress.

> NodeManager should handle disk-failures
> ---
>
> Key: MAPREDUCE-3121
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3121
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mrv2, nodemanager
>Affects Versions: 0.23.0
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Ravi Gummadi
> Fix For: 0.23.1
>
> Attachments: 3121.patch, 3121.v1.1.patch, 3121.v1.patch
>
>
> This is akin to MAPREDUCE-2413 but for YARN's NodeManager. We want to 
> minimize the impact of transient/permanent disk failures on containers. With 
> larger number of disks per node, the ability to continue to run containers on 
> other disks is crucial.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (MAPREDUCE-3121) NodeManager should handle disk-failures

2011-11-14 Thread Ravi Gummadi (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-3121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Gummadi updated MAPREDUCE-3121:


Status: Patch Available  (was: Open)

Just going through submitting patch. Not for commit.

> NodeManager should handle disk-failures
> ---
>
> Key: MAPREDUCE-3121
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3121
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mrv2, nodemanager
>Affects Versions: 0.23.0
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Ravi Gummadi
> Fix For: 0.23.1
>
> Attachments: 3121.patch, 3121.v1.1.patch, 3121.v1.patch
>
>
> This is akin to MAPREDUCE-2413 but for YARN's NodeManager. We want to 
> minimize the impact of transient/permanent disk failures on containers. With 
> larger number of disks per node, the ability to continue to run containers on 
> other disks is crucial.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (MAPREDUCE-3121) NodeManager should handle disk-failures

2011-11-14 Thread Ravi Gummadi (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-3121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Gummadi updated MAPREDUCE-3121:


Attachment: 3121.v1.1.patch

Attaching new patch with a minor fix to earlier patch.

> NodeManager should handle disk-failures
> ---
>
> Key: MAPREDUCE-3121
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3121
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mrv2, nodemanager
>Affects Versions: 0.23.0
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Ravi Gummadi
> Fix For: 0.23.1
>
> Attachments: 3121.patch, 3121.v1.1.patch, 3121.v1.patch
>
>
> This is akin to MAPREDUCE-2413 but for YARN's NodeManager. We want to 
> minimize the impact of transient/permanent disk failures on containers. With 
> larger number of disks per node, the ability to continue to run containers on 
> other disks is crucial.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (MAPREDUCE-3121) NodeManager should handle disk-failures

2011-11-14 Thread Ravi Gummadi (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-3121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Gummadi updated MAPREDUCE-3121:


Attachment: 3121.v1.patch

Attaching new patch incorporating most of the review comments. Will discuss the 
remaining minor comments with Vinod soon and upload another patch.

Added a new configuration property for "minimum fraction of number of disks 
that are to be healthy for considering a node to be healthy interms of disks". 
It is yarn.nodemanager.disk-health-checker.min-healthy-disks. It's 
default value is 0.05. i.e. By default, a node is considered unhealthy if there 
are only less than 5% of disks are healthy.

> NodeManager should handle disk-failures
> ---
>
> Key: MAPREDUCE-3121
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3121
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mrv2, nodemanager
>Affects Versions: 0.23.0
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Ravi Gummadi
> Fix For: 0.23.1
>
> Attachments: 3121.patch, 3121.v1.patch
>
>
> This is akin to MAPREDUCE-2413 but for YARN's NodeManager. We want to 
> minimize the impact of transient/permanent disk failures on containers. With 
> larger number of disks per node, the ability to continue to run containers on 
> other disks is crucial.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (MAPREDUCE-3121) NodeManager should handle disk-failures

2011-11-07 Thread Vinod Kumar Vavilapalli (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-3121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated MAPREDUCE-3121:
---

Status: Open  (was: Patch Available)

My comments on the patch.

*Major:*
DiskHealthCheckerService
 - Let us not use the configuration property to read the dir-list anywhere. I 
already see a couple of places where this can be trouble-some. Make sure that 
YarnConfiguration.NM_LOCAL_DIRS and YarnConfiguration.NM_LOG_DIRS aren't read 
anywhere after the initial NM-start.
   -- After this, you should definitely drop 
DiskHealthCheckerService.updateDirsInConfiguration().
 - ResourceLocalizationService.localDirsSelector doesn't look at healthy disks, 
same for ShuffleHandler.
 - Let's have a single API to return local-dirs either Array or a list. Let the 
caller take care of any conversions they need. Ditto about 
getLocalDirsString()/getLogDirsString().
 - Same as above for the LocalStorage class and getNumDirs() methods in there.
 - Instead of setting the disksInfo in NodePage, we can set it directly in the 
health-report via NodeStatusUpdater. That way, RM will also get the report.

bq. IMO, one approach we could take is to do a check at the top level. For 
example, ContainersLaunch as per the patch does a sanity check on available 
disks and bails if there is an issue. It could create a snapshot of available 
disks at this point.
Agreed, we should check at ContainersLauncherImpl level before launching every 
container and send out a container-failed event in case all disks have gone bad.

*Minor:*
DiskHealthCheckerService
 - DisksHealthMonitorExecutor can be static and final.
 - Drop the constructor DiskHealthCheckerService(Configuration conf) and use 
init() life-cycle directly.
 - init(Configuration conf): call super.init() at the end so that it can set 
the service-state correctly. Same for start(), stop()
 - Declare diskHealthChecker as a DisksHealthMonitorExecutor instead of 
TimerTask so as to avoid unnecessary casting in get{Log|Local}Dirs()
 - LocalStorage is way too general. Let's rename LocalStorage to 
LocalDirectories?

Let's have multiple local and log-dirs by default in MiniYarnCluster.

TestDiskFailures:
 - No need for hard-coded 2 secs sleep, MiniYarnCluster already has code to 
wait for starting.
 - waitForDiskHealthCheck() is error-prone. You can set an inline 
DiskHealthChecker and invoke explicit checking of disks.

*Miscellaneous:*
Like I summarized in my first comment, there are more pending items. Let me 
know if you want to tackle them separately:
 - We need a configuration for the max-percentage of disks that can fail 
without shutting down NM.
 - If all or a major percentage of disks turn bad at startup, NodeManager 
should either fail to start or get started, and report the RM that it is bad 
and needs operator intervention.
 - If all or a major percentage of disks turn bad during runtime, NodeManager 
can update the RM and ask for a halt in the scheduling.


> NodeManager should handle disk-failures
> ---
>
> Key: MAPREDUCE-3121
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3121
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mrv2, nodemanager
>Affects Versions: 0.23.0
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Ravi Gummadi
> Fix For: 0.23.1
>
> Attachments: 3121.patch
>
>
> This is akin to MAPREDUCE-2413 but for YARN's NodeManager. We want to 
> minimize the impact of transient/permanent disk failures on containers. With 
> larger number of disks per node, the ability to continue to run containers on 
> other disks is crucial.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (MAPREDUCE-3121) NodeManager should handle disk-failures

2011-10-30 Thread Ravi Gummadi (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-3121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Gummadi updated MAPREDUCE-3121:


Status: Patch Available  (was: Open)

> NodeManager should handle disk-failures
> ---
>
> Key: MAPREDUCE-3121
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3121
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mrv2, nodemanager
>Affects Versions: 0.23.0
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Ravi Gummadi
> Fix For: 0.23.0
>
> Attachments: 3121.patch
>
>
> This is akin to MAPREDUCE-2413 but for YARN's NodeManager. We want to 
> minimize the impact of transient/permanent disk failures on containers. With 
> larger number of disks per node, the ability to continue to run containers on 
> other disks is crucial.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (MAPREDUCE-3121) NodeManager should handle disk-failures

2011-10-30 Thread Ravi Gummadi (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-3121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Gummadi updated MAPREDUCE-3121:


Attachment: 3121.patch

Attaching patch that makes NodeManager to identify disk failures (for 
nodemanager-local-dirs and nodemanager-log-dirs) and makes sure to use good 
local-dirs and good log-dirs for the containers to be launched after 
disk-failure-identification.

Please review and provide your comments.

> NodeManager should handle disk-failures
> ---
>
> Key: MAPREDUCE-3121
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3121
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mrv2, nodemanager
>Affects Versions: 0.23.0
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Ravi Gummadi
> Fix For: 0.23.0
>
> Attachments: 3121.patch
>
>
> This is akin to MAPREDUCE-2413 but for YARN's NodeManager. We want to 
> minimize the impact of transient/permanent disk failures on containers. With 
> larger number of disks per node, the ability to continue to run containers on 
> other disks is crucial.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (MAPREDUCE-3121) NodeManager should handle disk-failures

2011-10-05 Thread Vinod Kumar Vavilapalli (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-3121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated MAPREDUCE-3121:
---

Priority: Major  (was: Blocker)

Damn, wrong JIRA.

> NodeManager should handle disk-failures
> ---
>
> Key: MAPREDUCE-3121
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3121
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mrv2, nodemanager
>Affects Versions: 0.23.0
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Ravi Gummadi
> Fix For: 0.23.0
>
>
> This is akin to MAPREDUCE-2413 but for YARN's NodeManager. We want to 
> minimize the impact of transient/permanent disk failures on containers. With 
> larger number of disks per node, the ability to continue to run containers on 
> other disks is crucial.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (MAPREDUCE-3121) NodeManager should handle disk-failures

2011-10-05 Thread Vinod Kumar Vavilapalli (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-3121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated MAPREDUCE-3121:
---

Priority: Blocker  (was: Major)

Let's get this in for 0.23.

> NodeManager should handle disk-failures
> ---
>
> Key: MAPREDUCE-3121
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3121
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mrv2, nodemanager
>Affects Versions: 0.23.0
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Ravi Gummadi
>Priority: Blocker
> Fix For: 0.23.0
>
>
> This is akin to MAPREDUCE-2413 but for YARN's NodeManager. We want to 
> minimize the impact of transient/permanent disk failures on containers. With 
> larger number of disks per node, the ability to continue to run containers on 
> other disks is crucial.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira