[ https://issues.apache.org/jira/browse/MAPREDUCE-2413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ravi Gummadi updated MAPREDUCE-2413: ------------------------------------ Attachment: MR-2413.v0.3.patch Attaching new patch removing an unused field in LocalStorage class. > TaskTracker should handle disk failures at both startup and runtime > ------------------------------------------------------------------- > > Key: MAPREDUCE-2413 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2413 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task-controller, tasktracker > Affects Versions: 0.20.204.0 > Reporter: Bharath Mundlapudi > Assignee: Ravi Gummadi > Fix For: 0.20.204.0 > > Attachments: MR-2413.v0.1.patch, MR-2413.v0.2.patch, > MR-2413.v0.3.patch, MR-2413.v0.patch > > > At present, TaskTracker doesn't handle disk failures properly both at startup > and runtime. > (1) Currently TaskTracker doesn't come up if any of the mapred-local-dirs is > on a bad disk. TaskTracker should ignore that particular mapred-local-dir and > start up and use only the remaining good mapred-local-dirs. > (2) If a disk goes bad while TaskTracker is running, currently TaskTracker > doesn't do anything special. This results in either > (a) TaskTracker continues to "try to use that bad disk" and this results > in lots of task failures and possibly job failures(because of multiple TTs > having bad disks) and eventually these TTs getting graylisted for all jobs. > And this needs manual restart of TT with modified configuration of > mapred-local-dirs avoiding the bad disk. OR > (b) Health check script identifying the disk as bad and the TT gets > blacklisted. And this also needs manual restart of TT with modified > configuration of mapred-local-dirs avoiding the bad disk. > This JIRA is to make TaskTracker more fault-tolerant to disk failures solving > (1) and (2). i.e. TT should start even if at least one of the > mapred-local-dirs is on a good disk and TT should adjust its in-memory list > of mapred-local-dirs and avoid using bad mapred-local-dirs. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira