[ http://issues.apache.org/jira/browse/HADOOP-610?page=all ]
Owen O'Malley updated HADOOP-610: --------------------------------- Attachment: lost-tt.patch This patch: 1. Refactors the Task Tracker's offerService loop into a handful of routines 2. Adds exception handlers inside the offerService loop. 3. Moves the Phase enum into the TaskStatus class rather than free floating at the top of TaskStatus.java, which was harder to find. 4. Adds generic arguments to some of the data structures in TaskTracker. 5. Pulls the Running Job stuff introduced in caching into methods. 6. Removes the old TaskTracker.getCallStacks, which used kill-all -QUIT java. 7. Adds some new stack gathering code into ReflectionUtils. 8. Add a short-circuit to enoughFreeSpace to handle the case where required size = 0. 9. Add equals/hashCode to TaskTracker.TaskInProgress to make them hashable. 10. Add a new switch to enable contention tracking in the TaskTracker (tasktracker.contention.tracking). > Task Tracker offerService does not adequately protect from exceptions > --------------------------------------------------------------------- > > Key: HADOOP-610 > URL: http://issues.apache.org/jira/browse/HADOOP-610 > Project: Hadoop > Issue Type: Bug > Components: mapred > Affects Versions: 0.7.1 > Reporter: Owen O'Malley > Assigned To: Owen O'Malley > Fix For: 0.8.0 > > Attachments: lost-tt.patch > > > The TaskTracker's offerService loop doesn't handle exceptions, such as time > outs well and will reset the task tracker. I believe this is the cause of > most of the lost task trackers. The scenario looks like: > 1. an rpc timeout in offerService > 2. the task tracker cleans up (which takes 30 minutes with the task tracker > locked up) > 3. the task tracker is declared lost for not providing its heartbeat -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira