[ http://issues.apache.org/jira/browse/HADOOP-547?page=all ]
Sanjay Dahiya updated HADOOP-547: --------------------------------- Attachment: Hadoop-547.patch Here is a patch for review - It makes sure that reduce task, sends a heartbeat/progress when none of copy tasks finishes with in "mapred.task.timeout". It replaces the unconditional wait with a conditional wait with a timeout of (mapred.task.timeout)/2. (we could make it 3/4th of this timeout as well). > ReduceTaskRunner can miss sending hearbeats if no map output copy finishes > within "mapred.task.timeout" > ------------------------------------------------------------------------------------------------------- > > Key: HADOOP-547 > URL: http://issues.apache.org/jira/browse/HADOOP-547 > Project: Hadoop > Issue Type: Bug > Components: mapred > Affects Versions: 0.6.2 > Reporter: Sanjay Dahiya > Assigned To: Sanjay Dahiya > Attachments: Hadoop-547.patch > > > In ReduceTaskRunner, main loop sending heartbeats waits on copyResults, which > releases only if a copy thread finishes copying. This can cause good reduce > tasks which are copying data to fail, if no map task output was copied within > "mapred.task.timeout". > ReduceTaskRunner.java:490 > try { > copyResults.wait(); <=========== Calls > unconditional wait. > } catch (InterruptedException e) { } > wait() should be with a timeout, possibly taskTimeout/2 after which it should > send a hearbeat and go back to wait. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira