[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13617471#comment-13617471
 ] 

Karthik Kambatla commented on MAPREDUCE-5110:
---------------------------------------------

Thanks Arun. Agree that we can't guarantee a single task attempt in the face of 
a transient network partition. That said, I think there is merit to solving 
something we can. For instance, the users could have their own SLAs (time or 
percentile or plain hardware-based) to guard against inconsistencies due to 
network partitions.

bq. I think MAPREDUCE-2217 made an important improvement and we should keep it. 
However, I'm very scared of trying to implement MAPREDUCE-2217 via TT-side 
changes, particularly, when we are adding complexity to already squiggly code 
on the TT.

Agree MAPREDUCE-2217 addresses the hung TT case, but only for UNASSIGNED tasks. 
The RUNNING/COMMIT_PENDING tasks still are addressed by TT. In other words, the 
rationale of monitoring task progress for RUNNING/COMMIT_PENDING in TT instead 
of JT applies to this case too. If anything, the proposed patch only makes it 
consistent.

All this said, if you are uncomfortable with the JT changes, I can restrict the 
changes to TT.
                
> Long task launch delays can lead to multiple parallel attempts of the task
> --------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-5110
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5110
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: tasktracker
>    Affects Versions: 1.1.2
>            Reporter: Karthik Kambatla
>            Assignee: Karthik Kambatla
>         Attachments: expose-mr-5110.patch, mr-5110.patch, mr-5110.patch
>
>
> If a task takes too long to launch, the JT expires the task and schedules 
> another attempt. The earlier attempt can start after the later attempt 
> leading to two parallel attempts running at the same time. This is 
> particularly an issue if the user turns off speculation and expects a single 
> attempt of a task to run at any point in time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to