[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13446528#comment-13446528
 ] 

Bikas Saha commented on MAPREDUCE-4607:
---------------------------------------

Attaching patch with fixes
1) Rename MapRetroactive* to Retroactive* because multi-task race conditions 
can make these execute for reduce and maps
2) Change RetroactiveKilled transition to first check if the killed attempt is 
same as the succeeded attempt. checking for map tasks after that case. This 
eliminates race conditions between concurrent reduce tasks. Adding tests for 
that in TestTaskImpl.java
3) Leaving the check and internalError() unchanged in Retroactive* transitions 
for reduce tasks. This is because currently its a system invariant that we dont 
fail/kill successful reduce tasks because the outputs are safe in hdfs. If we 
fail this check then it would change in the invariant caused by change in logic 
or a bug.
4) There is currently a bug/race condition that for the same reduce task 
successful completion and killed events might be in flight. This is fixed by 
making sure that successful reduce attempts dont kill themselves because that 
is inefficient incorrect behavior. Adding test coverage for that in existing 
testcase in TestMRApp.java
                
> Race condition in ReduceTask completion can result in Task being incorrectly 
> failed
> -----------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4607
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4607
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 2.1.0-alpha
>            Reporter: Bikas Saha
>            Assignee: Bikas Saha
>         Attachments: MAPREDUCE-4607.1.patch, MAPREDUCE-4607.2.patch
>
>
> Problem reported by chackaravarthy in MAPREDUCE-4252
> This problem has been handled when speculative task launched for map task and 
> other attempt got failed (not killed)
> Can the similar kind of scenario can happen in case of reduce task?
> Consider the following scenario for reduce task in case of speculation (one 
> attempt got killed):
> 1. A task attempt is started.
> 2. A speculative task attempt for the same task is started.
> 3. The first task attempt completes and causes the task to transition to 
> SUCCEEDED.
> 4. Then speculative task attempt will be killed because of the completion of 
> first attempt.
> As a result, internal error will be thrown from this attempt 
> (TaskImpl.MapRetroactiveKilledTransition) and hence task attempt failure 
> leads to job failure.
> TaskImpl.MapRetroactiveKilledTransition
> if (!TaskType.MAP.equals(task.getType())) {
>         LOG.error("Unexpected event for REDUCE task " + event.getType());
>         task.internalError(event.getType());
>       }
> So, do we need to have following code in MapRetroactiveKilledTransition also 
> just like in MapRetroactiveFailureTransition.
> if (event instanceof TaskTAttemptEvent) {
>         TaskTAttemptEvent castEvent = (TaskTAttemptEvent) event;
>         if (task.getState() == TaskState.SUCCEEDED &&
>             !castEvent.getTaskAttemptID().equals(task.successfulAttempt)) {
>           // don't allow a different task attempt to override a previous
>           // succeeded state
>           return TaskState.SUCCEEDED;
>         }
>       }
> please check whether this is a valid case and give your suggestion.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to