GitHub user hthuynh2 opened a pull request:

    https://github.com/apache/spark/pull/21653

    [SPARK-13343] speculative tasks that didn't commit shouldn't be marked as 
success

    **Description**
    Currently Speculative tasks that didn't commit can show up as success of 
failures (depending on timing of commit). This is a bit confusing because that 
task didn't really succeed in the sense it didn't write anything.
    I think these tasks should be marked as KILLED or something that is more 
obvious to the user exactly what happened. it is happened to hit the timing 
where it got a commit denied exception then it shows up as failed and counts 
against your task failures. It shouldn't count against task failures since that 
failure really doesn't matter.
    MapReduce handles these situation so perhaps we can look there for a model.
    
    <img width="1420" alt="unknown" 
src="https://user-images.githubusercontent.com/15680678/42013170-99db48c2-7a61-11e8-8c7b-ef94c84e36ea.png";>
    
    **How can this issue happen?**
    When both attempts of a task finish before the driver sends command to kill 
one of them, both of them send the status update FINISHED to the driver. The 
driver calls TaskSchedulerImpl to handle one successful task at a time. When it 
handles the first successful task, it sends the command to kill the other copy 
of the task, however, because that task is already finished, the executor will 
ignore the command. After finishing handling the first attempt, it processes 
the second one, although all actions on the result of this task are skipped, 
this copy of the task is still marked as SUCCESS. As a result, even though this 
issue does not affect the result of the job, it might cause confusing to user 
because both of them appear to be successful.
    
    **How does this PR fix the issue?**
    The simple way to fix this issue is that when taskSetManager handles 
successful task, it checks if any other attempt succeeded. If this is the case, 
it will call handleFailedTask with state==KILLED and 
reason==TaskKilled(“another attempt succeeded”) to handle this task as 
begin killed.
    
    **How was this patch tested?**
    I tested this manually by running applications, that caused the issue 
before, a few times, and observed that the issue does not happen again. Also, I 
added a unit test in TaskSetManagerSuite to test that if we call 
handleSuccessfulTask to handle status update for 2 copies of a task, only the 
one that is handled first will be mark as SUCCESS


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/hthuynh2/spark SPARK_13343

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21653.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21653
    
----
commit 8f7d98177816e11659cf79a2b28f96bd4b7173d5
Author: Hieu Huynh <“hieu.huynh@...>
Date:   2018-06-28T04:19:14Z

    Fixed issue and added unit test

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to