GitHub user ericl opened a pull request:

    https://github.com/apache/spark/pull/14932

    [SPARK-17371] Resubmitted shuffle outputs can get deleted by zombie map 
tasks

    ## What changes were proposed in this pull request?
    
    It seems that old shuffle map tasks hanging around after a stage resubmit 
will delete intended shuffle output files on stop(), causing downstream stages 
to fail even after successful resubmit completion. This can happen easily if 
the prior map task is waiting for a network timeout when its stage is 
resubmitted.
    
    This can cause unnecessary stage resubmits, sometimes multiple times, and 
very confusing FetchFailure messages that report shuffle index files missing 
from the local disk.
    
    Given that IndexShuffleBlockResolver commits data atomically, it seems 
unnecessary to ever delete committed task output: even in the rare case that a 
task is failed after it finishes committing shuffle output, it should be safe 
to retain that output.
    
    ## How was this patch tested?
    
    Prior to the fix proposed in https://github.com/apache/spark/pull/14931, I 
was able to reproduce this behavior by killing slaves. After this patch, stages 
were no longer resubmitted multiple times due to shuffle index loss.
    
    cc @JoshRosen @vanzin 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ericl/spark dont-remove-committed-files

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14932.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14932
    
----

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to