GitHub user brkyvz opened a pull request:

    https://github.com/apache/spark/pull/17543

    [SPARK-20230] FetchFailedExceptions should invalidate file caches in 
MapOutputTracker even if newer stages are launched

    ## What changes were proposed in this pull request?
    
    If you lose instances that have shuffle outputs, you will start observing 
messages like:
    ```
    17/03/24 11:49:23 WARN TaskSetManager: Lost task 0.0 in stage 64.1 (TID 
3849, 172.128.196.240, executor 0): FetchFailed(BlockManagerId(4, 
172.128.200.157, 4048, None), shuffleId=16, mapId=2, reduceId=3, message=
    ```
    Generally, these messages are followed by:
    ```
    17/03/24 11:49:23 INFO DAGScheduler: Executor lost: 4 (epoch 20)
    17/03/24 11:49:23 INFO BlockManagerMasterEndpoint: Trying to remove 
executor 4 from BlockManagerMaster.
    17/03/24 11:49:23 INFO BlockManagerMaster: Removed 4 successfully in 
removeExecutor
    17/03/24 11:49:23 INFO DAGScheduler: Shuffle files lost for executor: 4 
(epoch 20)
    17/03/24 11:49:23 INFO ShuffleMapStage: ShuffleMapStage 63 is now 
unavailable on executor 4 (73/89, false)
    ```
    which is great. Spark resubmits tasks for data that has been lost. However, 
if you have cascading instance failures, then you may come across:
    ```
    17/03/24 11:48:39 INFO DAGScheduler: Ignoring fetch failure from 
ResultTask(64, 46) as it's from ResultStage 64 attempt 0 and there is a more 
recent attempt for that stage (attempt ID 1) running
    ```
    which don't invalidate file outputs. In later retries of the stage, Spark 
will attempt to access files on machines that don't exist anymore, and then 
after 4 tries, Spark will give up. If it had not ignored the fetch failure, and 
invalidated the cache, then most of the lost files could have been computed 
during one of the previous retries.
    
    ## How was this patch tested?
    
    Will add tests based on feedback

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/brkyvz/spark SPARK-20230

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17543.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17543
    
----
commit bc80aab919cabe93c59a312e7e72ed1ed453906d
Author: Burak Yavuz <brk...@gmail.com>
Date:   2017-04-05T19:14:22Z

    invalidate files

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to