Burak Yavuz created SPARK-20230:
-----------------------------------

             Summary: FetchFailedExceptions should invalidate file caches in 
MapOutputTracker even if newer stages are launched
                 Key: SPARK-20230
                 URL: https://issues.apache.org/jira/browse/SPARK-20230
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.1.0
            Reporter: Burak Yavuz


If you lose instances that have shuffle outputs, you will start observing 
messages like:

{code}
17/03/24 11:49:23 WARN TaskSetManager: Lost task 0.0 in stage 64.1 (TID 3849, 
172.128.196.240, executor 0): FetchFailed(BlockManagerId(4, 172.128.200.157, 
4048, None), shuffleId=16, mapId=2, reduceId=3, message=
{code}

Generally, these messages are followed by:

{code}
17/03/24 11:49:23 INFO DAGScheduler: Executor lost: 4 (epoch 20)
17/03/24 11:49:23 INFO BlockManagerMasterEndpoint: Trying to remove executor 4 
from BlockManagerMaster.
17/03/24 11:49:23 INFO BlockManagerMaster: Removed 4 successfully in 
removeExecutor
17/03/24 11:49:23 INFO DAGScheduler: Shuffle files lost for executor: 4 (epoch 
20)
17/03/24 11:49:23 INFO ShuffleMapStage: ShuffleMapStage 63 is now unavailable 
on executor 4 (73/89, false)
{code}

which is great. Spark resubmits tasks for data that has been lost. However, if 
you have cascading instance failures, then you may come across:

{code}
17/03/24 11:48:39 INFO DAGScheduler: Ignoring fetch failure from ResultTask(64, 
46) as it's from ResultStage 64 attempt 0 and there is a more recent attempt 
for that stage (attempt ID 1) running
{code}

which don't invalidate file outputs. In later retries of the stage, Spark will 
attempt to access files on machines that don't exist anymore, and then after 4 
tries, Spark will give up. If it had not ignored the fetch failure, and 
invalidated the cache, then most of the lost files could have been computed 
during one of the previous retries.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to