Thomas Graves created SPARK-24909:
-------------------------------------

             Summary: Spark scheduler can hang with fetch failures and executor 
lost and multiple stage attempts
                 Key: SPARK-24909
                 URL: https://issues.apache.org/jira/browse/SPARK-24909
             Project: Spark
          Issue Type: Bug
          Components: Scheduler
    Affects Versions: 2.3.1
            Reporter: Thomas Graves


The DAGScheduler can hang if the executor was lost (due to fetch failure) and 
all the tasks in the tasks sets are marked as completed. 
([https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1265)]

It never creates new task attempts in the task scheduler but the dag scheduler 
still has pendingPartitions.

18/07/22 08:30:00 INFO scheduler.TaskSetManager: Starting task 55769.0 in stage 
44.0 (TID 970752, host1.com, executor 33, partition 55769, PROCESS_LOCAL, 7874 
bytes)

18/07/22 08:30:29 INFO scheduler.DAGScheduler: Marking ShuffleMapStage 44 
(repartition at Lift.scala:191) as failed due to a fetch failure from 
ShuffleMapStage 42 (map at foo.scala:27)
18/07/22 08:30:29 INFO scheduler.DAGScheduler: Resubmitting ShuffleMapStage 42 
(map at foo.scala:27) and ShuffleMapStage 44 (repartition at bar.scala:191) due 
to fetch failure
....

18/07/22 08:30:56 INFO scheduler.DAGScheduler: Executor lost: 33 (epoch 18)
18/07/22 08:30:56 INFO schedulerDAGScheduler: Shuffle files lost for executor: 
33 (epoch 18)

18/07/22 08:31:20 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 44 
(MapPartitionsRDD[70] at repartition at bar.scala:191), which has no missing 
parents
18/07/22 08:31:21 INFO cluster.YarnClusterScheduler: Adding task set 44.1 with 
59955 tasks

18/07/22 08:31:41 INFO scheduler.TaskSetManager: Finished task 55769.0 in stage 
44.0 (TID 970752) in 101505 ms on host1.com (executor 33) (15081/73320)


8/07/22 08:31:41 INFO scheduler.DAGScheduler: Ignoring possibly bogus 
ShuffleMapTask(44, 55769) completion from executor 33

 

In the logs above you will see that task 55769.0 finished after the executor 
was lost and a new task set was started.  The DAG scheduler says "Ignoring 
possibly bogus".. but in the TaskSetManager side it has marked those tasks as 
completed for all stage attempts.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to