subject:"\[jira\] \[Updated\] \(SPARK\-24909\) Spark scheduler can hang with fetch failures and executor lost and multiple stage attempts"

[jira] [Updated] (SPARK-24909) Spark scheduler can hang with fetch failures and executor lost and multiple stage attempts

2018-07-24 Thread Thomas Graves (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-24909:
--
Description: 
The DAGScheduler can hang if the executor was lost (due to fetch failure) and 
all the tasks in the tasks sets are marked as completed. 
([https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1265)]

It never creates new task attempts in the task scheduler but the dag scheduler 
still has pendingPartitions.

18/07/22 08:30:00 INFO scheduler.TaskSetManager: Starting task 55769.0 in stage 
44.0 (TID 970752, host1.com, executor 33, partition 55769, PROCESS_LOCAL, 7874 
bytes)

18/07/22 08:30:29 INFO scheduler.DAGScheduler: Marking ShuffleMapStage 44 
(repartition at Lift.scala:191) as failed due to a fetch failure from 
ShuffleMapStage 42 (map at foo.scala:27)
 18/07/22 08:30:29 INFO scheduler.DAGScheduler: Resubmitting ShuffleMapStage 42 
(map at foo.scala:27) and ShuffleMapStage 44 (repartition at bar.scala:191) due 
to fetch failure
 

18/07/22 08:30:56 INFO scheduler.DAGScheduler: Executor lost: 33 (epoch 18)
 18/07/22 08:30:56 INFO schedulerDAGScheduler: Shuffle files lost for executor: 
33 (epoch 18)

18/07/22 08:31:20 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 44 
(MapPartitionsRDD[70] at repartition at bar.scala:191), which has no missing 
parents
 18/07/22 08:31:21 INFO cluster.YarnClusterScheduler: Adding task set 44.1 with 
59955 tasks

18/07/22 08:31:41 INFO scheduler.TaskSetManager: Finished task 55769.0 in stage 
44.0 (TID 970752) in 101505 ms on host1.com (executor 33) (15081/73320)

8/07/22 08:31:41 INFO scheduler.DAGScheduler: Ignoring possibly bogus 
ShuffleMapTask(44, 55769) completion from executor 33

 

In the logs above you will see that task 55769.0 finished after the executor 
was lost and a new task set was started.  The DAG scheduler says "Ignoring 
possibly bogus".. but in the TaskSetManager side it has marked those tasks as 
completed for all stage attempts. The DAGScheduler gets hung here.  I did a 
heap dump on the process and can see that 55769 is still in the DAGScheduler 
pendingPartitions list but the tasksetmanagers are all complete

 

  was:
The DAGScheduler can hang if the executor was lost (due to fetch failure) and 
all the tasks in the tasks sets are marked as completed. 
([https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1265)]

It never creates new task attempts in the task scheduler but the dag scheduler 
still has pendingPartitions.

18/07/22 08:30:00 INFO scheduler.TaskSetManager: Starting task 55769.0 in stage 
44.0 (TID 970752, host1.com, executor 33, partition 55769, PROCESS_LOCAL, 7874 
bytes)

18/07/22 08:30:29 INFO scheduler.DAGScheduler: Marking ShuffleMapStage 44 
(repartition at Lift.scala:191) as failed due to a fetch failure from 
ShuffleMapStage 42 (map at foo.scala:27)
18/07/22 08:30:29 INFO scheduler.DAGScheduler: Resubmitting ShuffleMapStage 42 
(map at foo.scala:27) and ShuffleMapStage 44 (repartition at bar.scala:191) due 
to fetch failure


18/07/22 08:30:56 INFO scheduler.DAGScheduler: Executor lost: 33 (epoch 18)
18/07/22 08:30:56 INFO schedulerDAGScheduler: Shuffle files lost for executor: 
33 (epoch 18)

18/07/22 08:31:20 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 44 
(MapPartitionsRDD[70] at repartition at bar.scala:191), which has no missing 
parents
18/07/22 08:31:21 INFO cluster.YarnClusterScheduler: Adding task set 44.1 with 
59955 tasks

18/07/22 08:31:41 INFO scheduler.TaskSetManager: Finished task 55769.0 in stage 
44.0 (TID 970752) in 101505 ms on host1.com (executor 33) (15081/73320)


8/07/22 08:31:41 INFO scheduler.DAGScheduler: Ignoring possibly bogus 
ShuffleMapTask(44, 55769) completion from executor 33

 

In the logs above you will see that task 55769.0 finished after the executor 
was lost and a new task set was started.  The DAG scheduler says "Ignoring 
possibly bogus".. but in the TaskSetManager side it has marked those tasks as 
completed for all stage attempts.

 


> Spark scheduler can hang with fetch failures and executor lost and multiple 
> stage attempts
> --
>
> Key: SPARK-24909
> URL: https://issues.apache.org/jira/browse/SPARK-24909
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.1
>Reporter: Thomas Graves
>Priority: Critical
>
> The DAGScheduler can hang if the executor was lost (due to fetch failure) and 
> all the tasks in the tasks sets are marked as completed. 
> ([https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1265)]
> It never creates new task attempts in the task

[jira] [Updated] (SPARK-24909) Spark scheduler can hang with fetch failures and executor lost and multiple stage attempts

2018-07-24 Thread Thomas Graves (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-24909:
--
Priority: Critical  (was: Major)

> Spark scheduler can hang with fetch failures and executor lost and multiple 
> stage attempts
> --
>
> Key: SPARK-24909
> URL: https://issues.apache.org/jira/browse/SPARK-24909
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.1
>Reporter: Thomas Graves
>Priority: Critical
>
> The DAGScheduler can hang if the executor was lost (due to fetch failure) and 
> all the tasks in the tasks sets are marked as completed. 
> ([https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1265)]
> It never creates new task attempts in the task scheduler but the dag 
> scheduler still has pendingPartitions.
> 18/07/22 08:30:00 INFO scheduler.TaskSetManager: Starting task 55769.0 in 
> stage 44.0 (TID 970752, host1.com, executor 33, partition 55769, 
> PROCESS_LOCAL, 7874 bytes)
> 18/07/22 08:30:29 INFO scheduler.DAGScheduler: Marking ShuffleMapStage 44 
> (repartition at Lift.scala:191) as failed due to a fetch failure from 
> ShuffleMapStage 42 (map at foo.scala:27)
> 18/07/22 08:30:29 INFO scheduler.DAGScheduler: Resubmitting ShuffleMapStage 
> 42 (map at foo.scala:27) and ShuffleMapStage 44 (repartition at 
> bar.scala:191) due to fetch failure
> 
> 18/07/22 08:30:56 INFO scheduler.DAGScheduler: Executor lost: 33 (epoch 18)
> 18/07/22 08:30:56 INFO schedulerDAGScheduler: Shuffle files lost for 
> executor: 33 (epoch 18)
> 18/07/22 08:31:20 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 44 
> (MapPartitionsRDD[70] at repartition at bar.scala:191), which has no missing 
> parents
> 18/07/22 08:31:21 INFO cluster.YarnClusterScheduler: Adding task set 44.1 
> with 59955 tasks
> 18/07/22 08:31:41 INFO scheduler.TaskSetManager: Finished task 55769.0 in 
> stage 44.0 (TID 970752) in 101505 ms on host1.com (executor 33) (15081/73320)
> 8/07/22 08:31:41 INFO scheduler.DAGScheduler: Ignoring possibly bogus 
> ShuffleMapTask(44, 55769) completion from executor 33
>  
> In the logs above you will see that task 55769.0 finished after the executor 
> was lost and a new task set was started.  The DAG scheduler says "Ignoring 
> possibly bogus".. but in the TaskSetManager side it has marked those tasks as 
> completed for all stage attempts.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24909) Spark scheduler can hang with fetch failures and executor lost and multiple stage attempts

[jira] [Updated] (SPARK-24909) Spark scheduler can hang with fetch failures and executor lost and multiple stage attempts

2 matches

Site Navigation

Mail list logo

Footer information