[ 
https://issues.apache.org/jira/browse/SPARK-2947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-2947:
-------------------------------

    Summary: DAGScheduler resubmit the stage into an infinite loop  (was: 
DAGScheduler resubmit the task into an infinite loop)

> DAGScheduler resubmit the stage into an infinite loop
> -----------------------------------------------------
>
>                 Key: SPARK-2947
>                 URL: https://issues.apache.org/jira/browse/SPARK-2947
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.0.0, 1.0.2
>            Reporter: Guoqiang Li
>            Priority: Blocker
>             Fix For: 1.1.0, 1.0.3
>
>
> Stage to resubmit more than 50000 times.
> This seems to be caused by {{FetchFailed.bmAddress}} is null .
> I don't know how to reproduce it.
> master log:
> {noformat}
> 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Starting task 1.189:276 as 
> TID 52334 on executor 82: sanshan (PROCESS_LOCAL)
> 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Serialized task 1.189:276 as 
> 3060 bytes in 0 ms
> 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Starting task 1.189:277 as 
> TID 52335 on executor 78: tuan231 (PROCESS_LOCAL)
> 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Serialized task 1.189:277 as 
> 3060 bytes in 0 ms
> 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Lost TID 52199 (task 
> 1.189:141)
> 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch 
> failure from null
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
> DealCF.scala:215) for resubmision due to a fetch failure
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from 
> Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission
> 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch 
> failure from null
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
> DealCF.scala:215) for resubmision due to a fetch failure
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from 
> Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission
>  ------------------ 50000 times -------------------
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
> DealCF.scala:215) for resubmision due to a fetch failure
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from 
> Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission
> 14/08/09 21:50:17 INFO cluster.YarnClientClusterScheduler: Removed TaskSet 
> 1.189, whose tasks have all completed, from pool 
> 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Finished TID 1869 in 87398 
> ms on jilin (progress: 280/280)
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Completed ShuffleMapTask(2, 
> 269)
> 14/08/09 21:50:17 INFO cluster.YarnClientClusterScheduler: Removed TaskSet 
> 2.1, whose tasks have all completed, from pool 
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Stage 2 (flatMap at 
> DealCF.scala:207) finished in 129.544 s
> {noformat}
> worker: log
> {noformat}
> /1408/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_57 not found, 
> computing it
> 14/08/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_191 not found, 
> computing it
> 14/08/09 21:49:41 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 18017
> 14/08/09 21:49:41 INFO executor.Executor: Running task ID 18017
> 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_1 locally
> 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_2 locally
> 14/08/09 21:49:41 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 18151
> 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_0 locally
> 14/08/09 21:49:41 INFO executor.Executor: Running task ID 18151
> 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_1 locally
> 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_2 locally
> 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_0 locally
> 14/08/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_86 not found, 
> computing it
> 14/08/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_220 not found, 
> computing it
> 14/08/09 21:49:41 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 18285
> 14/08/09 21:49:41 INFO executor.Executor: Running task ID 18285
> 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_1 locally
> 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_2 locally
> 14/08/09 21:49:41 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 18419
> 14/08/09 21:49:41 INFO executor.Executor: Running task ID 18419
> 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_0 locally
> 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_1 locally
> 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_2 locally
> 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_0 locally
> 14/08/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_86 not found, 
> computing it
> 14/08/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_220 not found, 
> computing it
> 14/08/09 21:49:42 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 18535
> 14/08/09 21:49:42 INFO executor.Executor: Running task ID 18535
> 14/08/09 21:49:42 INFO storage.BlockManager: Found block broadcast_1 locally
> 14/08/09 21:49:42 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 18669
> 14/08/09 21:49:42 INFO executor.Executor: Running task ID 18669
> 14/08/09 21:49:42 INFO storage.BlockManager: Found block broadcast_2 locally
> 14/08/09 21:49:42 INFO storage.BlockManager: Found block broadcast_0 locally
> 14/08/09 21:49:42 INFO storage.BlockManager: Found block broadcast_1 locally
> 14/08/09 21:49:42 INFO storage.BlockManager: Found block broadcast_2 locally
> 14/08/09 21:49:42 INFO storage.BlockManager: Found block broadcast_0 locally
> 14/08/09 21:49:42 INFO spark.CacheManager: Partition rdd_23_68 not found, 
> computing it
> 14/08/09 21:49:42 INFO spark.CacheManager: Partition rdd_23_202 not found, 
> computing it
> 14/08/09 21:49:42 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 18787
> 14/08/09 21:49:42 INFO executor.Executor: Running task ID 18787
> 14/08/09 21:49:42 INFO storage.BlockManager: Found block broadcast_1 locally
> 14/08/09 21:49:42 INFO storage.BlockManager: Found block broadcast_2 locally
> 14/08/09 21:49:42 INFO storage.BlockManager: Found block broadcast_0 locally
> 14/08/09 21:49:42 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 18921
> 14/08/09 21:49:42 INFO executor.Executor: Running task ID 18921
> 14/08/09 21:49:42 INFO storage.BlockManager: Found block broadcast_1 locally
> 14/08/09 21:49:42 INFO storage.BlockManager: Found block broadcast_2 locally
> 14/08/09 21:49:42 INFO storage.BlockManager: Found block broadcast_0 locally
> 14/08/09 21:49:42 INFO spark.CacheManager: Partition rdd_23_52 not found, 
> computing it
> 14/08/09 21:49:42 INFO spark.CacheManager: Partition rdd_23_186 not found, 
> computing it
> 14/08/09 21:49:42 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 19012
> 14/08/09 21:49:42 INFO executor.Executor: Running task ID 19012
> 14/08/09 21:49:42 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 19146
> 14/08/09 21:49:42 INFO executor.Executor: Running task ID 19146
> 14/08/09 21:49:42 INFO storage.BlockManager: Found block broadcast_1 locally
> 14/08/09 21:49:42 INFO storage.BlockManager: Found block broadcast_2 locally
> 14/08/09 21:49:42 INFO storage.BlockManager: Found block broadcast_0 locally
> 14/08/09 21:49:42 INFO storage.BlockManager: Found block broadcast_1 locally
> 14/08/09 21:49:42 INFO storage.BlockManager: Found block broadcast_2 locally
> 14/08/09 21:49:42 INFO storage.BlockManager: Found block broadcast_0 locally
> 14/08/09 21:49:42 INFO spark.CacheManager: Partition rdd_23_9 not found, 
> computing it
> 14/08/09 21:49:42 INFO spark.CacheManager: Partition rdd_23_143 not found, 
> computing it
> 14/08/09 21:49:42 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 19351
> 14/08/09 21:49:42 INFO executor.Executor: Running task ID 19351
> 14/08/09 21:49:42 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 19484
> 14/08/09 21:49:42 INFO executor.Executor: Running task ID 19484
> 14/08/09 21:49:42 INFO storage.BlockManager: Found block broadcast_1 locally
> 14/08/09 21:49:42 INFO storage.BlockManager: Found block broadcast_2 locally
> 14/08/09 21:49:42 INFO storage.BlockManager: Found block broadcast_0 locally
> 14/08/09 21:49:42 INFO storage.BlockManager: Found block broadcast_1 locally
> 14/08/09 21:49:42 INFO storage.BlockManager: Found block broadcast_2 locally
> 14/08/09 21:49:42 INFO storage.BlockManager: Found block broadcast_0 locally
> 14/08/09 21:49:42 INFO spark.CacheManager: Partition rdd_23_80 not found, 
> computing it
> 14/08/09 21:49:42 INFO spark.CacheManager: Partition rdd_23_213 not found, 
> computing it
> 14/08/09 21:49:43 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 19668
> 14/08/09 21:49:43 INFO executor.Executor: Running task ID 19668
> 14/08/09 21:49:43 INFO storage.BlockManager: Found block broadcast_1 locally
> 14/08/09 21:49:43 INFO storage.BlockManager: Found block broadcast_2 locally
> 14/08/09 21:49:43 INFO storage.BlockManager: Found block broadcast_0 locally
> 14/08/09 21:49:43 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 19801
> 14/08/09 21:49:43 INFO executor.Executor: Running task ID 19801
> 14/08/09 21:49:43 INFO storage.BlockManager: Found block broadcast_1 locally
> 14/08/09 21:49:43 INFO storage.BlockManager: Found block broadcast_2 locally
> 14/08/09 21:49:43 INFO storage.BlockManager: Found block broadcast_0 locally
> 14/08/09 21:49:43 INFO spark.CacheManager: Partition rdd_23_128 not found, 
> computing it
> 14/08/09 21:49:43 INFO spark.CacheManager: Partition rdd_23_261 not found, 
> computing it
> 14/08/09 21:49:43 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 19826
> 14/08/09 21:49:43 INFO executor.Executor: Running task ID 19826
> 14/08/09 21:49:43 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 19958
> 14/08/09 21:49:43 INFO executor.Executor: Running task ID 19958
> 14/08/09 21:49:43 INFO storage.BlockManager: Found block broadcast_1 locally
> 14/08/09 21:49:43 INFO storage.BlockManager: Found block broadcast_1 locally
> 14/08/09 21:49:43 INFO storage.BlockManager: Found block broadcast_2 locally
> 14/08/09 21:49:43 INFO storage.BlockManager: Found block broadcast_2 locally
> 14/08/09 21:49:43 INFO storage.BlockManager: Found block broadcast_0 locally
> 14/08/09 21:49:43 INFO storage.BlockManager: Found block broadcast_0 locally
> 14/08/09 21:49:43 INFO spark.CacheManager: Partition rdd_23_17 not found, 
> computing it
> 14/08/09 21:49:43 INFO spark.CacheManager: Partition rdd_23_149 not found, 
> computing it
> 14/08/09 21:49:43 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 20129
> 14/08/09 21:49:43 INFO executor.Executor: Running task ID 20129
> 14/08/09 21:49:43 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 20262
> 14/08/09 21:49:43 INFO executor.Executor: Running task ID 20262
> 14/08/09 21:49:43 INFO storage.BlockManager: Found block broadcast_1 locally
> 14/08/09 21:49:43 INFO storage.BlockManager: Found block broadcast_1 locally
> 14/08/09 21:49:43 INFO storage.BlockManager: Found block broadcast_2 locally
> 14/08/09 21:49:43 INFO storage.BlockManager: Found block broadcast_0 locally
> 14/08/09 21:49:43 INFO storage.BlockManager: Found block broadcast_2 locally
> 14/08/09 21:49:43 INFO storage.BlockManager: Found block broadcast_0 locally
> 14/08/09 21:49:43 INFO spark.CacheManager: Partition rdd_23_184 not found, 
> computing it
> 14/08/09 21:49:43 INFO spark.CacheManager: Partition rdd_23_51 not found, 
> computing it
> 14/08/09 21:49:44 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 20386
> 14/08/09 21:49:44 INFO executor.Executor: Running task ID 20386
> 14/08/09 21:49:44 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 20520
> 14/08/09 21:49:44 INFO executor.Executor: Running task ID 20520
> 14/08/09 21:49:44 INFO storage.BlockManager: Found block broadcast_1 locally
> 14/08/09 21:49:44 INFO storage.BlockManager: Found block broadcast_1 locally
> 14/08/09 21:49:44 INFO storage.BlockManager: Found block broadcast_2 locally
> 14/08/09 21:49:44 INFO storage.BlockManager: Found block broadcast_0 locally
> 14/08/09 21:49:44 INFO storage.BlockManager: Found block broadcast_2 locally
> 14/08/09 21:49:44 INFO storage.BlockManager: Found block broadcast_0 locally
> 14/08/09 21:49:44 INFO spark.CacheManager: Partition rdd_23_173 not found, 
> computing it
> 14/08/09 21:49:44 INFO spark.CacheManager: Partition rdd_23_39 not found, 
> computing it
> 14/08/09 21:49:44 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 20618
> 14/08/09 21:49:44 INFO executor.Executor: Running task ID 20618
> 14/08/09 21:49:44 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 20752
> 14/08/09 21:49:44 INFO executor.Executor: Running task ID 20752
> 14/08/09 21:49:44 INFO storage.BlockManager: Found block broadcast_1 locally
> 14/08/09 21:49:44 INFO storage.BlockManager: Found block broadcast_2 locally
> 14/08/09 21:49:44 INFO storage.BlockManager: Found block broadcast_1 locally
> 14/08/09 21:49:44 INFO storage.BlockManager: Found block broadcast_0 locally
> 14/08/09 21:49:44 INFO storage.BlockManager: Found block broadcast_2 locally
> 14/08/09 21:49:44 INFO storage.BlockManager: Found block broadcast_0 locally
> 14/08/09 21:49:44 INFO spark.CacheManager: Partition rdd_23_135 not found, 
> computing it
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to