[ https://issues.apache.org/jira/browse/SPARK-2947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Guoqiang Li updated SPARK-2947: ------------------------------- Summary: DAGScheduler resubmit the task into an infinite loop (was: DAGScheduler scheduling infinite loop) > DAGScheduler resubmit the task into an infinite loop > ---------------------------------------------------- > > Key: SPARK-2947 > URL: https://issues.apache.org/jira/browse/SPARK-2947 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 1.0.0, 1.0.2 > Reporter: Guoqiang Li > Priority: Blocker > Fix For: 1.1.0, 1.0.3 > > > Stage to resubmit more than 50000 times. > This seems to be caused by {{FetchFailed.bmAddress}} is null . > I don't know how to reproduce it. > master log: > {noformat} > 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Starting task 1.189:276 as > TID 52334 on executor 82: sanshan (PROCESS_LOCAL) > 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Serialized task 1.189:276 as > 3060 bytes in 0 ms > 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Starting task 1.189:277 as > TID 52335 on executor 78: tuan231 (PROCESS_LOCAL) > 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Serialized task 1.189:277 as > 3060 bytes in 0 ms > 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Lost TID 52199 (task > 1.189:141) > 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch > failure from null > 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at > DealCF.scala:215) for resubmision due to a fetch failure > 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from > Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission > 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch > failure from null > 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at > DealCF.scala:215) for resubmision due to a fetch failure > 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from > Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission > ------------------ 50000 times ------------------- > 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at > DealCF.scala:215) for resubmision due to a fetch failure > 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from > Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission > 14/08/09 21:50:17 INFO cluster.YarnClientClusterScheduler: Removed TaskSet > 1.189, whose tasks have all completed, from pool > 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Finished TID 1869 in 87398 > ms on jilin (progress: 280/280) > 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Completed ShuffleMapTask(2, > 269) > 14/08/09 21:50:17 INFO cluster.YarnClientClusterScheduler: Removed TaskSet > 2.1, whose tasks have all completed, from pool > 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Stage 2 (flatMap at > DealCF.scala:207) finished in 129.544 s > {noformat} > worker: log > {noformat} > /1408/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_57 not found, > computing it > 14/08/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_191 not found, > computing it > 14/08/09 21:49:41 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 18017 > 14/08/09 21:49:41 INFO executor.Executor: Running task ID 18017 > 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_1 locally > 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_2 locally > 14/08/09 21:49:41 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 18151 > 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_0 locally > 14/08/09 21:49:41 INFO executor.Executor: Running task ID 18151 > 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_1 locally > 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_2 locally > 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_0 locally > 14/08/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_86 not found, > computing it > 14/08/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_220 not found, > computing it > 14/08/09 21:49:41 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 18285 > 14/08/09 21:49:41 INFO executor.Executor: Running task ID 18285 > 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_1 locally > 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_2 locally > 14/08/09 21:49:41 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 18419 > 14/08/09 21:49:41 INFO executor.Executor: Running task ID 18419 > 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_0 locally > 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_1 locally > 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_2 locally > 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_0 locally > 14/08/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_86 not found, > computing it > 14/08/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_220 not found, > computing it > 14/08/09 21:49:42 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 18535 > 14/08/09 21:49:42 INFO executor.Executor: Running task ID 18535 > 14/08/09 21:49:42 INFO storage.BlockManager: Found block broadcast_1 locally > 14/08/09 21:49:42 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 18669 > 14/08/09 21:49:42 INFO executor.Executor: Running task ID 18669 > 14/08/09 21:49:42 INFO storage.BlockManager: Found block broadcast_2 locally > 14/08/09 21:49:42 INFO storage.BlockManager: Found block broadcast_0 locally > 14/08/09 21:49:42 INFO storage.BlockManager: Found block broadcast_1 locally > 14/08/09 21:49:42 INFO storage.BlockManager: Found block broadcast_2 locally > 14/08/09 21:49:42 INFO storage.BlockManager: Found block broadcast_0 locally > 14/08/09 21:49:42 INFO spark.CacheManager: Partition rdd_23_68 not found, > computing it > 14/08/09 21:49:42 INFO spark.CacheManager: Partition rdd_23_202 not found, > computing it > 14/08/09 21:49:42 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 18787 > 14/08/09 21:49:42 INFO executor.Executor: Running task ID 18787 > 14/08/09 21:49:42 INFO storage.BlockManager: Found block broadcast_1 locally > 14/08/09 21:49:42 INFO storage.BlockManager: Found block broadcast_2 locally > 14/08/09 21:49:42 INFO storage.BlockManager: Found block broadcast_0 locally > 14/08/09 21:49:42 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 18921 > 14/08/09 21:49:42 INFO executor.Executor: Running task ID 18921 > 14/08/09 21:49:42 INFO storage.BlockManager: Found block broadcast_1 locally > 14/08/09 21:49:42 INFO storage.BlockManager: Found block broadcast_2 locally > 14/08/09 21:49:42 INFO storage.BlockManager: Found block broadcast_0 locally > 14/08/09 21:49:42 INFO spark.CacheManager: Partition rdd_23_52 not found, > computing it > 14/08/09 21:49:42 INFO spark.CacheManager: Partition rdd_23_186 not found, > computing it > 14/08/09 21:49:42 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 19012 > 14/08/09 21:49:42 INFO executor.Executor: Running task ID 19012 > 14/08/09 21:49:42 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 19146 > 14/08/09 21:49:42 INFO executor.Executor: Running task ID 19146 > 14/08/09 21:49:42 INFO storage.BlockManager: Found block broadcast_1 locally > 14/08/09 21:49:42 INFO storage.BlockManager: Found block broadcast_2 locally > 14/08/09 21:49:42 INFO storage.BlockManager: Found block broadcast_0 locally > 14/08/09 21:49:42 INFO storage.BlockManager: Found block broadcast_1 locally > 14/08/09 21:49:42 INFO storage.BlockManager: Found block broadcast_2 locally > 14/08/09 21:49:42 INFO storage.BlockManager: Found block broadcast_0 locally > 14/08/09 21:49:42 INFO spark.CacheManager: Partition rdd_23_9 not found, > computing it > 14/08/09 21:49:42 INFO spark.CacheManager: Partition rdd_23_143 not found, > computing it > 14/08/09 21:49:42 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 19351 > 14/08/09 21:49:42 INFO executor.Executor: Running task ID 19351 > 14/08/09 21:49:42 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 19484 > 14/08/09 21:49:42 INFO executor.Executor: Running task ID 19484 > 14/08/09 21:49:42 INFO storage.BlockManager: Found block broadcast_1 locally > 14/08/09 21:49:42 INFO storage.BlockManager: Found block broadcast_2 locally > 14/08/09 21:49:42 INFO storage.BlockManager: Found block broadcast_0 locally > 14/08/09 21:49:42 INFO storage.BlockManager: Found block broadcast_1 locally > 14/08/09 21:49:42 INFO storage.BlockManager: Found block broadcast_2 locally > 14/08/09 21:49:42 INFO storage.BlockManager: Found block broadcast_0 locally > 14/08/09 21:49:42 INFO spark.CacheManager: Partition rdd_23_80 not found, > computing it > 14/08/09 21:49:42 INFO spark.CacheManager: Partition rdd_23_213 not found, > computing it > 14/08/09 21:49:43 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 19668 > 14/08/09 21:49:43 INFO executor.Executor: Running task ID 19668 > 14/08/09 21:49:43 INFO storage.BlockManager: Found block broadcast_1 locally > 14/08/09 21:49:43 INFO storage.BlockManager: Found block broadcast_2 locally > 14/08/09 21:49:43 INFO storage.BlockManager: Found block broadcast_0 locally > 14/08/09 21:49:43 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 19801 > 14/08/09 21:49:43 INFO executor.Executor: Running task ID 19801 > 14/08/09 21:49:43 INFO storage.BlockManager: Found block broadcast_1 locally > 14/08/09 21:49:43 INFO storage.BlockManager: Found block broadcast_2 locally > 14/08/09 21:49:43 INFO storage.BlockManager: Found block broadcast_0 locally > 14/08/09 21:49:43 INFO spark.CacheManager: Partition rdd_23_128 not found, > computing it > 14/08/09 21:49:43 INFO spark.CacheManager: Partition rdd_23_261 not found, > computing it > 14/08/09 21:49:43 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 19826 > 14/08/09 21:49:43 INFO executor.Executor: Running task ID 19826 > 14/08/09 21:49:43 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 19958 > 14/08/09 21:49:43 INFO executor.Executor: Running task ID 19958 > 14/08/09 21:49:43 INFO storage.BlockManager: Found block broadcast_1 locally > 14/08/09 21:49:43 INFO storage.BlockManager: Found block broadcast_1 locally > 14/08/09 21:49:43 INFO storage.BlockManager: Found block broadcast_2 locally > 14/08/09 21:49:43 INFO storage.BlockManager: Found block broadcast_2 locally > 14/08/09 21:49:43 INFO storage.BlockManager: Found block broadcast_0 locally > 14/08/09 21:49:43 INFO storage.BlockManager: Found block broadcast_0 locally > 14/08/09 21:49:43 INFO spark.CacheManager: Partition rdd_23_17 not found, > computing it > 14/08/09 21:49:43 INFO spark.CacheManager: Partition rdd_23_149 not found, > computing it > 14/08/09 21:49:43 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 20129 > 14/08/09 21:49:43 INFO executor.Executor: Running task ID 20129 > 14/08/09 21:49:43 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 20262 > 14/08/09 21:49:43 INFO executor.Executor: Running task ID 20262 > 14/08/09 21:49:43 INFO storage.BlockManager: Found block broadcast_1 locally > 14/08/09 21:49:43 INFO storage.BlockManager: Found block broadcast_1 locally > 14/08/09 21:49:43 INFO storage.BlockManager: Found block broadcast_2 locally > 14/08/09 21:49:43 INFO storage.BlockManager: Found block broadcast_0 locally > 14/08/09 21:49:43 INFO storage.BlockManager: Found block broadcast_2 locally > 14/08/09 21:49:43 INFO storage.BlockManager: Found block broadcast_0 locally > 14/08/09 21:49:43 INFO spark.CacheManager: Partition rdd_23_184 not found, > computing it > 14/08/09 21:49:43 INFO spark.CacheManager: Partition rdd_23_51 not found, > computing it > 14/08/09 21:49:44 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 20386 > 14/08/09 21:49:44 INFO executor.Executor: Running task ID 20386 > 14/08/09 21:49:44 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 20520 > 14/08/09 21:49:44 INFO executor.Executor: Running task ID 20520 > 14/08/09 21:49:44 INFO storage.BlockManager: Found block broadcast_1 locally > 14/08/09 21:49:44 INFO storage.BlockManager: Found block broadcast_1 locally > 14/08/09 21:49:44 INFO storage.BlockManager: Found block broadcast_2 locally > 14/08/09 21:49:44 INFO storage.BlockManager: Found block broadcast_0 locally > 14/08/09 21:49:44 INFO storage.BlockManager: Found block broadcast_2 locally > 14/08/09 21:49:44 INFO storage.BlockManager: Found block broadcast_0 locally > 14/08/09 21:49:44 INFO spark.CacheManager: Partition rdd_23_173 not found, > computing it > 14/08/09 21:49:44 INFO spark.CacheManager: Partition rdd_23_39 not found, > computing it > 14/08/09 21:49:44 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 20618 > 14/08/09 21:49:44 INFO executor.Executor: Running task ID 20618 > 14/08/09 21:49:44 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 20752 > 14/08/09 21:49:44 INFO executor.Executor: Running task ID 20752 > 14/08/09 21:49:44 INFO storage.BlockManager: Found block broadcast_1 locally > 14/08/09 21:49:44 INFO storage.BlockManager: Found block broadcast_2 locally > 14/08/09 21:49:44 INFO storage.BlockManager: Found block broadcast_1 locally > 14/08/09 21:49:44 INFO storage.BlockManager: Found block broadcast_0 locally > 14/08/09 21:49:44 INFO storage.BlockManager: Found block broadcast_2 locally > 14/08/09 21:49:44 INFO storage.BlockManager: Found block broadcast_0 locally > 14/08/09 21:49:44 INFO spark.CacheManager: Partition rdd_23_135 not found, > computing it > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org