[ https://issues.apache.org/jira/browse/SPARK-2947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Guoqiang Li updated SPARK-2947: ------------------------------- Description: Stage to resubmit more than 50000 times. This seems to be caused by {{FetchFailed.bmAddress}} is null . I don't know how to reproduce it. log: {noformat} 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Starting task 1.189:276 as TID 52334 on executor 82: sanshan (PROCESS_LOCAL) 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Serialized task 1.189:276 as 3060 bytes in 0 ms 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Starting task 1.189:277 as TID 52335 on executor 78: tuan231 (PROCESS_LOCAL) 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Serialized task 1.189:277 as 3060 bytes in 0 ms 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Lost TID 52199 (task 1.189:141) 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch failure from null 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at DealCF.scala:215) for resubmision due to a fetch failure 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch failure from null 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at DealCF.scala:215) for resubmision due to a fetch failure 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission ------------------ 50000 times ------------------- 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at DealCF.scala:215) for resubmision due to a fetch failure 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission 14/08/09 21:50:17 INFO cluster.YarnClientClusterScheduler: Removed TaskSet 1.189, whose tasks have all completed, from pool 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Finished TID 1869 in 87398 ms on jilin (progress: 280/280) 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Completed ShuffleMapTask(2, 269) 14/08/09 21:50:17 INFO cluster.YarnClientClusterScheduler: Removed TaskSet 2.1, whose tasks have all completed, from pool 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Stage 2 (flatMap at DealCF.scala:207) finished in 129.544 s {noformat} was: Stage to resubmit more than 50000 times. This seems to be caused by {{FetchFailed.bmAddress}} is null . I don't know how to reproduce it. log: {noformat} 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Starting task 1.189:276 as TID 52334 on executor 82: sanshan (PROCESS_LOCAL) 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Serialized task 1.189:276 as 3060 bytes in 0 ms 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Starting task 1.189:277 as TID 52335 on executor 78: tuan231 (PROCESS_LOCAL) 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Serialized task 1.189:277 as 3060 bytes in 0 ms 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Lost TID 52199 (task 1.189:141) 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch failure from null 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at DealCF.scala:215) for resubmision due to a fetch failure 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch failure from null 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at DealCF.scala:215) for resubmision due to a fetch failure 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch failure from null 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at DealCF.scala:215) for resubmision due to a fetch failure 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch failure from null 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at DealCF.scala:215) for resubmision due to a fetch failure 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch failure from null 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at DealCF.scala:215) for resubmision due to a fetch failure 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch failure from null 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch failure from null 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at DealCF.scala:215) for resubmision due to a fetch failure 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at DealCF.scala:215) for resubmision due to a fetch failure 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch failure from null 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at DealCF.scala:215) for resubmision due to a fetch failure 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch failure from null 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at DealCF.scala:215) for resubmision due to a fetch failure 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch failure from null 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at DealCF.scala:215) for resubmision due to a fetch failure 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch failure from null 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at DealCF.scala:215) for resubmision due to a fetch failure 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch failure from null {noformat} > DAGScheduler scheduling dead cycle > ---------------------------------- > > Key: SPARK-2947 > URL: https://issues.apache.org/jira/browse/SPARK-2947 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 1.0.0, 1.0.2 > Reporter: Guoqiang Li > Priority: Blocker > Fix For: 1.1.0, 1.0.3 > > > Stage to resubmit more than 50000 times. > This seems to be caused by {{FetchFailed.bmAddress}} is null . > I don't know how to reproduce it. > log: > {noformat} > 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Starting task 1.189:276 as > TID 52334 on executor 82: sanshan (PROCESS_LOCAL) > 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Serialized task 1.189:276 as > 3060 bytes in 0 ms > 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Starting task 1.189:277 as > TID 52335 on executor 78: tuan231 (PROCESS_LOCAL) > 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Serialized task 1.189:277 as > 3060 bytes in 0 ms > 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Lost TID 52199 (task > 1.189:141) > 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch > failure from null > 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at > DealCF.scala:215) for resubmision due to a fetch failure > 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from > Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission > 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch > failure from null > 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at > DealCF.scala:215) for resubmision due to a fetch failure > 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from > Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission > ------------------ 50000 times ------------------- > 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at > DealCF.scala:215) for resubmision due to a fetch failure > 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from > Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission > 14/08/09 21:50:17 INFO cluster.YarnClientClusterScheduler: Removed TaskSet > 1.189, whose tasks have all completed, from pool > 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Finished TID 1869 in 87398 > ms on jilin (progress: 280/280) > 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Completed ShuffleMapTask(2, > 269) > 14/08/09 21:50:17 INFO cluster.YarnClientClusterScheduler: Removed TaskSet > 2.1, whose tasks have all completed, from pool > 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Stage 2 (flatMap at > DealCF.scala:207) finished in 129.544 s > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org