[ 
https://issues.apache.org/jira/browse/SPARK-14485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-14485.
------------------------------------
       Resolution: Fixed
    Fix Version/s: 2.0.0

> Task finished cause fetch failure when its executor has already been removed 
> by driver 
> ---------------------------------------------------------------------------------------
>
>                 Key: SPARK-14485
>                 URL: https://issues.apache.org/jira/browse/SPARK-14485
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.3.1, 1.5.2
>            Reporter: iward
>             Fix For: 2.0.0
>
>
> Now, when executor is removed by driver with heartbeats timeout, driver will 
> re-queue the task on this executor and send a kill command to cluster to kill 
> this executor.
> But, in a situation, the running task of this executor is finished and return 
> result to driver before this executor killed by kill command sent by driver. 
> At this situation, driver will accept the task finished event and ignore  
> speculative task and re-queued task. But, as we know, this executor has 
> removed by driver, the result of this finished task can not save in driver 
> because the *BlockManagerId* has also removed from *BlockManagerMaster* by 
> driver. So, the result data of this stage is not complete, and then, it will 
> cause fetch failure.
> For example, the following is the task log:
> {noformat}
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 WARN HeartbeatReceiver: Removing 
> executor 322 with no recent heartbeats: 256015 ms exceeds timeout 250000 ms
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 ERROR YarnScheduler: Lost executor 
> 322 on BJHC-HERA-16168.hadoop.jd.local: Executor heartbeat timed out after 
> 256015 ms
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO TaskSetManager: Re-queueing 
> tasks for 322 from TaskSet 107.0
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 WARN TaskSetManager: Lost task 
> 229.0 in stage 107.0 (TID 10384, BJHC-HERA-16168.hadoop.jd.local): 
> ExecutorLostFailure (executor 322 lost)
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO DAGScheduler: Executor lost: 
> 322 (epoch 11)
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO BlockManagerMasterEndpoint: 
> Trying to remove executor 322 from BlockManagerMaster.
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO BlockManagerMaster: Removed 
> 322 successfully in removeExecutor
> {noformat}
> {noformat}
> 2015-12-31 04:38:52 INFO 15/12/31 04:38:52 INFO TaskSetManager: Finished task 
> 229.0 in stage 107.0 (TID 10384) in 272315 ms on 
> BJHC-HERA-16168.hadoop.jd.local (579/700)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Ignoring 
> task-finished event for 229.1 in stage 107.0 because task 229 has already 
> completed successfully
> {noformat}
> {noformat}
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO DAGScheduler: Submitting 3 
> missing tasks from ShuffleMapStage 107 (MapPartitionsRDD[263] at 
> mapPartitions at Exchange.scala:137)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO YarnScheduler: Adding task 
> set 107.1 with 3 tasks
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 
> 0.0 in stage 107.1 (TID 10863, BJHC-HERA-18043.hadoop.jd.local, 
> PROCESS_LOCAL, 3745 bytes)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 
> 1.0 in stage 107.1 (TID 10864, BJHC-HERA-9291.hadoop.jd.local, PROCESS_LOCAL, 
> 3745 bytes)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 
> 2.0 in stage 107.1 (TID 10865, BJHC-HERA-16047.hadoop.jd.local, 
> PROCESS_LOCAL, 3745 bytes)
> {noformat}
> Driver will check the stage's result is not complete, and submit missing 
> task, but this time, the next stage has run because previous stage has finish 
> for its task is all finished although its result is not complete.
> {noformat}
> 2015-12-31 04:40:13 INFO 15/12/31 04:40:13 WARN TaskSetManager: Lost task 
> 39.0 in stage 109.0 (TID 10905, BJHC-HERA-9357.hadoop.jd.local): 
> FetchFailed(null, shuffleId=11, mapId=-1, reduceId=39, message=
> 2015-12-31 04:40:13 INFO 
> org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
> location for shuffle 11
> 2015-12-31 04:40:13 INFO at 
> org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:385)
> 2015-12-31 04:40:13 INFO at 
> org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:382)
> 2015-12-31 04:40:13 INFO at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> 2015-12-31 04:40:13 INFO at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> 2015-12-31 04:40:13 INFO at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> 2015-12-31 04:40:13 INFO at 
> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> 2015-12-31 04:40:13 INFO at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> 2015-12-31 04:40:13 INFO at 
> scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
> 2015-12-31 04:40:13 INFO at 
> org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:381)
> 2015-12-31 04:40:13 INFO at 
> org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:172)
> 2015-12-31 04:40:13 INFO at 
> org.apache.spark.shuffle.sort.SortShuffleReader.computeShuffleBlocks(SortShuffleReader.scala:301)
> 2015-12-31 04:40:13 INFO at 
> org.apache.spark.shuffle.sort.SortShuffleReader.read(SortShuffleReader.scala:111)
> 2015-12-31 04:40:13 INFO at 
> org.apache.spark.shuffle.sort.MixedShuffleReader.read(MixedShuffleReader.scala:41)
> 2015-12-31 04:40:13 INFO at 
> org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:90)
> 2015-12-31 04:40:13 INFO at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> 2015-12-31 04:40:13 INFO at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> 2015-12-31 04:40:13 INFO at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> 2015-12-31 04:40:13 INFO at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> 2015-12-31 04:40:13 INFO at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> 2015-12-31 04:40:13 INFO at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> 2015-12-31 04:40:13 INFO at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> 2015-12-31 04:40:13 INFO at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> 2015-12-31 04:40:13 INFO at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
> 2015-12-31 04:40:13 INFO at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> 2015-12-31 04:40:13 INFO at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> 2015-12-31 04:40:13 INFO at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> 2015-12-31 04:40:13 INFO at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> 2015-12-31 04:40:13 INFO at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> 2015-12-31 04:40:13 INFO at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> 2015-12-31 04:40:13 INFO at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> 2015-12-31 04:40:13 INFO at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> 2015-12-31 04:40:13 INFO at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
> 2015-12-31 04:40:13 INFO at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> 2015-12-31 04:40:13 INFO at org.apache.spark.scheduler.Task.run(Task.scala:64)
> 2015-12-31 04:40:13 INFO at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:209)
> 2015-12-31 04:40:13 INFO at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 2015-12-31 04:40:13 INFO at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 2015-12-31 04:40:13 INFO at java.lang.Thread.run(Thread.java:745)
> 2015-12-31 04:40:13 INFO 
> {noformat}
> As the task log show, in this situation, it will casue FetchFailedException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to