[jira] [Updated] (SPARK-4609) Job can not finish if there is one bad slave in clusters

2015-05-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4609:
-
Target Version/s:   (was: 1.3.0)

 Job can not finish if there is one bad slave in clusters
 

 Key: SPARK-4609
 URL: https://issues.apache.org/jira/browse/SPARK-4609
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Davies Liu

 If there is one bad machine in the cluster, the executor will keep die (such 
 as out of space in the disk), some task may be scheduled to this machines 
 multiple times, then the job will failed after several failures of one task.
 {code}
 14/11/26 00:34:57 INFO TaskSetManager: Starting task 39.0 in stage 3.0 (TID 
 1255, spark-worker-028.c.lofty-inn-754.internal, PROCESS_LOCAL, 5119 bytes)
 14/11/26 00:34:57 WARN TaskSetManager: Lost task 39.0 in stage 3.0 (TID 1255, 
 spark-worker-028.c.lofty-inn-754.internal): ExecutorLostFailure (executor 60 
 lost)
 14/11/26 00:35:02 INFO TaskSetManager: Starting task 39.1 in stage 3.0 (TID 
 1256, spark-worker-028.c.lofty-inn-754.internal, PROCESS_LOCAL, 5119 bytes)
 14/11/26 00:35:03 WARN TaskSetManager: Lost task 39.1 in stage 3.0 (TID 1256, 
 spark-worker-028.c.lofty-inn-754.internal): ExecutorLostFailure (executor 61 
 lost)
 14/11/26 00:35:08 INFO TaskSetManager: Starting task 39.2 in stage 3.0 (TID 
 1257, spark-worker-028.c.lofty-inn-754.internal, PROCESS_LOCAL, 5119 bytes)
 14/11/26 00:35:08 WARN TaskSetManager: Lost task 39.2 in stage 3.0 (TID 1257, 
 spark-worker-028.c.lofty-inn-754.internal): ExecutorLostFailure (executor 62 
 lost)
 14/11/26 00:35:13 INFO TaskSetManager: Starting task 39.3 in stage 3.0 (TID 
 1258, spark-worker-028.c.lofty-inn-754.internal, PROCESS_LOCAL, 5119 bytes)
 14/11/26 00:35:14 WARN TaskSetManager: Lost task 39.3 in stage 3.0 (TID 1258, 
 spark-worker-028.c.lofty-inn-754.internal): ExecutorLostFailure (executor 63 
 lost)
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 39 in 
 stage 3.0 failed 4 times, most recent failure: Lost task 39.3 in stage 3.0 
 (TID 1258, spark-worker-028.c.lofty-inn-754.internal): ExecutorLostFailure 
 (executor 63 lost)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1207)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1196)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1195)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1195)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1413)
   at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1368)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
   at akka.actor.ActorCell.invoke(ActorCell.scala:487)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
   at akka.dispatch.Mailbox.run(Mailbox.scala:220)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 {code}
 The task should not be scheduled to a machines for more than one times. Also, 
 if one machine failed with executor lost, it should be put in black list for 
 some time, then try again.
 cc [~kayousterhout] [~matei]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4609) Job can not finish if there is one bad slave in clusters

2014-11-25 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-4609:
--
Description: 
If there is one bad machine in the cluster, the executor will keep die (such as 
out of space in the disk), some task may be scheduled to this machines multiple 
times, then the job will failed after several failures of one task.

{code}
14/11/26 00:34:57 INFO TaskSetManager: Starting task 39.0 in stage 3.0 (TID 
1255, spark-worker-028.c.lofty-inn-754.internal, PROCESS_LOCAL, 5119 bytes)
14/11/26 00:34:57 WARN TaskSetManager: Lost task 39.0 in stage 3.0 (TID 1255, 
spark-worker-028.c.lofty-inn-754.internal): ExecutorLostFailure (executor 60 
lost)
14/11/26 00:35:02 INFO TaskSetManager: Starting task 39.1 in stage 3.0 (TID 
1256, spark-worker-028.c.lofty-inn-754.internal, PROCESS_LOCAL, 5119 bytes)
14/11/26 00:35:03 WARN TaskSetManager: Lost task 39.1 in stage 3.0 (TID 1256, 
spark-worker-028.c.lofty-inn-754.internal): ExecutorLostFailure (executor 61 
lost)
14/11/26 00:35:08 INFO TaskSetManager: Starting task 39.2 in stage 3.0 (TID 
1257, spark-worker-028.c.lofty-inn-754.internal, PROCESS_LOCAL, 5119 bytes)
14/11/26 00:35:08 WARN TaskSetManager: Lost task 39.2 in stage 3.0 (TID 1257, 
spark-worker-028.c.lofty-inn-754.internal): ExecutorLostFailure (executor 62 
lost)
14/11/26 00:35:13 INFO TaskSetManager: Starting task 39.3 in stage 3.0 (TID 
1258, spark-worker-028.c.lofty-inn-754.internal, PROCESS_LOCAL, 5119 bytes)
14/11/26 00:35:14 WARN TaskSetManager: Lost task 39.3 in stage 3.0 (TID 1258, 
spark-worker-028.c.lofty-inn-754.internal): ExecutorLostFailure (executor 63 
lost)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 39 in 
stage 3.0 failed 4 times, most recent failure: Lost task 39.3 in stage 3.0 (TID 
1258, spark-worker-028.c.lofty-inn-754.internal): ExecutorLostFailure (executor 
63 lost)
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1207)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1196)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1195)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1195)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1413)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1368)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
{code}

The task should not be scheduled to a machines for more than one times. Also, 
if one machine failed with executor lost, it should be put in black list for 
some time, then try again.

cc [~kayousterhout] [~matei]

  was:
If there is one bad machine in the cluster, the executor will keep die (such as 
out of space in the disk), some task may be scheduled to this machines multiple 
times, then the job will failed after several failures of one task.

{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 39 in 
stage 3.0 failed 4 times, most recent failure: Lost task 39.3 in stage 3.0 (TID 
1258, spark-worker-028.c.lofty-inn-754.internal): ExecutorLostFailure (executor 
63 lost)
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1207)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1196)
at