We have an instance of Spark running on top of Mesos and GlusterFS. Due to some fixes of bugs that we also came across, we installed the latest versions: 1.0.0-rc9 (spark-1.0.0-bin-2.0.5-alpha, java 1.6.0_27), Mesos 0.18.1. Since then, moderate sized tasks (10-20GB) cannot complete.
I notice on the Mesos UI that for a failed task and a consequently killed context, many tasks appear to keep on running. 20 minutes later some (data reading) tasks are still in the 'Running' state. Furthermore, on the specific error I get (note, we never call 'collect' explicitly, only count(), but the execution has definitely reach that point yet): : An error occurred while calling o127.collect. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 4.0:212 failed 4 times, most recent failure: TID 594 on host slave3.domain.com failed for unknown reason I notice that task 4.0.212 is still running on at least another slave. 537 task 4.0:212 RUNNING 2014-05-21T17:12:19+0300 2014-05-21T17:12:19+0300 slave1.domain.com Sandbox And also from stderr: 14/05/21 17:12:19 INFO Executor: Running task ID 537 [...] 14/05/21 17:12:24 INFO Executor: Serialized size of result for 537 is 1100 14/05/21 17:12:24 INFO Executor: Sending result for 537 directly to driver 14/05/21 17:12:24 INFO Executor: Finished task ID 537 At the same time, INFO messages like this one appear: 14/05/22 15:06:40 INFO TaskSetManager: Ignorning task-finished event for TID 621 because task 157 has already completed successfully Additional errors include: 14/05/22 15:06:37 INFO DAGScheduler: Ignoring possibly bogus ShuffleMapTask completion from 20140516-155535-170164746-5050-22001-5 and more importantly: W0522 15:06:33.621423 12899 sched.cpp:901] Attempting to launch task 559 with an unknown offer 20140516-155535-170164746-5050-22001-114535 An assumption based on the above is that certain tasks complete and some part of the system is not notified about it. So the task gets rescheduled and after 4 tries the context exits. Thank you in advance! -Orestis