We have an instance of Spark running on top of Mesos and GlusterFS. Due to some 
fixes of bugs that we also came across, we installed the latest versions: 
1.0.0-rc9 (spark-1.0.0-bin-2.0.5-alpha, java 1.6.0_27), Mesos 0.18.1. Since 
then, moderate sized tasks (10-20GB) cannot complete.

I notice on the Mesos UI that for a failed task and a consequently killed 
context, many tasks appear to keep on running. 20 minutes later some (data 
reading) tasks are still in the 'Running' state.

Furthermore, on the specific error I get (note, we never call 'collect' 
explicitly, only count(), but the execution has definitely reach that point 
yet):
: An error occurred while calling o127.collect.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 
4.0:212 failed 4 times, most recent failure: TID 594 on host slave3.domain.com 
failed for unknown reason
I notice that task 4.0.212 is still  running on at least another slave.
537         task 4.0:212        RUNNING           2014-05-21T17:12:19+0300      
     2014-05-21T17:12:19+0300           slave1.domain.com        Sandbox

And also from stderr:
14/05/21 17:12:19 INFO Executor: Running task ID 537
[...]
14/05/21 17:12:24 INFO Executor: Serialized size of result for 537 is 1100
14/05/21 17:12:24 INFO Executor: Sending result for 537 directly to driver
14/05/21 17:12:24 INFO Executor: Finished task ID 537

At the same time, INFO messages like this one appear:
14/05/22 15:06:40 INFO TaskSetManager: Ignorning task-finished event for TID 
621 because task 157 has already completed successfully

Additional errors include:
14/05/22 15:06:37 INFO DAGScheduler: Ignoring possibly bogus ShuffleMapTask 
completion from 20140516-155535-170164746-5050-22001-5

and more importantly:
W0522 15:06:33.621423 12899 sched.cpp:901] Attempting to launch task 559 with 
an unknown offer 20140516-155535-170164746-5050-22001-114535


An assumption based on the above is that certain tasks complete and some part 
of the system is not notified about it. So the task gets rescheduled and after 
4 tries the context exits.

Thank you in advance!
-Orestis

Reply via email to