Mesos Spark Tasks - Lost

Panagiotis Garefalakis Tue, 19 May 2015 08:58:07 -0700

Hello all,

I am facing a weird issue for the last couple of days running Spark on top
of Mesos and I need your help. I am running Mesos in a private cluster and
managed to deploy successfully  hdfs, cassandra, marathon and play but
Spark is not working for a reason. I have tried so far:
different java versions (1.6 and 1.7 oracle and openjdk), different
spark-env configuration, different Spark versions (from 0.8.8 to 1.3.1),
different HDFS versions (hadoop 5.1 and 4.6), and updating pom dependencies.


More specifically while local tasks complete fine, in cluster mode all the
tasks get lost.
(both using spark-shell and spark-submit)
>From the worker log I see something like this:

-------------------------------------------------------------------
I0519 02:36:30.475064 12863 fetcher.cpp:214] Fetching URI
'hdfs:/XXXXXXXX:8020/spark-1.1.0-bin-2.0.0-cdh4.7.0.tgz'
I0519 02:36:30.747372 12863 fetcher.cpp:99] Fetching URI
'hdfs://XXXXXXXXX:8020/spark-1.1.0-bin-2.0.0-cdh4.7.0.tgz' using Hadoop
Client
I0519 02:36:30.747546 12863 fetcher.cpp:109] Downloading resource from
'hdfs://XXXXXXXX:8020/spark-1.1.0-bin-2.0.0-cdh4.7.0.tgz' to
'/tmp/mesos/slaves/20150515-164602-2877535122-5050-32131-S2/frameworks/20150517-162701-2877535122-5050-28705-0084/executors/20150515-164602-2877535122-5050-32131-S2/runs/660d78ec-e2f4-4d38-881b-7209cbd3c5c3/spark-1.1.0-bin-2.0.0-cdh4.7.0.tgz'
I0519 02:36:34.205878 12863 fetcher.cpp:78] Extracted resource
'/tmp/mesos/slaves/20150515-164602-2877535122-5050-32131-S2/frameworks/20150517-162701-2877535122-5050-28705-0084/executors/20150515-164602-2877535122-5050-32131-S2/runs/660d78ec-e2f4-4d38-881b-7209cbd3c5c3/spark-1.1.0-bin-2.0.0-cdh4.7.0.tgz'
into
'/tmp/mesos/slaves/20150515-164602-2877535122-5050-32131-S2/frameworks/20150517-162701-2877535122-5050-28705-0084/executors/20150515-164602-2877535122-5050-32131-S2/runs/660d78ec-e2f4-4d38-881b-7209cbd3c5c3'
*Error: Could not find or load main class two*

-------------------------------------------------------------------

And from the Spark Terminal:

-------------------------------------------------------------------
15/05/19 02:36:39 INFO scheduler.TaskSchedulerImpl: Cancelling stage 0
15/05/19 02:36:39 INFO scheduler.TaskSchedulerImpl: Stage 0 was cancelled
15/05/19 02:36:39 INFO scheduler.DAGScheduler: Failed to run reduce at
SparkPi.scala:35
15/05/19 02:36:39 INFO scheduler.DAGScheduler: Failed to run reduce at
SparkPi.scala:35
Exception in thread "main" org.apache.spark.SparkException: Job aborted due
to stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure:
Lost task 7.3 in stage 0.0 (TID 26, XXXXXXXX): ExecutorLostFailure
(executor lost)
Driver stacktrace: at
org.apache.spark.scheduler.DAGScheduler.org
<http://org.apache.spark.scheduler.dagscheduler.org/>$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)atorg.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
......
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

-------------------------------------------------------------------

Any help will be greatly appreciated!

Regards,
Panagiotis

Mesos Spark Tasks - Lost

Reply via email to