Hello all, I am facing a weird issue for the last couple of days running Spark on top of Mesos and I need your help. I am running Mesos in a private cluster and managed to deploy successfully hdfs, cassandra, marathon and play but Spark is not working for a reason. I have tried so far: different java versions (1.6 and 1.7 oracle and openjdk), different spark-env configuration, different Spark versions (from 0.8.8 to 1.3.1), different HDFS versions (hadoop 5.1 and 4.6), and updating pom dependencies.
More specifically while local tasks complete fine, in cluster mode all the tasks get lost. (both using spark-shell and spark-submit) >From the worker log I see something like this: ------------------------------------------------------------------- I0519 02:36:30.475064 12863 fetcher.cpp:214] Fetching URI 'hdfs:/XXXXXXXX:8020/spark-1.1.0-bin-2.0.0-cdh4.7.0.tgz' I0519 02:36:30.747372 12863 fetcher.cpp:99] Fetching URI 'hdfs://XXXXXXXXX:8020/spark-1.1.0-bin-2.0.0-cdh4.7.0.tgz' using Hadoop Client I0519 02:36:30.747546 12863 fetcher.cpp:109] Downloading resource from 'hdfs://XXXXXXXX:8020/spark-1.1.0-bin-2.0.0-cdh4.7.0.tgz' to '/tmp/mesos/slaves/20150515-164602-2877535122-5050-32131-S2/frameworks/20150517-162701-2877535122-5050-28705-0084/executors/20150515-164602-2877535122-5050-32131-S2/runs/660d78ec-e2f4-4d38-881b-7209cbd3c5c3/spark-1.1.0-bin-2.0.0-cdh4.7.0.tgz' I0519 02:36:34.205878 12863 fetcher.cpp:78] Extracted resource '/tmp/mesos/slaves/20150515-164602-2877535122-5050-32131-S2/frameworks/20150517-162701-2877535122-5050-28705-0084/executors/20150515-164602-2877535122-5050-32131-S2/runs/660d78ec-e2f4-4d38-881b-7209cbd3c5c3/spark-1.1.0-bin-2.0.0-cdh4.7.0.tgz' into '/tmp/mesos/slaves/20150515-164602-2877535122-5050-32131-S2/frameworks/20150517-162701-2877535122-5050-28705-0084/executors/20150515-164602-2877535122-5050-32131-S2/runs/660d78ec-e2f4-4d38-881b-7209cbd3c5c3' *Error: Could not find or load main class two* ------------------------------------------------------------------- And from the Spark Terminal: ------------------------------------------------------------------- 15/05/19 02:36:39 INFO scheduler.TaskSchedulerImpl: Cancelling stage 0 15/05/19 02:36:39 INFO scheduler.TaskSchedulerImpl: Stage 0 was cancelled 15/05/19 02:36:39 INFO scheduler.DAGScheduler: Failed to run reduce at SparkPi.scala:35 15/05/19 02:36:39 INFO scheduler.DAGScheduler: Failed to run reduce at SparkPi.scala:35 Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure: Lost task 7.3 in stage 0.0 (TID 26, XXXXXXXX): ExecutorLostFailure (executor lost) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org <http://org.apache.spark.scheduler.dagscheduler.org/>$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)atorg.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) ...... at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) ------------------------------------------------------------------- Any help will be greatly appreciated! Regards, Panagiotis