Re: Mesos Spark Tasks - Lost
Can you share your exact spark-submit command line? And also cluster mode is not yet released yet (1.4) and doesn't support spark-shell, so I think you're just using client mode unless you're using latest master. Tim On Tue, May 19, 2015 at 8:57 AM, Panagiotis Garefalakis panga...@gmail.com wrote: Hello all, I am facing a weird issue for the last couple of days running Spark on top of Mesos and I need your help. I am running Mesos in a private cluster and managed to deploy successfully hdfs, cassandra, marathon and play but Spark is not working for a reason. I have tried so far: different java versions (1.6 and 1.7 oracle and openjdk), different spark-env configuration, different Spark versions (from 0.8.8 to 1.3.1), different HDFS versions (hadoop 5.1 and 4.6), and updating pom dependencies. More specifically while local tasks complete fine, in cluster mode all the tasks get lost. (both using spark-shell and spark-submit) From the worker log I see something like this: --- I0519 02:36:30.475064 12863 fetcher.cpp:214] Fetching URI 'hdfs:/:8020/spark-1.1.0-bin-2.0.0-cdh4.7.0.tgz' I0519 02:36:30.747372 12863 fetcher.cpp:99] Fetching URI 'hdfs://X:8020/spark-1.1.0-bin-2.0.0-cdh4.7.0.tgz' using Hadoop Client I0519 02:36:30.747546 12863 fetcher.cpp:109] Downloading resource from 'hdfs://:8020/spark-1.1.0-bin-2.0.0-cdh4.7.0.tgz' to '/tmp/mesos/slaves/20150515-164602-2877535122-5050-32131-S2/frameworks/20150517-162701-2877535122-5050-28705-0084/executors/20150515-164602-2877535122-5050-32131-S2/runs/660d78ec-e2f4-4d38-881b-7209cbd3c5c3/spark-1.1.0-bin-2.0.0-cdh4.7.0.tgz' I0519 02:36:34.205878 12863 fetcher.cpp:78] Extracted resource '/tmp/mesos/slaves/20150515-164602-2877535122-5050-32131-S2/frameworks/20150517-162701-2877535122-5050-28705-0084/executors/20150515-164602-2877535122-5050-32131-S2/runs/660d78ec-e2f4-4d38-881b-7209cbd3c5c3/spark-1.1.0-bin-2.0.0-cdh4.7.0.tgz' into '/tmp/mesos/slaves/20150515-164602-2877535122-5050-32131-S2/frameworks/20150517-162701-2877535122-5050-28705-0084/executors/20150515-164602-2877535122-5050-32131-S2/runs/660d78ec-e2f4-4d38-881b-7209cbd3c5c3' *Error: Could not find or load main class two* --- And from the Spark Terminal: --- 15/05/19 02:36:39 INFO scheduler.TaskSchedulerImpl: Cancelling stage 0 15/05/19 02:36:39 INFO scheduler.TaskSchedulerImpl: Stage 0 was cancelled 15/05/19 02:36:39 INFO scheduler.DAGScheduler: Failed to run reduce at SparkPi.scala:35 15/05/19 02:36:39 INFO scheduler.DAGScheduler: Failed to run reduce at SparkPi.scala:35 Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure: Lost task 7.3 in stage 0.0 (TID 26, ): ExecutorLostFailure (executor lost) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org http://org.apache.spark.scheduler.dagscheduler.org/$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)atorg.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) .. at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) --- Any help will be greatly appreciated! Regards, Panagiotis
Re: Mesos Spark Tasks - Lost
Tim thanks for your reply, I am following this quite clear mesos-spark tutorial: https://docs.mesosphere.com/tutorials/run-spark-on-mesos/ So mainly I tried running spark-shell which locally works fine but when the jobs are submitted through mesos something goes wrong! My question is: is there a some extra configuration needed for the workers (that is not mentioned at the tutorial) ?? The Executor Lost message I get is really generic so I dont know whats going on.. Please check the attached mesos execution event log. Thanks again, Panagiotis On Wed, May 20, 2015 at 8:21 AM, Tim Chen t...@mesosphere.io wrote: Can you share your exact spark-submit command line? And also cluster mode is not yet released yet (1.4) and doesn't support spark-shell, so I think you're just using client mode unless you're using latest master. Tim On Tue, May 19, 2015 at 8:57 AM, Panagiotis Garefalakis panga...@gmail.com wrote: Hello all, I am facing a weird issue for the last couple of days running Spark on top of Mesos and I need your help. I am running Mesos in a private cluster and managed to deploy successfully hdfs, cassandra, marathon and play but Spark is not working for a reason. I have tried so far: different java versions (1.6 and 1.7 oracle and openjdk), different spark-env configuration, different Spark versions (from 0.8.8 to 1.3.1), different HDFS versions (hadoop 5.1 and 4.6), and updating pom dependencies. More specifically while local tasks complete fine, in cluster mode all the tasks get lost. (both using spark-shell and spark-submit) From the worker log I see something like this: --- I0519 02:36:30.475064 12863 fetcher.cpp:214] Fetching URI 'hdfs:/:8020/spark-1.1.0-bin-2.0.0-cdh4.7.0.tgz' I0519 02:36:30.747372 12863 fetcher.cpp:99] Fetching URI 'hdfs://X:8020/spark-1.1.0-bin-2.0.0-cdh4.7.0.tgz' using Hadoop Client I0519 02:36:30.747546 12863 fetcher.cpp:109] Downloading resource from 'hdfs://:8020/spark-1.1.0-bin-2.0.0-cdh4.7.0.tgz' to '/tmp/mesos/slaves/20150515-164602-2877535122-5050-32131-S2/frameworks/20150517-162701-2877535122-5050-28705-0084/executors/20150515-164602-2877535122-5050-32131-S2/runs/660d78ec-e2f4-4d38-881b-7209cbd3c5c3/spark-1.1.0-bin-2.0.0-cdh4.7.0.tgz' I0519 02:36:34.205878 12863 fetcher.cpp:78] Extracted resource '/tmp/mesos/slaves/20150515-164602-2877535122-5050-32131-S2/frameworks/20150517-162701-2877535122-5050-28705-0084/executors/20150515-164602-2877535122-5050-32131-S2/runs/660d78ec-e2f4-4d38-881b-7209cbd3c5c3/spark-1.1.0-bin-2.0.0-cdh4.7.0.tgz' into '/tmp/mesos/slaves/20150515-164602-2877535122-5050-32131-S2/frameworks/20150517-162701-2877535122-5050-28705-0084/executors/20150515-164602-2877535122-5050-32131-S2/runs/660d78ec-e2f4-4d38-881b-7209cbd3c5c3' *Error: Could not find or load main class two* --- And from the Spark Terminal: --- 15/05/19 02:36:39 INFO scheduler.TaskSchedulerImpl: Cancelling stage 0 15/05/19 02:36:39 INFO scheduler.TaskSchedulerImpl: Stage 0 was cancelled 15/05/19 02:36:39 INFO scheduler.DAGScheduler: Failed to run reduce at SparkPi.scala:35 15/05/19 02:36:39 INFO scheduler.DAGScheduler: Failed to run reduce at SparkPi.scala:35 Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure: Lost task 7.3 in stage 0.0 (TID 26, ): ExecutorLostFailure (executor lost) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org http://org.apache.spark.scheduler.dagscheduler.org/$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)atorg.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) .. at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) --- Any help will be greatly appreciated! Regards, Panagiotis -sparklogs-spark-shell-1431993674182-EVENT_LOG_1 Description: Binary data - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Mesos Spark Tasks - Lost
Hello all, I am facing a weird issue for the last couple of days running Spark on top of Mesos and I need your help. I am running Mesos in a private cluster and managed to deploy successfully hdfs, cassandra, marathon and play but Spark is not working for a reason. I have tried so far: different java versions (1.6 and 1.7 oracle and openjdk), different spark-env configuration, different Spark versions (from 0.8.8 to 1.3.1), different HDFS versions (hadoop 5.1 and 4.6), and updating pom dependencies. More specifically while local tasks complete fine, in cluster mode all the tasks get lost. (both using spark-shell and spark-submit) From the worker log I see something like this: --- I0519 02:36:30.475064 12863 fetcher.cpp:214] Fetching URI 'hdfs:/:8020/spark-1.1.0-bin-2.0.0-cdh4.7.0.tgz' I0519 02:36:30.747372 12863 fetcher.cpp:99] Fetching URI 'hdfs://X:8020/spark-1.1.0-bin-2.0.0-cdh4.7.0.tgz' using Hadoop Client I0519 02:36:30.747546 12863 fetcher.cpp:109] Downloading resource from 'hdfs://:8020/spark-1.1.0-bin-2.0.0-cdh4.7.0.tgz' to '/tmp/mesos/slaves/20150515-164602-2877535122-5050-32131-S2/frameworks/20150517-162701-2877535122-5050-28705-0084/executors/20150515-164602-2877535122-5050-32131-S2/runs/660d78ec-e2f4-4d38-881b-7209cbd3c5c3/spark-1.1.0-bin-2.0.0-cdh4.7.0.tgz' I0519 02:36:34.205878 12863 fetcher.cpp:78] Extracted resource '/tmp/mesos/slaves/20150515-164602-2877535122-5050-32131-S2/frameworks/20150517-162701-2877535122-5050-28705-0084/executors/20150515-164602-2877535122-5050-32131-S2/runs/660d78ec-e2f4-4d38-881b-7209cbd3c5c3/spark-1.1.0-bin-2.0.0-cdh4.7.0.tgz' into '/tmp/mesos/slaves/20150515-164602-2877535122-5050-32131-S2/frameworks/20150517-162701-2877535122-5050-28705-0084/executors/20150515-164602-2877535122-5050-32131-S2/runs/660d78ec-e2f4-4d38-881b-7209cbd3c5c3' *Error: Could not find or load main class two* --- And from the Spark Terminal: --- 15/05/19 02:36:39 INFO scheduler.TaskSchedulerImpl: Cancelling stage 0 15/05/19 02:36:39 INFO scheduler.TaskSchedulerImpl: Stage 0 was cancelled 15/05/19 02:36:39 INFO scheduler.DAGScheduler: Failed to run reduce at SparkPi.scala:35 15/05/19 02:36:39 INFO scheduler.DAGScheduler: Failed to run reduce at SparkPi.scala:35 Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure: Lost task 7.3 in stage 0.0 (TID 26, ): ExecutorLostFailure (executor lost) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org http://org.apache.spark.scheduler.dagscheduler.org/$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)atorg.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) .. at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) --- Any help will be greatly appreciated! Regards, Panagiotis