Re: Tasks randomly stall when running on mesos
Hi, I just configured my cluster to run with 1.4.0-rc2, alas the dependency jungle does not one let just download, config and start. Instead one will have to fiddle with sbt settings for the upcoming couple of nights: 2015-05-26 14:50:52,686 WARN a.r.ReliableDeliverySupervisor - Association with remote system [akka.tcp://driverPropsFetcher@app03:44805] has failed, address is now gated for [5000] ms. Reason is: [org.apache.spark.rpc.akka.AkkaMessage]. 2015-05-26 14:52:55,707 ERROR Remoting - org.apache.spark.rpc.akka.AkkaMessage java.lang.ClassNotFoundException: org.apache.spark.rpc.akka.AkkaMessage at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at java.io.ObjectInputStream.resolveClass(ObjectInputStream.java:626) at akka.util.ClassLoaderObjectInputStream.resolveClass(ClassLoaderObjectInputStream.scala:19) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) at akka.serialization.JavaSerializer$$anonfun$1.apply(Serializer.scala:136) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at akka.serialization.JavaSerializer.fromBinary(Serializer.scala:136) at akka.serialization.Serialization$$anonfun$deserialize$1.apply(Serialization.scala:104) at scala.util.Try$.apply(Try.scala:161) at akka.serialization.Serialization.deserialize(Serialization.scala:98) at akka.remote.MessageSerializer$.deserialize(MessageSerializer.scala:23) at akka.remote.DefaultMessageDispatcher.payload$lzycompute$1(Endpoint.scala:58) at akka.remote.DefaultMessageDispatcher.payload$1(Endpoint.scala:58) at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:76) at akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:937) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:415) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) kind regards reinis On 25.05.2015 23:09, Reinis Vicups wrote: Great hints, you guys! Yes spark-shell worked fine with mesos as master. I haven't tried to execute multiple rdd actions in a row though (I did couple of successful counts on hbase tables i am working with in several experiments but nothing that would compare to the stuff my spark jobs are doing), but will check if shell stalls upon some decent rdd action. Also thanks a bunch for the links to binaries. This will literally save me hours! kind regards reinis On 25.05.2015 21:00, Dean Wampler wrote: Here is a link for builds of 1.4 RC2: http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc2-bin/ http://people.apache.org/%7Epwendell/spark-releases/spark-1.4.0-rc2-bin/ For a mvn repo, I believe the RC2 artifacts are here: https://repository.apache.org/content/repositories/orgapachespark-1104/ A few experiments you might try: 1. Does spark-shell work? It might start fine, but make sure you can create an RDD and use it, e.g., something like: val rdd = sc.parallelize(Seq(1,2,3,4,5,6)) rdd foreach println 2. Try coarse grained mode, which has different logic for executor management. You can set it in $SPARK_HOME/conf/spark-defaults.conf file: spark.mesos.coarse true Or, from this page http://spark.apache.org/docs/latest/running-on-mesos.html, set the property in a SparkConf object used to construct the SparkContext: conf.set(spark.mesos.coarse, true) dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler
Tasks randomly stall when running on mesos
Hello, I am using Spark 1.3.1-hadoop2.4 with Mesos 0.22.1 with zookeeper and running on a cluster with 3 nodes on 64bit ubuntu. My application is compiled with spark 1.3.1 (apparently with mesos 0.21.0 dependency), hadoop 2.5.1-mapr-1503 and akka 2.3.10. Only with this combination I have succeeded to run spark-jobs on mesos at all. Different versions are causing class loader issues. I am submitting spark jobs with spark-submit with mesos://zk://.../mesos. About 50% of all jobs stall forever (or until I kill spark driver). Error occurs randomly on different slave-nodes. It happens that 4 spark-job in a row run completely without problems and then problem suddenly occurs. I am always testing same set of 5 different jobs single and combined and the error occurs always in different job/node/stage/task combinations. Whenever a slave-node stalls, this message appears in sandbox-log of the failing slave: 10:01:34 ERROR MesosExecutorBackend: Received launchTask but executor was null Any hints on how to address this issue are greatly appreciated kind regards reinis Job that stalls, shows following in spark-driver log (As one can see - the task 1.0 is never finished): 10:01:25,620 INFO o.a.s.s.DAGScheduler - Submitting 4 missing tasks from Stage 0 (MapPartitionsRDD[1] at groupBy at ImportExtensionFieldsSparkJob.scala:57) 10:01:25,621 INFO o.a.s.s.TaskSchedulerImpl - Adding task set 0.0 with 4 tasks 10:01:25,656 INFO o.a.s.s.TaskSetManager - Starting task 0.0 in stage 0.0 (TID 0, app03, PROCESS_LOCAL, 1140 bytes) 10:01:25,660 INFO o.a.s.s.TaskSetManager - Starting task 1.0 in stage 0.0 (TID 1, app01, PROCESS_LOCAL, 1140 bytes) 10:01:25,661 INFO o.a.s.s.TaskSetManager - Starting task 2.0 in stage 0.0 (TID 2, app02, PROCESS_LOCAL, 1140 bytes) 10:01:25,662 INFO o.a.s.s.TaskSetManager - Starting task 3.0 in stage 0.0 (TID 3, app03, PROCESS_LOCAL, 1140 bytes) 10:01:36,842 INFO o.a.s.s.BlockManagerMasterActor - Registering block manager app02 with 88.3 MB RAM, BlockManagerId(20150511-150924-3410235146-5050-1903-S1, app02, 59622) 10:01:36,862 INFO o.a.s.s.BlockManagerMasterActor - Registering block manager app03 with 88.3 MB RAM, BlockManagerId(20150511-150924-3410235146-5050-1903-S2, app03, 39420) 10:01:36,917 INFO o.a.s.s.BlockManagerMasterActor - Registering block manager app01 with 88.3 MB RAM, BlockManagerId(20150511-150924-3410235146-5050-1903-S3, app01, 45605) 10:01:38,701 INFO o.a.s.s.BlockManagerInfo - Added broadcast_2_piece0 in memory on app03 (size: 2.6 KB, free: 88.3 MB) 10:01:38,702 INFO o.a.s.s.BlockManagerInfo - Added broadcast_2_piece0 in memory on app02 (size: 2.6 KB, free: 88.3 MB) 10:01:41,400 INFO o.a.s.s.TaskSetManager - Finished task 0.0 in stage 0.0 (TID 0) in 15721 ms on app03 (1/4) 10:01:41,539 INFO o.a.s.s.TaskSetManager - Finished task 2.0 in stage 0.0 (TID 2) in 15870 ms on app02 (2/4) 10:01:41,697 INFO o.a.s.s.TaskSetManager - Finished task 3.0 in stage 0.0 (TID 3) in 16029 ms on app03 (3/4) sandbox log of slave-node app01 (the one that stalls) shows following: 10:01:25.815506 35409 fetcher.cpp:214] Fetching URI 'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz' 10:01:26.497764 35409 fetcher.cpp:99] Fetching URI 'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz' using Hadoop Client 10:01:26.497869 35409 fetcher.cpp:109] Downloading resource from 'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz' to '/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05/spark-1.3.1-bin-hadoop2.4.tgz' 10:01:32.877717 35409 fetcher.cpp:78] Extracted resource '/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05/spark-1.3.1-bin-hadoop2.4.tgz' into '/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05' Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 10:01:34 INFO MesosExecutorBackend: Registered signal handlers for [TERM, HUP, INT] 10:01:34.459292 35730 exec.cpp:132] Version: 0.22.0 *10:01:34 ERROR MesosExecutorBackend: Received launchTask but executor was null* 10:01:34.540870 35765 exec.cpp:206] Executor registered on slave 20150511-150924-3410235146-5050-1903-S3 10:01:34 INFO MesosExecutorBackend: Registered with Mesos as executor ID 20150511-150924-3410235146-5050-1903-S3 with 1 cpus 10:01:34 INFO SecurityManager: Changing view acls to... 10:01:35 INFO Slf4jLogger: Slf4jLogger started 10:01:35 INFO Remoting: Starting remoting 10:01:35 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutor@app01:xxx] 10:01:35 INFO Utils:
Re: Tasks randomly stall when running on mesos
On Mon, May 25, 2015 at 2:43 PM, Reinis Vicups sp...@orbit-x.de wrote: Hello, I am using Spark 1.3.1-hadoop2.4 with Mesos 0.22.1 with zookeeper and running on a cluster with 3 nodes on 64bit ubuntu. My application is compiled with spark 1.3.1 (apparently with mesos 0.21.0 dependency), hadoop 2.5.1-mapr-1503 and akka 2.3.10. Only with this combination I have succeeded to run spark-jobs on mesos at all. Different versions are causing class loader issues. I am submitting spark jobs with spark-submit with mesos://zk://.../mesos. Are you using coarse grained or fine grained mode? sandbox log of slave-node app01 (the one that stalls) shows following: 10:01:25.815506 35409 fetcher.cpp:214] Fetching URI 'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz' 10:01:26.497764 35409 fetcher.cpp:99] Fetching URI 'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz' using Hadoop Client 10:01:26.497869 35409 fetcher.cpp:109] Downloading resource from 'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz' to '/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05/spark-1.3.1-bin-hadoop2.4.tgz' 10:01:32.877717 35409 fetcher.cpp:78] Extracted resource '/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05/spark-1.3.1-bin-hadoop2.4.tgz' into '/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05' Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 10:01:34 INFO MesosExecutorBackend: Registered signal handlers for [TERM, HUP, INT] 10:01:34.459292 35730 exec.cpp:132] Version: 0.22.0 *10:01:34 ERROR MesosExecutorBackend: Received launchTask but executor was null* 10:01:34.540870 35765 exec.cpp:206] Executor registered on slave 20150511-150924-3410235146-5050-1903-S3 10:01:34 INFO MesosExecutorBackend: Registered with Mesos as executor ID 20150511-150924-3410235146-5050-1903-S3 with 1 cpus It looks like an inconsistent state on the Mesos scheduler. It tries to launch a task on a given slave before the executor has registered. This code was improved/refactored in 1.4, could you try 1.4.0-RC1? iulian 10:01:34 INFO SecurityManager: Changing view acls to... 10:01:35 INFO Slf4jLogger: Slf4jLogger started 10:01:35 INFO Remoting: Starting remoting 10:01:35 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutor@app01:xxx] 10:01:35 INFO Utils: Successfully started service 'sparkExecutor' on port xxx. 10:01:35 INFO AkkaUtils: Connecting to MapOutputTracker: akka.tcp://sparkDriver@dev-web01/user/MapOutputTracker 10:01:35 INFO AkkaUtils: Connecting to BlockManagerMaster: akka.tcp://sparkDriver@dev-web01/user/BlockManagerMaster 10:01:36 INFO DiskBlockManager: Created local directory at /tmp/spark-52a6585a-f9f2-4ab6-bebc-76be99b0c51c/blockmgr-e6d79818-fe30-4b5c-bcd6-8fbc5a201252 10:01:36 INFO MemoryStore: MemoryStore started with capacity 88.3 MB 10:01:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 10:01:36 INFO AkkaUtils: Connecting to OutputCommitCoordinator: akka.tcp://sparkDriver@dev-web01/user/OutputCommitCoordinator 10:01:36 INFO Executor: Starting executor ID 20150511-150924-3410235146-5050-1903-S3 on host app01 10:01:36 INFO NettyBlockTransferService: Server created on XXX 10:01:36 INFO BlockManagerMaster: Trying to register BlockManager 10:01:36 INFO BlockManagerMaster: Registered BlockManager 10:01:36 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@dev-web01/user/HeartbeatReceiver As soon as spark-driver is aborted, following log entries are added to the sandbox log of slave-node app01: 10:17:29.559433 35772 exec.cpp:379] Executor asked to shutdown 10:17:29 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver@dev-web01] has failed, address is now gated for [5000] ms. Reason is: [Disassociated] Successful Job shows instead following in spark-driver log: 08:03:19,862 INFO o.a.s.s.TaskSetManager - Finished task 3.0 in stage 1.0 (TID 7) in 1688 ms on app01 (1/4) 08:03:19,869 INFO o.a.s.s.TaskSetManager - Finished task 0.0 in stage 1.0 (TID 4) in 1700 ms on app03 (2/4) 08:03:19,874 INFO o.a.s.s.TaskSetManager - Finished task 1.0 in stage 1.0 (TID 5) in 1703 ms on app02 (3/4) 08:03:19,878 INFO o.a.s.s.TaskSetManager - Finished task 2.0 in stage 1.0 (TID 6) in 1706 ms on app02 (4/4) 08:03:19,878 INFO o.a.s.s.DAGScheduler - Stage 1 (saveAsNewAPIHadoopDataset at ImportSparkJob.scala:90) finished in 1.718 s
Re: Tasks randomly stall when running on mesos
Hello, I assume I am running spark in a fine-grained mode since I haven't changed the default here. One question regarding 1.4.0-RC1 - is there a mvn snapshot repository I could use for my project config? (I know that I have to download source and make-distribution for executor as well) thanks reinis On 25.05.2015 17:07, Iulian Dragoș wrote: On Mon, May 25, 2015 at 2:43 PM, Reinis Vicups sp...@orbit-x.de mailto:sp...@orbit-x.de wrote: Hello, I am using Spark 1.3.1-hadoop2.4 with Mesos 0.22.1 with zookeeper and running on a cluster with 3 nodes on 64bit ubuntu. My application is compiled with spark 1.3.1 (apparently with mesos 0.21.0 dependency), hadoop 2.5.1-mapr-1503 and akka 2.3.10. Only with this combination I have succeeded to run spark-jobs on mesos at all. Different versions are causing class loader issues. I am submitting spark jobs with spark-submit with mesos://zk://.../mesos. Are you using coarse grained or fine grained mode? sandbox log of slave-node app01 (the one that stalls) shows following: 10:01:25.815506 35409 fetcher.cpp:214] Fetching URI 'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz' 10:01:26.497764 35409 fetcher.cpp:99] Fetching URI 'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz' using Hadoop Client 10:01:26.497869 35409 fetcher.cpp:109] Downloading resource from 'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz' to '/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05/spark-1.3.1-bin-hadoop2.4.tgz' 10:01:32.877717 35409 fetcher.cpp:78] Extracted resource '/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05/spark-1.3.1-bin-hadoop2.4.tgz' into '/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05' Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 10:01:34 INFO MesosExecutorBackend: Registered signal handlers for [TERM, HUP, INT] 10:01:34.459292 35730 exec.cpp:132] Version: 0.22.0 *10:01:34 ERROR MesosExecutorBackend: Received launchTask but executor was null* 10:01:34.540870 35765 exec.cpp:206] Executor registered on slave 20150511-150924-3410235146-5050-1903-S3 10:01:34 INFO MesosExecutorBackend: Registered with Mesos as executor ID 20150511-150924-3410235146-5050-1903-S3 with 1 cpus It looks like an inconsistent state on the Mesos scheduler. It tries to launch a task on a given slave before the executor has registered. This code was improved/refactored in 1.4, could you try 1.4.0-RC1? iulian 10:01:34 INFO SecurityManager: Changing view acls to... 10:01:35 INFO Slf4jLogger: Slf4jLogger started 10:01:35 INFO Remoting: Starting remoting 10:01:35 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutor@app01:xxx] 10:01:35 INFO Utils: Successfully started service 'sparkExecutor' on port xxx. 10:01:35 INFO AkkaUtils: Connecting to MapOutputTracker: akka.tcp://sparkDriver@dev-web01/user/MapOutputTracker 10:01:35 INFO AkkaUtils: Connecting to BlockManagerMaster: akka.tcp://sparkDriver@dev-web01/user/BlockManagerMaster 10:01:36 INFO DiskBlockManager: Created local directory at /tmp/spark-52a6585a-f9f2-4ab6-bebc-76be99b0c51c/blockmgr-e6d79818-fe30-4b5c-bcd6-8fbc5a201252 10:01:36 INFO MemoryStore: MemoryStore started with capacity 88.3 MB 10:01:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 10:01:36 INFO AkkaUtils: Connecting to OutputCommitCoordinator: akka.tcp://sparkDriver@dev-web01/user/OutputCommitCoordinator 10:01:36 INFO Executor: Starting executor ID 20150511-150924-3410235146-5050-1903-S3 on host app01 10:01:36 INFO NettyBlockTransferService: Server created on XXX 10:01:36 INFO BlockManagerMaster: Trying to register BlockManager 10:01:36 INFO BlockManagerMaster: Registered BlockManager 10:01:36 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@dev-web01/user/HeartbeatReceiver As soon as spark-driver is aborted, following log entries are added to the sandbox log of slave-node app01: 10:17:29.559433 35772 exec.cpp:379] Executor asked to shutdown 10:17:29 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver@dev-web01] has failed, address is now gated for [5000] ms. Reason is: [Disassociated] Successful Job shows instead following in
Re: Tasks randomly stall when running on mesos
Great hints, you guys! Yes spark-shell worked fine with mesos as master. I haven't tried to execute multiple rdd actions in a row though (I did couple of successful counts on hbase tables i am working with in several experiments but nothing that would compare to the stuff my spark jobs are doing), but will check if shell stalls upon some decent rdd action. Also thanks a bunch for the links to binaries. This will literally save me hours! kind regards reinis On 25.05.2015 21:00, Dean Wampler wrote: Here is a link for builds of 1.4 RC2: http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc2-bin/ http://people.apache.org/%7Epwendell/spark-releases/spark-1.4.0-rc2-bin/ For a mvn repo, I believe the RC2 artifacts are here: https://repository.apache.org/content/repositories/orgapachespark-1104/ A few experiments you might try: 1. Does spark-shell work? It might start fine, but make sure you can create an RDD and use it, e.g., something like: val rdd = sc.parallelize(Seq(1,2,3,4,5,6)) rdd foreach println 2. Try coarse grained mode, which has different logic for executor management. You can set it in $SPARK_HOME/conf/spark-defaults.conf file: spark.mesos.coarse true Or, from this page http://spark.apache.org/docs/latest/running-on-mesos.html, set the property in a SparkConf object used to construct the SparkContext: conf.set(spark.mesos.coarse, true) dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Mon, May 25, 2015 at 12:06 PM, Reinis Vicups sp...@orbit-x.de mailto:sp...@orbit-x.de wrote: Hello, I assume I am running spark in a fine-grained mode since I haven't changed the default here. One question regarding 1.4.0-RC1 - is there a mvn snapshot repository I could use for my project config? (I know that I have to download source and make-distribution for executor as well) thanks reinis On 25.05.2015 17:07, Iulian Dragoș wrote: On Mon, May 25, 2015 at 2:43 PM, Reinis Vicups sp...@orbit-x.de mailto:sp...@orbit-x.de wrote: Hello, I am using Spark 1.3.1-hadoop2.4 with Mesos 0.22.1 with zookeeper and running on a cluster with 3 nodes on 64bit ubuntu. My application is compiled with spark 1.3.1 (apparently with mesos 0.21.0 dependency), hadoop 2.5.1-mapr-1503 and akka 2.3.10. Only with this combination I have succeeded to run spark-jobs on mesos at all. Different versions are causing class loader issues. I am submitting spark jobs with spark-submit with mesos://zk://.../mesos. Are you using coarse grained or fine grained mode? sandbox log of slave-node app01 (the one that stalls) shows following: 10:01:25.815506 35409 fetcher.cpp:214] Fetching URI 'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz' 10:01:26.497764 35409 fetcher.cpp:99] Fetching URI 'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz' using Hadoop Client 10:01:26.497869 35409 fetcher.cpp:109] Downloading resource from 'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz' to '/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05/spark-1.3.1-bin-hadoop2.4.tgz' 10:01:32.877717 35409 fetcher.cpp:78] Extracted resource '/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05/spark-1.3.1-bin-hadoop2.4.tgz' into '/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05' Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 10:01:34 INFO MesosExecutorBackend: Registered signal handlers for [TERM, HUP, INT] 10:01:34.459292 35730 exec.cpp:132] Version: 0.22.0 *10:01:34 ERROR MesosExecutorBackend: Received launchTask but executor was null* 10:01:34.540870 35765 exec.cpp:206] Executor registered on slave 20150511-150924-3410235146-5050-1903-S3 10:01:34 INFO MesosExecutorBackend: Registered with Mesos as executor ID 20150511-150924-3410235146-5050-1903-S3 with 1 cpus It looks like an inconsistent state on the Mesos scheduler. It tries to launch a task on a given slave before the executor has registered. This code was improved/refactored in 1.4, could you try 1.4.0-RC1? Yes and note the second
Re: Tasks randomly stall when running on mesos
Here is a link for builds of 1.4 RC2: http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc2-bin/ For a mvn repo, I believe the RC2 artifacts are here: https://repository.apache.org/content/repositories/orgapachespark-1104/ A few experiments you might try: 1. Does spark-shell work? It might start fine, but make sure you can create an RDD and use it, e.g., something like: val rdd = sc.parallelize(Seq(1,2,3,4,5,6)) rdd foreach println 2. Try coarse grained mode, which has different logic for executor management. You can set it in $SPARK_HOME/conf/spark-defaults.conf file: spark.mesos.coarse true Or, from this page http://spark.apache.org/docs/latest/running-on-mesos.html, set the property in a SparkConf object used to construct the SparkContext: conf.set(spark.mesos.coarse, true) dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Mon, May 25, 2015 at 12:06 PM, Reinis Vicups sp...@orbit-x.de wrote: Hello, I assume I am running spark in a fine-grained mode since I haven't changed the default here. One question regarding 1.4.0-RC1 - is there a mvn snapshot repository I could use for my project config? (I know that I have to download source and make-distribution for executor as well) thanks reinis On 25.05.2015 17:07, Iulian Dragoș wrote: On Mon, May 25, 2015 at 2:43 PM, Reinis Vicups sp...@orbit-x.de wrote: Hello, I am using Spark 1.3.1-hadoop2.4 with Mesos 0.22.1 with zookeeper and running on a cluster with 3 nodes on 64bit ubuntu. My application is compiled with spark 1.3.1 (apparently with mesos 0.21.0 dependency), hadoop 2.5.1-mapr-1503 and akka 2.3.10. Only with this combination I have succeeded to run spark-jobs on mesos at all. Different versions are causing class loader issues. I am submitting spark jobs with spark-submit with mesos://zk://.../mesos. Are you using coarse grained or fine grained mode? sandbox log of slave-node app01 (the one that stalls) shows following: 10:01:25.815506 35409 fetcher.cpp:214] Fetching URI 'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz' 10:01:26.497764 35409 fetcher.cpp:99] Fetching URI 'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz' using Hadoop Client 10:01:26.497869 35409 fetcher.cpp:109] Downloading resource from 'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz' to '/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05/spark-1.3.1-bin-hadoop2.4.tgz' 10:01:32.877717 35409 fetcher.cpp:78] Extracted resource '/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05/spark-1.3.1-bin-hadoop2.4.tgz' into '/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05' Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 10:01:34 INFO MesosExecutorBackend: Registered signal handlers for [TERM, HUP, INT] 10:01:34.459292 35730 exec.cpp:132] Version: 0.22.0 *10:01:34 ERROR MesosExecutorBackend: Received launchTask but executor was null* 10:01:34.540870 35765 exec.cpp:206] Executor registered on slave 20150511-150924-3410235146-5050-1903-S3 10:01:34 INFO MesosExecutorBackend: Registered with Mesos as executor ID 20150511-150924-3410235146-5050-1903-S3 with 1 cpus It looks like an inconsistent state on the Mesos scheduler. It tries to launch a task on a given slave before the executor has registered. This code was improved/refactored in 1.4, could you try 1.4.0-RC1? Yes and note the second message after the error you highlighted; that's when the executor would be registered with Mesos and the local object created. iulian 10:01:34 INFO SecurityManager: Changing view acls to... 10:01:35 INFO Slf4jLogger: Slf4jLogger started 10:01:35 INFO Remoting: Starting remoting 10:01:35 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutor@app01:xxx] 10:01:35 INFO Utils: Successfully started service 'sparkExecutor' on port xxx. 10:01:35 INFO AkkaUtils: Connecting to MapOutputTracker: akka.tcp://sparkDriver@dev-web01/user/MapOutputTracker 10:01:35 INFO AkkaUtils: Connecting to BlockManagerMaster: akka.tcp://sparkDriver@dev-web01/user/BlockManagerMaster 10:01:36 INFO DiskBlockManager: Created local directory at /tmp/spark-52a6585a-f9f2-4ab6-bebc-76be99b0c51c/blockmgr-e6d79818-fe30-4b5c-bcd6-8fbc5a201252 10:01:36 INFO MemoryStore: MemoryStore started with capacity 88.3 MB 10:01:36 WARN