subject:"Tasks randomly stall when running on mesos"

Re: Tasks randomly stall when running on mesos

2015-05-26 Thread Reinis Vicups


Hi,

I just configured my cluster to run with 1.4.0-rc2, alas the dependency 
jungle does not one let just download, config and start. Instead one 
will have to fiddle with sbt settings for the upcoming couple of nights:


2015-05-26 14:50:52,686 WARN  a.r.ReliableDeliverySupervisor - 
Association with remote system 
[akka.tcp://driverPropsFetcher@app03:44805] has failed, address is now 
gated for [5000] ms. Reason is: [org.apache.spark.rpc.akka.AkkaMessage].
2015-05-26 14:52:55,707 ERROR Remoting - 
org.apache.spark.rpc.akka.AkkaMessage

java.lang.ClassNotFoundException: org.apache.spark.rpc.akka.AkkaMessage
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at 
java.io.ObjectInputStream.resolveClass(ObjectInputStream.java:626)
at 
akka.util.ClassLoaderObjectInputStream.resolveClass(ClassLoaderObjectInputStream.scala:19)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
at 
java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)

at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at 
akka.serialization.JavaSerializer$$anonfun$1.apply(Serializer.scala:136)

at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
akka.serialization.JavaSerializer.fromBinary(Serializer.scala:136)
at 
akka.serialization.Serialization$$anonfun$deserialize$1.apply(Serialization.scala:104)

at scala.util.Try$.apply(Try.scala:161)
at 
akka.serialization.Serialization.deserialize(Serialization.scala:98)
at 
akka.remote.MessageSerializer$.deserialize(MessageSerializer.scala:23)
at 
akka.remote.DefaultMessageDispatcher.payload$lzycompute$1(Endpoint.scala:58)
at 
akka.remote.DefaultMessageDispatcher.payload$1(Endpoint.scala:58)

at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:76)
at 
akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:937)

at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:415)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at 
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


kind regards
reinis

On 25.05.2015 23:09, Reinis Vicups wrote:

Great hints, you guys!

Yes spark-shell worked fine with mesos as master. I haven't tried to 
execute multiple rdd actions in a row though (I did couple of 
successful  counts on hbase tables i am working with in several 
experiments but nothing that would compare to the stuff my spark jobs 
are doing), but will check if shell stalls upon some decent rdd action.


Also thanks a bunch for the links to binaries. This will literally 
save me hours!


kind regards
reinis

On 25.05.2015 21:00, Dean Wampler wrote:

Here is a link for builds of 1.4 RC2:

http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc2-bin/ 
http://people.apache.org/%7Epwendell/spark-releases/spark-1.4.0-rc2-bin/

For a mvn repo, I believe the RC2 artifacts are here:

https://repository.apache.org/content/repositories/orgapachespark-1104/

A few experiments you might try:

1. Does spark-shell work? It might start fine, but make sure you can 
create an RDD and use it, e.g., something like:


val rdd = sc.parallelize(Seq(1,2,3,4,5,6))
rdd foreach println

2. Try coarse grained mode, which has different logic for executor 
management.


You can set it in $SPARK_HOME/conf/spark-defaults.conf file:

spark.mesos.coarse   true

Or, from this page 
http://spark.apache.org/docs/latest/running-on-mesos.html, set the 
property in a SparkConf object used to construct the SparkContext:


conf.set(spark.mesos.coarse, true)

dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition 
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)

Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler

Tasks randomly stall when running on mesos

2015-05-25 Thread Reinis Vicups


Hello,

I am using Spark 1.3.1-hadoop2.4 with Mesos 0.22.1 with zookeeper and 
running on a cluster with 3 nodes on 64bit ubuntu.


My application is compiled with spark 1.3.1 (apparently with mesos 
0.21.0 dependency), hadoop 2.5.1-mapr-1503 and akka 2.3.10. Only with 
this combination I have succeeded to run spark-jobs on mesos at all. 
Different versions are causing class loader issues.


I am submitting spark jobs with spark-submit with mesos://zk://.../mesos.

About 50% of all jobs stall forever (or until I kill spark driver).
Error occurs randomly on different slave-nodes. It happens that 4 
spark-job in a row run completely without problems and then problem 
suddenly occurs.
I am always testing same set of 5 different jobs single and combined and 
the error occurs always in different job/node/stage/task combinations.


Whenever a slave-node stalls, this message appears in sandbox-log of the 
failing slave:
10:01:34 ERROR MesosExecutorBackend: Received launchTask but executor 
was null


Any hints on how to address this issue are greatly appreciated

kind regards
reinis


Job that stalls, shows following in spark-driver log (As one can see - 
the task 1.0 is never finished):


10:01:25,620 INFO  o.a.s.s.DAGScheduler - Submitting 4 missing tasks 
from Stage 0 (MapPartitionsRDD[1] at groupBy at 
ImportExtensionFieldsSparkJob.scala:57)
10:01:25,621 INFO  o.a.s.s.TaskSchedulerImpl - Adding task set 0.0 with 
4 tasks
10:01:25,656 INFO  o.a.s.s.TaskSetManager - Starting task 0.0 in stage 
0.0 (TID 0, app03, PROCESS_LOCAL, 1140 bytes)
10:01:25,660 INFO  o.a.s.s.TaskSetManager - Starting task 1.0 in stage 
0.0 (TID 1, app01, PROCESS_LOCAL, 1140 bytes)
10:01:25,661 INFO  o.a.s.s.TaskSetManager - Starting task 2.0 in stage 
0.0 (TID 2, app02, PROCESS_LOCAL, 1140 bytes)
10:01:25,662 INFO  o.a.s.s.TaskSetManager - Starting task 3.0 in stage 
0.0 (TID 3, app03, PROCESS_LOCAL, 1140 bytes)
10:01:36,842 INFO  o.a.s.s.BlockManagerMasterActor - Registering block 
manager app02 with 88.3 MB RAM, 
BlockManagerId(20150511-150924-3410235146-5050-1903-S1, app02, 59622)
10:01:36,862 INFO  o.a.s.s.BlockManagerMasterActor - Registering block 
manager app03 with 88.3 MB RAM, 
BlockManagerId(20150511-150924-3410235146-5050-1903-S2, app03, 39420)
10:01:36,917 INFO  o.a.s.s.BlockManagerMasterActor - Registering block 
manager app01 with 88.3 MB RAM, 
BlockManagerId(20150511-150924-3410235146-5050-1903-S3, app01, 45605)
10:01:38,701 INFO  o.a.s.s.BlockManagerInfo - Added broadcast_2_piece0 
in memory on app03 (size: 2.6 KB, free: 88.3 MB)
10:01:38,702 INFO  o.a.s.s.BlockManagerInfo - Added broadcast_2_piece0 
in memory on app02 (size: 2.6 KB, free: 88.3 MB)
10:01:41,400 INFO  o.a.s.s.TaskSetManager - Finished task 0.0 in stage 
0.0 (TID 0) in 15721 ms on app03 (1/4)
10:01:41,539 INFO  o.a.s.s.TaskSetManager - Finished task 2.0 in stage 
0.0 (TID 2) in 15870 ms on app02 (2/4)
10:01:41,697 INFO  o.a.s.s.TaskSetManager - Finished task 3.0 in stage 
0.0 (TID 3) in 16029 ms on app03 (3/4)


sandbox log of slave-node app01 (the one that stalls) shows following:

10:01:25.815506 35409 fetcher.cpp:214] Fetching URI 
'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz'
10:01:26.497764 35409 fetcher.cpp:99] Fetching URI 
'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz' using Hadoop Client
10:01:26.497869 35409 fetcher.cpp:109] Downloading resource from 
'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz' to 
'/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05/spark-1.3.1-bin-hadoop2.4.tgz'
10:01:32.877717 35409 fetcher.cpp:78] Extracted resource 
'/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05/spark-1.3.1-bin-hadoop2.4.tgz' 
into 
'/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05'
Using Spark's default log4j profile: 
org/apache/spark/log4j-defaults.properties
10:01:34 INFO MesosExecutorBackend: Registered signal handlers for 
[TERM, HUP, INT]

10:01:34.459292 35730 exec.cpp:132] Version: 0.22.0
*10:01:34 ERROR MesosExecutorBackend: Received launchTask but executor 
was null*
10:01:34.540870 35765 exec.cpp:206] Executor registered on slave 
20150511-150924-3410235146-5050-1903-S3
10:01:34 INFO MesosExecutorBackend: Registered with Mesos as executor ID 
20150511-150924-3410235146-5050-1903-S3 with 1 cpus

10:01:34 INFO SecurityManager: Changing view acls to...
10:01:35 INFO Slf4jLogger: Slf4jLogger started
10:01:35 INFO Remoting: Starting remoting
10:01:35 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://sparkExecutor@app01:xxx]
10:01:35 INFO Utils:

Re: Tasks randomly stall when running on mesos

2015-05-25 Thread Iulian Dragoș

On Mon, May 25, 2015 at 2:43 PM, Reinis Vicups sp...@orbit-x.de wrote:

  Hello,

 I am using Spark 1.3.1-hadoop2.4 with Mesos 0.22.1 with zookeeper and
 running on a cluster with 3 nodes on 64bit ubuntu.

 My application is compiled with spark 1.3.1 (apparently with mesos 0.21.0
 dependency), hadoop 2.5.1-mapr-1503 and akka 2.3.10. Only with this
 combination I have succeeded to run spark-jobs on mesos at all. Different
 versions are causing class loader issues.

 I am submitting spark jobs with spark-submit with mesos://zk://.../mesos.


Are you using coarse grained or fine grained mode?

sandbox log of slave-node app01 (the one that stalls) shows following:

 10:01:25.815506 35409 fetcher.cpp:214] Fetching URI
 'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz'
 10:01:26.497764 35409 fetcher.cpp:99] Fetching URI
 'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz' using Hadoop Client
 10:01:26.497869 35409 fetcher.cpp:109] Downloading resource from
 'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz' to
 '/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05/spark-1.3.1-bin-hadoop2.4.tgz'
 10:01:32.877717 35409 fetcher.cpp:78] Extracted resource
 '/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05/spark-1.3.1-bin-hadoop2.4.tgz'
 into
 '/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05'
 Using Spark's default log4j profile:
 org/apache/spark/log4j-defaults.properties
 10:01:34 INFO MesosExecutorBackend: Registered signal handlers for [TERM,
 HUP, INT]
 10:01:34.459292 35730 exec.cpp:132] Version: 0.22.0
 *10:01:34 ERROR MesosExecutorBackend: Received launchTask but executor was
 null*
 10:01:34.540870 35765 exec.cpp:206] Executor registered on slave
 20150511-150924-3410235146-5050-1903-S3
 10:01:34 INFO MesosExecutorBackend: Registered with Mesos as executor ID
 20150511-150924-3410235146-5050-1903-S3 with 1 cpus


It looks like an inconsistent state on the Mesos scheduler. It tries to
launch a task on a given slave before the executor has registered. This
code was improved/refactored in 1.4, could you try 1.4.0-RC1?

iulian


 10:01:34 INFO SecurityManager: Changing view acls to...
 10:01:35 INFO Slf4jLogger: Slf4jLogger started
 10:01:35 INFO Remoting: Starting remoting
 10:01:35 INFO Remoting: Remoting started; listening on addresses
 :[akka.tcp://sparkExecutor@app01:xxx]
 10:01:35 INFO Utils: Successfully started service 'sparkExecutor' on port
 xxx.
 10:01:35 INFO AkkaUtils: Connecting to MapOutputTracker:
 akka.tcp://sparkDriver@dev-web01/user/MapOutputTracker
 10:01:35 INFO AkkaUtils: Connecting to BlockManagerMaster:
 akka.tcp://sparkDriver@dev-web01/user/BlockManagerMaster
 10:01:36 INFO DiskBlockManager: Created local directory at
 /tmp/spark-52a6585a-f9f2-4ab6-bebc-76be99b0c51c/blockmgr-e6d79818-fe30-4b5c-bcd6-8fbc5a201252
 10:01:36 INFO MemoryStore: MemoryStore started with capacity 88.3 MB
 10:01:36 WARN NativeCodeLoader: Unable to load native-hadoop library for
 your platform... using builtin-java classes where applicable
 10:01:36 INFO AkkaUtils: Connecting to OutputCommitCoordinator:
 akka.tcp://sparkDriver@dev-web01/user/OutputCommitCoordinator
 10:01:36 INFO Executor: Starting executor ID
 20150511-150924-3410235146-5050-1903-S3 on host app01
 10:01:36 INFO NettyBlockTransferService: Server created on XXX
 10:01:36 INFO BlockManagerMaster: Trying to register BlockManager
 10:01:36 INFO BlockManagerMaster: Registered BlockManager
 10:01:36 INFO AkkaUtils: Connecting to HeartbeatReceiver:
 akka.tcp://sparkDriver@dev-web01/user/HeartbeatReceiver

 As soon as spark-driver is aborted, following log entries are added to the
 sandbox log of slave-node app01:

 10:17:29.559433 35772 exec.cpp:379] Executor asked to shutdown
 10:17:29 WARN ReliableDeliverySupervisor: Association with remote system
 [akka.tcp://sparkDriver@dev-web01] has failed, address is now gated for
 [5000] ms. Reason is: [Disassociated]

 Successful Job shows instead following in spark-driver log:

 08:03:19,862 INFO  o.a.s.s.TaskSetManager - Finished task 3.0 in stage 1.0
 (TID 7) in 1688 ms on app01 (1/4)
 08:03:19,869 INFO  o.a.s.s.TaskSetManager - Finished task 0.0 in stage 1.0
 (TID 4) in 1700 ms on app03 (2/4)
 08:03:19,874 INFO  o.a.s.s.TaskSetManager - Finished task 1.0 in stage 1.0
 (TID 5) in 1703 ms on app02 (3/4)
 08:03:19,878 INFO  o.a.s.s.TaskSetManager - Finished task 2.0 in stage 1.0
 (TID 6) in 1706 ms on app02 (4/4)
 08:03:19,878 INFO  o.a.s.s.DAGScheduler - Stage 1
 (saveAsNewAPIHadoopDataset at ImportSparkJob.scala:90) finished in 1.718 s

Re: Tasks randomly stall when running on mesos

2015-05-25 Thread Reinis Vicups


Hello,

I assume I am running spark in a fine-grained mode since I haven't 
changed the default here.


One question regarding 1.4.0-RC1 - is there a mvn snapshot repository I 
could use for my project config? (I know that I have to download source 
and make-distribution for executor as well)


thanks
reinis

On 25.05.2015 17:07, Iulian Dragoș wrote:


On Mon, May 25, 2015 at 2:43 PM, Reinis Vicups sp...@orbit-x.de 
mailto:sp...@orbit-x.de wrote:


Hello,

I am using Spark 1.3.1-hadoop2.4 with Mesos 0.22.1 with zookeeper
and running on a cluster with 3 nodes on 64bit ubuntu.

My application is compiled with spark 1.3.1 (apparently with mesos
0.21.0 dependency), hadoop 2.5.1-mapr-1503 and akka 2.3.10. Only
with this combination I have succeeded to run spark-jobs on mesos
at all. Different versions are causing class loader issues.

I am submitting spark jobs with spark-submit with
mesos://zk://.../mesos.


Are you using coarse grained or fine grained mode?

sandbox log of slave-node app01 (the one that stalls) shows following:

10:01:25.815506 35409 fetcher.cpp:214] Fetching URI
'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz'
10:01:26.497764 35409 fetcher.cpp:99] Fetching URI
'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz' using
Hadoop Client
10:01:26.497869 35409 fetcher.cpp:109] Downloading resource from
'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz' to

'/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05/spark-1.3.1-bin-hadoop2.4.tgz'
10:01:32.877717 35409 fetcher.cpp:78] Extracted resource

'/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05/spark-1.3.1-bin-hadoop2.4.tgz'
into

'/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05'
Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties
10:01:34 INFO MesosExecutorBackend: Registered signal handlers for
[TERM, HUP, INT]
10:01:34.459292 35730 exec.cpp:132] Version: 0.22.0
*10:01:34 ERROR MesosExecutorBackend: Received launchTask but
executor was null*
10:01:34.540870 35765 exec.cpp:206] Executor registered on slave
20150511-150924-3410235146-5050-1903-S3
10:01:34 INFO MesosExecutorBackend: Registered with Mesos as
executor ID 20150511-150924-3410235146-5050-1903-S3 with 1 cpus


It looks like an inconsistent state on the Mesos scheduler. It tries 
to launch a task on a given slave before the executor has registered. 
This code was improved/refactored in 1.4, could you try 1.4.0-RC1?


iulian

10:01:34 INFO SecurityManager: Changing view acls to...
10:01:35 INFO Slf4jLogger: Slf4jLogger started
10:01:35 INFO Remoting: Starting remoting
10:01:35 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://sparkExecutor@app01:xxx]
10:01:35 INFO Utils: Successfully started service 'sparkExecutor'
on port xxx.
10:01:35 INFO AkkaUtils: Connecting to MapOutputTracker:
akka.tcp://sparkDriver@dev-web01/user/MapOutputTracker
10:01:35 INFO AkkaUtils: Connecting to BlockManagerMaster:
akka.tcp://sparkDriver@dev-web01/user/BlockManagerMaster
10:01:36 INFO DiskBlockManager: Created local directory at

/tmp/spark-52a6585a-f9f2-4ab6-bebc-76be99b0c51c/blockmgr-e6d79818-fe30-4b5c-bcd6-8fbc5a201252
10:01:36 INFO MemoryStore: MemoryStore started with capacity 88.3 MB
10:01:36 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where
applicable
10:01:36 INFO AkkaUtils: Connecting to OutputCommitCoordinator:
akka.tcp://sparkDriver@dev-web01/user/OutputCommitCoordinator
10:01:36 INFO Executor: Starting executor ID
20150511-150924-3410235146-5050-1903-S3 on host app01
10:01:36 INFO NettyBlockTransferService: Server created on XXX
10:01:36 INFO BlockManagerMaster: Trying to register BlockManager
10:01:36 INFO BlockManagerMaster: Registered BlockManager
10:01:36 INFO AkkaUtils: Connecting to HeartbeatReceiver:
akka.tcp://sparkDriver@dev-web01/user/HeartbeatReceiver

As soon as spark-driver is aborted, following log entries are
added to the sandbox log of slave-node app01:

10:17:29.559433 35772 exec.cpp:379] Executor asked to shutdown
10:17:29 WARN ReliableDeliverySupervisor: Association with remote
system [akka.tcp://sparkDriver@dev-web01] has failed, address is
now gated for [5000] ms. Reason is: [Disassociated]

Successful Job shows instead following in

Re: Tasks randomly stall when running on mesos

2015-05-25 Thread Reinis Vicups


Great hints, you guys!

Yes spark-shell worked fine with mesos as master. I haven't tried to 
execute multiple rdd actions in a row though (I did couple of 
successful  counts on hbase tables i am working with in several 
experiments but nothing that would compare to the stuff my spark jobs 
are doing), but will check if shell stalls upon some decent rdd action.


Also thanks a bunch for the links to binaries. This will literally save 
me hours!


kind regards
reinis

On 25.05.2015 21:00, Dean Wampler wrote:

Here is a link for builds of 1.4 RC2:

http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc2-bin/ 
http://people.apache.org/%7Epwendell/spark-releases/spark-1.4.0-rc2-bin/


For a mvn repo, I believe the RC2 artifacts are here:

https://repository.apache.org/content/repositories/orgapachespark-1104/

A few experiments you might try:

1. Does spark-shell work? It might start fine, but make sure you can 
create an RDD and use it, e.g., something like:


val rdd = sc.parallelize(Seq(1,2,3,4,5,6))
rdd foreach println

2. Try coarse grained mode, which has different logic for executor 
management.


You can set it in $SPARK_HOME/conf/spark-defaults.conf file:

spark.mesos.coarse   true

Or, from this page 
http://spark.apache.org/docs/latest/running-on-mesos.html, set the 
property in a SparkConf object used to construct the SparkContext:


conf.set(spark.mesos.coarse, true)

dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition 
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)

Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler
http://polyglotprogramming.com

On Mon, May 25, 2015 at 12:06 PM, Reinis Vicups sp...@orbit-x.de 
mailto:sp...@orbit-x.de wrote:


Hello,

I assume I am running spark in a fine-grained mode since I haven't
changed the default here.

One question regarding 1.4.0-RC1 - is there a mvn snapshot
repository I could use for my project config? (I know that I have
to download source and make-distribution for executor as well)

thanks
reinis


On 25.05.2015 17:07, Iulian Dragoș wrote:


On Mon, May 25, 2015 at 2:43 PM, Reinis Vicups sp...@orbit-x.de
mailto:sp...@orbit-x.de wrote:

Hello,

I am using Spark 1.3.1-hadoop2.4 with Mesos 0.22.1 with
zookeeper and running on a cluster with 3 nodes on 64bit ubuntu.

My application is compiled with spark 1.3.1 (apparently with
mesos 0.21.0 dependency), hadoop 2.5.1-mapr-1503 and akka
2.3.10. Only with this combination I have succeeded to run
spark-jobs on mesos at all. Different versions are causing
class loader issues.

I am submitting spark jobs with spark-submit with
mesos://zk://.../mesos.


Are you using coarse grained or fine grained mode?

sandbox log of slave-node app01 (the one that stalls) shows
following:

10:01:25.815506 35409 fetcher.cpp:214] Fetching URI
'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz'
10:01:26.497764 35409 fetcher.cpp:99] Fetching URI
'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz'
using Hadoop Client
10:01:26.497869 35409 fetcher.cpp:109] Downloading resource
from 'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz'
to

'/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05/spark-1.3.1-bin-hadoop2.4.tgz'
10:01:32.877717 35409 fetcher.cpp:78] Extracted resource

'/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05/spark-1.3.1-bin-hadoop2.4.tgz'
into

'/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05'
Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties
10:01:34 INFO MesosExecutorBackend: Registered signal
handlers for [TERM, HUP, INT]
10:01:34.459292 35730 exec.cpp:132] Version: 0.22.0
*10:01:34 ERROR MesosExecutorBackend: Received launchTask but
executor was null*
10:01:34.540870 35765 exec.cpp:206] Executor registered on
slave 20150511-150924-3410235146-5050-1903-S3
10:01:34 INFO MesosExecutorBackend: Registered with Mesos as
executor ID 20150511-150924-3410235146-5050-1903-S3 with 1 cpus


It looks like an inconsistent state on the Mesos scheduler. It
tries to launch a task on a given slave before the executor has
registered. This code was improved/refactored in 1.4, could you
try 1.4.0-RC1?



Yes and note the second

Re: Tasks randomly stall when running on mesos

2015-05-25 Thread Dean Wampler

Here is a link for builds of 1.4 RC2:

http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc2-bin/

For a mvn repo, I believe the RC2 artifacts are here:

https://repository.apache.org/content/repositories/orgapachespark-1104/

A few experiments you might try:

1. Does spark-shell work? It might start fine, but make sure you can create
an RDD and use it, e.g., something like:

val rdd = sc.parallelize(Seq(1,2,3,4,5,6))
rdd foreach println

2. Try coarse grained mode, which has different logic for executor
management.

You can set it in $SPARK_HOME/conf/spark-defaults.conf file:

spark.mesos.coarse   true

Or, from this page
http://spark.apache.org/docs/latest/running-on-mesos.html, set the
property in a SparkConf object used to construct the SparkContext:

conf.set(spark.mesos.coarse, true)

dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler
http://polyglotprogramming.com

On Mon, May 25, 2015 at 12:06 PM, Reinis Vicups sp...@orbit-x.de wrote:

  Hello,

 I assume I am running spark in a fine-grained mode since I haven't changed
 the default here.

 One question regarding 1.4.0-RC1 - is there a mvn snapshot repository I
 could use for my project config? (I know that I have to download source and
 make-distribution for executor as well)

 thanks
 reinis


 On 25.05.2015 17:07, Iulian Dragoș wrote:


 On Mon, May 25, 2015 at 2:43 PM, Reinis Vicups sp...@orbit-x.de wrote:

  Hello,

 I am using Spark 1.3.1-hadoop2.4 with Mesos 0.22.1 with zookeeper and
 running on a cluster with 3 nodes on 64bit ubuntu.

 My application is compiled with spark 1.3.1 (apparently with mesos 0.21.0
 dependency), hadoop 2.5.1-mapr-1503 and akka 2.3.10. Only with this
 combination I have succeeded to run spark-jobs on mesos at all. Different
 versions are causing class loader issues.

 I am submitting spark jobs with spark-submit with mesos://zk://.../mesos.


  Are you using coarse grained or fine grained mode?

  sandbox log of slave-node app01 (the one that stalls) shows following:

 10:01:25.815506 35409 fetcher.cpp:214] Fetching URI
 'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz'
 10:01:26.497764 35409 fetcher.cpp:99] Fetching URI
 'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz' using Hadoop Client
 10:01:26.497869 35409 fetcher.cpp:109] Downloading resource from
 'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz' to
 '/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05/spark-1.3.1-bin-hadoop2.4.tgz'
 10:01:32.877717 35409 fetcher.cpp:78] Extracted resource
 '/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05/spark-1.3.1-bin-hadoop2.4.tgz'
 into
 '/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05'
 Using Spark's default log4j profile:
 org/apache/spark/log4j-defaults.properties
 10:01:34 INFO MesosExecutorBackend: Registered signal handlers for [TERM,
 HUP, INT]
 10:01:34.459292 35730 exec.cpp:132] Version: 0.22.0
 *10:01:34 ERROR MesosExecutorBackend: Received launchTask but executor
 was null*
 10:01:34.540870 35765 exec.cpp:206] Executor registered on slave
 20150511-150924-3410235146-5050-1903-S3
 10:01:34 INFO MesosExecutorBackend: Registered with Mesos as executor ID
 20150511-150924-3410235146-5050-1903-S3 with 1 cpus


  It looks like an inconsistent state on the Mesos scheduler. It tries to
 launch a task on a given slave before the executor has registered. This
 code was improved/refactored in 1.4, could you try 1.4.0-RC1?


Yes and note the second message after the error you highlighted; that's
when the executor would be registered with Mesos and the local object
created. 



  iulian


  10:01:34 INFO SecurityManager: Changing view acls to...
 10:01:35 INFO Slf4jLogger: Slf4jLogger started
 10:01:35 INFO Remoting: Starting remoting
 10:01:35 INFO Remoting: Remoting started; listening on addresses
 :[akka.tcp://sparkExecutor@app01:xxx]
 10:01:35 INFO Utils: Successfully started service 'sparkExecutor' on port
 xxx.
 10:01:35 INFO AkkaUtils: Connecting to MapOutputTracker:
 akka.tcp://sparkDriver@dev-web01/user/MapOutputTracker
 10:01:35 INFO AkkaUtils: Connecting to BlockManagerMaster:
 akka.tcp://sparkDriver@dev-web01/user/BlockManagerMaster
 10:01:36 INFO DiskBlockManager: Created local directory at
 /tmp/spark-52a6585a-f9f2-4ab6-bebc-76be99b0c51c/blockmgr-e6d79818-fe30-4b5c-bcd6-8fbc5a201252
 10:01:36 INFO MemoryStore: MemoryStore started with capacity 88.3 MB
 10:01:36 WARN

Re: Tasks randomly stall when running on mesos

Tasks randomly stall when running on mesos

Re: Tasks randomly stall when running on mesos

Re: Tasks randomly stall when running on mesos

Re: Tasks randomly stall when running on mesos

Re: Tasks randomly stall when running on mesos

6 matches

Site Navigation

Mail list logo

Footer information