Hi, with the hints from Gerard I was able to get my locally working Spark code running on Mesos. Thanks!
Basically, on my local dev machine, I use "sbt assembly" to create a fat jar (which is actually not so fat since I use "... % 'provided'" in my sbt file for the Spark dependencies), upload it to my cluster and run it using java -cp myApplicationCode.jar:spark-assembly-1.0.0-SNAPSHOT.jar mypackage.MainClass I can see in my Mesos master web interface how the tasks are added and distributed to the slaves and in the driver program I can see the final results, that is very nice. Now, as the next step, I wanted to get Spark Streaming running. That worked out by now, but I have various questions. I'd be happy if someone could help me out with some answers. 1. I wrongly assumed that when using ssc.socketTextStream(), the driver would connect to the specified server. It does not; apparently one of the slaves does ;-) Does that mean that before any DStream processing can be done, all the received data needs to be sent to the other slaves? What about the extreme case dstream.filter(x => false); would all the data be transferred to other hosts, just to be discarded there? 2. How can I reduce the logging? It seems like for every chunk received from the socketTextStream, I get a line "INFO BlockManagerInfo: Added input-0-1400739888200 in memory on ...", that's very noisy. Also, when the foreachRDD() is processed every N seconds, I get a lot of output. 3. In my (non-production) cluster, I have six slaves, two of which have 2G of RAM, the other four just 512M. So far, I have not seen Mesos ever give a job to one of the four low-mem machines. Is 512M just not enough for *any* task, or is there a rationale like "they are not cool enough to play with the Big Guys" built into Mesos? 4. I don't have any HDFS or shared disk space. What does this mean for Spark Streaming's default storage level MEMORY_AND_DISK_SER_2? 5. My prototype example for Spark Streaming is a simple word count: val wordCounts = ssc.socketTextStream(...).flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _) wordCounts.print() However, (with a batchDuration of five seconds) this only works correctly if I run the application in Mesos "coarse mode". In the default "fine-grained mode", I will always receive 0 as word count (that is, a wrong result), and a lot of warnings like W0522 06:57:23.578400 12824 sched.cpp:901] Attempting to launch task 7 with an unknown offer 20140520-102159-2154735808-5050-1108-7891 Can anyone explain this behavior? Thanks, Tobias