Spark Streaming on Mesos, various questions

Tobias Pfeiffer Thu, 22 May 2014 00:11:29 -0700

Hi,

with the hints from Gerard I was able to get my locally working Spark
code running on Mesos. Thanks!


Basically, on my local dev machine, I use "sbt assembly" to create a
fat jar (which is actually not so fat since I use "... % 'provided'"
in my sbt file for the Spark dependencies), upload it to my cluster
and run it using
  java -cp myApplicationCode.jar:spark-assembly-1.0.0-SNAPSHOT.jar
mypackage.MainClass
I can see in my Mesos master web interface how the tasks are added and
distributed to the slaves and in the driver program I can see the
final results, that is very nice.

Now, as the next step, I wanted to get Spark Streaming running. That
worked out by now, but I have various questions. I'd be happy if
someone could help me out with some answers.

1. I wrongly assumed that when using ssc.socketTextStream(), the
driver would connect to the specified server. It does not; apparently
one of the slaves does ;-) Does that mean that before any DStream
processing can be done, all the received data needs to be sent to the
other slaves? What about the extreme case dstream.filter(x => false);
would all the data be transferred to other hosts, just to be discarded
there?

2. How can I reduce the logging? It seems like for every chunk
received from the socketTextStream, I get a line "INFO
BlockManagerInfo: Added input-0-1400739888200 in memory on ...",
that's very noisy. Also, when the foreachRDD() is processed every N
seconds, I get a lot of output.

3. In my (non-production) cluster, I have six slaves, two of which
have 2G of RAM, the other four just 512M. So far, I have not seen
Mesos ever give a job to one of the four low-mem machines. Is 512M
just not enough for *any* task, or is there a rationale like "they are
not cool enough to play with the Big Guys" built into Mesos?

4. I don't have any HDFS or shared disk space. What does this mean for
Spark Streaming's default storage level MEMORY_AND_DISK_SER_2?

5. My prototype example for Spark Streaming is a simple word count:
  val wordCounts = ssc.socketTextStream(...).flatMap(_.split("
")).map((_, 1)).reduceByKey(_ + _)
  wordCounts.print()
However, (with a batchDuration of five seconds) this only works
correctly if I run the application in Mesos "coarse mode". In the
default "fine-grained mode", I will always receive 0 as word count
(that is, a wrong result), and a lot of warnings like
  W0522 06:57:23.578400 12824 sched.cpp:901] Attempting to launch task
7 with an unknown offer 20140520-102159-2154735808-5050-1108-7891
Can anyone explain this behavior?

Thanks,
Tobias

Spark Streaming on Mesos, various questions

Reply via email to