Spark Streaming unable to handle production Kafka load

maddenpj Wed, 24 Sep 2014 14:15:30 -0700

I am attempting to use Spark Streaming to summarize event data streaming in
and save it to a MySQL table. The input source data is stored in 4 topics on
Kafka with each topic having 12 partitions. My approach to doing this worked
both in local development and a simulated load testing environment but I
cannot seem to get it working when hooking it up to our production source.
I'm having a hard time figuring out what is going on because I'm not seeing
any obvious errors in the logs, just the first batch never finishes
processing. I believe its a data rate problem (most active topic clocks in
around 4k messages per second, least active topic is around 0.5 msg/s) but
I'm completely stuck of the best way to resolve this and maybe I'm not
following best practices.


Here is a gist of the essentials of my program. I use an updateStateByKey
approach to keep around the MySQL id of that piece of data (so if we've
already seen that particular piece we just update the existing total in
mysql with the total spark just computed in the current window.
https://gist.github.com/maddenpj/74a4c8ce372888ade92d
<https://gist.github.com/maddenpj/74a4c8ce372888ade92d>  

One thing I have noticed is my Kafka Receiver is only on one machine and I
have not yet tried to increase the parallelism of reading out of kafka,
something like this solution:
http://apache-spark-user-list.1001560.n3.nabble.com/Multiple-Kafka-Receivers-and-Union-td14901.html
<http://apache-spark-user-list.1001560.n3.nabble.com/Multiple-Kafka-Receivers-and-Union-td14901.html>
  

So that's next on my list, but I'm still in need of insight into how to
figure out what's going on. When I watch the stages execute on the web UI, I
see occasional activity (A map stage processing) but most of the time it
looks like I'm stuck in some arbitrary stage (i.e.: take and runJob will be
active for the entire life of the program with 0 tasks ever completing).
This is contrary to what I see when I watch the Kafka topics get consumed,
the program is always consuming messages, it just gives no indication it's
doing any actual processing on them.

On a somewhat related note, how does everyone capacity plan for building out
a spark cluster? So far I've just been using trial and error but I still
haven't found the right number of nodes that can handle our 4k/s topic. I've
tried up to 6 amazon m3.large's (2 cores, 7.5 GB memory),  but even that
feels excessive to me as currently we're processing this data load on a
single node mapreduce cluster,  an m3.xlarge (4 cores, 15 GB memory).

Thanks,Patrick



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-unable-to-handle-production-Kafka-load-tp15077.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Spark Streaming unable to handle production Kafka load

Reply via email to