I am attempting to use Spark Streaming to summarize event data streaming in and save it to a MySQL table. The input source data is stored in 4 topics on Kafka with each topic having 12 partitions. My approach to doing this worked both in local development and a simulated load testing environment but I cannot seem to get it working when hooking it up to our production source. I'm having a hard time figuring out what is going on because I'm not seeing any obvious errors in the logs, just the first batch never finishes processing. I believe its a data rate problem (most active topic clocks in around 4k messages per second, least active topic is around 0.5 msg/s) but I'm completely stuck of the best way to resolve this and maybe I'm not following best practices.
Here is a gist of the essentials of my program. I use an updateStateByKey approach to keep around the MySQL id of that piece of data (so if we've already seen that particular piece we just update the existing total in mysql with the total spark just computed in the current window. https://gist.github.com/maddenpj/74a4c8ce372888ade92d <https://gist.github.com/maddenpj/74a4c8ce372888ade92d> One thing I have noticed is my Kafka Receiver is only on one machine and I have not yet tried to increase the parallelism of reading out of kafka, something like this solution: http://apache-spark-user-list.1001560.n3.nabble.com/Multiple-Kafka-Receivers-and-Union-td14901.html <http://apache-spark-user-list.1001560.n3.nabble.com/Multiple-Kafka-Receivers-and-Union-td14901.html> So that's next on my list, but I'm still in need of insight into how to figure out what's going on. When I watch the stages execute on the web UI, I see occasional activity (A map stage processing) but most of the time it looks like I'm stuck in some arbitrary stage (i.e.: take and runJob will be active for the entire life of the program with 0 tasks ever completing). This is contrary to what I see when I watch the Kafka topics get consumed, the program is always consuming messages, it just gives no indication it's doing any actual processing on them. On a somewhat related note, how does everyone capacity plan for building out a spark cluster? So far I've just been using trial and error but I still haven't found the right number of nodes that can handle our 4k/s topic. I've tried up to 6 amazon m3.large's (2 cores, 7.5 GB memory), but even that feels excessive to me as currently we're processing this data load on a single node mapreduce cluster, an m3.xlarge (4 cores, 15 GB memory). Thanks,Patrick -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-unable-to-handle-production-Kafka-load-tp15077.html Sent from the Apache Spark User List mailing list archive at Nabble.com.