It’s really hard to answer this, as the comparison is not really fair – Storm is much lower level than Spark and has less overhead when dealing with stateless operations. I’d be curious how is your colleague implementing the Average on a “batch” and what is the storm equivalent of a Batch.
That aside, the storm implementation seems in the ballpark (we were clocking ~80K events /sec/node in a well tuned storm app). For a spark app, it really depends on the lineage, checkpointing, level of parallelism, partitioning, number of tasks, scheduling overhead, etc – unless we confirm that the job operates at full capacity, I can’t say for sure that it’s good or bad. Some questions to isolate the issue: * Can you try with a manual average operation? (e.g.reduce the events to a tuple with count and sum, then divide them) * Looking at the DoubleRDD.mean implementation, the stats module computes lots of stuff on the side, not just the mean * Not sure if it matters, but I’m assuming on the storm side you’re not doing that * Can you confirm that the operation in cause is indeed computed in parallel (should have 48 tasks). * If yes, how many records per second do you get on average, only for this operation? - you can find this out in the SparkUI, dividing the number of records allocated to one of the executors by the total task time for that executor * Are the executors well balanced? Is each of them processing 16 partitions in approx equal time? If not, you can have multiple issues here: * Kafka cluster imbalance – happens from time to time, you can monitor this from the command line with kafka-topics.sh —describe * Kafka key partitioning scheme – assuming round-robin distribution, just checking – you should see this in the 1st stage, again all the executors should have allocated an equal number of tasks/partitions to process with an equal number of messages Given that you have 7 kafka brokers and 3 spark executors, you should try a .repartition(48) on the kafka Dstream – at the one time cost of a shuffle you redistribute the data evenly across all nodes/cores and avoid most of the issues above – at least you have a guarantee that the load is spread evenly across the cluster. Hope this helps, -adrian From: "Young, Matthew T" Date: Thursday, September 24, 2015 at 11:47 PM To: "user@spark.apache.org<mailto:user@spark.apache.org>" Cc: "Krumm, Greg" Subject: Reasonable performance numbers? Hello, I am doing performance testing with Spark Streaming. I want to know if the throughput numbers I am encountering are reasonable for the power of my cluster and Spark’s performance characteristics. My job has the following processing steps: 1. Read 600 Byte JSON strings from a 7 broker / 48 partition Kafka cluster via the Kafka Direct API 2. Parse the JSON with play-json or lift-json (no significant performance difference) 3. Read one integer value out of the JSON 4. Compute the average of this integer value across all records in the batch with DoubleRDD.mean 5. Write the average for the batch back to a different Kafka topic I have tried 2, 4, and 10 second batch intervals. The best throughput I can sustain is about 75,000 records/second for the whole cluster. The Spark cluster is in a VM environment with 3 VMs. Each VM has 32 GB of RAM and 16 cores. The systems are networked with 10 GB NICs. I started testing with Spark 1.3.1 and switched to Spark 1.5 to see if there was improvement (none significant). When I look at the event timeline in the WebUI I see that the majority of the processing time for each batch is “Executor Computing Time” in the foreachRDD that computes the average, not the transform that does the JSON parsing. CPU util hovers around 40% across the cluster, and RAM has plenty of free space remaining as well. Network comes nowhere close to being saturated. My colleague implementing similar functionality in Storm is able to exceed a quarter million records per second with the same hardware. Is 75K records/seconds reasonable for a cluster of this size? What kind of performance would you expect for this job? Thanks, -- Matthew