It’s really hard to answer this, as the comparison is not really fair – Storm 
is much lower level than Spark and has less overhead when dealing with 
stateless operations. I’d be curious how is your colleague implementing the 
Average on a “batch” and what is the storm equivalent of a Batch.

That aside, the storm implementation seems in the ballpark (we were clocking 
~80K events /sec/node in a well tuned storm app).
For a spark app, it really depends on the lineage, checkpointing, level of 
parallelism, partitioning, number of tasks, scheduling overhead, etc – unless 
we confirm that the job operates at full capacity, I can’t say for sure that 
it’s good or bad.

Some questions to isolate the issue:

  *   Can you try with a manual average operation? (e.g.reduce the events to a 
tuple with count and sum, then divide them)
     *   Looking at the DoubleRDD.mean implementation, the stats module 
computes lots of stuff on the side, not just the mean
     *   Not sure if it matters, but I’m assuming on the storm side you’re not 
doing that
  *   Can you confirm that the operation in cause is indeed computed in 
parallel (should have 48 tasks).
     *   If yes, how many records per second do you get on average, only for 
this operation? - you can find this out in the SparkUI, dividing the number of 
records allocated to one of the executors by the total task time for that 
executor
  *   Are the executors well balanced? Is each of them processing 16 partitions 
in approx equal time? If not, you can have multiple issues here:
     *   Kafka cluster imbalance – happens from time to time, you can monitor 
this from the command line with kafka-topics.sh —describe
     *   Kafka key partitioning scheme – assuming round-robin distribution, 
just checking – you should see this in the 1st stage, again all the executors 
should have allocated an equal number of tasks/partitions to process with an 
equal number of messages

Given that you have 7 kafka brokers and 3 spark executors, you should try a 
.repartition(48) on the kafka Dstream – at the one time cost of a shuffle you 
redistribute the data evenly across all nodes/cores and avoid most of the 
issues above – at least you have a guarantee that the load is spread evenly 
across the cluster.

Hope this helps,

-adrian

From: "Young, Matthew T"
Date: Thursday, September 24, 2015 at 11:47 PM
To: "user@spark.apache.org<mailto:user@spark.apache.org>"
Cc: "Krumm, Greg"
Subject: Reasonable performance numbers?

Hello,

I am doing performance testing with Spark Streaming. I want to know if the 
throughput numbers I am encountering are reasonable for the power of my cluster 
and Spark’s performance characteristics.

My job has the following processing steps:

1.      Read 600 Byte JSON strings from a 7 broker / 48 partition Kafka cluster 
via the Kafka Direct API

2.      Parse the JSON with play-json or lift-json (no significant performance 
difference)

3.      Read one integer value out of the JSON

4.      Compute the average of this integer value across all records in the 
batch with DoubleRDD.mean

5.      Write the average for the batch back to a different Kafka topic

I have tried 2, 4, and 10 second batch intervals. The best throughput I can 
sustain is about 75,000 records/second for the whole cluster.

The Spark cluster is in a VM environment with 3 VMs. Each VM has 32 GB of RAM 
and 16 cores. The systems are networked with 10 GB NICs. I started testing with 
Spark 1.3.1 and switched to Spark 1.5 to see if there was improvement (none 
significant). When I look at the event timeline in the WebUI I see that the 
majority of the processing time for each batch is “Executor Computing Time” in 
the foreachRDD that computes the average, not the transform that does the JSON 
parsing.

CPU util hovers around 40% across the cluster, and RAM has plenty of free space 
remaining as well. Network comes nowhere close to being saturated.

My colleague implementing similar functionality in Storm is able to exceed a 
quarter million records per second with the same hardware.

Is 75K records/seconds reasonable for a cluster of this size? What kind of 
performance would you expect for this job?


Thanks,

-- Matthew

Reply via email to