Hi,

We've recently started testing spark on kubernetes, and have found some odd
performance decreases. In particular its almost an order of magnitude
slower pulling data from kafka than it is in our mesos cluster.

We've tested a few set-ups:

Baseline: Spark 2.3.0 on Mesos host networking (~5million records processed
per batch, ~12-15s a batch)

Spark 2.3.0 on k8s 1.10 EKS with the AWS CNI plugin
Spark 2.3.0 on k8s 1.10 EKS with the Calico CNI plugin
Spark 2.4.0 on k8s 1.10.x with the Cilium CNI plugin (~5 million, 80-100s a
batch)

All of them show the same performance decrease, though we don't have good
numbers we have on the EKS cluster (those tests were run by a different
group). On the 2.4.0/Cilium cluster I run I've seen roughly 8x performance
decrease compared to the equivalent in our mesos cluster.

Our production mesos job runs with 20 executors with 2 cores each, 6g mem.
When we set-up the equivalent I also set it up in the same fashion and
removed CPU limits so that it could burst up if it was needing more CPU.
But none of that seemed to get it to the performance level of our mesos
set-up.

The mesos cluster does not use any CNI, so it's all host based networking,
but I woulnd't expect an overlay network to slow down our jobs by an order
of magnitude.

I was able to compare running a simple consumer that just reads from kafak
and measures how fast it could go, and found that running in the overlay
was about 8% slower than running in the host network stack. So the numbers
aren't lining up for an order of magnitude slower.

Does anyone else have any experience in running streaming production
workloads in kubernetes and if they've run into issues with performance?
Any potential settings I could be missing?

I know most kubernetes clusters are set-up wildly differently but any tips
or insights on where to look would be greatly appreciated.

Thanks,
Kalvin

Reply via email to