[SPARK SQL] Sometimes spark does not scale down on k8s

Alexei Fri, 02 Apr 2021 07:47:33 -0700

Hi all!

We are using spark as constantly running sql interface to parquet on hdfs and gcs with our in-house app. We use autoscaling with k8s backend. Sometimes (approx. once a day) something nasty happens and spark stops to scale down staying with max available executors.

I've checked graphs (https://imgur.com/a/6h3MfPa) and found few strange things:

At the same time numberTargetExecutors and numberMaxNeededExecutors increases drastically and remains large even though there could be no requests at all (I've tried to remove driver from backend pool, this did not help to scale down even with no requests during ~20mins)

There are also lots of dropped events from executorManagement queue

I've tried to increase executorManagement queue size up to 30000, this did not help.

Is this a bug or kinda expected behavior? Shall I increase queue size even more or there is another thing to adjust?

Thank you.

spark: 3.1.1

jvm: openjdk-11-jre-headless:amd64 11.0.10+9-0ubuntu1~18.04

k8s provider: gke

some related spark options:

spark.dynamicAllocation.enabled=true

spark.dynamicAllocation.minExecutors=5

spark.dynamicAllocation.maxExecutors=50

spark.dynamicAllocation.executorIdleTimeout=120s

spark.dynamicAllocation.shuffleTracking.enabled=true

spark.dynamicAllocation.cachedExecutorIdleTimeout=120s

spark.dynamicAllocation.shuffleTracking.timeout=120s

spark.dynamicAllocation.executorAllocationRatio=0.5

spark.dynamicAllocation.schedulerBacklogTimeout=2s

spark.dynamicAllocation.sustainedSchedulerBacklogTimeout=1s

spark.scheduler.listenerbus.eventqueue.capacity=30000

--
Grats, Alex.

--------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org

[SPARK SQL] Sometimes spark does not scale down on k8s

Reply via email to