Hi all!
We are using spark as constantly running sql interface to parquet on hdfs and gcs with our in-house app. We use autoscaling with k8s backend. Sometimes (approx. once a day) something nasty happens and spark stops to scale down staying with max available executors.
I've checked graphs (https://imgur.com/a/6h3MfPa) and found few strange things:
At the same time numberTargetExecutors and numberMaxNeededExecutors increases drastically and remains large even though there could be no requests at all (I've tried to remove driver from backend pool, this did not help to scale down even with no requests during ~20mins)
There are also lots of dropped events from executorManagement queue
I've tried to increase executorManagement queue size up to 30000, this did not help.
Is this a bug or kinda expected behavior? Shall I increase queue size even more or there is another thing to adjust?
Thank you.
spark: 3.1.1
jvm: openjdk-11-jre-headless:amd64 11.0.10+9-0ubuntu1~18.04
k8s provider: gke
some related spark options:
spark.dynamicAllocation.enabled=true
spark.dynamicAllocation.minExecutors=5
spark.dynamicAllocation.maxExecutors=50
spark.dynamicAllocation.executorIdleTimeout=120s
spark.dynamicAllocation.shuffleTracking.enabled=true
spark.dynamicAllocation.cachedExecutorIdleTimeout=120s
spark.dynamicAllocation.shuffleTracking.timeout=120s
spark.dynamicAllocation.executorAllocationRatio=0.5
spark.dynamicAllocation.schedulerBacklogTimeout=2s
spark.dynamicAllocation.sustainedSchedulerBacklogTimeout=1s
spark.scheduler.listenerbus.eventqueue.capacity=30000
--
Grats, Alex.
Grats, Alex.