Re: Autoscaling in Spark

Mich Talebzadeh Tue, 10 Oct 2023 09:42:22 -0700

This has been brought up a few times. I will focus on Spark Structured
Streaming

Autoscaling does not support Spark Structured Streaming (SSS). Why because
streaming jobs are typically long-running jobs that need to maintain state
across micro-batches. Autoscaling is designed to scale up and down Spark
clusters in response to workload changes However, this would cause problems
for Spark Structured Streaming jobs because it would cause the jobs to lose
their state. These jobs continuously process incoming data and update their
state incrementally (see checkpoint directory). Autoscaling, which can
dynamically add or remove worker nodes, would disrupt this stateful
processing. Although Spark itself supports dynamic allocation, (i.e. which
can add or remove executor nodes based on demand), it is not the same as
autoscaling in cloud  like GCP etc like Kubernetes or managed clusters. For
now you need to plan your workload in SSS accordingly.

My general advice, the usual thing to watch  from Spark GUI

Processing Time (Process Rate)  + Reserved Capacity < Batch Interval (Batch
Duration)

If your sink  has an issue absorbing data in a timely manner as per above
formulae, you will see the defect on the Processing Rate

Batch Interval, i.e. the rate at which the upstream source sends messages
through Kafka or other source. We can start by assuming that the rate of
increase in the number of messages processed (processing time) will require
an *additional reserved capacity*. We can anticipate a heuristic 70% (~1SD)
increase in the processing time so in theory you  should be able to handle
all this work below the batch interval.

HTH

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

 https://en.everybodywiki.com/Mich_Talebzadeh

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On Tue, 10 Oct 2023 at 16:11, Kiran Biswal <biswalki...@gmail.com> wrote:

> Hello Experts
>
> Is there any true auto scaling option for spark? The dynamic auto scaling
> works only for batch. Any guidelines on spark streaming  autoscaling and
> how that will be tied to any cluster level autoscaling solutions?
>
> Thanks
>

Re: Autoscaling in Spark

Reply via email to