How many topics and checkpoint directories are you dealing with?

Does each topic has its own checkpoint  on S3?

All these checkpoints are sequential writes so even SSD would not really
help

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 6 Jan 2024 at 08:19, Andrzej Zera <andrzejz...@gmail.com> wrote:

> Hey,
>
> I'm running a few Structured Streaming jobs (with Spark 3.5.0) that
> require near-real time accuracy with trigger intervals in the level of 5-10
> seconds. I usually run 3-6 streaming queries as part of the job and each
> query includes at least one stateful operation (and usually two or more).
> My checkpoint location is S3 bucket and I use RocksDB as a state store.
> Unfortunately, checkpointing costs are quite high. It's the main cost item
> of the system and it's roughly 4-5 times the cost of compute.
>
> To save on compute costs, the following things are usually recommended:
>
>    - increase trigger interval (as mentioned, I don't have much space
>    here)
>    - decrease the number of shuffle partitions (I have 2x the number of
>    workers)
>
> I'm looking for some other recommendations that I can use to save on
> checkpointing costs. I saw that most requests are LIST requests. Can we cut
> them down somehow? I'm using Databricks. If I replace S3 bucket with DBFS,
> will it help in any way?
>
> Thank you!
> Andrzej
>
>

Reply via email to