Re: [Structured Streaming] Keeping checkpointing cost under control

Andrzej Zera Sun, 07 Jan 2024 00:08:36 -0800

Usually one or two topics per query. Each query has its own checkpoint
directory. Each topic has a few partitions.


Performance-wise I don't experience any bottlenecks in terms of
checkpointing. It's all about the number of requests (including a high
number of LIST requests) and the associated cost.

sob., 6 sty 2024 o 13:30 Mich Talebzadeh <[email protected]>
napisał(a):

> How many topics and checkpoint directories are you dealing with?
>
> Does each topic has its own checkpoint  on S3?
>
> All these checkpoints are sequential writes so even SSD would not really
> help
>
> HTH
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 6 Jan 2024 at 08:19, Andrzej Zera <[email protected]> wrote:
>
>> Hey,
>>
>> I'm running a few Structured Streaming jobs (with Spark 3.5.0) that
>> require near-real time accuracy with trigger intervals in the level of 5-10
>> seconds. I usually run 3-6 streaming queries as part of the job and each
>> query includes at least one stateful operation (and usually two or more).
>> My checkpoint location is S3 bucket and I use RocksDB as a state store.
>> Unfortunately, checkpointing costs are quite high. It's the main cost item
>> of the system and it's roughly 4-5 times the cost of compute.
>>
>> To save on compute costs, the following things are usually recommended:
>>
>>    - increase trigger interval (as mentioned, I don't have much space
>>    here)
>>    - decrease the number of shuffle partitions (I have 2x the number of
>>    workers)
>>
>> I'm looking for some other recommendations that I can use to save on
>> checkpointing costs. I saw that most requests are LIST requests. Can we cut
>> them down somehow? I'm using Databricks. If I replace S3 bucket with DBFS,
>> will it help in any way?
>>
>> Thank you!
>> Andrzej
>>
>>

Re: [Structured Streaming] Keeping checkpointing cost under control

Reply via email to