Re: [Structured Streaming] Keeping checkpointing cost under control

Jungtaek Lim Thu, 11 Jan 2024 04:07:40 -0800

If you use RocksDB state store provider, you can turn on changelog
checkpoint to put the single changelog file per partition per batch. With
disabling changelog checkpoint, Spark uploads newly created SST files and
some log files. If compaction had happened, most SST files have to be
re-uploaded. Using changelog checkpoint would upload the snapshot a lot
less frequently, and actually better latency on committing.


On Sat, Jan 6, 2024 at 7:40 PM Andrzej Zera <andrzejz...@gmail.com> wrote:

> Hey,
>
> I'm running a few Structured Streaming jobs (with Spark 3.5.0) that
> require near-real time accuracy with trigger intervals in the level of 5-10
> seconds. I usually run 3-6 streaming queries as part of the job and each
> query includes at least one stateful operation (and usually two or more).
> My checkpoint location is S3 bucket and I use RocksDB as a state store.
> Unfortunately, checkpointing costs are quite high. It's the main cost item
> of the system and it's roughly 4-5 times the cost of compute.
>
> To save on compute costs, the following things are usually recommended:
>
>    - increase trigger interval (as mentioned, I don't have much space
>    here)
>    - decrease the number of shuffle partitions (I have 2x the number of
>    workers)
>
> I'm looking for some other recommendations that I can use to save on
> checkpointing costs. I saw that most requests are LIST requests. Can we cut
> them down somehow? I'm using Databricks. If I replace S3 bucket with DBFS,
> will it help in any way?
>
> Thank you!
> Andrzej
>
>

Re: [Structured Streaming] Keeping checkpointing cost under control

Reply via email to