Hey, I'm running a few Structured Streaming jobs (with Spark 3.5.0) that require near-real time accuracy with trigger intervals in the level of 5-10 seconds. I usually run 3-6 streaming queries as part of the job and each query includes at least one stateful operation (and usually two or more). My checkpoint location is S3 bucket and I use RocksDB as a state store. Unfortunately, checkpointing costs are quite high. It's the main cost item of the system and it's roughly 4-5 times the cost of compute.
To save on compute costs, the following things are usually recommended: - increase trigger interval (as mentioned, I don't have much space here) - decrease the number of shuffle partitions (I have 2x the number of workers) I'm looking for some other recommendations that I can use to save on checkpointing costs. I saw that most requests are LIST requests. Can we cut them down somehow? I'm using Databricks. If I replace S3 bucket with DBFS, will it help in any way? Thank you! Andrzej