Usually one or two topics per query. Each query has its own checkpoint directory. Each topic has a few partitions.
Performance-wise I don't experience any bottlenecks in terms of checkpointing. It's all about the number of requests (including a high number of LIST requests) and the associated cost. sob., 6 sty 2024 o 13:30 Mich Talebzadeh <mich.talebza...@gmail.com> napisaĆ(a): > How many topics and checkpoint directories are you dealing with? > > Does each topic has its own checkpoint on S3? > > All these checkpoints are sequential writes so even SSD would not really > help > > HTH > > Mich Talebzadeh, > Dad | Technologist | Solutions Architect | Engineer > London > United Kingdom > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Sat, 6 Jan 2024 at 08:19, Andrzej Zera <andrzejz...@gmail.com> wrote: > >> Hey, >> >> I'm running a few Structured Streaming jobs (with Spark 3.5.0) that >> require near-real time accuracy with trigger intervals in the level of 5-10 >> seconds. I usually run 3-6 streaming queries as part of the job and each >> query includes at least one stateful operation (and usually two or more). >> My checkpoint location is S3 bucket and I use RocksDB as a state store. >> Unfortunately, checkpointing costs are quite high. It's the main cost item >> of the system and it's roughly 4-5 times the cost of compute. >> >> To save on compute costs, the following things are usually recommended: >> >> - increase trigger interval (as mentioned, I don't have much space >> here) >> - decrease the number of shuffle partitions (I have 2x the number of >> workers) >> >> I'm looking for some other recommendations that I can use to save on >> checkpointing costs. I saw that most requests are LIST requests. Can we cut >> them down somehow? I'm using Databricks. If I replace S3 bucket with DBFS, >> will it help in any way? >> >> Thank you! >> Andrzej >> >>