Re: [Structured Streaming] Keeping checkpointing cost under control

2024-01-11 Thread Mich Talebzadeh
catching up a bit late on this, I mentioned optimising RockDB as below in my earlier thread, specifically # Add RocksDB configurations here spark.conf.set("spark.sql.streaming.stateStore.providerClass",

Re: [Structured Streaming] Keeping checkpointing cost under control

2024-01-11 Thread Jungtaek Lim
If you use RocksDB state store provider, you can turn on changelog checkpoint to put the single changelog file per partition per batch. With disabling changelog checkpoint, Spark uploads newly created SST files and some log files. If compaction had happened, most SST files have to be re-uploaded.

Re: [Structured Streaming] Keeping checkpointing cost under control

2024-01-10 Thread Andrzej Zera
Yes, I agree. But apart from maintaining this state internally (in memory or in memory+disk as in case of RocksDB), every trigger it saves some information about this state in a checkpoint location. I'm afraid we can't do much about this checkpointing operation. I'll continue looking for

Re: [Structured Streaming] Keeping checkpointing cost under control

2024-01-10 Thread Mich Talebzadeh
Hi, You may have a point on scenario 2. Caching Streaming DataFrames: In Spark Streaming, each batch of data is processed incrementally, and it may not fit the typical caching we discussed. Instead, Spark Streaming has its mechanisms to manage and optimize the processing of streaming data. Case

Re: [Structured Streaming] Keeping checkpointing cost under control

2024-01-10 Thread Andrzej Zera
Hey, Yes, that's how I understood it (scenario 1). However, I'm not sure if scenario 2 is possible. I think cache on streaming DataFrame is supported only in forEachBatch (in which it's actually no longer a streaming DF). śr., 10 sty 2024 o 15:01 Mich Talebzadeh napisał(a): > Hi, > > With

Re: [Structured Streaming] Keeping checkpointing cost under control

2024-01-10 Thread Mich Talebzadeh
Hi, With regard to your point - Caching: Can you please explain what you mean by caching? I know that when you have batch and streaming sources in a streaming query, then you can try to cache batch ones to save on reads. But I'm not sure if it's what you mean, and I don't know how to apply what

Re: [Structured Streaming] Keeping checkpointing cost under control

2024-01-10 Thread Andrzej Zera
Thank you very much for your suggestions. Yes, my main concern is checkpointing costs. I went through your suggestions and here're my comments: - Caching: Can you please explain what you mean by caching? I know that when you have batch and streaming sources in a streaming query, then you can try

Re: [Structured Streaming] Keeping checkpointing cost under control

2024-01-07 Thread Mich Talebzadeh
OK I assume that your main concern is checkpointing costs. - Caching: If your queries read the same data multiple times, caching the data might reduce the amount of data that needs to be checkpointed. - Optimize Checkpointing Frequency i.e - Consider Changelog Checkpointing with RocksDB.

Re: [Structured Streaming] Keeping checkpointing cost under control

2024-01-07 Thread Andrzej Zera
Usually one or two topics per query. Each query has its own checkpoint directory. Each topic has a few partitions. Performance-wise I don't experience any bottlenecks in terms of checkpointing. It's all about the number of requests (including a high number of LIST requests) and the associated

Re: [Structured Streaming] Keeping checkpointing cost under control

2024-01-06 Thread Mich Talebzadeh
How many topics and checkpoint directories are you dealing with? Does each topic has its own checkpoint on S3? All these checkpoints are sequential writes so even SSD would not really help HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view

[Structured Streaming] Keeping checkpointing cost under control

2024-01-05 Thread Andrzej Zera
Hey, I'm running a few Structured Streaming jobs (with Spark 3.5.0) that require near-real time accuracy with trigger intervals in the level of 5-10 seconds. I usually run 3-6 streaming queries as part of the job and each query includes at least one stateful operation (and usually two or more).