Re: Bugs with joins and SQL in Structured Streaming

2024-03-11 Thread Andrzej Zera
Hi, Do you think there is any chance for this issue to get resolved? Should I create another bug report? As mentioned in my message, there is one open already: https://issues.apache.org/jira/browse/SPARK-45637 but it covers only one of the problems. Andrzej wt., 27 lut 2024 o 09:58 Andrzej Zera

Re: Bugs with joins and SQL in Structured Streaming

2024-02-27 Thread Andrzej Zera
anteed . It is essential to note > that, as with any advice, quote "one test result is worth one-thousand > expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von > Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". > > > O

Bugs with joins and SQL in Structured Streaming

2024-02-26 Thread Andrzej Zera
Hey all, I've been using Structured Streaming in production for almost a year already and I want to share the bugs I found in this time. I created a test for each of the issues and put them all here: https://github.com/andrzejzera/spark-bugs/tree/main/spark-3.5/src/test/scala I split the issues

Re: [Structured Streaming] Avoid one microbatch delay with multiple stateful operations

2024-01-24 Thread Andrzej Zera
helps. > > Thanks, > Jungtaek Lim (HeartSaVioR) > > On Thu, Jan 11, 2024 at 6:13 AM Andrzej Zera > wrote: > >> I'm struggling with the following issue in Spark >=3.4, related to >> multiple stateful operations. >> >> When spark.sql.strea

Re: [Structured Streaming] Keeping checkpointing cost under control

2024-01-10 Thread Andrzej Zera
pBy(...).count() >>> intermediate_df.cache() >>> # Use cached intermediate_df for further transformations or actions >>> >>> HTH >>> >>> Mich Talebzadeh, >>> Dad | Technologist | Solutions Architect | Engineer >>> London >

[Structured Streaming] Avoid one microbatch delay with multiple stateful operations

2024-01-10 Thread Andrzej Zera
I'm struggling with the following issue in Spark >=3.4, related to multiple stateful operations. When spark.sql.streaming.statefulOperator.allowMultiple is enabled, Spark keeps track of two types of watermarks: eventTimeWatermarkForEviction and eventTimeWatermarkForLateEvents. Introducing them

Re: [Structured Streaming] Keeping checkpointing cost under control

2024-01-10 Thread Andrzej Zera
s://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The a

Re: [Structured Streaming] Keeping checkpointing cost under control

2024-01-10 Thread Andrzej Zera
https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed.

Re: [Structured Streaming] Keeping checkpointing cost under control

2024-01-07 Thread Andrzej Zera
arising from > such loss, damage or destruction. > > > > > On Sat, 6 Jan 2024 at 08:19, Andrzej Zera wrote: > >> Hey, >> >> I'm running a few Structured Streaming jobs (with Spark 3.5.0) that >> require near-real time accuracy with trigger inter

[Structured Streaming] Keeping checkpointing cost under control

2024-01-05 Thread Andrzej Zera
Hey, I'm running a few Structured Streaming jobs (with Spark 3.5.0) that require near-real time accuracy with trigger intervals in the level of 5-10 seconds. I usually run 3-6 streaming queries as part of the job and each query includes at least one stateful operation (and usually two or more).

Re: [Structured Streaming] Joins after aggregation don't work in streaming

2023-10-27 Thread Andrzej Zera
gt; > Thanks, > Jungtaek Lim (HeartSaVioR) > > On Fri, Oct 27, 2023 at 5:22 AM Andrzej Zera > wrote: > >> Hey All, >> >> I'm trying to reproduce the following streaming operation: "Time window >> aggregation in separate streams followed by stream-stream jo

[Structured Streaming] Joins after aggregation don't work in streaming

2023-10-26 Thread Andrzej Zera
Hey All, I'm trying to reproduce the following streaming operation: "Time window aggregation in separate streams followed by stream-stream join". According to documentation, this should be possible in Spark 3.5.0 but I had no success despite different tries. Here is a documentation snippet I'm