Hi, These are all on spark 3.5, correct?
Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". On Mon, 26 Feb 2024 at 22:18, Andrzej Zera <andrzejz...@gmail.com> wrote: > Hey all, > > I've been using Structured Streaming in production for almost a year > already and I want to share the bugs I found in this time. I created a test > for each of the issues and put them all here: > https://github.com/andrzejzera/spark-bugs/tree/main/spark-3.5/src/test/scala > > I split the issues into three groups: outer joins on event time, interval > joins and Spark SQL. > > Issues related to outer joins: > > - When joining three or more input streams on event time, if two or > more streams don't contain an event for a join key (which is event time), > no row will be output even if other streams contain an event for this join > key. Tests that check for this: > > https://github.com/andrzejzera/spark-bugs/blob/abae7a3839326a8eafc7516a51aca5e0c79282a6/spark-3.5/src/test/scala/OuterJoinTest.scala#L86 > and > > https://github.com/andrzejzera/spark-bugs/blob/abae7a3839326a8eafc7516a51aca5e0c79282a6/spark-3.5/src/test/scala/OuterJoinTest.scala#L169 > - When joining aggregated stream with raw events with a stream with > already aggregated events (aggregation made outside of Spark), then no row > will be output if that second stream don't contain a corresponding event. > Test that checks for this: > > https://github.com/andrzejzera/spark-bugs/blob/abae7a3839326a8eafc7516a51aca5e0c79282a6/spark-3.5/src/test/scala/OuterJoinTest.scala#L266 > - When joining two aggregated streams (aggregated in Spark), no result > is produced. Test that checks for this: > > https://github.com/andrzejzera/spark-bugs/blob/abae7a3839326a8eafc7516a51aca5e0c79282a6/spark-3.5/src/test/scala/OuterJoinTest.scala#L341. > I've already reported this one here: > https://issues.apache.org/jira/browse/SPARK-45637 but it hasn't been > handled yet. > > Issues related to interval joins: > > - When joining three streams (A, B, C) using interval join on event > time, in the way that B.eventTime is conditioned on A.eventTime and > C.eventTime is also conditioned on A.eventTime, and then doing window > aggregation based on A's event time, the result is output only after > watermark crosses the window end + interval(A, B) + interval (A, C). > However, I'd expect results to be output faster, i.e. when the watermark > crosses window end + MAX(interval(A, B) + interval (A, C)). If our case is > that event B can happen 3 minutes after event A and event C can happen 5 > minutes after A, there is no point to suspend reporting output for 8 > minutes (3+5) after the end of the window if we know that no more event can > be matched after 5 min from the window end (assuming window end is based on > A's event time). Test that checks for this: > > https://github.com/andrzejzera/spark-bugs/blob/abae7a3839326a8eafc7516a51aca5e0c79282a6/spark-3.5/src/test/scala/IntervalJoinTest.scala#L32 > > SQL issues: > > - WITH clause (in contrast to subquery) seems to create a static > DataFrame that can't be used in streaming joins. Test that checks for this: > > https://github.com/andrzejzera/spark-bugs/blob/abae7a3839326a8eafc7516a51aca5e0c79282a6/spark-3.5/src/test/scala/SqlSyntaxTest.scala#L31 > - Two subqueries, each aggregating data using window() functio, breaks > the output schema. Test that checks for this: > > https://github.com/andrzejzera/spark-bugs/blob/abae7a3839326a8eafc7516a51aca5e0c79282a6/spark-3.5/src/test/scala/SqlSyntaxTest.scala#L122 > > I'm a beginner with Scala (I'm using Structured Streaming with PySpark) so > won't be able to provide fixes. But I hope the test cases I provided can be > of some help. > > Regards, > Andrzej >