Hi, Yes, I tested all of them on spark 3.5.
Regards, Andrzej pon., 26 lut 2024 o 23:24 Mich Talebzadeh <mich.talebza...@gmail.com> napisaĆ(a): > Hi, > > These are all on spark 3.5, correct? > > Mich Talebzadeh, > Dad | Technologist | Solutions Architect | Engineer > London > United Kingdom > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* The information provided is correct to the best of my > knowledge but of course cannot be guaranteed . It is essential to note > that, as with any advice, quote "one test result is worth one-thousand > expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von > Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". > > > On Mon, 26 Feb 2024 at 22:18, Andrzej Zera <andrzejz...@gmail.com> wrote: > >> Hey all, >> >> I've been using Structured Streaming in production for almost a year >> already and I want to share the bugs I found in this time. I created a test >> for each of the issues and put them all here: >> https://github.com/andrzejzera/spark-bugs/tree/main/spark-3.5/src/test/scala >> >> I split the issues into three groups: outer joins on event time, interval >> joins and Spark SQL. >> >> Issues related to outer joins: >> >> - When joining three or more input streams on event time, if two or >> more streams don't contain an event for a join key (which is event time), >> no row will be output even if other streams contain an event for this join >> key. Tests that check for this: >> >> https://github.com/andrzejzera/spark-bugs/blob/abae7a3839326a8eafc7516a51aca5e0c79282a6/spark-3.5/src/test/scala/OuterJoinTest.scala#L86 >> and >> >> https://github.com/andrzejzera/spark-bugs/blob/abae7a3839326a8eafc7516a51aca5e0c79282a6/spark-3.5/src/test/scala/OuterJoinTest.scala#L169 >> - When joining aggregated stream with raw events with a stream with >> already aggregated events (aggregation made outside of Spark), then no row >> will be output if that second stream don't contain a corresponding event. >> Test that checks for this: >> >> https://github.com/andrzejzera/spark-bugs/blob/abae7a3839326a8eafc7516a51aca5e0c79282a6/spark-3.5/src/test/scala/OuterJoinTest.scala#L266 >> - When joining two aggregated streams (aggregated in Spark), no >> result is produced. Test that checks for this: >> >> https://github.com/andrzejzera/spark-bugs/blob/abae7a3839326a8eafc7516a51aca5e0c79282a6/spark-3.5/src/test/scala/OuterJoinTest.scala#L341. >> I've already reported this one here: >> https://issues.apache.org/jira/browse/SPARK-45637 but it hasn't been >> handled yet. >> >> Issues related to interval joins: >> >> - When joining three streams (A, B, C) using interval join on event >> time, in the way that B.eventTime is conditioned on A.eventTime and >> C.eventTime is also conditioned on A.eventTime, and then doing window >> aggregation based on A's event time, the result is output only after >> watermark crosses the window end + interval(A, B) + interval (A, C). >> However, I'd expect results to be output faster, i.e. when the watermark >> crosses window end + MAX(interval(A, B) + interval (A, C)). If our case is >> that event B can happen 3 minutes after event A and event C can happen 5 >> minutes after A, there is no point to suspend reporting output for 8 >> minutes (3+5) after the end of the window if we know that no more event >> can >> be matched after 5 min from the window end (assuming window end is based >> on >> A's event time). Test that checks for this: >> >> https://github.com/andrzejzera/spark-bugs/blob/abae7a3839326a8eafc7516a51aca5e0c79282a6/spark-3.5/src/test/scala/IntervalJoinTest.scala#L32 >> >> SQL issues: >> >> - WITH clause (in contrast to subquery) seems to create a static >> DataFrame that can't be used in streaming joins. Test that checks for >> this: >> >> https://github.com/andrzejzera/spark-bugs/blob/abae7a3839326a8eafc7516a51aca5e0c79282a6/spark-3.5/src/test/scala/SqlSyntaxTest.scala#L31 >> - Two subqueries, each aggregating data using window() functio, >> breaks the output schema. Test that checks for this: >> >> https://github.com/andrzejzera/spark-bugs/blob/abae7a3839326a8eafc7516a51aca5e0c79282a6/spark-3.5/src/test/scala/SqlSyntaxTest.scala#L122 >> >> I'm a beginner with Scala (I'm using Structured Streaming with PySpark) >> so won't be able to provide fixes. But I hope the test cases I provided can >> be of some help. >> >> Regards, >> Andrzej >> >