Re: Spark structured streaming leftOuter join not working as I expect

2019-06-11 Thread Jungtaek Lim
Got the point. If you would like to get "correct" output, you may need to set global watermark as "min", because watermark is not only used for evicting rows in state, but also discarding input rows later than watermark. Here you may want to be aware that there're two stateful operators which will

Re: Spark structured streaming leftOuter join not working as I expect

2019-06-10 Thread Joe Ammann
Hi all it took me some time to get the issues extracted into a piece of standalone code. I created the following gist https://gist.github.com/jammann/b58bfbe0f4374b89ecea63c1e32c8f17 I has messages for 4 topics A/B/C/D and a simple Python program which shows 6 use cases, with my expectations

Re: Spark structured streaming leftOuter join not working as I expect

2019-06-05 Thread Jungtaek Lim
Nice to hear you're investigating the issue deeply. Btw, if attaching code is not easy, maybe you could share logical/physical plan on any batch: "detail" in SQL tab would show up the plan as string. Plans from sequential batches would be much helpful - and streaming query status in these batch

Re: Spark structured streaming leftOuter join not working as I expect

2019-06-05 Thread Joe Ammann
Hi Jungtaek Thanks for your response! I actually have set watermarks on all the streams A/B/C with the respective event time column A/B/C_LAST_MOD. So I think this should not be the reason. Of course, the event time on the C stream (the "optional one") progresses much slower than on the other

Re: Spark structured streaming leftOuter join not working as I expect

2019-06-05 Thread Jungtaek Lim
I would suspect that rows are never evicted in state in second join. To determine whether the row is NOT matched to other side, Spark should check whether the row is ever matched before evicted. You need to set watermark either B_LAST_MOD or C_LAST_MOD. If you already did but not exposed to here,

Spark structured streaming leftOuter join not working as I expect

2019-06-04 Thread Joe Ammann
Hi all sorry, tl;dr I'm on my first Python Spark structured streaming app, in the end joining messages from ~10 different Kafka topics. I've recently upgraded to Spark 2.4.3, which has resolved all my issues with the time handling (watermarks, join windows) I had before with Spark 2.3.2. My