Got the point. If you would like to get "correct" output, you may need to
set global watermark as "min", because watermark is not only used for
evicting rows in state, but also discarding input rows later than
watermark. Here you may want to be aware that there're two stateful
operators which will
Hi all
it took me some time to get the issues extracted into a piece of standalone
code. I created the following gist
https://gist.github.com/jammann/b58bfbe0f4374b89ecea63c1e32c8f17
I has messages for 4 topics A/B/C/D and a simple Python program which shows 6
use cases, with my expectations a
Nice to hear you're investigating the issue deeply.
Btw, if attaching code is not easy, maybe you could share logical/physical
plan on any batch: "detail" in SQL tab would show up the plan as string.
Plans from sequential batches would be much helpful - and streaming query
status in these batch (e
Hi Jungtaek
Thanks for your response!
I actually have set watermarks on all the streams A/B/C with the respective
event time
column A/B/C_LAST_MOD. So I think this should not be the reason.
Of course, the event time on the C stream (the "optional one") progresses much
slower
than on the other
I would suspect that rows are never evicted in state in second join. To
determine whether the row is NOT matched to other side, Spark should check
whether the row is ever matched before evicted. You need to set watermark
either B_LAST_MOD or C_LAST_MOD.
If you already did but not exposed to here,