Re: structured streaming join of streaming dataframe with static dataframe performance

2022-08-04 Thread Koert Kuipers
thats good point about skewness and potential join optimizations. i will try turning off all skew optimizations, and force a sort-merge-join, and see if it then re-uses shuffle files on the static side. unfortunately my static side is too large to broadcast. the streaming side can be broadcasted

Re: structured streaming join of streaming dataframe with static dataframe performance

2022-08-04 Thread kant kodali
I suspect it is probably because the incoming rows when I joined with static frame can lead to variable degree of skewness over time and if so it is probably better to employ different join strategies at run time. But if you know your Dataset I believe you can just do broadcast join for your

structured streaming join of streaming dataframe with static dataframe performance

2022-07-17 Thread Koert Kuipers
i was surprised to find out that if a streaming dataframe is joined with a static dataframe, that the static dataframe is re-shuffled for every microbatch, which adds considerable overhead. wouldn't it make more sense to re-use the shuffle files? or if that is not possible then load the static