adriangb commented on PR #17529: URL: https://github.com/apache/datafusion/pull/17529#issuecomment-3301353213
Very cool! Incidentally we were just discussing today with @gabotechs and @robtandy how to make HashJoin dynamic filter pushdown more compatible with distributed datafusion and how to eliminate the latency associated with waiting until we have the full build side to create filters. One idea that came up was to push something like: `(hash(join_key_col) % parts != 0) or join_key_col >= 1 and join_key_col <= 2` where `0` is the partition number, `1` is the min val for `join_key_col` in that partition and `2` is the max val We can push these down "as they are ready" and then once the all partitions are ready we can simplify this to our current `join_key_col >= 1 and join_key_col <= 2`. I bring this up because the main sticking point with that approach is that we add third computation of the hash to the existing two (in the hash join and repartition/shuffle), which might have a performance impact. That led us to file https://github.com/apache/datafusion/issues/17599 which is a broader issue that DataFusion has. But for this PR the big question in my mind is going to be: is the cost of the extra evaluation of the hash worth it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org