adriangb commented on PR #17529:
URL: https://github.com/apache/datafusion/pull/17529#issuecomment-3301353213

   Very cool!
   
   Incidentally we were just discussing today with @gabotechs and @robtandy how 
to make HashJoin dynamic filter pushdown more compatible with distributed 
datafusion and how to eliminate the latency associated with waiting until we 
have the full build side to create filters.
   
   One idea that came up was to push something like:
   `(hash(join_key_col) % parts != 0) or join_key_col >= 1 and join_key_col <= 
2` where `0` is the partition number, `1` is the min val for `join_key_col` in 
that partition and `2` is the max val
   We can push these down "as they are ready" and then once the all partitions 
are ready we can simplify this to our current `join_key_col >= 1 and 
join_key_col <= 2`.
   I bring this up because the main sticking point with that approach is that 
we add third computation of the hash to the existing two (in the hash join and 
repartition/shuffle), which might have a performance impact.
   That led us to file https://github.com/apache/datafusion/issues/17599 which 
is a broader issue that DataFusion has.
   
   But for this PR the big question in my mind is going to be: is the cost of 
the extra evaluation of the hash worth it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to