Re: [I] Push down entire hash table from HashJoinExec into scans [datafusion]

via GitHub Sat, 20 Sep 2025 16:23:53 -0700


adriangb commented on issue #17171:
URL: https://github.com/apache/datafusion/issues/17171#issuecomment-3280509469


   > Probe Phase
   > For each tuple in S, hash its join key and check to see whether there is a 
match for each tuple in corresponding
   bucket in the hash table constructed for R. If inputs were partitioned, then 
assign each thread a unique
   partition. Otherwise, synchronize their access to the cursor on S.
   Bloom Filter: Create a Bloom Filter during the build phase when the key is 
likely to not exist in the hash
   table [4]. Threads check the filter before probing the hash table. This will 
be faster since the filter will fit in
   CPU caches. Sometimes called sideways information passing.
   
   But fair enough yeah.
   
   I think the best way to figure this out is to cook up the implementation(s) 
and put them behind feature flags and have folks like @LiaCastaneda report 
their results.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Push down entire hash table from HashJoinExec into scans [datafusion]

Reply via email to