Re: [PR] Hash join buffering on probe side [datafusion]

via GitHub Mon, 19 Jan 2026 06:49:50 -0800


gabotechs commented on PR #19761:
URL: https://github.com/apache/datafusion/pull/19761#issuecomment-3768701234


   I think for this to be optimal further complex logic will need to be 
introduced so that buffering can properly interact with dynamic filtering.
   
   However, even if we decide to not buffer eagerly in case there's a dynamic 
filter below a `BufferExec`, it does show some small speedups even in 
benchmarks that pull data from local files, which means that speedups in more 
IO bounded environments (files in S3) will be even more noticeable.
   
   I want to propose moving forward with this, but with buffering disabled by 
default. That way, users can decide to opt into it until we find a smarter rule 
of when to apply buffering automatically based on the query shape and where are 
dynamic filters placed.
   
   The steps I want to propose are:
   - https://github.com/apache/datafusion/pull/19759: review and ship this one 
first in isolation
   - https://github.com/apache/datafusion/pull/19760: just add the `BufferExec` 
node without any wireup to the rest of DataFusion
   - Remove all the `ExecutionPlan` modifications from this PR and just move 
forward with the physical optimizer rule, leaving it disabled by default
   - Do some follow ups in order to ship any smarter logic that opens the door 
to apply buffering automatically
   
   @Dandandan what do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Hash join buffering on probe side [datafusion]

Reply via email to