gabotechs commented on PR #19761: URL: https://github.com/apache/datafusion/pull/19761#issuecomment-3768701234
I think for this to be optimal further complex logic will need to be introduced so that buffering can properly interact with dynamic filtering. However, even if we decide to not buffer eagerly in case there's a dynamic filter below a `BufferExec`, it does show some small speedups even in benchmarks that pull data from local files, which means that speedups in more IO bounded environments (files in S3) will be even more noticeable. I want to propose moving forward with this, but with buffering disabled by default. That way, users can decide to opt into it until we find a smarter rule of when to apply buffering automatically based on the query shape and where are dynamic filters placed. The steps I want to propose are: - https://github.com/apache/datafusion/pull/19759: review and ship this one first in isolation - https://github.com/apache/datafusion/pull/19760: just add the `BufferExec` node without any wireup to the rest of DataFusion - Remove all the `ExecutionPlan` modifications from this PR and just move forward with the physical optimizer rule, leaving it disabled by default - Do some follow ups in order to ship any smarter logic that opens the door to apply buffering automatically @Dandandan what do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
