gabotechs commented on PR #19760: URL: https://github.com/apache/datafusion/pull/19760#issuecomment-3833440299
> This seems in many ways quite similar to what RepartitionExec w/ spilling does. Have you had a chance to poke at that code? Yes, in fact a small chunk of the code there still shows my name in the git blames. It do is similar in the sense that there is some per-partition buffering, but it looks like that code is in a more difficult situation, as it needs to be able to buffer potentially indefinitely due to the unbounded nature of RepartitionExec (correct me if I'm wrong, it's been a while since I looked at that code), whether the code in this PR can afford to have bounded channels. At first sight I do not see a lot of opportunities for reusing code in both places due to the different requirements, but happy to listen to ideas. > Maybe both are needed though? As in: you want buffering and prefetching. Another difference with RepartitionExec is that BufferExec will eagerly poll its children regardless of whether its stream was polled or not, and RepartitionExec will wait for the first poll to start doing work. This means that RepartitionExec does not prefetch, but BufferExec does > The advantage I see of buffering at the Parquet level is that the reader can do fancy things like planning to fetch a larger contiguous chunk of data from object storage 👍 I can see this being beneficial. My intention was to first use this in https://github.com/apache/datafusion/pull/19761, but the BufferExec node is something you are supposed to be able to place wherever you want. In fact, we do use it in more scenarios at DataDog. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
