BlakeOrth commented on PR #18146: URL: https://github.com/apache/datafusion/pull/18146#issuecomment-3487838551
> Thus, what I suggest as action item is: > > 1. We get the CI green (by moving the test into core_integration) > 2. Merge this PR (after we have branched for https://github.com/apache/datafusion/issues/17558 ) > > As follow on PRs then > > 1. Implement caching of LIST results (tracked by https://github.com/apache/datafusion/issues/17211) > 2. Try and be smarter about the prefixes used in LISTs when we have equality predicates on partition columns @alamb This path forward sounds good to me. For the follow-on PRs I believe the changes needed to implement the smarter prefixed listing are actually relatively simple, and will have implications for how caching is ultimately handled. I would personally recommend we implement that small optimization prior to implementing caching. Does that sound like a reasonable implementation order? > This is likely to work well for tables with fewer than 1000 files (the maximum number of results that comes back in a single LIST request) However, when there are many more files this PR will likely take longer as it will list ALL files present with sequential LIST operations whereas main will issue concurrent LIST operations) I agree with this assessment of the performance implications. I think there is an additional subtle performance improvement here, where this implementation allows better downstream concurrency in all cases. The previous implementation effectively removed any benefits of the files coming back as a stream because it had to complete at least initial list operation(s) fully prior to yielding any elements on the stream, whereas this implementation will (in most cases) begin yielding elements on the stream at the first request. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
