alamb commented on PR #18146: URL: https://github.com/apache/datafusion/pull/18146#issuecomment-3491578094
> @alamb This path forward sounds good to me. For the follow-on PRs I believe the changes needed to implement the smarter prefixed listing are actually relatively simple, and will have implications for how caching is ultimately handled. I would personally recommend we implement that small optimization prior to implementing caching. Does that sound like a reasonable implementation order? Yes for sure. Note you don't have to be the only one implementing this -- I think we could implement caching in parallel with smarter use of partitioning values for `LIST`ing > > This is likely to work well for tables with fewer than 1000 files (the maximum number of results that comes back in a single LIST request) However, when there are many more files this PR will likely take longer as it will list ALL files present with sequential LIST operations whereas main will issue concurrent LIST operations) > > I agree with this assessment of the performance implications. I think there is an additional subtle performance improvement here, where this implementation allows better downstream concurrency in all cases. The previous implementation effectively removed any benefits of the files coming back as a stream because it had to complete at least initial list operation(s) fully prior to yielding any elements on the stream, whereas this implementation will (in most cases) begin yielding elements on the stream at the first request. That is an interesting point -- though this PR won't interleave IO and CPU the way the previous one did -- though realistically the amount of processing per response is pretty small -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
