BlakeOrth commented on PR #18146: URL: https://github.com/apache/datafusion/pull/18146#issuecomment-3423770018
> Thank you @BlakeOrth > > > tl;dr of the issue: normalizing the access pattern(s) for objects for partitioned tables should not only reduce the number of requests to a backing object store, but will also allow any existing and/or future caching mechanisms to apply equally to both directory-partitioned and flat tables. > > I don't fully understand this. Is the idea that the current code will do something like > > ``` > LIST path/to/table/a=1/b=2/c=3/ > ``` > > But if we aren' more clever the basic cache will just have a list like > > ``` > LIST path/to/table/ > ``` > > (and thus not be able to satisfy the request)? > > It seems to me that we may have to implement prefix listing on the files cache as well, to avoid causing regressions in existing functionality. @alamb So in the current code ``` LIST path/to/table/a=1/b=2/c=3/ ``` This table cannot take advantage of any list file caching (at least as implemented) because the cache mechanisms don't exist for tables with partition columns. However, the current code _can_ reduce the number of `LIST` operations for this table given appropriate query filters. The code in this PR would enable a simple implementation of the list files cache to store a key for _all_ objects under ``` LIST path/to/table/ ``` and continue to appropriately filter cached results based on query filters. However, it would (again, as written) remove the ability to list specific prefixes based on query filters. > It seems to me that we may have to implement prefix listing on the files cache as well, to avoid causing regressions in existing functionality. If we implemented the ability to list a specific prefix in a table I think any cache would also need to be "prefix aware", otherwise we've more or less just made a lateral move where caching may apply to flat tables but not directory partitioned tables. Does that help clarify this a bit? I hope I understood your question correctly. If we need more clarification on something I can probably put together and annotate some queries against a hypothetical table to help make this all a bit more clear. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
