BlakeOrth commented on PR #18146:
URL: https://github.com/apache/datafusion/pull/18146#issuecomment-3434241866
@alamb
> I went through this PR again and I am pretty sure it will cause
regressions in performance for people that have non trivially sized partitions.
Yes, I'm almost sure there is more than one case where there would be a
regression. Similarly, there are likely cases where this PR would improve
performance even without caching (for tables with a relatively low number of
object, but a deep partition structure, for instance). Based on the PR you
referenced in a PR comment above, I think it might be best to see if we can
cook up a synthetic benchmark to help us better understand where the
performance tradeoffs might be. Does that sound reasonable? If so, can you
point me to good existing benchmarks that target methods (as opposed to an
integration level bench)?
> However, I am not sure I fully understand the implications of this change
fully (nor the existing code) and since all the tests continue to pass in this
PR it suggests to me there is not particularly good test coverage either.
I was similarly surprised the existing tests passed. I did review the
existing unit tests and to my surprise they seemed to address most test
scenarios I could think of (and several I had not yet thought of).
Additionally, unless I'm mistaken, this code path should be exercised in many
`sqllogic` tests, which have cases specifically looking at partitioned tables.
I did have to make a couple of bug fixes from my initial implementation to get
all the existing unit and `sqllogic` tests to pass.
> It is not at all clear to me how this helps enabling the ListFilesCache.
Does it make it easier to thread through the references to the cache?
Currently the `ListFilesCache` is only invoked in the
[`list_all_files`](https://github.com/apache/datafusion/blob/774b6fee0b8a33b48e28ef35ac5242e80312900b/datafusion/datasource/src/url.rs#L237)
method. The existing code never calls this method when a table has partition
columns. This PR uses `list_all_files` regardless of whether or not a table has
partition columns, making the cache infrastructure available to all
`ListingTables`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]