BlakeOrth commented on PR #18146:
URL: https://github.com/apache/datafusion/pull/18146#issuecomment-3434241866

   @alamb 
   
   > I went through this PR again and I am pretty sure it will cause 
regressions in performance for people that have non trivially sized partitions.
   
   Yes, I'm almost sure there is more than one case where there would be a 
regression. Similarly, there are likely cases where this PR would improve 
performance even without caching (for tables with a relatively low number of 
object, but a deep partition structure, for instance). Based on the PR you 
referenced in a PR comment above, I think it might be best to see if we can 
cook up a synthetic benchmark to help us better understand where the 
performance tradeoffs might be. Does that sound reasonable? If so, can you 
point me to good existing benchmarks that target methods (as opposed to an 
integration level bench)?
   
    > However, I am not sure I fully understand the implications of this change 
fully (nor the existing code) and since all the tests continue to pass in this 
PR it suggests to me there is not particularly good test coverage either.
    
    I was similarly surprised the existing tests passed. I did review the 
existing unit tests and to my surprise they seemed to address most test 
scenarios I could think of (and several I had not yet thought of). 
Additionally, unless I'm mistaken, this code path should be exercised in many 
`sqllogic` tests, which have cases specifically looking at partitioned tables. 
I did have to make a couple of bug fixes from my initial implementation to get 
all the existing unit and `sqllogic` tests to pass.
   
   > It is not at all clear to me how this helps enabling the ListFilesCache. 
Does it make it easier to thread through the references to the cache?
   
   Currently the `ListFilesCache` is only invoked in the 
[`list_all_files`](https://github.com/apache/datafusion/blob/774b6fee0b8a33b48e28ef35ac5242e80312900b/datafusion/datasource/src/url.rs#L237)
 method. The existing code never calls this method when a table has partition 
columns. This PR uses `list_all_files` regardless of whether or not a table has 
partition columns, making the cache infrastructure available to all 
`ListingTables`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to