alamb commented on code in PR #17031: URL: https://github.com/apache/datafusion/pull/17031#discussion_r2254953311
########## Cargo.toml: ########## @@ -153,6 +153,7 @@ hex = { version = "0.4.3" } indexmap = "2.10.0" itertools = "0.14" log = "^0.4" +lru = "0.16.0" Review Comment: I spoke with @XiangpengHao this afternoon and we have a proposal: 1. Use a FIFO queue -- aka evict the oldest elements first (e.g. store `Path`s in a `VecDeque` or something and remove them from the HashSet in order when the cache is full) Benefits of this approach 1. Simple to implement 2. Predictable, easy to explain behavior Downsides: 1. Can evict item that are frequently used 2. Doesn't account for how costly an item is to keep in the cache However, I think this is an appropriate tradeoff for the default implementation because 1. a properly sized cache will have no evictions (so the eviction strategy for the simple case doesn't matter) 1. users can supply their own implementations for more advanced usecases I argue against including anything more complicated in the initial implementation because of the many tradeoffs. A cache eviction strategy needs to weigh multiple different factors, for example 1. The cost (eg. size in bytes) of keeping elements in the cache 3. The benefit (e.g. how much will the cached metadata improve future queries) 4. The cost of next miss (e.g. the cost to reload metadata from a local disk is likely much lower than from S3) The relative tradeoffs between these factors likely varies substantially from system to sytem Users with more advanced needs can implement whatever strategy is most appropriate to their system via the traits -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org