BlakeOrth commented on code in PR #16971: URL: https://github.com/apache/datafusion/pull/16971#discussion_r2246474646
########## datafusion/execution/src/cache/cache_unit.rs: ########## @@ -157,9 +158,79 @@ impl CacheAccessor<Path, Arc<Vec<ObjectMeta>>> for DefaultListFilesCache { } } +/// Collected file embedded metadata cache. +/// The metadata for some file is invalided when the file size or last modification time have been +/// changed. +#[derive(Default)] +pub struct DefaultFilesMetadataCache { + metadata: DashMap<Path, (ObjectMeta, Arc<FileMetadata>)>, +} + +impl CacheAccessor<Path, Arc<FileMetadata>> for DefaultFilesMetadataCache { + type Extra = ObjectMeta; + + fn get(&self, _k: &Path) -> Option<Arc<FileMetadata>> { + panic!("get in DefaultFilesMetadataCache is not supported, please use get_with_extra") + } + + fn get_with_extra(&self, k: &Path, e: &Self::Extra) -> Option<Arc<FileMetadata>> { + self.metadata + .get(k) + .map(|s| { + let (extra, metadata) = s.value(); + if extra.size != e.size || extra.last_modified != e.last_modified { + None + } else { + Some(Arc::clone(metadata)) + } + }) + .unwrap_or(None) + } + + fn put(&self, _key: &Path, _value: Arc<FileMetadata>) -> Option<Arc<FileMetadata>> { + panic!("put in DefaultFilesMetadataCache is not supported, please use put_with_extra") Review Comment: Preface: I'm a user looking forward to this work being implemented, so don't let anything said here impact the timeline for this PR. As a user who has recently been looking into (seemingly) IO related performance using remote parquet backing listing tables, I came across the `cache_unit.rs` source the other day and similarly felt a bit concerned that methods in the default cache implementation code could panic. Obviously an error that can be handled is better than a panic; is this a situation that could be reasonably handled at compile time? This implementation makes use of only `_with_extra` implementations, but the `DefaultListFilesCache` makes use of only the normal `get` and `put`. Returning a `Result` would already be API breaking, perhaps the `CacheAccessor` trait could be broken into several parts indicating whether the implementer intends to support just `get` and `put` or the `_with_extra` variants. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org