BlakeOrth commented on code in PR #16971:
URL: https://github.com/apache/datafusion/pull/16971#discussion_r2246474646
##########
datafusion/execution/src/cache/cache_unit.rs:
##########
@@ -157,9 +158,79 @@ impl CacheAccessor<Path, Arc<Vec<ObjectMeta>>> for
DefaultListFilesCache {
}
}
+/// Collected file embedded metadata cache.
+/// The metadata for some file is invalided when the file size or last
modification time have been
+/// changed.
+#[derive(Default)]
+pub struct DefaultFilesMetadataCache {
+ metadata: DashMap<Path, (ObjectMeta, Arc<FileMetadata>)>,
+}
+
+impl CacheAccessor<Path, Arc<FileMetadata>> for DefaultFilesMetadataCache {
+ type Extra = ObjectMeta;
+
+ fn get(&self, _k: &Path) -> Option<Arc<FileMetadata>> {
+ panic!("get in DefaultFilesMetadataCache is not supported, please use
get_with_extra")
+ }
+
+ fn get_with_extra(&self, k: &Path, e: &Self::Extra) ->
Option<Arc<FileMetadata>> {
+ self.metadata
+ .get(k)
+ .map(|s| {
+ let (extra, metadata) = s.value();
+ if extra.size != e.size || extra.last_modified !=
e.last_modified {
+ None
+ } else {
+ Some(Arc::clone(metadata))
+ }
+ })
+ .unwrap_or(None)
+ }
+
+ fn put(&self, _key: &Path, _value: Arc<FileMetadata>) ->
Option<Arc<FileMetadata>> {
+ panic!("put in DefaultFilesMetadataCache is not supported, please use
put_with_extra")
Review Comment:
Preface: I'm a user looking forward to this work being implemented, so don't
let anything said here impact the timeline for this PR.
As a user who has recently been looking into (seemingly) IO related
performance using remote parquet backing listing tables, I came across the
`cache_unit.rs` source the other day and similarly felt a bit concerned that
methods in the default cache implementation code could panic. Obviously an
error that can be handled is better than a panic; is this a situation that
could be reasonably handled at compile time? This implementation makes use of
only `_with_extra` implementations, but the `DefaultListFilesCache` makes use
of only the normal `get` and `put`. Returning a `Result` would already be API
breaking, perhaps the `CacheAccessor` trait could be broken into several parts
indicating whether the implementer intends to support just `get` and `put` or
the `_with_extra` variants.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]