Re: [PR] refactor: make file-statistics cache keys schema-aware [datafusion]

via GitHub Sun, 28 Jun 2026 21:15:35 -0700


kosiew commented on code in PR #23201:
URL: https://github.com/apache/datafusion/pull/23201#discussion_r3489285370



##########
datafusion/execution/src/cache/mod.rs:
##########
@@ -165,3 +168,145 @@ impl Display for TableScopedPath {
         }
     }
 }
+
+/// A fingerprint of the `file_schema` used to compute a file's statistics.
+///
+/// Captures exactly the attributes that determine the layout and meaning of
+/// `Statistics::column_statistics`: each column's name, data type and
+/// nullability, in order. It deliberately excludes field/schema metadata, 
which
+/// cannot affect statistics — including it would needlessly fragment the 
cache.
+#[derive(Clone, Debug)]
+pub struct SchemaFingerprint {
+    columns: Vec<(String, DataType, bool)>,
+    /// Precomputed hash of `columns`, so hashing a key on every cache lookup 
is
+    /// O(1) rather than O(schema width). Computed once in `from_schema` with a
+    /// fixed-seed hasher so it is stable across keys; `PartialEq` still 
compares
+    /// `columns` exactly, so a hash collision can never make two different
+    /// schemas share a cache entry.
+    hash: u64,
+}
+
+impl SchemaFingerprint {
+    /// Builds a fingerprint from the `file_schema` used to compute statistics
+    /// (the schema of the columns physically read, not the full table schema —
+    /// partition columns and their statistics are handled separately).
+    pub fn from_schema(file_schema: &Schema) -> Self {
+        let columns: Vec<(String, DataType, bool)> = file_schema
+            .fields()
+            .iter()
+            .map(|f| (f.name().clone(), f.data_type().clone(), 
f.is_nullable()))
+            .collect();
+        let mut hasher = DefaultHasher::new();
+        columns.hash(&mut hasher);
+        Self {
+            columns,
+            hash: hasher.finish(),
+        }
+    }
+}
+
+impl PartialEq for SchemaFingerprint {
+    fn eq(&self, other: &Self) -> bool {
+        // Cheap hash gate first, then an exact comparison so collisions are 
safe.
+        self.hash == other.hash && self.columns == other.columns
+    }
+}
+
+impl Eq for SchemaFingerprint {}
+
+impl Hash for SchemaFingerprint {
+    fn hash<H: Hasher>(&self, state: &mut H) {
+        state.write_u64(self.hash);
+    }
+}
+
+impl DFHeapSize for SchemaFingerprint {
+    fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize {
+        self.columns.heap_size(ctx)
+    }
+}
+
+/// Cache key for the file-statistics cache.
+///
+/// Like [`TableScopedPath`] it is scoped by table and path, but it 
additionally
+/// carries a [`SchemaFingerprint`]. File statistics are computed against a
+/// specific `file_schema`, so the same path read under different schemas must
+/// not share an entry; the fingerprint keeps those entries distinct while a
+/// repeated read of the same schema still reuses its entry.
+#[derive(PartialEq, Eq, Hash, Clone, Debug)]
+pub struct FileStatisticsCacheKey {
+    pub table: Option<TableReference>,
+    pub path: Path,
+    // `Arc` so building a key per file is a cheap refcount bump rather than a
+    // deep clone of the fingerprint. `Arc`'s `Eq`/`Hash` compare the inner 
value,
+    // so keying remains by schema contents (not pointer identity).
+    pub schema: Arc<SchemaFingerprint>,
+}
+
+impl DFHeapSize for FileStatisticsCacheKey {
+    fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize {
+        self.path.as_ref().heap_size(ctx)
+            + self.table.heap_size(ctx)
+            + self.schema.as_ref().heap_size(ctx)

Review Comment:
   `FileStatisticsCacheKey::heap_size` appears to deep-count the shared 
`SchemaFingerprint` for every cached file key. Since `ListingTable` now shares 
one `Arc<SchemaFingerprint>` across all files for the same table and schema, 
this could overstate cache memory for wide schemas with many files and lead to 
earlier eviction than necessary.
   
   Could we count only the incremental cache-owned cost here, or add a small 
test that documents the intended accounting tradeoff?



##########
datafusion/execution/src/cache/mod.rs:
##########
@@ -165,3 +168,145 @@ impl Display for TableScopedPath {
         }
     }
 }
+
+/// A fingerprint of the `file_schema` used to compute a file's statistics.
+///
+/// Captures exactly the attributes that determine the layout and meaning of
+/// `Statistics::column_statistics`: each column's name, data type and
+/// nullability, in order. It deliberately excludes field/schema metadata, 
which
+/// cannot affect statistics — including it would needlessly fragment the 
cache.
+#[derive(Clone, Debug)]
+pub struct SchemaFingerprint {

Review Comment:
   The schema-aware key looks correct, and I think this fixes the bug. That 
said, the implementation feels a bit more involved than this cache path needs.
   
   Could we simplify `SchemaFingerprint` to a small derived newtype over 
something like `Vec<(String, DataType, bool)>`, or an equivalent 
representation, and rely on derived `Hash` and `Eq`? The precomputed hash plus 
custom `PartialEq` collision handling adds some cleverness that feels hard to 
justify here unless profiling shows schema-key hashing is material.
   
   Another possible approach would be to avoid caching anonymous 
explicit-schema reads, but if we keep the schema-aware key, I would prefer the 
simpler version for maintainability.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] refactor: make file-statistics cache keys schema-aware [datafusion]

Reply via email to