kosiew commented on code in PR #23201:
URL: https://github.com/apache/datafusion/pull/23201#discussion_r3489285369
##########
datafusion/execution/src/cache/mod.rs:
##########
@@ -165,3 +168,145 @@ impl Display for TableScopedPath {
}
}
}
+
+/// A fingerprint of the `file_schema` used to compute a file's statistics.
+///
+/// Captures exactly the attributes that determine the layout and meaning of
+/// `Statistics::column_statistics`: each column's name, data type and
+/// nullability, in order. It deliberately excludes field/schema metadata,
which
+/// cannot affect statistics — including it would needlessly fragment the
cache.
+#[derive(Clone, Debug)]
+pub struct SchemaFingerprint {
Review Comment:
The schema-aware key looks correct, and I think this fixes the bug. That
said, the implementation feels a bit more involved than this cache path needs.
Could we simplify `SchemaFingerprint` to a small derived newtype over
something like `Vec<(String, DataType, bool)>`, or an equivalent
representation, and rely on derived `Hash` and `Eq`? The precomputed hash plus
custom `PartialEq` collision handling adds some cleverness that feels hard to
justify here unless profiling shows schema-key hashing is material.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]