adriangb commented on code in PR #22026:
URL: https://github.com/apache/datafusion/pull/22026#discussion_r3190962992
##########
datafusion/datasource-parquet/src/opener.rs:
##########
@@ -807,11 +827,24 @@ impl MetadataLoadedParquetOpen {
let needs_rewrite = prepared.predicate.is_some()
|| prepared.logical_file_schema != physical_file_schema;
if needs_rewrite {
+ // When virtual columns are requested, augment the logical and
+ // physical schemas passed to the rewriter/simplifier with those
+ // fields. The rewriter identity-rewrites references found in both
+ // schemas, keeping virtual-column references as `Column` rather
+ // than replacing them with null literals; the simplifier needs
+ // them present so it can resolve their data types while walking
+ // expression trees. We keep `physical_file_schema` itself as the
+ // pure file schema so downstream predicate pushdown, pruning, and
+ // row filter construction stay unaffected.
Review Comment:
What if there is a filter on e.g. `row_number % 2 = 0 OR column = 123`?
##########
datafusion/datasource/src/table_schema.rs:
##########
@@ -63,6 +70,15 @@ pub struct TableSchema {
/// this field holds that schema.
file_schema: SchemaRef,
+ /// Virtual columns that are generated by the reader rather than read from
+ /// the data files or the directory structure.
+ ///
+ /// For example, a Parquet reader may inject a `row_number` column whose
+ /// values are produced per file by the reader. Virtual column fields must
+ /// carry an arrow extension type (e.g. `RowNumber`, `RowGroupIndex`) so
the
+ /// file reader can recognize them.
+ virtual_columns: Arc<Vec<FieldRef>>,
Review Comment:
Can this specify if they are appended at the front or back of the other
columns / projection?
##########
datafusion/datasource/src/table_schema.rs:
##########
@@ -149,9 +172,38 @@ impl TableSchema {
);
table_partition_cols.extend(partition_cols);
}
- let mut builder = SchemaBuilder::from(self.file_schema.as_ref());
- builder.extend(self.table_partition_cols.iter().cloned());
- self.table_schema = Arc::new(builder.finish());
+ self.table_schema = build_table_schema(
+ &self.file_schema,
+ self.table_partition_cols.as_ref(),
+ self.virtual_columns.as_ref(),
+ );
+ self
+ }
+
+ /// Add virtual columns to an existing TableSchema, returning a new
instance.
+ ///
+ /// Virtual columns are produced by the file reader (e.g. a Parquet
+ /// `row_number` column) rather than being stored in the files or derived
+ /// from partition paths. Each field must carry an arrow virtual extension
+ /// type so the reader can recognize it; `ParquetOpener` forwards these
+ /// fields to
`parquet::arrow::arrow_reader::ArrowReaderOptions::with_virtual_columns`.
+ ///
+ /// Virtual columns are appended at the end of the table schema, after any
+ /// partition columns.
+ pub fn with_virtual_columns(mut self, virtual_columns: Vec<FieldRef>) ->
Self {
Review Comment:
Should we put an enum of supported virtual columns somewhere instead of
using `FieldRef` + metadata? We could implement `TryFrom<FieldRef>` or
something. Even if the information is passed to arrow as a `FieldRef` with
specific metadata it would be nice to enforce the contract in the type system
as much as possible
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]