friendlymatthew commented on code in PR #20854:
URL: https://github.com/apache/datafusion/pull/20854#discussion_r2918863264
##########
datafusion/datasource-parquet/src/row_filter.rs:
##########
@@ -251,15 +251,26 @@ impl FilterCandidateBuilder {
return Ok(None);
};
+ let schema_descr = metadata.file_metadata().schema_descr();
let root_indices: Vec<_> =
required_columns.required_columns.into_iter().collect();
- let leaf_indices = leaf_indices_for_roots(
- &root_indices,
- metadata.file_metadata().schema_descr(),
+ let mut leaf_indices = leaf_indices_for_roots(&root_indices,
schema_descr);
+
+ let struct_leaf_indices = resolve_struct_field_leaves(
+ &required_columns.struct_field_accesses,
+ &self.file_schema,
+ schema_descr,
);
+ leaf_indices.extend_from_slice(&struct_leaf_indices);
+ leaf_indices.sort_unstable();
Review Comment:
sort and dedup are necessary since we concatenate `leaf_indices_for_roots`
and `resolve_struct_field_leaves`. The first iterates parquet leaves 0..N
collecting those belonging to regular (non-struct) cols, and the second does
the same for struct field accesses. Both produce individually sorted output,
but when a struct column appears before a regular column in the schema, the
struct's leaf indices are numerically lower.
Dedup is needed because the same struct field can appear multiple times in a
filter expression like `get_field(s, 'val') > 5 and get_field(s, 'val') < 100`
producing duplicate entries in `struct_field_accesses`. Without dedup, we'd
double count the compressed size of that column
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]