etseidl commented on code in PR #8312:
URL: https://github.com/apache/arrow-rs/pull/8312#discussion_r2345522146
##########
parquet/src/schema/types.rs:
##########
@@ -1137,11 +1160,13 @@ fn build_tree<'a>(
Type::PrimitiveType { .. } => {
let mut path: Vec<String> = vec![];
path.extend(path_so_far.iter().copied().map(String::from));
+ let column_path = ColumnPath::new(path);
+ leaf_to_idx.insert(column_path.clone(), leaves.len());
Review Comment:
I can't seem to upload images in comments right now, so text will have to
do. I did some profiles with samply, and as I feared this change has a pretty
significant impact. Overall, `SchemaDescriptor::new` goes from 10.6% of
metadata parsing to 16.3%. Not quite apples-to-apples, but the number of
samples goes from 3317 to 6049. 1614 samples are pushing to the map, and 1214
performing this clone. In other words, this change makes the schema conversion
80% slower.
Now in terms of a complete read of a Parquet file this isn't really a big
deal, so maybe we don't care. But given the lengths the Parquet community is
going to to wring performance out of handling metadata, maybe we should think
twice, or at least see if we can improve on this implementation.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]