etseidl commented on code in PR #8312:
URL: https://github.com/apache/arrow-rs/pull/8312#discussion_r2345522146


##########
parquet/src/schema/types.rs:
##########
@@ -1137,11 +1160,13 @@ fn build_tree<'a>(
         Type::PrimitiveType { .. } => {
             let mut path: Vec<String> = vec![];
             path.extend(path_so_far.iter().copied().map(String::from));
+            let column_path = ColumnPath::new(path);
+            leaf_to_idx.insert(column_path.clone(), leaves.len());

Review Comment:
   I can't seem to upload images in comments right now, so text will have to 
do. I did some profiles with samply, and as I feared this change has a pretty 
significant impact. Overall, `SchemaDescriptor::new` goes from 10.6% of 
metadata parsing to 16.3%. Not quite apples-to-apples, but the number of 
samples goes from 3317 to 6049. 1614 samples are pushing to the map, and 1214 
performing this clone. In other words, this change makes the schema conversion 
80% slower.
   
   Now in terms of a complete read of a Parquet file this isn't really a big 
deal, so maybe we don't care. But given the lengths the Parquet community is 
going to to wring performance out of handling metadata, maybe we should think 
twice, or at least see if we can improve on this implementation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to