alamb commented on code in PR #7922: URL: https://github.com/apache/arrow-rs/pull/7922#discussion_r2207907193
########## parquet-variant/benches/variant_builder.rs: ########## @@ -495,6 +495,18 @@ fn bench_iteration_performance(c: &mut Criterion) { group.finish(); } +fn bench_extend_metadata_builder(c: &mut Criterion) { + let list = (0..u32::MAX).map(|i| format!("id_{i}")).collect::<Vec<_>>(); Review Comment: FWIW this benchmark doesn't finish on my machine in a reasonable amount of time ``` time cargo bench --bench variant_builder -- bench_validate_large_nested_list ``` Probably b/c it creates 4B strings I think you could get the same effect using `1000` strings or something -- I don't think there is any need for 4 billion 🤔 ########## parquet-variant/benches/variant_builder.rs: ########## @@ -495,6 +495,18 @@ fn bench_iteration_performance(c: &mut Criterion) { group.finish(); } +fn bench_extend_metadata_builder(c: &mut Criterion) { + let list = (0..u32::MAX).map(|i| format!("id_{i}")).collect::<Vec<_>>(); Review Comment: I only waited 2 minutes before cancelling 😆 ``` andrewlamb@Andrews-MacBook-Pro-3:~/Software/arrow-rs$ time cargo bench --bench variant_builder -- bench_validate_large_nested_list Compiling parquet-variant v0.1.0 (/Users/andrewlamb/Software/arrow-rs/parquet-variant) Finished `bench` profile [optimized] target(s) in 1.44s Running benches/variant_builder.rs (target/release/deps/variant_builder-5d3061d750cb04a4) ^C real 2m13.341s user 0m6.141s sys 0m0.293s ``` ########## parquet-variant/src/builder.rs: ########## @@ -402,6 +402,11 @@ impl<S: AsRef<str>> FromIterator<S> for MetadataBuilder { impl<S: AsRef<str>> Extend<S> for MetadataBuilder { fn extend<T: IntoIterator<Item = S>>(&mut self, iter: T) { + let iter = iter.into_iter(); + let (min, max) = iter.size_hint(); + + self.field_names.reserve(max.unwrap_or(min)); + for field_name in iter { self.upsert_field_name(field_name.as_ref()); Review Comment: One problem with using `extend` is that it requires an iterator of `String`s -- meaning you would have to allocate the String even if it wouldn't be put into the Set ########## parquet-variant/src/builder.rs: ########## @@ -760,6 +765,15 @@ impl VariantBuilder { self } + /// This method reserves capacity for field names in the Variant metadata, + /// which can improve performance when you know the approximate number of unique field + /// names that will be used across all objects in the [`Variant`]. + pub fn with_field_name_capacity(mut self, capacity: usize) -> Self { Review Comment: I recommend naming this `reserve` and have it take a `&mut self` to match the std::Vec APIs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org