alamb commented on code in PR #7833: URL: https://github.com/apache/arrow-rs/pull/7833#discussion_r2183382331
########## parquet-variant/src/builder.rs: ########## @@ -240,6 +240,18 @@ struct MetadataBuilder { } impl MetadataBuilder { + /// Pre-populates the list of field names + fn from_field_names<'a>(field_name: impl Iterator<Item = &'a str>) -> Self { + Self { + field_names: IndexSet::from_iter(field_name.map(|f| f.to_string())), + } + } + + /// Checks whether field names by insertion order is lexicographically sorted + fn is_sorted(&self) -> bool { + !self.field_names.is_empty() && self.field_names.iter().is_sorted() Review Comment: > This is an expensive check, especially for a larger dictionary. Since the builder is append-only (can't change or remove existing entries), we could incrementally maintain an is_sorted flag instead? I am not sure it is all that expensive https://doc.rust-lang.org/core/iter/trait.Iterator.html#method.is_sorted I think it simply checks all the elements pariwise ([code](https://doc.rust-lang.org/src/core/iter/traits/iterator.rs.html#3972)) I don't think it is any more/less expensive in theory to check the dictionary on write than to do the checks incrementally as the variant is written. In each case it needs to do N-1 comparisons where N is the number of dictionary entries ########## parquet-variant/src/builder.rs: ########## @@ -240,6 +240,18 @@ struct MetadataBuilder { } impl MetadataBuilder { + /// Pre-populates the list of field names + fn from_field_names<'a>(field_name: impl Iterator<Item = &'a str>) -> Self { + Self { + field_names: IndexSet::from_iter(field_name.map(|f| f.to_string())), + } + } + + /// Checks whether field names by insertion order is lexicographically sorted + fn is_sorted(&self) -> bool { + !self.field_names.is_empty() && self.field_names.iter().is_sorted() + } + /// Upsert field name to dictionary, return its ID fn upsert_field_name(&mut self, field_name: &str) -> u32 { Review Comment: One option could be to change the code to panic in this unlikely case and deal with it when / if we have an actual issue As you point out, I don't think this is related to this PR ########## parquet-variant/src/builder.rs: ########## @@ -240,6 +240,18 @@ struct MetadataBuilder { } impl MetadataBuilder { + /// Pre-populates the list of field names + fn from_field_names<'a>(field_name: impl Iterator<Item = &'a str>) -> Self { + Self { + field_names: IndexSet::from_iter(field_name.map(|f| f.to_string())), + } Review Comment: I agree switching to IntoIterator would be nicer as would implemeting `Extend` If we don't do it in this PR, let's file a ticket to fix follow on PRs ########## parquet-variant/src/builder.rs: ########## @@ -479,6 +493,25 @@ impl VariantBuilder { self } + /// This method pre-populates the field name directory in the Variant metadata with + /// the specific field names, in order. + /// + /// You can use this to pre-populate a [`VariantBuilder`] with a sorted dictionary if you + /// know the field names beforehand. Sorted dictionaries can accelerate field access when + /// reading [`Variant`]s. + pub fn with_field_names<'a>(mut self, field_names: impl Iterator<Item = &'a str>) -> Self { + self.metadata_builder = MetadataBuilder::from_field_names(field_names); Review Comment: Replacing the metadata builder also invalidates any previously created objects which seems likely to be pretty bad I agree we should change this to append to (not replace) existing fields ``` pub fn with_field_names<'a>(mut self, field_names: impl Iterator<Item = &'a str>) -> Self { for field_name in field_names { self.metadata_builder.upsert_field_name(field_name) } self } ``` The Extend is an excellent idea as well, but it is less clear to me that `Extend` on a VariantBuilder should only extend fields -- I would expect Extend to take Variants and logically extend the in-progress variant 🤔 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org