scovich commented on code in PR #7833: URL: https://github.com/apache/arrow-rs/pull/7833#discussion_r2180370340
########## parquet-variant/src/builder.rs: ########## @@ -240,6 +240,18 @@ struct MetadataBuilder { } impl MetadataBuilder { + /// Pre-populates the list of field names + fn from_field_names<'a>(field_name: impl Iterator<Item = &'a str>) -> Self { + Self { + field_names: IndexSet::from_iter(field_name.map(|f| f.to_string())), + } + } + + /// Checks whether field names by insertion order is lexicographically sorted + fn is_sorted(&self) -> bool { + !self.field_names.is_empty() && self.field_names.iter().is_sorted() + } + /// Upsert field name to dictionary, return its ID fn upsert_field_name(&mut self, field_name: &str) -> u32 { Review Comment: aside: On the off-chance somebody ever managed to insert more than `2**32-1` entries, this would silently produce wrong results. We should consider returning `usize` and require the caller to do the conversion, since they have a better chance of being able to handle the issue cleanly. On the other hand, we have two potential sources of "protection" already: 1. Most machines will run out of memory long before they hit u32 overflow, and that will panic. So it would be hard to trigger the bug in the first place. 2. `num_field_names` returns `usize`. So we could make the builder's `finish` method fallible and then check for u32 overflow only once at the end. Downside is, then we did all the work to insert those extra field names even tho we're doomed to fail later. ########## parquet-variant/src/builder.rs: ########## @@ -240,6 +240,18 @@ struct MetadataBuilder { } impl MetadataBuilder { + /// Pre-populates the list of field names + fn from_field_names<'a>(field_name: impl Iterator<Item = &'a str>) -> Self { + Self { + field_names: IndexSet::from_iter(field_name.map(|f| f.to_string())), + } Review Comment: Also -- would it make sense to provide an `impl FromIterator` for `MetadataBuilder`, instead of or in addition to this constructor? Then people could build a new instance in all the usual rustic ways, e.g. ```rust let builder = MetadataBuilder::from_iter(["a", "b", "c"]); ``` or ```rust self.metadata_builder = ["a", "b", "c"].into_iter().collect(); ``` ########## parquet-variant/src/builder.rs: ########## @@ -240,6 +240,18 @@ struct MetadataBuilder { } impl MetadataBuilder { + /// Pre-populates the list of field names + fn from_field_names<'a>(field_name: impl Iterator<Item = &'a str>) -> Self { + Self { + field_names: IndexSet::from_iter(field_name.map(|f| f.to_string())), + } + } + + /// Checks whether field names by insertion order is lexicographically sorted + fn is_sorted(&self) -> bool { + !self.field_names.is_empty() && self.field_names.iter().is_sorted() Review Comment: This is an expensive check, especially for a larger dictionary. Since the builder is append-only (can't change or remove existing entries), we could incrementally maintain an `is_sorted` flag instead? Although the total number of operations is the same, it should be cheaper because the bytes of a freshly inserted field name will still be in CPU cache (having just been hashed). So the string comparison will be a _lot_ faster than if it takes a cache miss after not having been accessed for a long time. <details> When inserting a new field name: ```rust fn upsert_field_name(&mut self, field_name: &str) -> u32 { let (id, inserted) = self.field_names.insert_full(field_name.to_string()); if inserted { let n = self.field_names.len(); if n >= 2 && self.field_names[n-2] > self.field_names[n-1] { self.is_sorted = false; } } id as u32 } ``` And then when initializing from an iterator, just insert each entry manually in a loop ```rust let mut new_self = Self { field_names: IndexSet::new(), is_sorted: true, }; for field_name in field_names { let _ = new_self.upsert_field_name(field_name); } new_self ``` </details> ########## parquet-variant/src/builder.rs: ########## @@ -479,6 +493,25 @@ impl VariantBuilder { self } + /// This method pre-populates the field name directory in the Variant metadata with + /// the specific field names, in order. + /// + /// You can use this to pre-populate a [`VariantBuilder`] with a sorted dictionary if you + /// know the field names beforehand. Sorted dictionaries can accelerate field access when + /// reading [`Variant`]s. + pub fn with_field_names<'a>(mut self, field_names: impl Iterator<Item = &'a str>) -> Self { + self.metadata_builder = MetadataBuilder::from_field_names(field_names); Review Comment: This would wipe out any previously added field names, which seems like an unnecessary footgun? ########## parquet-variant/src/builder.rs: ########## @@ -479,6 +493,25 @@ impl VariantBuilder { self } + /// This method pre-populates the field name directory in the Variant metadata with + /// the specific field names, in order. + /// + /// You can use this to pre-populate a [`VariantBuilder`] with a sorted dictionary if you + /// know the field names beforehand. Sorted dictionaries can accelerate field access when + /// reading [`Variant`]s. + pub fn with_field_names<'a>(mut self, field_names: impl Iterator<Item = &'a str>) -> Self { + self.metadata_builder = MetadataBuilder::from_field_names(field_names); Review Comment: Also, similar to above -- would it make sense to `impl Extend` for `MetadataBuilder` and then `extend` here instead? ########## parquet-variant/src/builder.rs: ########## @@ -240,6 +240,18 @@ struct MetadataBuilder { } impl MetadataBuilder { + /// Pre-populates the list of field names + fn from_field_names<'a>(field_name: impl Iterator<Item = &'a str>) -> Self { + Self { + field_names: IndexSet::from_iter(field_name.map(|f| f.to_string())), + } Review Comment: `field_names`? Also, why `impl Iterator` instead of the more common `impl IntoIterator`, out of curiosity? Can also take items that `impl Into<String>`, for additional flexibility. ```suggestion fn from_field_names<'a>(field_names: impl IntoIterator<Item = impl Into<String>>) -> Self { let field_names = field_names.into_iter().map(Into::into).collect(); Self { field_names } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org