scovich commented on code in PR #7833: URL: https://github.com/apache/arrow-rs/pull/7833#discussion_r2180382588
########## parquet-variant/src/builder.rs: ########## @@ -240,6 +240,18 @@ struct MetadataBuilder { } impl MetadataBuilder { + /// Pre-populates the list of field names + fn from_field_names<'a>(field_name: impl Iterator<Item = &'a str>) -> Self { + Self { + field_names: IndexSet::from_iter(field_name.map(|f| f.to_string())), + } + } + + /// Checks whether field names by insertion order is lexicographically sorted + fn is_sorted(&self) -> bool { + !self.field_names.is_empty() && self.field_names.iter().is_sorted() Review Comment: This is an expensive check, especially for a larger dictionary. Since the builder is append-only (can't change or remove existing entries), we could incrementally maintain an `is_sorted` flag instead? Although the total number of operations is the same, it should be cheaper because the bytes of a freshly inserted field name will still be in CPU cache (having just been initialized and/or hashed). So the string comparison will be a _lot_ faster than if it takes a cache miss after not having been accessed for a long time. <details> The builder would default to is_sorted=true, but switches to false if any newly inserted field name breaks ordering: ```rust fn upsert_field_name(&mut self, field_name: &str) -> u32 { let (id, inserted) = self.field_names.insert_full(field_name.to_string()); if inserted { let n = self.field_names.len(); if n >= 2 && self.field_names[n-2] > self.field_names[n-1] { self.is_sorted = false; } } id as u32 } ``` And then when initializing from an iterator, just insert each entry manually in a loop ```rust let mut new_self = Self::new(); for field_name in field_names { let _ = new_self.upsert_field_name(field_name); } new_self ``` Actually, if we `impl Extend` for `MetadataBuilder`, then the initialization becomes even simpler: ```rust let mut new_self = Self::new(); new_self.extend(field_names); new_self ``` </details> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org