scovich commented on code in PR #7833:
URL: https://github.com/apache/arrow-rs/pull/7833#discussion_r2180382588


##########
parquet-variant/src/builder.rs:
##########
@@ -240,6 +240,18 @@ struct MetadataBuilder {
 }
 
 impl MetadataBuilder {
+    /// Pre-populates the list of field names
+    fn from_field_names<'a>(field_name: impl Iterator<Item = &'a str>) -> Self 
{
+        Self {
+            field_names: IndexSet::from_iter(field_name.map(|f| 
f.to_string())),
+        }
+    }
+
+    /// Checks whether field names by insertion order is lexicographically 
sorted
+    fn is_sorted(&self) -> bool {
+        !self.field_names.is_empty() && self.field_names.iter().is_sorted()

Review Comment:
   This is an expensive check, especially for a larger dictionary. Since the 
builder is append-only (can't change or remove existing entries), we could 
incrementally maintain an `is_sorted` flag instead?
   
   Although the total number of operations is the same, it should be cheaper 
because the bytes of a freshly inserted field name will still be in CPU cache 
(having just been hashed). So the string comparison will be a _lot_ faster than 
if it takes a cache miss after not having been accessed for a long time.
   
   <details>
   
   The builder would default to is_sorted=true, but switches to false if any 
newly inserted field name breaks ordering:
   ```rust
   fn upsert_field_name(&mut self, field_name: &str) -> u32 {
       let (id, inserted) = 
self.field_names.insert_full(field_name.to_string());
       if inserted {
           let n = self.field_names.len();
           if n >= 2 && self.field_names[n-2] > self.field_names[n-1] {
               self.is_sorted = false;
           }
       }
       id as u32
   }
   ```
   
   And then when initializing from an iterator, just insert each entry manually 
in a loop
   ```rust
   let mut new_self = Self::new();
   for field_name in field_names {
       let _ = new_self.upsert_field_name(field_name);
   }
   new_self
   ```
   
   Actually, if we impl Extend for MetadataBuilder, then the initialization 
becomes even simpler:
   ```rust
   let mut new_self = Self::new();
   new_self.extend(field_names);
   new_self
   ```
   </details>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to