alamb commented on code in PR #7833:
URL: https://github.com/apache/arrow-rs/pull/7833#discussion_r2183382331


##########
parquet-variant/src/builder.rs:
##########
@@ -240,6 +240,18 @@ struct MetadataBuilder {
 }
 
 impl MetadataBuilder {
+    /// Pre-populates the list of field names
+    fn from_field_names<'a>(field_name: impl Iterator<Item = &'a str>) -> Self 
{
+        Self {
+            field_names: IndexSet::from_iter(field_name.map(|f| 
f.to_string())),
+        }
+    }
+
+    /// Checks whether field names by insertion order is lexicographically 
sorted
+    fn is_sorted(&self) -> bool {
+        !self.field_names.is_empty() && self.field_names.iter().is_sorted()

Review Comment:
   > This is an expensive check, especially for a larger dictionary. Since the 
builder is append-only (can't change or remove existing entries), we could 
incrementally maintain an is_sorted flag instead?
   
   I am not sure it is all that expensive 
   https://doc.rust-lang.org/core/iter/trait.Iterator.html#method.is_sorted
   
   I think it simply checks all the elements pariwise 
([code](https://doc.rust-lang.org/src/core/iter/traits/iterator.rs.html#3972))
   
   I don't think it is any more/less expensive in theory to check the 
dictionary on write than to do the checks incrementally as the variant is 
written. In each case it needs to do N-1 comparisons where N is the number of 
dictionary entries 



##########
parquet-variant/src/builder.rs:
##########
@@ -240,6 +240,18 @@ struct MetadataBuilder {
 }
 
 impl MetadataBuilder {
+    /// Pre-populates the list of field names
+    fn from_field_names<'a>(field_name: impl Iterator<Item = &'a str>) -> Self 
{
+        Self {
+            field_names: IndexSet::from_iter(field_name.map(|f| 
f.to_string())),
+        }
+    }
+
+    /// Checks whether field names by insertion order is lexicographically 
sorted
+    fn is_sorted(&self) -> bool {
+        !self.field_names.is_empty() && self.field_names.iter().is_sorted()
+    }
+
     /// Upsert field name to dictionary, return its ID
     fn upsert_field_name(&mut self, field_name: &str) -> u32 {

Review Comment:
   One option  could be to change the code to panic in this unlikely case and 
deal with it when / if we have an actual issue 
   
   As you point out, I don't think this is related to this PR



##########
parquet-variant/src/builder.rs:
##########
@@ -240,6 +240,18 @@ struct MetadataBuilder {
 }
 
 impl MetadataBuilder {
+    /// Pre-populates the list of field names
+    fn from_field_names<'a>(field_name: impl Iterator<Item = &'a str>) -> Self 
{
+        Self {
+            field_names: IndexSet::from_iter(field_name.map(|f| 
f.to_string())),
+        }

Review Comment:
   I agree switching to IntoIterator would be nicer as would implemeting 
`Extend` 
   
   If we don't do it in this PR, let's file a ticket to fix follow on PRs



##########
parquet-variant/src/builder.rs:
##########
@@ -479,6 +493,25 @@ impl VariantBuilder {
         self
     }
 
+    /// This method pre-populates the field name directory in the Variant 
metadata with
+    /// the specific field names, in order.
+    ///
+    /// You can use this to pre-populate a [`VariantBuilder`] with a sorted 
dictionary if you
+    /// know the field names beforehand. Sorted dictionaries can accelerate 
field access when
+    /// reading [`Variant`]s.
+    pub fn with_field_names<'a>(mut self, field_names: impl Iterator<Item = 
&'a str>) -> Self {
+        self.metadata_builder = MetadataBuilder::from_field_names(field_names);

Review Comment:
   Replacing the metadata builder also invalidates any previously created 
objects which seems likely to be pretty bad
   
   I agree we should change this to append to (not replace) existing fields
   
   ```
       pub fn with_field_names<'a>(mut self, field_names: impl Iterator<Item = 
&'a str>) -> Self {
           for field_name in field_names {
             self.metadata_builder.upsert_field_name(field_name)
           }
           self
        }
   ```
   
   The Extend is an excellent idea as well, but it is less clear to me that 
`Extend` on a VariantBuilder should only extend fields -- I would expect Extend 
to take Variants and logically extend the in-progress variant 🤔 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to