scovich commented on code in PR #7833:
URL: https://github.com/apache/arrow-rs/pull/7833#discussion_r2180370340


##########
parquet-variant/src/builder.rs:
##########
@@ -240,6 +240,18 @@ struct MetadataBuilder {
 }
 
 impl MetadataBuilder {
+    /// Pre-populates the list of field names
+    fn from_field_names<'a>(field_name: impl Iterator<Item = &'a str>) -> Self 
{
+        Self {
+            field_names: IndexSet::from_iter(field_name.map(|f| 
f.to_string())),
+        }
+    }
+
+    /// Checks whether field names by insertion order is lexicographically 
sorted
+    fn is_sorted(&self) -> bool {
+        !self.field_names.is_empty() && self.field_names.iter().is_sorted()
+    }
+
     /// Upsert field name to dictionary, return its ID
     fn upsert_field_name(&mut self, field_name: &str) -> u32 {

Review Comment:
   aside: On the off-chance somebody ever managed to insert more than `2**32-1` 
entries, this would silently produce wrong results. We should consider 
returning `usize` and require the caller to do the conversion, since they have 
a better chance of being able to handle the issue cleanly.
   
   On the other hand, we have two potential sources of "protection" already:
   1.  Most machines will run out of memory long before they hit u32 overflow, 
and that will panic. So it would be hard to trigger the bug in the first place.
   2. `num_field_names` returns `usize`. So we could make the builder's 
`finish` method fallible and then check for u32 overflow only once at the end. 
Downside is, then we did all the work to insert those extra field names even 
tho we're doomed to fail later.



##########
parquet-variant/src/builder.rs:
##########
@@ -240,6 +240,18 @@ struct MetadataBuilder {
 }
 
 impl MetadataBuilder {
+    /// Pre-populates the list of field names
+    fn from_field_names<'a>(field_name: impl Iterator<Item = &'a str>) -> Self 
{
+        Self {
+            field_names: IndexSet::from_iter(field_name.map(|f| 
f.to_string())),
+        }

Review Comment:
   Also -- would it make sense to provide an `impl FromIterator` for 
`MetadataBuilder`, instead of or in addition to this constructor? Then people 
could build a new instance in all the usual rustic ways, e.g.
   ```rust
   let builder = MetadataBuilder::from_iter(["a", "b", "c"]);
   ```
   or
   ```rust
   self.metadata_builder = ["a", "b", "c"].into_iter().collect();
   ```



##########
parquet-variant/src/builder.rs:
##########
@@ -240,6 +240,18 @@ struct MetadataBuilder {
 }
 
 impl MetadataBuilder {
+    /// Pre-populates the list of field names
+    fn from_field_names<'a>(field_name: impl Iterator<Item = &'a str>) -> Self 
{
+        Self {
+            field_names: IndexSet::from_iter(field_name.map(|f| 
f.to_string())),
+        }
+    }
+
+    /// Checks whether field names by insertion order is lexicographically 
sorted
+    fn is_sorted(&self) -> bool {
+        !self.field_names.is_empty() && self.field_names.iter().is_sorted()

Review Comment:
   This is an expensive check, especially for a larger dictionary. Since the 
builder is append-only (can't change or remove existing entries), we could 
incrementally maintain an `is_sorted` flag instead?
   
   Although the total number of operations is the same, it should be cheaper 
because the bytes of a freshly inserted field name will still be in CPU cache 
(having just been hashed). So the string comparison will be a _lot_ faster than 
if it takes a cache miss after not having been accessed for a long time.
   
   <details>
   
   When inserting a new field name:
   ```rust
   fn upsert_field_name(&mut self, field_name: &str) -> u32 {
       let (id, inserted) = 
self.field_names.insert_full(field_name.to_string());
       if inserted {
           let n = self.field_names.len();
           if n >= 2 && self.field_names[n-2] > self.field_names[n-1] {
               self.is_sorted = false;
           }
       }
       id as u32
   }
   ```
   
   And then when initializing from an iterator, just insert each entry manually 
in a loop
   ```rust
   let mut new_self = Self {
       field_names: IndexSet::new(),
       is_sorted: true,
   };
   for field_name in field_names {
       let _ = new_self.upsert_field_name(field_name);
   }
   new_self
   ```
   
   </details>



##########
parquet-variant/src/builder.rs:
##########
@@ -479,6 +493,25 @@ impl VariantBuilder {
         self
     }
 
+    /// This method pre-populates the field name directory in the Variant 
metadata with
+    /// the specific field names, in order.
+    ///
+    /// You can use this to pre-populate a [`VariantBuilder`] with a sorted 
dictionary if you
+    /// know the field names beforehand. Sorted dictionaries can accelerate 
field access when
+    /// reading [`Variant`]s.
+    pub fn with_field_names<'a>(mut self, field_names: impl Iterator<Item = 
&'a str>) -> Self {
+        self.metadata_builder = MetadataBuilder::from_field_names(field_names);

Review Comment:
   This would wipe out any previously added field names, which seems like an 
unnecessary footgun?



##########
parquet-variant/src/builder.rs:
##########
@@ -479,6 +493,25 @@ impl VariantBuilder {
         self
     }
 
+    /// This method pre-populates the field name directory in the Variant 
metadata with
+    /// the specific field names, in order.
+    ///
+    /// You can use this to pre-populate a [`VariantBuilder`] with a sorted 
dictionary if you
+    /// know the field names beforehand. Sorted dictionaries can accelerate 
field access when
+    /// reading [`Variant`]s.
+    pub fn with_field_names<'a>(mut self, field_names: impl Iterator<Item = 
&'a str>) -> Self {
+        self.metadata_builder = MetadataBuilder::from_field_names(field_names);

Review Comment:
   Also, similar to above -- would it make sense to `impl Extend` for 
`MetadataBuilder` and then `extend` here instead?



##########
parquet-variant/src/builder.rs:
##########
@@ -240,6 +240,18 @@ struct MetadataBuilder {
 }
 
 impl MetadataBuilder {
+    /// Pre-populates the list of field names
+    fn from_field_names<'a>(field_name: impl Iterator<Item = &'a str>) -> Self 
{
+        Self {
+            field_names: IndexSet::from_iter(field_name.map(|f| 
f.to_string())),
+        }

Review Comment:
   `field_names`?
   
   Also, why `impl Iterator` instead of the more common `impl IntoIterator`, 
out of curiosity? Can also take items that `impl Into<String>`, for additional 
flexibility.
   
   ```suggestion
       fn from_field_names<'a>(field_names: impl IntoIterator<Item = impl 
Into<String>>) -> Self {
           let field_names = field_names.into_iter().map(Into::into).collect();
           Self { field_names }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to