scovich commented on code in PR #8167:
URL: https://github.com/apache/arrow-rs/pull/8167#discussion_r2286033483


##########
parquet-variant/src/variant/metadata.rs:
##########
@@ -331,6 +331,11 @@ impl<'m> VariantMetadata<'m> {
         self.iter_try()
             .map(|result| result.expect("Invalid metadata dictionary entry"))
     }
+
+    /// Same as `Index::index`, but with the correct lifetime.
+    pub(crate) fn get_infallible(&self, i: usize) -> &'m str {

Review Comment:
   The`Index::index` method has the following signature:
   ```rust
   fn index(&self, i: usize) -> &Self::Output
   ```
   which is really 
   ```rust
   fn index<'a>(&'a self, i: usize) -> &'a Self::Output
   ```
   and because the (elided) `'a` is not related to `'m`, we can't use the 
return value in a place that requires `&'m str`.
   
   You're right that `self.get(i).unwrap()` would also work; I was just trying 
to preserve the `expect` at both call sites. Should we just `unwrap` everywhere 
and be done with it?
   
   NOTE: The `get_infallible` should be private, except then `Index` can't 
access it.



##########
parquet-variant/src/builder.rs:
##########
@@ -1428,54 +1415,61 @@ impl<'a> ObjectBuilder<'a> {
     }
 
     // Returns validate_unique_fields because we can no longer reference self 
once this method returns.
-    fn parent_state<'b>(&'b mut self, key: &'b str) -> (ParentState<'b>, bool) 
{
+    fn parent_state<'b>(
+        &'b mut self,
+        field_name: &'b str,
+    ) -> Result<(ParentState<'b>, bool), ArrowError> {
+        let saved_parent_buffer_offset = 
self.parent_state.saved_buffer_offset();
         let validate_unique_fields = self.validate_unique_fields;
-
         let (buffer, metadata_builder) = 
self.parent_state.buffer_and_metadata_builder();
-
-        let state = ParentState::Object {
+        let state = ParentState::object(
             buffer,
             metadata_builder,
-            fields: &mut self.fields,
-            field_name: key,
-            parent_value_offset_base: self.parent_value_offset_base,
-        };
-        (state, validate_unique_fields)
+            &mut self.fields,
+            saved_parent_buffer_offset,
+            field_name,
+            validate_unique_fields,
+        )?;
+        Ok((state, validate_unique_fields))
     }
 
     /// Returns an object builder that can be used to append a new (nested) 
object to this object.
     ///
+    /// Panics if the requested key cannot be inserted (e.g. because it is a 
duplicate).

Review Comment:
   > it could panic [if] the builder has read only metadata, right?
   
   Yep! And having the same method fail for both error cases is really 
appealing to me. 
   
   (I just wrote the code comment before adding the read-only builder to this 
PR)
   
   > We could avoid panic's [by deferring] errors until the final finish() is 
called
   
   The goal was to make `finish` signature infallible, now that it actually is 
infallible in practice. I just didn't want to bloat this PR with the extra 
churn that would cause.
   
   Plus, it always seemed weird to do extra work to defer an obvious known 
failure until later. Is there a disadvantage to failing immediately?



##########
parquet-variant/src/builder.rs:
##########
@@ -437,13 +397,76 @@ impl ValueBuffer {
     }
 }
 
+pub trait MetadataBuilder: std::fmt::Debug {
+    fn try_upsert_field_name(&mut self, field_name: &str) -> Result<u32, 
ArrowError>;
+    fn field_name(&self, field_id: usize) -> &str;
+    fn num_field_names(&self) -> usize;
+    fn truncate_field_names(&mut self, new_size: usize);
+}

Review Comment:
   Needs comments for sure!



##########
parquet-variant/src/builder.rs:
##########
@@ -437,13 +397,76 @@ impl ValueBuffer {
     }
 }
 
+pub trait MetadataBuilder: std::fmt::Debug {
+    fn try_upsert_field_name(&mut self, field_name: &str) -> Result<u32, 
ArrowError>;
+    fn field_name(&self, field_id: usize) -> &str;
+    fn num_field_names(&self) -> usize;
+    fn truncate_field_names(&mut self, new_size: usize);
+}
+
+impl MetadataBuilder for MetadataBuilderXX {
+    fn try_upsert_field_name(&mut self, field_name: &str) -> Result<u32, 
ArrowError> {
+        Ok(self.upsert_field_name(field_name))
+    }
+    fn field_name(&self, field_id: usize) -> &str {
+        self.field_name(field_id)
+    }
+    fn num_field_names(&self) -> usize {
+        self.num_field_names()
+    }
+    fn truncate_field_names(&mut self, new_size: usize) {
+        self.field_names.truncate(new_size)
+    }
+}
+
+#[derive(Debug)]
+pub struct ReadOnlyMetadataBuilder<'m> {
+    metadata: VariantMetadata<'m>,
+    known_field_names: HashMap<&'m str, usize>,
+}
+
+impl<'m> ReadOnlyMetadataBuilder<'m> {
+    pub fn new(metadata: VariantMetadata<'m>) -> Self {
+        Self {
+            metadata,
+            known_field_names: HashMap::new(),
+        }
+    }
+}
+
+impl MetadataBuilder for ReadOnlyMetadataBuilder<'_> {
+    fn try_upsert_field_name(&mut self, field_name: &str) -> Result<u32, 
ArrowError> {
+        if let Some(field_id) = self.known_field_names.get(field_name) {
+            return Ok(*field_id as u32);
+        }
+
+        // TODO: Be (a lot) smarter here!
+        let Some(field_id) = self.metadata.iter().position(|name| name == 
field_name) else {
+            return Err(ArrowError::InvalidArgumentError(format!(
+                "Field name '{field_name}' not found in metadata",
+            )));
+        };
+        self.known_field_names.insert(self.metadata.get_infallible(field_id), 
field_id);
+        Ok(field_id as u32)
+    }
+    fn field_name(&self, field_id: usize) -> &str {
+        &self.metadata[field_id]
+    }
+    fn num_field_names(&self) -> usize {
+        self.metadata.len()
+    }
+    fn truncate_field_names(&mut self, new_size: usize) {
+        debug_assert_eq!(self.metadata.len(), new_size);
+    }
+}
+
 /// Builder for constructing metadata for [`Variant`] values.
 ///
 /// This is used internally by the [`VariantBuilder`] to construct the metadata
 ///
 /// You can use an existing `Vec<u8>` as the metadata buffer by using the 
`from` impl.
 #[derive(Default, Debug)]
-struct MetadataBuilder {
+pub struct MetadataBuilderXX {

Review Comment:
   Yeah my other PR used `DefaultMetadataBuilder`, but I wasn't sure the name 
conveyed what it actually does differently than a "not default" (or "not 
owned") metadata builder might do? Maybe `BasicMetadataBuilder` would convey 
the fact that any other impl must be doing something fancy/special (= not 
basic)?



##########
parquet-variant/src/builder.rs:
##########
@@ -1575,22 +1536,30 @@ impl Drop for ObjectBuilder<'_> {
 pub trait VariantBuilderExt {
     fn append_value<'m, 'v>(&mut self, value: impl Into<Variant<'m, 'v>>);
 
-    fn new_list(&mut self) -> ListBuilder<'_>;
+    fn new_list(&mut self) -> ListBuilder<'_> {
+        self.try_new_list().unwrap()
+    }
+
+    fn new_object(&mut self) -> ObjectBuilder<'_> {
+        self.try_new_object().unwrap()
+    }
 
-    fn new_object(&mut self) -> ObjectBuilder<'_>;
+    fn try_new_list(&mut self) -> Result<ListBuilder<'_>, ArrowError>;
+
+    fn try_new_object(&mut self) -> Result<ObjectBuilder<'_>, ArrowError>;

Review Comment:
   The thing that really put me off of tracking deferred errors is, it became 
yet another thing that has to be rolled back if the offending builder never 
finishes. I'm pretty sure we weren't rolling it back properly before, and I 
didn't want to complicate the `ParentState` stuff even further by "fixing" it, 
if we can just eliminate it.



##########
parquet-variant-compute/src/variant_array_builder.rs:
##########
@@ -136,20 +140,15 @@ impl VariantArrayBuilder {
     pub fn append_null(&mut self) {
         self.nulls.append_null();
         // The subfields are expected to be non-nullable according to the 
parquet variant spec.
-        let metadata_offset = self.metadata_buffer.len();
-        let metadata_length = 0;
-        self.metadata_locations
-            .push((metadata_offset, metadata_length));
-        let value_offset = self.value_buffer.len();
-        let value_length = 0;
-        self.value_locations.push((value_offset, value_length));
+        self.metadata_offsets.push(self.metadata_builder.offset());
+        self.value_offsets.push(self.value_builder.offset());

Review Comment:
   For top-level (which this is), I believe you are correct. Good catch!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to