scovich commented on code in PR #8167: URL: https://github.com/apache/arrow-rs/pull/8167#discussion_r2286033483
########## parquet-variant/src/variant/metadata.rs: ########## @@ -331,6 +331,11 @@ impl<'m> VariantMetadata<'m> { self.iter_try() .map(|result| result.expect("Invalid metadata dictionary entry")) } + + /// Same as `Index::index`, but with the correct lifetime. + pub(crate) fn get_infallible(&self, i: usize) -> &'m str { Review Comment: The`Index::index` method has the following signature: ```rust fn index(&self, i: usize) -> &Self::Output ``` which is really ```rust fn index<'a>(&'a self, i: usize) -> &'a Self::Output ``` and because the (elided) `'a` is not related to `'m`, we can't use the return value in a place that requires `&'m str`. You're right that `self.get(i).unwrap()` would also work; I was just trying to preserve the `expect` at both call sites. Should we just `unwrap` everywhere and be done with it? NOTE: The `get_infallible` should be private, except then `Index` can't access it. ########## parquet-variant/src/builder.rs: ########## @@ -1428,54 +1415,61 @@ impl<'a> ObjectBuilder<'a> { } // Returns validate_unique_fields because we can no longer reference self once this method returns. - fn parent_state<'b>(&'b mut self, key: &'b str) -> (ParentState<'b>, bool) { + fn parent_state<'b>( + &'b mut self, + field_name: &'b str, + ) -> Result<(ParentState<'b>, bool), ArrowError> { + let saved_parent_buffer_offset = self.parent_state.saved_buffer_offset(); let validate_unique_fields = self.validate_unique_fields; - let (buffer, metadata_builder) = self.parent_state.buffer_and_metadata_builder(); - - let state = ParentState::Object { + let state = ParentState::object( buffer, metadata_builder, - fields: &mut self.fields, - field_name: key, - parent_value_offset_base: self.parent_value_offset_base, - }; - (state, validate_unique_fields) + &mut self.fields, + saved_parent_buffer_offset, + field_name, + validate_unique_fields, + )?; + Ok((state, validate_unique_fields)) } /// Returns an object builder that can be used to append a new (nested) object to this object. /// + /// Panics if the requested key cannot be inserted (e.g. because it is a duplicate). Review Comment: > it could panic [if] the builder has read only metadata, right? Yep! And having the same method fail for both error cases is really appealing to me. (I just wrote the code comment before adding the read-only builder to this PR) > We could avoid panic's [by deferring] errors until the final finish() is called The goal was to make `finish` signature infallible, now that it actually is infallible in practice. I just didn't want to bloat this PR with the extra churn that would cause. Plus, it always seemed weird to do extra work to defer an obvious known failure until later. Is there a disadvantage to failing immediately? ########## parquet-variant/src/builder.rs: ########## @@ -437,13 +397,76 @@ impl ValueBuffer { } } +pub trait MetadataBuilder: std::fmt::Debug { + fn try_upsert_field_name(&mut self, field_name: &str) -> Result<u32, ArrowError>; + fn field_name(&self, field_id: usize) -> &str; + fn num_field_names(&self) -> usize; + fn truncate_field_names(&mut self, new_size: usize); +} Review Comment: Needs comments for sure! ########## parquet-variant/src/builder.rs: ########## @@ -437,13 +397,76 @@ impl ValueBuffer { } } +pub trait MetadataBuilder: std::fmt::Debug { + fn try_upsert_field_name(&mut self, field_name: &str) -> Result<u32, ArrowError>; + fn field_name(&self, field_id: usize) -> &str; + fn num_field_names(&self) -> usize; + fn truncate_field_names(&mut self, new_size: usize); +} + +impl MetadataBuilder for MetadataBuilderXX { + fn try_upsert_field_name(&mut self, field_name: &str) -> Result<u32, ArrowError> { + Ok(self.upsert_field_name(field_name)) + } + fn field_name(&self, field_id: usize) -> &str { + self.field_name(field_id) + } + fn num_field_names(&self) -> usize { + self.num_field_names() + } + fn truncate_field_names(&mut self, new_size: usize) { + self.field_names.truncate(new_size) + } +} + +#[derive(Debug)] +pub struct ReadOnlyMetadataBuilder<'m> { + metadata: VariantMetadata<'m>, + known_field_names: HashMap<&'m str, usize>, +} + +impl<'m> ReadOnlyMetadataBuilder<'m> { + pub fn new(metadata: VariantMetadata<'m>) -> Self { + Self { + metadata, + known_field_names: HashMap::new(), + } + } +} + +impl MetadataBuilder for ReadOnlyMetadataBuilder<'_> { + fn try_upsert_field_name(&mut self, field_name: &str) -> Result<u32, ArrowError> { + if let Some(field_id) = self.known_field_names.get(field_name) { + return Ok(*field_id as u32); + } + + // TODO: Be (a lot) smarter here! + let Some(field_id) = self.metadata.iter().position(|name| name == field_name) else { + return Err(ArrowError::InvalidArgumentError(format!( + "Field name '{field_name}' not found in metadata", + ))); + }; + self.known_field_names.insert(self.metadata.get_infallible(field_id), field_id); + Ok(field_id as u32) + } + fn field_name(&self, field_id: usize) -> &str { + &self.metadata[field_id] + } + fn num_field_names(&self) -> usize { + self.metadata.len() + } + fn truncate_field_names(&mut self, new_size: usize) { + debug_assert_eq!(self.metadata.len(), new_size); + } +} + /// Builder for constructing metadata for [`Variant`] values. /// /// This is used internally by the [`VariantBuilder`] to construct the metadata /// /// You can use an existing `Vec<u8>` as the metadata buffer by using the `from` impl. #[derive(Default, Debug)] -struct MetadataBuilder { +pub struct MetadataBuilderXX { Review Comment: Yeah my other PR used `DefaultMetadataBuilder`, but I wasn't sure the name conveyed what it actually does differently than a "not default" (or "not owned") metadata builder might do? Maybe `BasicMetadataBuilder` would convey the fact that any other impl must be doing something fancy/special (= not basic)? ########## parquet-variant/src/builder.rs: ########## @@ -1575,22 +1536,30 @@ impl Drop for ObjectBuilder<'_> { pub trait VariantBuilderExt { fn append_value<'m, 'v>(&mut self, value: impl Into<Variant<'m, 'v>>); - fn new_list(&mut self) -> ListBuilder<'_>; + fn new_list(&mut self) -> ListBuilder<'_> { + self.try_new_list().unwrap() + } + + fn new_object(&mut self) -> ObjectBuilder<'_> { + self.try_new_object().unwrap() + } - fn new_object(&mut self) -> ObjectBuilder<'_>; + fn try_new_list(&mut self) -> Result<ListBuilder<'_>, ArrowError>; + + fn try_new_object(&mut self) -> Result<ObjectBuilder<'_>, ArrowError>; Review Comment: The thing that really put me off of tracking deferred errors is, it became yet another thing that has to be rolled back if the offending builder never finishes. I'm pretty sure we weren't rolling it back properly before, and I didn't want to complicate the `ParentState` stuff even further by "fixing" it, if we can just eliminate it. ########## parquet-variant-compute/src/variant_array_builder.rs: ########## @@ -136,20 +140,15 @@ impl VariantArrayBuilder { pub fn append_null(&mut self) { self.nulls.append_null(); // The subfields are expected to be non-nullable according to the parquet variant spec. - let metadata_offset = self.metadata_buffer.len(); - let metadata_length = 0; - self.metadata_locations - .push((metadata_offset, metadata_length)); - let value_offset = self.value_buffer.len(); - let value_length = 0; - self.value_locations.push((value_offset, value_length)); + self.metadata_offsets.push(self.metadata_builder.offset()); + self.value_offsets.push(self.value_builder.offset()); Review Comment: For top-level (which this is), I believe you are correct. Good catch! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org