klion26 commented on code in PR #7935:
URL: https://github.com/apache/arrow-rs/pull/7935#discussion_r2218045675


##########
parquet-variant/src/builder.rs:
##########
@@ -1118,8 +1181,8 @@ impl<'a> ObjectBuilder<'a> {
 
     /// Finalizes this object and appends it to its parent, which otherwise 
remains unmodified.
     pub fn finish(mut self) -> Result<(), ArrowError> {
-        let metadata_builder = self.parent_state.metadata_builder();
         if self.validate_unique_fields && !self.duplicate_fields.is_empty() {
+            let metadata_builder = self.parent_state.metadata_builder();

Review Comment:
   This wants to avoid the double mutable reference problem, but it's not a 
problem after the implementation has changed, I can revert this if needed. 



##########
parquet-variant/src/builder.rs:
##########
@@ -1028,18 +1078,29 @@ impl Drop for ListBuilder<'_> {
 pub struct ObjectBuilder<'a> {
     parent_state: ParentState<'a>,
     fields: IndexMap<u32, usize>, // (field_id, offset)
-    buffer: ValueBuffer,
+    /// the starting offset in the parent's buffer where this object starts
+    object_start_offset: usize,

Review Comment:
   Fixed



##########
parquet-variant/src/builder.rs:
##########
@@ -2357,8 +2673,8 @@ mod tests {
         // Only the second attempt should appear in the final variant
         let (metadata, value) = builder.finish();
         let metadata = VariantMetadata::try_new(&metadata).unwrap();
-        assert_eq!(metadata.len(), 1);
-        assert_eq!(&metadata[0], "name"); // not rolled back
+        assert_eq!(metadata.len(), 1); // rolled back
+        assert_eq!(&metadata[0], "name");

Review Comment:
   Fixed, my bad, this was unintentionally modified.



##########
parquet-variant/src/builder.rs:
##########
@@ -1028,18 +1078,29 @@ impl Drop for ListBuilder<'_> {
 pub struct ObjectBuilder<'a> {
     parent_state: ParentState<'a>,
     fields: IndexMap<u32, usize>, // (field_id, offset)
-    buffer: ValueBuffer,
+    /// the starting offset in the parent's buffer where this object starts

Review Comment:
   Fixed



##########
parquet-variant/src/builder.rs:
##########
@@ -1134,39 +1197,86 @@ impl<'a> ObjectBuilder<'a> {
             )));
         }
 
-        let data_size = self.buffer.offset();
-        let num_fields = self.fields.len();
-        let is_large = num_fields > u8::MAX as usize;
+        let metadata_builder = self.parent_state.metadata_builder();
 
         self.fields.sort_by(|&field_a_id, _, &field_b_id, _| {
-            let key_a = &metadata_builder.field_name(field_a_id as usize);
-            let key_b = &metadata_builder.field_name(field_b_id as usize);
-            key_a.cmp(key_b)
+            let field_a_name = metadata_builder.field_name(field_a_id as 
usize);
+            let field_b_name = metadata_builder.field_name(field_b_id as 
usize);
+            field_a_name.cmp(field_b_name)
         });
 
-        let max_id = self.fields.iter().map(|(i, _)| *i).max().unwrap_or(0);
+        // the length of the metadata's field names is a very cheap to compute 
the upper bound.
+        // it will almost always be a tight upper bound as well -- it would 
take a pretty
+        // carefully  crafted object to use only the early field ids of a 
large dictionary.
+        let max_id = metadata_builder.field_names.len();

Review Comment:
   IIUC, `metadata_builder.field_names.len()` returns the size of the 
underlying map, it may be bigger than the actual max `field_id`, changing to 
this here is that it can avoid one pass for the `field_ids` to calucalute the 
`max_id`.
   
   But this change can make test more diffict(more difficult to calculate the 
header size) -- such as the failed case 
`from_json::test::test_json_to_variant_object_very_large`  ([test code 
here](https://github.com/apache/arrow-rs/blob/82821e574df7e699c7a491da90c54429a5a439e9/parquet-variant-json/src/from_json.rs#L633)/[failed
 
ci](https://github.com/apache/arrow-rs/actions/runs/16401976375/job/46342711866?pr=7935))
   
   @alamb @viirya @scovich Maybe we can revert this to the original code(travel 
one pass from `filed_ids`) what do you think about this? thanks.



##########
parquet-variant/src/builder.rs:
##########
@@ -506,6 +506,7 @@ enum ParentState<'a> {
         metadata_builder: &'a mut MetadataBuilder,
         fields: &'a mut IndexMap<u32, usize>,
         field_name: &'a str,
+        object_start_offset: usize,

Review Comment:
   Fixed, `object_start_offset` wants to indicate that this is the offset of 
the current object start.



##########
parquet-variant/src/builder.rs:
##########
@@ -1028,18 +1078,29 @@ impl Drop for ListBuilder<'_> {
 pub struct ObjectBuilder<'a> {
     parent_state: ParentState<'a>,
     fields: IndexMap<u32, usize>, // (field_id, offset)
-    buffer: ValueBuffer,
+    /// the starting offset in the parent's buffer where this object starts
+    object_start_offset: usize,
+    /// the starting offset in the parent's metadata buffer where this object 
starts
+    /// used to truncate the written fields in `drop` if the current object 
has not been finished
+    object_meta_start_offset: usize,

Review Comment:
   Fixed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to