klion26 commented on code in PR #7935:
URL: https://github.com/apache/arrow-rs/pull/7935#discussion_r2221326885


##########
parquet-variant/src/builder.rs:
##########
@@ -1275,38 +1330,80 @@ impl<'a> ObjectBuilder<'a> {
             )));
         }
 
-        let data_size = self.buffer.offset();
-        let num_fields = self.fields.len();
-        let is_large = num_fields > u8::MAX as usize;
+        let metadata_builder = self.parent_state.metadata_builder();
 
         self.fields.sort_by(|&field_a_id, _, &field_b_id, _| {
-            let key_a = &metadata_builder.field_name(field_a_id as usize);
-            let key_b = &metadata_builder.field_name(field_b_id as usize);
-            key_a.cmp(key_b)
+            let field_a_name = metadata_builder.field_name(field_a_id as 
usize);
+            let field_b_name = metadata_builder.field_name(field_b_id as 
usize);
+            field_a_name.cmp(field_b_name)
         });
-
         let max_id = self.fields.iter().map(|(i, _)| *i).max().unwrap_or(0);

Review Comment:
   change current logic to `let max_id = self.fields.len() - 1` will fail the 
`parquet_varient_json/from_json#test_json_to_variant_object_very_large` test, 
~~I'm looking into it~~.
   
   The tests failed because of `self.fields.len() - 1` may be different from 
`max(field ids)`
   
   The difference comes from that there are objects inside objects. The schema 
looks like below(after some simplification)
   
   ```
   {
     "000": {
       "240": [0,...,127],
       "241": [0,...,127],
       "495":[0,...,127],
     },   <-- when the object finishes, there will be 256 field names in the 
metadatabuilder.field_names("240",.."495"), `self.fields.len()` equals to 256, 
the max_id will be 255
     "001": {
       "240": [0,...,127],
       "241": [0,...,127],
       "495":[0,...,127],
     },  <-- when the object finishes, there will be 257 field names in the 
metadatabuilder.field_names("240",.."495", **"000"**), `self.fields.len()` 
equals to 256, the max_id will be 255
     "002": {
       "240": [0,...,127],
       "241": [0,...,127],
       "495":[0,...,127],
     }  <-- when the object finishes, there will be 258 field names in the 
metadatabuilder.field_names("240",.."495", **"000", "001"**, 
"002"),`self.fields.len()` equals to 256, the max_id will be 255
   }  <-- when the object finishes, there will be 258 field names in the 
metadatabuilder.field_names("240",.."495", **"000", "001", "002"**), 
`self.fields.len()` equals to 3 ("000", "001", "002"), the max_id will be 258
   ```
   
   I'll dig into if there are any solutions we can optimize the max_id 
calculation logic.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to