klion26 commented on code in PR #7935: URL: https://github.com/apache/arrow-rs/pull/7935#discussion_r2218119219
########## parquet-variant/src/builder.rs: ########## @@ -1134,39 +1197,86 @@ impl<'a> ObjectBuilder<'a> { ))); } - let data_size = self.buffer.offset(); - let num_fields = self.fields.len(); - let is_large = num_fields > u8::MAX as usize; + let metadata_builder = self.parent_state.metadata_builder(); self.fields.sort_by(|&field_a_id, _, &field_b_id, _| { - let key_a = &metadata_builder.field_name(field_a_id as usize); - let key_b = &metadata_builder.field_name(field_b_id as usize); - key_a.cmp(key_b) + let field_a_name = metadata_builder.field_name(field_a_id as usize); + let field_b_name = metadata_builder.field_name(field_b_id as usize); + field_a_name.cmp(field_b_name) }); - let max_id = self.fields.iter().map(|(i, _)| *i).max().unwrap_or(0); + // the length of the metadata's field names is a very cheap to compute the upper bound. + // it will almost always be a tight upper bound as well -- it would take a pretty + // carefully crafted object to use only the early field ids of a large dictionary. + let max_id = metadata_builder.field_names.len(); Review Comment: IIUC, `metadata_builder.field_names.len()` returns the size of the underlying map, it may be bigger than the actual max `field_id`, changing to this here is that it can avoid one pass for the `field_ids` to calucalute the `max_id`. But this change can make test more diffict(more difficult to calculate the header size) -- such as the failed case `from_json::test::test_json_to_variant_object_very_large` ([test code here](https://github.com/apache/arrow-rs/blob/82821e574df7e699c7a491da90c54429a5a439e9/parquet-variant-json/src/from_json.rs#L633)/[failed ci](https://github.com/apache/arrow-rs/actions/runs/16401976375/job/46342711866?pr=7935)) @alamb @viirya @scovich Maybe we can revert this to the original code(travel one pass from `filed_ids`) what do you think about this? thanks. -- added a separate commit which reverted this logic to see if any other tests failed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org