scovich commented on code in PR #7906:
URL: https://github.com/apache/arrow-rs/pull/7906#discussion_r2201728155


##########
parquet-variant/src/variant/metadata.rs:
##########
@@ -268,6 +261,16 @@ impl<'m> VariantMetadata<'m> {
                         "dictionary values are not unique and 
ordered".to_string(),
                     ));
                 }
+            } else {
+                // Validate offsets are in-bounds and monotonically increasing.
+                // Since shallow validation ensures the first and last offsets 
are in bounds, we can also verify all offsets
+                // are in-bounds by checking if offsets are monotonically 
increasing.
+                let are_offsets_monotonic = offsets.is_sorted_by(|a, b| a < b);
+                if !are_offsets_monotonic {

Review Comment:
   not sure the extra `let` is helpful?
   ```suggestion
                   if !offsets.is_sorted_by(|a, b| a < b) {
   ```



##########
parquet-variant/src/variant/metadata.rs:
##########
@@ -237,22 +237,15 @@ impl<'m> VariantMetadata<'m> {
             let offsets =
                 map_bytes_to_offsets(offset_bytes, 
self.header.offset_size).collect::<Vec<_>>();

Review Comment:
   Are we still tracking a TODO to eliminate this materialization?
   Once the comment below is addressed, I _think_ it's the only one left.



##########
parquet-variant/src/variant/object.rs:
##########
@@ -242,6 +242,8 @@ impl<'m, 'v> VariantObject<'m, 'v> {
             } else {
                 // The metadata dictionary can't guarantee uniqueness or 
sortedness, so we have to parse out the corresponding field names
                 // to check lexicographical order
+                //
+                // Since we are probing the metadata dictionary by field id, 
this also verifies field ids are in-bounds
                 let are_field_names_sorted = field_ids
                     .iter()

Review Comment:
   We only make a single pass now, so we no longer need to collect field ids 
into a vec. The only non-trivial tweak is to request the last field id 
specifically for the field id bounds check -- O(1) cost, so no need to 
materialize a whole just vec for that.
   
   While you're at it, consider replacing the `collect` + `is_sorted` pair just 
below with just `Iterator::is_sorted`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to