friendlymatthew commented on code in PR #7878:
URL: https://github.com/apache/arrow-rs/pull/7878#discussion_r2198274030


##########
parquet-variant/src/variant/object.rs:
##########
@@ -210,9 +209,80 @@ impl<'m, 'v> VariantObject<'m, 'v> {
             // by value to all the children (who would otherwise re-validate 
it repeatedly).
             self.metadata = self.metadata.with_full_validation()?;
 
-            // Iterate over all string keys in this dictionary in order to 
prove that the offset
-            // array is valid, all offsets are in bounds, and all string bytes 
are valid utf-8.
-            validate_fallible_iterator(self.iter_try())?;
+            let field_id_buffer = slice_from_slice(
+                self.value,
+                
self.header.field_ids_start_byte()..self.first_field_offset_byte,
+            )?;
+
+            let field_ids = map_bytes_to_offsets(field_id_buffer, 
self.header.field_id_size)
+                .collect::<Vec<_>>();
+
+            // Validate all field ids exist in the metadata dictionary and the 
corresponding field names are lexicographically sorted
+            if self.metadata.is_sorted() {
+                // Since the metadata dictionary has unique and sorted field 
names, we can also guarantee this object's field names
+                // are lexicographically sorted by their field id ordering
+                if !field_ids.is_sorted() {
+                    return Err(ArrowError::InvalidArgumentError(
+                        "field names not sorted".to_string(),
+                    ));
+                }
+
+                // Since field ids are sorted, if the last field is smaller 
than the dictionary size,
+                // we also know all field ids are smaller than the dictionary 
size and in-bounds.
+                if let Some(&last_field_id) = field_ids.last() {
+                    if last_field_id >= self.metadata.dictionary_size() {
+                        return Err(ArrowError::InvalidArgumentError(
+                            "field id is not valid".to_string(),
+                        ));
+                    }
+                }
+            } else {
+                // The metadata dictionary can't guarantee uniqueness or 
sortedness, so we have to parse out the corresponding field names

Review Comment:
   Hi @viirya, good question. From the specification:
   
   > If `sorted_strings` is set to 1, strings in the dictionary must be unique 
and sorted in lexicographic order. If the value is set to 0, readers may not 
make any assumptions about string order or uniqueness.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to