Re: [PR] [Variant] Speedup validation [arrow-rs]

via GitHub Thu, 10 Jul 2025 07:02:52 -0700


scovich commented on code in PR #7878:
URL: https://github.com/apache/arrow-rs/pull/7878#discussion_r2197669159



##########
parquet-variant/src/variant/list.rs:
##########
@@ -209,9 +208,35 @@ impl<'m, 'v> VariantList<'m, 'v> {
             // by value to all the children (who would otherwise re-validate 
it repeatedly).
             self.metadata = self.metadata.with_full_validation()?;
 
-            // Iterate over all string keys in this dictionary in order to 
prove that the offset
-            // array is valid, all offsets are in bounds, and all string bytes 
are valid utf-8.
-            validate_fallible_iterator(self.iter_try())?;
+            let offset_buffer = slice_from_slice(
+                self.value,
+                self.header.first_offset_byte()..self.first_value_byte,
+            )?;
+
+            let offsets =
+                map_bytes_to_offsets(offset_buffer, 
self.header.offset_size).collect::<Vec<_>>();
+
+            // Validate offsets are in-bounds and monotonically increasing.
+            // Since shallow verification checks whether the first and last 
offsets are in-bounds,
+            // we can also verify all offsets are in-bounds by checking if 
offsets are monotonically increasing.
+            let are_offsets_monotonic = offsets.is_sorted_by(|a, b| a < b);
+            if !are_offsets_monotonic {
+                return Err(ArrowError::InvalidArgumentError(
+                    "offsets are not monotonically increasing".to_string(),
+                ));
+            }

Review Comment:
   We don't need this check -- the loop below does `offsets[i-1]..offsets[i]` 
for every `i` in `1..offsets.len()`, and any non-monotonic offset would cause 
`slice_from_slice` to return an Err like:?
   > Tried to extract byte(s) 42..25 from 100-byte buffer
   
   Is it worth making an extra pass over the offsets (which requires 
materializing them) just to have a slightly nicer error message?



##########
parquet-variant/src/variant/object.rs:
##########
@@ -210,9 +209,80 @@ impl<'m, 'v> VariantObject<'m, 'v> {
             // by value to all the children (who would otherwise re-validate 
it repeatedly).
             self.metadata = self.metadata.with_full_validation()?;
 
-            // Iterate over all string keys in this dictionary in order to 
prove that the offset
-            // array is valid, all offsets are in bounds, and all string bytes 
are valid utf-8.
-            validate_fallible_iterator(self.iter_try())?;
+            let field_id_buffer = slice_from_slice(
+                self.value,
+                
self.header.field_ids_start_byte()..self.first_field_offset_byte,
+            )?;
+
+            let field_ids = map_bytes_to_offsets(field_id_buffer, 
self.header.field_id_size)
+                .collect::<Vec<_>>();
+
+            // Validate all field ids exist in the metadata dictionary and the 
corresponding field names are lexicographically sorted
+            if self.metadata.is_sorted() {
+                // Since the metadata dictionary has unique and sorted field 
names, we can also guarantee this object's field names
+                // are lexicographically sorted by their field id ordering
+                if !field_ids.is_sorted() {
+                    return Err(ArrowError::InvalidArgumentError(
+                        "field names not sorted".to_string(),
+                    ));
+                }
+
+                // Since field ids are sorted, if the last field is smaller 
than the dictionary size,
+                // we also know all field ids are smaller than the dictionary 
size and in-bounds.
+                if let Some(&last_field_id) = field_ids.last() {
+                    if last_field_id >= self.metadata.dictionary_size() {
+                        return Err(ArrowError::InvalidArgumentError(
+                            "field id is not valid".to_string(),
+                        ));
+                    }
+                }
+            } else {
+                // The metadata dictionary can't guarantee uniqueness or 
sortedness, so we have to parse out the corresponding field names
+                // to check lexicographical order
+                let are_field_names_sorted = field_ids
+                    .iter()
+                    .map(|&i| self.metadata.get(i))
+                    .collect::<Result<Vec<_>, _>>()?
+                    .is_sorted();
+
+                if !are_field_names_sorted {
+                    return Err(ArrowError::InvalidArgumentError(
+                        "field names not sorted".to_string(),
+                    ));
+                }
+
+                // Since field ids are not guaranteed to be sorted, scan over 
all field ids
+                // and check that field ids are less than dictionary size
+
+                let are_field_ids_in_bounds = field_ids
+                    .iter()
+                    .all(|&id| id < self.metadata.dictionary_size());
+
+                if !are_field_ids_in_bounds {
+                    return Err(ArrowError::InvalidArgumentError(
+                        "field id is not valid".to_string(),
+                    ));
+                }
+            }
+
+            // Validate whether values are valid variant objects
+            let field_offset_buffer = slice_from_slice(
+                self.value,
+                self.first_field_offset_byte..self.first_value_byte,
+            )?;
+            let num_offsets = field_offset_buffer.len() / 
self.header.field_offset_size();
+
+            let value_buffer = slice_from_slice(self.value, 
self.first_value_byte..)?;
+
+            map_bytes_to_offsets(field_offset_buffer, 
self.header.field_offset_size)
+                .take(num_offsets.saturating_sub(1))

Review Comment:
   Shallow validation already ensured the last offset is in bounds, correct?



##########
parquet-variant/src/variant/list.rs:
##########
@@ -209,9 +208,35 @@ impl<'m, 'v> VariantList<'m, 'v> {
             // by value to all the children (who would otherwise re-validate 
it repeatedly).
             self.metadata = self.metadata.with_full_validation()?;
 
-            // Iterate over all string keys in this dictionary in order to 
prove that the offset
-            // array is valid, all offsets are in bounds, and all string bytes 
are valid utf-8.
-            validate_fallible_iterator(self.iter_try())?;
+            let offset_buffer = slice_from_slice(
+                self.value,
+                self.header.first_offset_byte()..self.first_value_byte,
+            )?;
+
+            let offsets =
+                map_bytes_to_offsets(offset_buffer, 
self.header.offset_size).collect::<Vec<_>>();
+
+            // Validate offsets are in-bounds and monotonically increasing.
+            // Since shallow verification checks whether the first and last 
offsets are in-bounds,
+            // we can also verify all offsets are in-bounds by checking if 
offsets are monotonically increasing.
+            let are_offsets_monotonic = offsets.is_sorted_by(|a, b| a < b);
+            if !are_offsets_monotonic {
+                return Err(ArrowError::InvalidArgumentError(
+                    "offsets are not monotonically increasing".to_string(),
+                ));
+            }

Review Comment:
   The simplified version would be something like:
   ```rust
   let offsets = map_bytes_to_offsets(offset_buffer, self.header.offset_size);
   if let Some(mut start) = offsets.next() {
       for end in offsets {
             ... validate start..end ...
           start = end;
       }
   }
   ```



##########
parquet-variant/src/decoder.rs:
##########
@@ -200,6 +200,24 @@ impl OffsetSizeBytes {
     }
 }
 
+pub(crate) fn map_bytes_to_offsets(

Review Comment:
   Needs a doc comment explaining what it does and the conditions under which 
it's panic-free?



##########
parquet-variant/src/variant/object.rs:
##########
@@ -210,9 +209,80 @@ impl<'m, 'v> VariantObject<'m, 'v> {
             // by value to all the children (who would otherwise re-validate 
it repeatedly).
             self.metadata = self.metadata.with_full_validation()?;
 
-            // Iterate over all string keys in this dictionary in order to 
prove that the offset
-            // array is valid, all offsets are in bounds, and all string bytes 
are valid utf-8.
-            validate_fallible_iterator(self.iter_try())?;
+            let field_id_buffer = slice_from_slice(
+                self.value,
+                
self.header.field_ids_start_byte()..self.first_field_offset_byte,
+            )?;
+
+            let field_ids = map_bytes_to_offsets(field_id_buffer, 
self.header.field_id_size)
+                .collect::<Vec<_>>();
+
+            // Validate all field ids exist in the metadata dictionary and the 
corresponding field names are lexicographically sorted
+            if self.metadata.is_sorted() {
+                // Since the metadata dictionary has unique and sorted field 
names, we can also guarantee this object's field names
+                // are lexicographically sorted by their field id ordering
+                if !field_ids.is_sorted() {
+                    return Err(ArrowError::InvalidArgumentError(
+                        "field names not sorted".to_string(),
+                    ));
+                }
+
+                // Since field ids are sorted, if the last field is smaller 
than the dictionary size,
+                // we also know all field ids are smaller than the dictionary 
size and in-bounds.
+                if let Some(&last_field_id) = field_ids.last() {
+                    if last_field_id >= self.metadata.dictionary_size() {
+                        return Err(ArrowError::InvalidArgumentError(
+                            "field id is not valid".to_string(),
+                        ));
+                    }
+                }
+            } else {
+                // The metadata dictionary can't guarantee uniqueness or 
sortedness, so we have to parse out the corresponding field names
+                // to check lexicographical order
+                let are_field_names_sorted = field_ids
+                    .iter()
+                    .map(|&i| self.metadata.get(i))
+                    .collect::<Result<Vec<_>, _>>()?
+                    .is_sorted();
+
+                if !are_field_names_sorted {
+                    return Err(ArrowError::InvalidArgumentError(
+                        "field names not sorted".to_string(),
+                    ));
+                }
+
+                // Since field ids are not guaranteed to be sorted, scan over 
all field ids
+                // and check that field ids are less than dictionary size
+
+                let are_field_ids_in_bounds = field_ids
+                    .iter()
+                    .all(|&id| id < self.metadata.dictionary_size());
+
+                if !are_field_ids_in_bounds {
+                    return Err(ArrowError::InvalidArgumentError(
+                        "field id is not valid".to_string(),
+                    ));
+                }

Review Comment:
   We don't need this check. If any field id were out of bounds, the 
`collect`at L245 would have already failed.



##########
parquet-variant/src/variant/object.rs:
##########
@@ -210,9 +209,80 @@ impl<'m, 'v> VariantObject<'m, 'v> {
             // by value to all the children (who would otherwise re-validate 
it repeatedly).
             self.metadata = self.metadata.with_full_validation()?;
 
-            // Iterate over all string keys in this dictionary in order to 
prove that the offset
-            // array is valid, all offsets are in bounds, and all string bytes 
are valid utf-8.
-            validate_fallible_iterator(self.iter_try())?;
+            let field_id_buffer = slice_from_slice(
+                self.value,
+                
self.header.field_ids_start_byte()..self.first_field_offset_byte,
+            )?;
+
+            let field_ids = map_bytes_to_offsets(field_id_buffer, 
self.header.field_id_size)
+                .collect::<Vec<_>>();
+
+            // Validate all field ids exist in the metadata dictionary and the 
corresponding field names are lexicographically sorted
+            if self.metadata.is_sorted() {
+                // Since the metadata dictionary has unique and sorted field 
names, we can also guarantee this object's field names
+                // are lexicographically sorted by their field id ordering
+                if !field_ids.is_sorted() {
+                    return Err(ArrowError::InvalidArgumentError(
+                        "field names not sorted".to_string(),
+                    ));
+                }
+
+                // Since field ids are sorted, if the last field is smaller 
than the dictionary size,
+                // we also know all field ids are smaller than the dictionary 
size and in-bounds.
+                if let Some(&last_field_id) = field_ids.last() {
+                    if last_field_id >= self.metadata.dictionary_size() {
+                        return Err(ArrowError::InvalidArgumentError(
+                            "field id is not valid".to_string(),
+                        ));
+                    }
+                }
+            } else {
+                // The metadata dictionary can't guarantee uniqueness or 
sortedness, so we have to parse out the corresponding field names
+                // to check lexicographical order
+                let are_field_names_sorted = field_ids
+                    .iter()
+                    .map(|&i| self.metadata.get(i))
+                    .collect::<Result<Vec<_>, _>>()?
+                    .is_sorted();
+
+                if !are_field_names_sorted {
+                    return Err(ArrowError::InvalidArgumentError(
+                        "field names not sorted".to_string(),
+                    ));
+                }
+
+                // Since field ids are not guaranteed to be sorted, scan over 
all field ids
+                // and check that field ids are less than dictionary size
+
+                let are_field_ids_in_bounds = field_ids
+                    .iter()
+                    .all(|&id| id < self.metadata.dictionary_size());
+
+                if !are_field_ids_in_bounds {
+                    return Err(ArrowError::InvalidArgumentError(
+                        "field id is not valid".to_string(),
+                    ));
+                }
+            }
+
+            // Validate whether values are valid variant objects
+            let field_offset_buffer = slice_from_slice(
+                self.value,
+                self.first_field_offset_byte..self.first_value_byte,
+            )?;
+            let num_offsets = field_offset_buffer.len() / 
self.header.field_offset_size();
+
+            let value_buffer = slice_from_slice(self.value, 
self.first_value_byte..)?;
+
+            map_bytes_to_offsets(field_offset_buffer, 
self.header.field_offset_size)
+                .take(num_offsets.saturating_sub(1))
+                .try_for_each(|offset| {
+                    let value_bytes = slice_from_slice(value_buffer, 
offset..)?;
+                    Variant::try_new_with_metadata(self.metadata, 
value_bytes)?;
+
+                    Ok::<_, ArrowError>(())

Review Comment:
   does this not work for some reason?
   ```suggestion
                       Ok(())
   ```
   (`try_for_each` _does_ seem to have a pretty weird `collect`-like signature, 
but I would have hoped that the previous `?` would have fixed the error type...)



##########
parquet-variant/src/variant/metadata.rs:
##########
@@ -228,9 +225,47 @@ impl<'m> VariantMetadata<'m> {
     /// [validation]: Self#Validation
     pub fn with_full_validation(mut self) -> Result<Self, ArrowError> {
         if !self.validated {
-            // Iterate over all string keys in this dictionary in order to 
prove that the offset
-            // array is valid, all offsets are in bounds, and all string bytes 
are valid utf-8.
-            validate_fallible_iterator(self.iter_try())?;
+            let offset_bytes = slice_from_slice(
+                self.bytes,
+                self.header.first_offset_byte()..self.first_value_byte,
+            )?;
+
+            let offsets =
+                map_bytes_to_offsets(offset_bytes, 
self.header.offset_size).collect::<Vec<_>>();
+
+            // Validate offsets are in-bounds and monotonically increasing.
+            // Since shallow validation ensures the first and last offsets are 
in bounds, we can also verify all offsets
+            // are in-bounds by checking if offsets are monotonically 
increasing.
+            let are_offsets_monotonic = offsets.is_sorted_by(|a, b| a < b);
+            if !are_offsets_monotonic {
+                return Err(ArrowError::InvalidArgumentError(
+                    "offsets not monotonically increasing".to_string(),
+                ));
+            }

Review Comment:
   Again, I'm not sure it's worth paying an extra pass for a monotonicity 
check, when the slicing operations that follow will naturally fail in the 
presence of non-monotonic offsets?
   ```rust
   let offsets = map_bytes_to_offsets(offset_bytes, self.header.offset_size);
   if let Some(first_offset) = offsets.next() {
       // Create an iterator over the strings. We must consume the iterator to 
validate it.
       let strings = offsets.scan(first_offset, |start, end| {
           let s = slice_from_slice(value_buffer, start..end);
           *start = end;
           Some(s)
       });
       if self.header.is_sorted {
           // verify the strings are sorted and unique, in addition to being 
individually valid
           if let Some(mut a) = strings.next().transpose()? {
               for b in strings {
                   let b = b?;
                   if a >= b {
                       return Err(... dictionary values are not unique and 
ordered ...);
                   }
                   a = b;
               }
           }
       } else {
           // Just verify the strings are all individually valid
           validate_fallible_iterator(strings)?;
       }
   }
   ```
   This time, an iterator scan works well because we naturally unpack the 
errors later.
   
   Note: The above does require making `slice_from_slice` fully generic:
   ```rust
   pub(crate) fn slice_from_slice<T, I: SliceIndex<[T]> + Clone + Debug>(
       bytes: &[T],
       index: I,
   ) -> Result<&I::Output, ArrowError> {
   ```



##########
parquet-variant/src/variant/object.rs:
##########
@@ -210,9 +209,80 @@ impl<'m, 'v> VariantObject<'m, 'v> {
             // by value to all the children (who would otherwise re-validate 
it repeatedly).
             self.metadata = self.metadata.with_full_validation()?;
 
-            // Iterate over all string keys in this dictionary in order to 
prove that the offset
-            // array is valid, all offsets are in bounds, and all string bytes 
are valid utf-8.
-            validate_fallible_iterator(self.iter_try())?;
+            let field_id_buffer = slice_from_slice(
+                self.value,
+                
self.header.field_ids_start_byte()..self.first_field_offset_byte,
+            )?;
+
+            let field_ids = map_bytes_to_offsets(field_id_buffer, 
self.header.field_id_size)
+                .collect::<Vec<_>>();
+
+            // Validate all field ids exist in the metadata dictionary and the 
corresponding field names are lexicographically sorted
+            if self.metadata.is_sorted() {
+                // Since the metadata dictionary has unique and sorted field 
names, we can also guarantee this object's field names
+                // are lexicographically sorted by their field id ordering

Review Comment:
   clever!



##########
parquet-variant/src/variant/object.rs:
##########
@@ -206,13 +205,118 @@ impl<'m, 'v> VariantObject<'m, 'v> {
     /// [validation]: Self#Validation
     pub fn with_full_validation(mut self) -> Result<Self, ArrowError> {
         if !self.validated {
+            /*
+            (1) the associated variant metadata is [valid] (*)
+            (2) all field ids are valid metadata dictionary entries (*)
+            (3) field ids are lexically ordered according by their 
corresponding string values (*)
+            (4) all field offsets are in bounds (*)
+            (5) all field values are (recursively) _valid_ variant values (*)
+
+            */
+
+            // (1)
             // Validate the metadata dictionary first, if not already 
validated, because we pass it
             // by value to all the children (who would otherwise re-validate 
it repeatedly).
             self.metadata = self.metadata.with_full_validation()?;
 
-            // Iterate over all string keys in this dictionary in order to 
prove that the offset
-            // array is valid, all offsets are in bounds, and all string bytes 
are valid utf-8.
-            validate_fallible_iterator(self.iter_try())?;
+            // (2)
+            // Field ids serve as indexes into the metadata buffer.
+            // As long as we guarantee the largest field id is < dictionary 
size,
+            // we can guarantee all field ids are valid metadata dictionaries
+
+            // (2), (3)
+            let byte_range = 
self.header.field_ids_start_byte()..self.first_field_offset_byte;
+            let field_id_bytes = slice_from_slice(self.value, byte_range)?;
+            // let field_id = 
self.header.field_id_size.unpack_usize(field_id_bytes, i)?;
+
+            let field_id_chunks = 
field_id_bytes.chunks_exact(self.header.field_id_size());
+            assert!(field_id_chunks.remainder().is_empty()); // guaranteed to 
be none
+
+            let field_ids = field_id_chunks
+                .map(|chunk| match self.header.field_id_size {
+                    OffsetSizeBytes::One => chunk[0] as usize,
+                    OffsetSizeBytes::Two => u16::from_le_bytes([chunk[0], 
chunk[1]]) as usize,
+                    OffsetSizeBytes::Three => {
+                        u32::from_le_bytes([chunk[0], chunk[1], chunk[2], 0]) 
as usize
+                    }
+                    OffsetSizeBytes::Four => {
+                        u32::from_le_bytes([chunk[0], chunk[1], chunk[2], 
chunk[3]]) as usize
+                    }
+                })
+                .collect::<Vec<_>>();

Review Comment:
   > I could inline calls to map_bytes_to_offsets everywhere we loop over 
field_ids, but that doesn’t seem much better than doing a single allocation
   
   I only see one pass over `field_ids`? Two possible paths, but each is 
single-pass?
   (the not-sorted-metadata case makes a second pass, but I'm pretty sure it's 
redundant and will never find a failure -- see other comment)
   
   > I think in this case the extra allocation isn’t hurting perf very much 
   
   Seems like something we could (should) quantify? For a not-small dictionary 
I would expect materialization cost to be quite noticeable. Anything big enough 
to fall out L1 cache would likely see a significant slowdown due to a wave of 
cache misses on every pass. And that includes not just the offset vec itself 
(easy enough to prefetch), but also the bytes of the strings and values we 
examine.
   
   > and it makes the code a lot simpler to reason about
   
   I don't think that's necessarily true for the metadata and list cases above 
(see other comments).
   
   For this case, it would look something like:
   ```rust
   let field_ids = map_bytes_to_offsets(field_id_buffer, 
self.header.field_id_size);
   if let Some(mut field_id) = field_ids.next() {
       if self.metadata.is_sorted() {
           // Since the metadata dictionary has unique and sorted field names...
           for next_field_id in field_ids {
               if next_field_id <= field_id {
                   return Err(... field names not sorted ...);
               }
               field_id = next_field_id;
           }
       } else {
           // The metadata dictionary can't guarantee uniqueness... 
           ... very similar to the metadata sortedness check: scan + 
validate_fallible_iterator ...
           ... probably could even factor out a helper function that both sites 
can call ...
       }
       if field_id >= self.metadata.dictionary_size() {
           return Err(... field id is not valid ...);
       }
   }
   ```
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [Variant] Speedup validation [arrow-rs]

Reply via email to