scovich commented on code in PR #7878: URL: https://github.com/apache/arrow-rs/pull/7878#discussion_r2197669159
########## parquet-variant/src/variant/list.rs: ########## @@ -209,9 +208,35 @@ impl<'m, 'v> VariantList<'m, 'v> { // by value to all the children (who would otherwise re-validate it repeatedly). self.metadata = self.metadata.with_full_validation()?; - // Iterate over all string keys in this dictionary in order to prove that the offset - // array is valid, all offsets are in bounds, and all string bytes are valid utf-8. - validate_fallible_iterator(self.iter_try())?; + let offset_buffer = slice_from_slice( + self.value, + self.header.first_offset_byte()..self.first_value_byte, + )?; + + let offsets = + map_bytes_to_offsets(offset_buffer, self.header.offset_size).collect::<Vec<_>>(); + + // Validate offsets are in-bounds and monotonically increasing. + // Since shallow verification checks whether the first and last offsets are in-bounds, + // we can also verify all offsets are in-bounds by checking if offsets are monotonically increasing. + let are_offsets_monotonic = offsets.is_sorted_by(|a, b| a < b); + if !are_offsets_monotonic { + return Err(ArrowError::InvalidArgumentError( + "offsets are not monotonically increasing".to_string(), + )); + } Review Comment: We don't need this check -- the loop below does `offsets[i-1]..offsets[i]` for every `i` in `1..offsets.len()`, and any non-monotonic offset would cause `slice_from_slice` to return an Err like:? > Tried to extract byte(s) 42..25 from 100-byte buffer Is it worth making an extra pass over the offsets (which requires materializing them) just to have a slightly nicer error message? ########## parquet-variant/src/variant/object.rs: ########## @@ -210,9 +209,80 @@ impl<'m, 'v> VariantObject<'m, 'v> { // by value to all the children (who would otherwise re-validate it repeatedly). self.metadata = self.metadata.with_full_validation()?; - // Iterate over all string keys in this dictionary in order to prove that the offset - // array is valid, all offsets are in bounds, and all string bytes are valid utf-8. - validate_fallible_iterator(self.iter_try())?; + let field_id_buffer = slice_from_slice( + self.value, + self.header.field_ids_start_byte()..self.first_field_offset_byte, + )?; + + let field_ids = map_bytes_to_offsets(field_id_buffer, self.header.field_id_size) + .collect::<Vec<_>>(); + + // Validate all field ids exist in the metadata dictionary and the corresponding field names are lexicographically sorted + if self.metadata.is_sorted() { + // Since the metadata dictionary has unique and sorted field names, we can also guarantee this object's field names + // are lexicographically sorted by their field id ordering + if !field_ids.is_sorted() { + return Err(ArrowError::InvalidArgumentError( + "field names not sorted".to_string(), + )); + } + + // Since field ids are sorted, if the last field is smaller than the dictionary size, + // we also know all field ids are smaller than the dictionary size and in-bounds. + if let Some(&last_field_id) = field_ids.last() { + if last_field_id >= self.metadata.dictionary_size() { + return Err(ArrowError::InvalidArgumentError( + "field id is not valid".to_string(), + )); + } + } + } else { + // The metadata dictionary can't guarantee uniqueness or sortedness, so we have to parse out the corresponding field names + // to check lexicographical order + let are_field_names_sorted = field_ids + .iter() + .map(|&i| self.metadata.get(i)) + .collect::<Result<Vec<_>, _>>()? + .is_sorted(); + + if !are_field_names_sorted { + return Err(ArrowError::InvalidArgumentError( + "field names not sorted".to_string(), + )); + } + + // Since field ids are not guaranteed to be sorted, scan over all field ids + // and check that field ids are less than dictionary size + + let are_field_ids_in_bounds = field_ids + .iter() + .all(|&id| id < self.metadata.dictionary_size()); + + if !are_field_ids_in_bounds { + return Err(ArrowError::InvalidArgumentError( + "field id is not valid".to_string(), + )); + } + } + + // Validate whether values are valid variant objects + let field_offset_buffer = slice_from_slice( + self.value, + self.first_field_offset_byte..self.first_value_byte, + )?; + let num_offsets = field_offset_buffer.len() / self.header.field_offset_size(); + + let value_buffer = slice_from_slice(self.value, self.first_value_byte..)?; + + map_bytes_to_offsets(field_offset_buffer, self.header.field_offset_size) + .take(num_offsets.saturating_sub(1)) Review Comment: Shallow validation already ensured the last offset is in bounds, correct? ########## parquet-variant/src/variant/list.rs: ########## @@ -209,9 +208,35 @@ impl<'m, 'v> VariantList<'m, 'v> { // by value to all the children (who would otherwise re-validate it repeatedly). self.metadata = self.metadata.with_full_validation()?; - // Iterate over all string keys in this dictionary in order to prove that the offset - // array is valid, all offsets are in bounds, and all string bytes are valid utf-8. - validate_fallible_iterator(self.iter_try())?; + let offset_buffer = slice_from_slice( + self.value, + self.header.first_offset_byte()..self.first_value_byte, + )?; + + let offsets = + map_bytes_to_offsets(offset_buffer, self.header.offset_size).collect::<Vec<_>>(); + + // Validate offsets are in-bounds and monotonically increasing. + // Since shallow verification checks whether the first and last offsets are in-bounds, + // we can also verify all offsets are in-bounds by checking if offsets are monotonically increasing. + let are_offsets_monotonic = offsets.is_sorted_by(|a, b| a < b); + if !are_offsets_monotonic { + return Err(ArrowError::InvalidArgumentError( + "offsets are not monotonically increasing".to_string(), + )); + } Review Comment: The simplified version would be something like: ```rust let offsets = map_bytes_to_offsets(offset_buffer, self.header.offset_size); if let Some(mut start) = offsets.next() { for end in offsets { ... validate start..end ... start = end; } } ``` ########## parquet-variant/src/decoder.rs: ########## @@ -200,6 +200,24 @@ impl OffsetSizeBytes { } } +pub(crate) fn map_bytes_to_offsets( Review Comment: Needs a doc comment explaining what it does and the conditions under which it's panic-free? ########## parquet-variant/src/variant/object.rs: ########## @@ -210,9 +209,80 @@ impl<'m, 'v> VariantObject<'m, 'v> { // by value to all the children (who would otherwise re-validate it repeatedly). self.metadata = self.metadata.with_full_validation()?; - // Iterate over all string keys in this dictionary in order to prove that the offset - // array is valid, all offsets are in bounds, and all string bytes are valid utf-8. - validate_fallible_iterator(self.iter_try())?; + let field_id_buffer = slice_from_slice( + self.value, + self.header.field_ids_start_byte()..self.first_field_offset_byte, + )?; + + let field_ids = map_bytes_to_offsets(field_id_buffer, self.header.field_id_size) + .collect::<Vec<_>>(); + + // Validate all field ids exist in the metadata dictionary and the corresponding field names are lexicographically sorted + if self.metadata.is_sorted() { + // Since the metadata dictionary has unique and sorted field names, we can also guarantee this object's field names + // are lexicographically sorted by their field id ordering + if !field_ids.is_sorted() { + return Err(ArrowError::InvalidArgumentError( + "field names not sorted".to_string(), + )); + } + + // Since field ids are sorted, if the last field is smaller than the dictionary size, + // we also know all field ids are smaller than the dictionary size and in-bounds. + if let Some(&last_field_id) = field_ids.last() { + if last_field_id >= self.metadata.dictionary_size() { + return Err(ArrowError::InvalidArgumentError( + "field id is not valid".to_string(), + )); + } + } + } else { + // The metadata dictionary can't guarantee uniqueness or sortedness, so we have to parse out the corresponding field names + // to check lexicographical order + let are_field_names_sorted = field_ids + .iter() + .map(|&i| self.metadata.get(i)) + .collect::<Result<Vec<_>, _>>()? + .is_sorted(); + + if !are_field_names_sorted { + return Err(ArrowError::InvalidArgumentError( + "field names not sorted".to_string(), + )); + } + + // Since field ids are not guaranteed to be sorted, scan over all field ids + // and check that field ids are less than dictionary size + + let are_field_ids_in_bounds = field_ids + .iter() + .all(|&id| id < self.metadata.dictionary_size()); + + if !are_field_ids_in_bounds { + return Err(ArrowError::InvalidArgumentError( + "field id is not valid".to_string(), + )); + } Review Comment: We don't need this check. If any field id were out of bounds, the `collect`at L245 would have already failed. ########## parquet-variant/src/variant/object.rs: ########## @@ -210,9 +209,80 @@ impl<'m, 'v> VariantObject<'m, 'v> { // by value to all the children (who would otherwise re-validate it repeatedly). self.metadata = self.metadata.with_full_validation()?; - // Iterate over all string keys in this dictionary in order to prove that the offset - // array is valid, all offsets are in bounds, and all string bytes are valid utf-8. - validate_fallible_iterator(self.iter_try())?; + let field_id_buffer = slice_from_slice( + self.value, + self.header.field_ids_start_byte()..self.first_field_offset_byte, + )?; + + let field_ids = map_bytes_to_offsets(field_id_buffer, self.header.field_id_size) + .collect::<Vec<_>>(); + + // Validate all field ids exist in the metadata dictionary and the corresponding field names are lexicographically sorted + if self.metadata.is_sorted() { + // Since the metadata dictionary has unique and sorted field names, we can also guarantee this object's field names + // are lexicographically sorted by their field id ordering + if !field_ids.is_sorted() { + return Err(ArrowError::InvalidArgumentError( + "field names not sorted".to_string(), + )); + } + + // Since field ids are sorted, if the last field is smaller than the dictionary size, + // we also know all field ids are smaller than the dictionary size and in-bounds. + if let Some(&last_field_id) = field_ids.last() { + if last_field_id >= self.metadata.dictionary_size() { + return Err(ArrowError::InvalidArgumentError( + "field id is not valid".to_string(), + )); + } + } + } else { + // The metadata dictionary can't guarantee uniqueness or sortedness, so we have to parse out the corresponding field names + // to check lexicographical order + let are_field_names_sorted = field_ids + .iter() + .map(|&i| self.metadata.get(i)) + .collect::<Result<Vec<_>, _>>()? + .is_sorted(); + + if !are_field_names_sorted { + return Err(ArrowError::InvalidArgumentError( + "field names not sorted".to_string(), + )); + } + + // Since field ids are not guaranteed to be sorted, scan over all field ids + // and check that field ids are less than dictionary size + + let are_field_ids_in_bounds = field_ids + .iter() + .all(|&id| id < self.metadata.dictionary_size()); + + if !are_field_ids_in_bounds { + return Err(ArrowError::InvalidArgumentError( + "field id is not valid".to_string(), + )); + } + } + + // Validate whether values are valid variant objects + let field_offset_buffer = slice_from_slice( + self.value, + self.first_field_offset_byte..self.first_value_byte, + )?; + let num_offsets = field_offset_buffer.len() / self.header.field_offset_size(); + + let value_buffer = slice_from_slice(self.value, self.first_value_byte..)?; + + map_bytes_to_offsets(field_offset_buffer, self.header.field_offset_size) + .take(num_offsets.saturating_sub(1)) + .try_for_each(|offset| { + let value_bytes = slice_from_slice(value_buffer, offset..)?; + Variant::try_new_with_metadata(self.metadata, value_bytes)?; + + Ok::<_, ArrowError>(()) Review Comment: does this not work for some reason? ```suggestion Ok(()) ``` (`try_for_each` _does_ seem to have a pretty weird `collect`-like signature, but I would have hoped that the previous `?` would have fixed the error type...) ########## parquet-variant/src/variant/metadata.rs: ########## @@ -228,9 +225,47 @@ impl<'m> VariantMetadata<'m> { /// [validation]: Self#Validation pub fn with_full_validation(mut self) -> Result<Self, ArrowError> { if !self.validated { - // Iterate over all string keys in this dictionary in order to prove that the offset - // array is valid, all offsets are in bounds, and all string bytes are valid utf-8. - validate_fallible_iterator(self.iter_try())?; + let offset_bytes = slice_from_slice( + self.bytes, + self.header.first_offset_byte()..self.first_value_byte, + )?; + + let offsets = + map_bytes_to_offsets(offset_bytes, self.header.offset_size).collect::<Vec<_>>(); + + // Validate offsets are in-bounds and monotonically increasing. + // Since shallow validation ensures the first and last offsets are in bounds, we can also verify all offsets + // are in-bounds by checking if offsets are monotonically increasing. + let are_offsets_monotonic = offsets.is_sorted_by(|a, b| a < b); + if !are_offsets_monotonic { + return Err(ArrowError::InvalidArgumentError( + "offsets not monotonically increasing".to_string(), + )); + } Review Comment: Again, I'm not sure it's worth paying an extra pass for a monotonicity check, when the slicing operations that follow will naturally fail in the presence of non-monotonic offsets? ```rust let offsets = map_bytes_to_offsets(offset_bytes, self.header.offset_size); if let Some(first_offset) = offsets.next() { // Create an iterator over the strings. We must consume the iterator to validate it. let strings = offsets.scan(first_offset, |start, end| { let s = slice_from_slice(value_buffer, start..end); *start = end; Some(s) }); if self.header.is_sorted { // verify the strings are sorted and unique, in addition to being individually valid if let Some(mut a) = strings.next().transpose()? { for b in strings { let b = b?; if a >= b { return Err(... dictionary values are not unique and ordered ...); } a = b; } } } else { // Just verify the strings are all individually valid validate_fallible_iterator(strings)?; } } ``` This time, an iterator scan works well because we naturally unpack the errors later. Note: The above does require making `slice_from_slice` fully generic: ```rust pub(crate) fn slice_from_slice<T, I: SliceIndex<[T]> + Clone + Debug>( bytes: &[T], index: I, ) -> Result<&I::Output, ArrowError> { ``` ########## parquet-variant/src/variant/object.rs: ########## @@ -210,9 +209,80 @@ impl<'m, 'v> VariantObject<'m, 'v> { // by value to all the children (who would otherwise re-validate it repeatedly). self.metadata = self.metadata.with_full_validation()?; - // Iterate over all string keys in this dictionary in order to prove that the offset - // array is valid, all offsets are in bounds, and all string bytes are valid utf-8. - validate_fallible_iterator(self.iter_try())?; + let field_id_buffer = slice_from_slice( + self.value, + self.header.field_ids_start_byte()..self.first_field_offset_byte, + )?; + + let field_ids = map_bytes_to_offsets(field_id_buffer, self.header.field_id_size) + .collect::<Vec<_>>(); + + // Validate all field ids exist in the metadata dictionary and the corresponding field names are lexicographically sorted + if self.metadata.is_sorted() { + // Since the metadata dictionary has unique and sorted field names, we can also guarantee this object's field names + // are lexicographically sorted by their field id ordering Review Comment: clever! ########## parquet-variant/src/variant/object.rs: ########## @@ -206,13 +205,118 @@ impl<'m, 'v> VariantObject<'m, 'v> { /// [validation]: Self#Validation pub fn with_full_validation(mut self) -> Result<Self, ArrowError> { if !self.validated { + /* + (1) the associated variant metadata is [valid] (*) + (2) all field ids are valid metadata dictionary entries (*) + (3) field ids are lexically ordered according by their corresponding string values (*) + (4) all field offsets are in bounds (*) + (5) all field values are (recursively) _valid_ variant values (*) + + */ + + // (1) // Validate the metadata dictionary first, if not already validated, because we pass it // by value to all the children (who would otherwise re-validate it repeatedly). self.metadata = self.metadata.with_full_validation()?; - // Iterate over all string keys in this dictionary in order to prove that the offset - // array is valid, all offsets are in bounds, and all string bytes are valid utf-8. - validate_fallible_iterator(self.iter_try())?; + // (2) + // Field ids serve as indexes into the metadata buffer. + // As long as we guarantee the largest field id is < dictionary size, + // we can guarantee all field ids are valid metadata dictionaries + + // (2), (3) + let byte_range = self.header.field_ids_start_byte()..self.first_field_offset_byte; + let field_id_bytes = slice_from_slice(self.value, byte_range)?; + // let field_id = self.header.field_id_size.unpack_usize(field_id_bytes, i)?; + + let field_id_chunks = field_id_bytes.chunks_exact(self.header.field_id_size()); + assert!(field_id_chunks.remainder().is_empty()); // guaranteed to be none + + let field_ids = field_id_chunks + .map(|chunk| match self.header.field_id_size { + OffsetSizeBytes::One => chunk[0] as usize, + OffsetSizeBytes::Two => u16::from_le_bytes([chunk[0], chunk[1]]) as usize, + OffsetSizeBytes::Three => { + u32::from_le_bytes([chunk[0], chunk[1], chunk[2], 0]) as usize + } + OffsetSizeBytes::Four => { + u32::from_le_bytes([chunk[0], chunk[1], chunk[2], chunk[3]]) as usize + } + }) + .collect::<Vec<_>>(); Review Comment: > I could inline calls to map_bytes_to_offsets everywhere we loop over field_ids, but that doesn’t seem much better than doing a single allocation I only see one pass over `field_ids`? Two possible paths, but each is single-pass? (the not-sorted-metadata case makes a second pass, but I'm pretty sure it's redundant and will never find a failure -- see other comment) > I think in this case the extra allocation isn’t hurting perf very much Seems like something we could (should) quantify? For a not-small dictionary I would expect materialization cost to be quite noticeable. Anything big enough to fall out L1 cache would likely see a significant slowdown due to a wave of cache misses on every pass. And that includes not just the offset vec itself (easy enough to prefetch), but also the bytes of the strings and values we examine. > and it makes the code a lot simpler to reason about I don't think that's necessarily true for the metadata and list cases above (see other comments). For this case, it would look something like: ```rust let field_ids = map_bytes_to_offsets(field_id_buffer, self.header.field_id_size); if let Some(mut field_id) = field_ids.next() { if self.metadata.is_sorted() { // Since the metadata dictionary has unique and sorted field names... for next_field_id in field_ids { if next_field_id <= field_id { return Err(... field names not sorted ...); } field_id = next_field_id; } } else { // The metadata dictionary can't guarantee uniqueness... ... very similar to the metadata sortedness check: scan + validate_fallible_iterator ... ... probably could even factor out a helper function that both sites can call ... } if field_id >= self.metadata.dictionary_size() { return Err(... field id is not valid ...); } } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org