alamb commented on code in PR #7878: URL: https://github.com/apache/arrow-rs/pull/7878#discussion_r2196081537
########## parquet-variant-json/benches/variant_validation.rs: ########## @@ -0,0 +1,137 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +extern crate parquet_variant; +extern crate parquet_variant_json; + +use criterion::*; + +use parquet_variant::{Variant, VariantBuilder}; +use parquet_variant_json::json_to_variant; + +fn generate_large_object() -> (Vec<u8>, Vec<u8>) { + // 256 elements (keys: 000-255) - each element is an object of 256 elements (240-495) - each + // element a list of numbers from 0-127 + let keys: Vec<String> = (0..=255).map(|n| format!("{n:03}")).collect(); Review Comment: I think a non trivial amount of the time being measured by this benchmark is the creation of the JSON string... ########## parquet-variant-json/benches/variant_validation.rs: ########## @@ -0,0 +1,137 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +extern crate parquet_variant; Review Comment: I found these benchmarks somewhat confusing -- they are mostly testing json --> Variant conversion I think. While those are useful benchmarks, they probably shouldn't be called `variant_validation` SO I suggest you split this benchmark up into two: 1. parquet-variant/benches/variant_validation -- creates the `metadata` and `value` progrmatically with `VariantBuilder` *once* and then benchmarks calling`Variant::try_new` on that (precreated) metadata/value 2. parquet-variant-json/benches/variant_json.rs -- creates a json string once and then benchmarks how fast calling `json_to_variant` is ########## parquet-variant/src/variant/list.rs: ########## @@ -209,9 +208,35 @@ impl<'m, 'v> VariantList<'m, 'v> { // by value to all the children (who would otherwise re-validate it repeatedly). self.metadata = self.metadata.with_full_validation()?; - // Iterate over all string keys in this dictionary in order to prove that the offset - // array is valid, all offsets are in bounds, and all string bytes are valid utf-8. - validate_fallible_iterator(self.iter_try())?; + let offset_buffer = slice_from_slice( Review Comment: it seems like there is a lot of duplicated code that does about the same thing: Check that a slice of offsets are all 1. sorted (potentially)-- 2. Less than some max offset 3. Point into a valid sub variant I wonder if it possible to factor it all out into a function like ```rust fn validate_offsets(offset_buffer: &[u8], num_offsets: usize, offset_size: OffsetSize, max_valid_offset: usize) { ... } ``` Or something 🤔 ########## parquet-variant-json/benches/variant_validation.rs: ########## @@ -0,0 +1,137 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +extern crate parquet_variant; +extern crate parquet_variant_json; + +use criterion::*; + +use parquet_variant::{Variant, VariantBuilder}; +use parquet_variant_json::json_to_variant; + +fn generate_large_object() -> (Vec<u8>, Vec<u8>) { + // 256 elements (keys: 000-255) - each element is an object of 256 elements (240-495) - each + // element a list of numbers from 0-127 + let keys: Vec<String> = (0..=255).map(|n| format!("{n:03}")).collect(); + let innermost_list: String = format!( + "[{}]", + (0..=127) + .map(|n| format!("{n}")) + .collect::<Vec<_>>() + .join(",") + ); + let inner_keys: Vec<String> = (240..=495).map(|n| format!("{n}")).collect(); + let inner_object = format!( + "{{{}:{}}}", + inner_keys + .iter() + .map(|k| format!("\"{k}\"")) + .collect::<Vec<String>>() + .join(format!(":{innermost_list},").as_str()), + innermost_list + ); + let json = format!( + "{{{}:{}}}", + keys.iter() + .map(|k| format!("\"{k}\"")) + .collect::<Vec<String>>() + .join(format!(":{inner_object},").as_str()), + inner_object + ); + // Manually verify raw JSON value size Review Comment: this comment seems incorrect ########## parquet-variant/src/variant/object.rs: ########## @@ -210,9 +209,77 @@ impl<'m, 'v> VariantObject<'m, 'v> { // by value to all the children (who would otherwise re-validate it repeatedly). self.metadata = self.metadata.with_full_validation()?; - // Iterate over all string keys in this dictionary in order to prove that the offset - // array is valid, all offsets are in bounds, and all string bytes are valid utf-8. - validate_fallible_iterator(self.iter_try())?; + let field_id_buffer = slice_from_slice( + self.value, + self.header.field_ids_start_byte()..self.first_field_offset_byte, + )?; + + let field_ids = map_bytes_to_offsets(field_id_buffer, self.header.field_id_size) + .collect::<Vec<_>>(); + + // Validate all field ids exist in the metadata dictionary and the corresponding field names are lexicographically sorted + if self.metadata.is_sorted() { + // Since the metadata dictionary has unique and sorted field names, we can also guarantee this object's field names + // are lexicographically sorted by their field id ordering + if !field_ids.is_sorted() { + return Err(ArrowError::InvalidArgumentError( + "field names not sorted".to_string(), + )); + } + + // Since field ids are sorted, if the last field is smaller than the dictionary size, + // we also know all field ids are smaller than the dictionary size and in-bounds. + if let Some(&last_field_id) = field_ids.last() { + if last_field_id >= self.metadata.dictionary_size() { + return Err(ArrowError::InvalidArgumentError( + "field id is not valid".to_string(), + )); + } + } + } else { + // The metadata dictionary can't guarantee uniqueness or sortedness, so we have to parse out the corresponding field names + // to check lexicographical order + let are_field_names_sorted = field_ids + .iter() + .map(|&i| self.metadata.get(i)) + .collect::<Result<Vec<_>, _>>()? + .is_sorted(); + + if !are_field_names_sorted { + return Err(ArrowError::InvalidArgumentError( + "field names not sorted".to_string(), + )); + } + + // Since field ids aren't guaranteed to be sorted, scan through the field ids and pick the largest field id in this object. + // If the largest field id is smaller than the dictionary size, we also know all field ids are smaller than the dictionary size and in-bounds. Review Comment: I don't understand why this is faster than simply checking that all the field_ids are less than self.metadata.dictionary_size -- finding the `max` requires looking at all the items anywyas ########## parquet-variant/src/variant/metadata.rs: ########## @@ -228,9 +225,48 @@ impl<'m> VariantMetadata<'m> { /// [validation]: Self#Validation pub fn with_full_validation(mut self) -> Result<Self, ArrowError> { if !self.validated { - // Iterate over all string keys in this dictionary in order to prove that the offset - // array is valid, all offsets are in bounds, and all string bytes are valid utf-8. - validate_fallible_iterator(self.iter_try())?; + let offset_bytes = slice_from_slice( + self.bytes, + self.header.first_offset_byte()..self.first_value_byte, + )?; + + let offsets = + map_bytes_to_offsets(offset_bytes, self.header.offset_size).collect::<Vec<_>>(); + + // Validate offsets are in-bounds and monotonically increasing. + // Since shallow validation ensures the first and last offsets are in bounds, we can also verify all offsets + // are in-bounds by checking if offsets are monotonically increasing. + let are_offsets_monotonic = offsets.is_sorted_by(|a, b| a < b); + if !are_offsets_monotonic { + return Err(ArrowError::InvalidArgumentError( + "offsets not monotonically increasing".to_string(), + )); + } + + // Verify the string values in the dictionary are UTF-8 encoded strings. + let value_buffer = slice_from_slice(self.bytes, self.first_value_byte..)?; + let value_str = simdutf8::basic::from_utf8(value_buffer) Review Comment: I wonder how much improvement the new crate adds ? One thing I thought was that if we create a Variant from `JSON` we already know all the values were utf8, so it makes sense to skip validating it again befor read ########## parquet-variant/src/variant/object.rs: ########## @@ -206,13 +205,118 @@ impl<'m, 'v> VariantObject<'m, 'v> { /// [validation]: Self#Validation pub fn with_full_validation(mut self) -> Result<Self, ArrowError> { if !self.validated { + /* + (1) the associated variant metadata is [valid] (*) + (2) all field ids are valid metadata dictionary entries (*) + (3) field ids are lexically ordered according by their corresponding string values (*) + (4) all field offsets are in bounds (*) + (5) all field values are (recursively) _valid_ variant values (*) + + */ + + // (1) // Validate the metadata dictionary first, if not already validated, because we pass it // by value to all the children (who would otherwise re-validate it repeatedly). self.metadata = self.metadata.with_full_validation()?; - // Iterate over all string keys in this dictionary in order to prove that the offset - // array is valid, all offsets are in bounds, and all string bytes are valid utf-8. - validate_fallible_iterator(self.iter_try())?; + // (2) + // Field ids serve as indexes into the metadata buffer. + // As long as we guarantee the largest field id is < dictionary size, + // we can guarantee all field ids are valid metadata dictionaries + + // (2), (3) + let byte_range = self.header.field_ids_start_byte()..self.first_field_offset_byte; + let field_id_bytes = slice_from_slice(self.value, byte_range)?; + // let field_id = self.header.field_id_size.unpack_usize(field_id_bytes, i)?; + + let field_id_chunks = field_id_bytes.chunks_exact(self.header.field_id_size()); + assert!(field_id_chunks.remainder().is_empty()); // guaranteed to be none + + let field_ids = field_id_chunks + .map(|chunk| match self.header.field_id_size { + OffsetSizeBytes::One => chunk[0] as usize, + OffsetSizeBytes::Two => u16::from_le_bytes([chunk[0], chunk[1]]) as usize, + OffsetSizeBytes::Three => { + u32::from_le_bytes([chunk[0], chunk[1], chunk[2], 0]) as usize + } + OffsetSizeBytes::Four => { + u32::from_le_bytes([chunk[0], chunk[1], chunk[2], chunk[3]]) as usize + } + }) + .collect::<Vec<_>>(); Review Comment: I agree -- it seems unecessary to allocate and copy all the values into a Vec here ########## parquet-variant-json/benches/variant_validation.rs: ########## @@ -0,0 +1,137 @@ +// Licensed to the Apache Software Foundation (ASF) under one Review Comment: I ran these benchmarks locally ```shell cargo bench --bench variant_validation ``` Here is some output: ``` Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 33.9s, or reduce sample count to 10. Benchmarking bench_validate_large_object: Collecting 100 samples in estimated 33.859 s (100 iterations) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org