alamb commented on code in PR #7807: URL: https://github.com/apache/arrow-rs/pull/7807#discussion_r2173717027
########## parquet-variant/src/variant/metadata.rs: ########## @@ -204,8 +292,8 @@ mod tests { let md = VariantMetadata::try_new(bytes).expect("should parse"); assert_eq!(md.dictionary_size(), 2); // Fields - assert_eq!(md.get(0).unwrap(), "cat"); - assert_eq!(md.get(1).unwrap(), "dog"); + assert_eq!(&md[0], "cat"); Review Comment: that looks much better ########## parquet-variant/src/variant/metadata.rs: ########## @@ -161,22 +230,41 @@ impl<'m> VariantMetadata<'m> { .unpack_usize(bytes, NUM_HEADER_BYTES, i + 1) } - /// Gets a dictionary entry by index + /// Attempts to retrieve a dictionary entry by index, failing if out of bounds or if the + /// underlying bytes are [invalid]. + /// + /// [invalid]: Self#Validation pub fn get(&self, i: usize) -> Result<&'m str, ArrowError> { let byte_range = self.get_offset(i)?..self.get_offset(i + 1)?; string_from_slice(self.bytes, self.dictionary_key_start_byte, byte_range) } - /// Get all dictionary entries as an Iterator of strings + /// Returns an iterator that attempts to visit all dictionary entries, producing `Err` if the + /// iterator encounters [invalid] data. + /// + /// [invalid]: Self#Validation + pub fn iter_try(&self) -> impl Iterator<Item = Result<&'m str, ArrowError>> + '_ { + (0..self.dictionary_size).map(move |i| self.get(i)) + } + + /// Iterates oer all dictionary entries. When working with [unvalidated] input, prefer + /// [`Self::iter_try`] to avoid panics due to invalid data. + /// + /// [unvalidated]: Self#Validation pub fn iter(&self) -> impl Iterator<Item = &'m str> + '_ { - // NOTE: It is safe to unwrap because the constructor already made a successful traversal. - self.iter_checked().map(Result::unwrap) + self.iter_try().map(Result::unwrap) } +} + +/// Retrieves the ith dictionary entry, panicking if the index is out of bounds. Accessing Review Comment: 👍 ########## parquet-variant/src/variant/metadata.rs: ########## @@ -78,35 +78,75 @@ impl VariantMetadataHeader { /// /// See the [Variant Spec] file for more information /// +/// # Validation +/// +/// Every instance of variant metadata is either _valid_ or _invalid_. depending on whether the +/// underlying bytes are a valid encoding of variant metadata (see below). +/// +/// Instances produced by [`Self::try_new`] or [`Self::validate`] are fully _validated_, and always +/// contain _valid_ data. +/// +/// Instances produced by [`Self::new`] are _unvalidated_ and may contain either _valid_ or +/// _invalid_ data. Infallible accesses such as iteration and indexing may panic if the underlying +/// bytes are _invalid_, and fallible alternatives such as [`Self::iter_try`] and [`Self::get`] are +/// strongly recommended. [`Self::validate`] can also be used to _validate_ an _unvalidated_ +/// instance, if desired. Review Comment: Rather than "strongly recommend" I think it would be better here to spell out the differences. Something like ``` For performance reasons, some users may skip validation even for untrusted user input. In these cases, care must be taken to use the fallible (error checking) APIs, otherwise your code may panic at runtime. When in doubt, use `Self::new` to ensure validation. ``` I also think it is important to point out that in no case will unvalidated data cause `unsafe` (in the rust sense of the word) behavior ``` # Safety All Variant APIs are memory safe. While accessing an unvalidated Variant may result in a panic, it will **not** result in undefined behavior (it is not `unsafe` in the Rust sense) ``` ########## parquet-variant/src/variant/metadata.rs: ########## @@ -78,35 +78,75 @@ impl VariantMetadataHeader { /// /// See the [Variant Spec] file for more information /// +/// # Validation +/// +/// Every instance of variant metadata is either _valid_ or _invalid_. depending on whether the +/// underlying bytes are a valid encoding of variant metadata (see below). +/// +/// Instances produced by [`Self::try_new`] or [`Self::validate`] are fully _validated_, and always +/// contain _valid_ data. +/// +/// Instances produced by [`Self::new`] are _unvalidated_ and may contain either _valid_ or +/// _invalid_ data. Infallible accesses such as iteration and indexing may panic if the underlying +/// bytes are _invalid_, and fallible alternatives such as [`Self::iter_try`] and [`Self::get`] are +/// strongly recommended. [`Self::validate`] can also be used to _validate_ an _unvalidated_ +/// instance, if desired. +/// +/// A _validated_ instance guarantees that: +/// +/// - header byte is valid +/// - dictionary size is in bounds +/// - offset array content is in-bounds +/// - first offset is zero +/// - last offset is in-bounds +/// - all other offsets are in-bounds (*) +/// - all offsets are monotonically increasing (*) +/// - all values are valid utf-8 (*) Review Comment: I agree -- the API sketched out in this PR looks great to me -- ergonomic to use as well as offering optimized path access when needed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org