alamb commented on code in PR #7807:
URL: https://github.com/apache/arrow-rs/pull/7807#discussion_r2173717027


##########
parquet-variant/src/variant/metadata.rs:
##########
@@ -204,8 +292,8 @@ mod tests {
         let md = VariantMetadata::try_new(bytes).expect("should parse");
         assert_eq!(md.dictionary_size(), 2);
         // Fields
-        assert_eq!(md.get(0).unwrap(), "cat");
-        assert_eq!(md.get(1).unwrap(), "dog");
+        assert_eq!(&md[0], "cat");

Review Comment:
   that looks much better



##########
parquet-variant/src/variant/metadata.rs:
##########
@@ -161,22 +230,41 @@ impl<'m> VariantMetadata<'m> {
             .unpack_usize(bytes, NUM_HEADER_BYTES, i + 1)
     }
 
-    /// Gets a dictionary entry by index
+    /// Attempts to retrieve a dictionary entry by index, failing if out of 
bounds or if the
+    /// underlying bytes are [invalid].
+    ///
+    /// [invalid]: Self#Validation
     pub fn get(&self, i: usize) -> Result<&'m str, ArrowError> {
         let byte_range = self.get_offset(i)?..self.get_offset(i + 1)?;
         string_from_slice(self.bytes, self.dictionary_key_start_byte, 
byte_range)
     }
 
-    /// Get all dictionary entries as an Iterator of strings
+    /// Returns an iterator that attempts to visit all dictionary entries, 
producing `Err` if the
+    /// iterator encounters [invalid] data.
+    ///
+    /// [invalid]: Self#Validation
+    pub fn iter_try(&self) -> impl Iterator<Item = Result<&'m str, 
ArrowError>> + '_ {
+        (0..self.dictionary_size).map(move |i| self.get(i))
+    }
+
+    /// Iterates oer all dictionary entries. When working with [unvalidated] 
input, prefer
+    /// [`Self::iter_try`] to avoid panics due to invalid data.
+    ///
+    /// [unvalidated]: Self#Validation
     pub fn iter(&self) -> impl Iterator<Item = &'m str> + '_ {
-        // NOTE: It is safe to unwrap because the constructor already made a 
successful traversal.
-        self.iter_checked().map(Result::unwrap)
+        self.iter_try().map(Result::unwrap)
     }
+}
+
+/// Retrieves the ith dictionary entry, panicking if the index is out of 
bounds. Accessing

Review Comment:
   👍 



##########
parquet-variant/src/variant/metadata.rs:
##########
@@ -78,35 +78,75 @@ impl VariantMetadataHeader {
 ///
 /// See the [Variant Spec] file for more information
 ///
+/// # Validation
+///
+/// Every instance of variant metadata is either _valid_ or _invalid_. 
depending on whether the
+/// underlying bytes are a valid encoding of variant metadata (see below).
+///
+/// Instances produced by [`Self::try_new`] or [`Self::validate`] are fully 
_validated_, and always
+/// contain _valid_ data.
+///
+/// Instances produced by [`Self::new`] are _unvalidated_ and may contain 
either _valid_ or
+/// _invalid_ data. Infallible accesses such as iteration and indexing may 
panic if the underlying
+/// bytes are _invalid_, and fallible alternatives such as [`Self::iter_try`] 
and [`Self::get`] are
+/// strongly recommended. [`Self::validate`] can also be used to _validate_ an 
_unvalidated_
+/// instance, if desired.

Review Comment:
   Rather than "strongly recommend" I think it would be better here to spell 
out the differences. Something like
   
   ```
   For performance reasons, some users may skip validation even for untrusted 
user input. In these cases, care must be taken to use the fallible (error 
checking) APIs, otherwise your code may panic at runtime. 
   
   When in doubt, use `Self::new`  to ensure validation. 
   ```
   
   I also think it is important to point out that in no case will unvalidated 
data cause `unsafe` (in the rust sense of the word) behavior
   
   ```
   
   # Safety
   All Variant APIs are memory safe. While accessing an unvalidated Variant may 
result in a panic,
   it will **not** result in undefined behavior (it is not `unsafe` in the Rust 
sense)
   ```



##########
parquet-variant/src/variant/metadata.rs:
##########
@@ -78,35 +78,75 @@ impl VariantMetadataHeader {
 ///
 /// See the [Variant Spec] file for more information
 ///
+/// # Validation
+///
+/// Every instance of variant metadata is either _valid_ or _invalid_. 
depending on whether the
+/// underlying bytes are a valid encoding of variant metadata (see below).
+///
+/// Instances produced by [`Self::try_new`] or [`Self::validate`] are fully 
_validated_, and always
+/// contain _valid_ data.
+///
+/// Instances produced by [`Self::new`] are _unvalidated_ and may contain 
either _valid_ or
+/// _invalid_ data. Infallible accesses such as iteration and indexing may 
panic if the underlying
+/// bytes are _invalid_, and fallible alternatives such as [`Self::iter_try`] 
and [`Self::get`] are
+/// strongly recommended. [`Self::validate`] can also be used to _validate_ an 
_unvalidated_
+/// instance, if desired.
+///
+/// A _validated_ instance guarantees that:
+///
+/// - header byte is valid
+/// - dictionary size is in bounds
+/// - offset array content is in-bounds
+/// - first offset is zero
+/// - last offset is in-bounds
+/// - all other offsets are in-bounds (*)
+/// - all offsets are monotonically increasing (*)
+/// - all values are valid utf-8 (*)

Review Comment:
   I agree -- the API sketched out in this PR looks great to me -- ergonomic to 
use as well as offering optimized path access when needed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to