friendlymatthew commented on code in PR #7908:
URL: https://github.com/apache/arrow-rs/pull/7908#discussion_r2202057389


##########
parquet-variant/Cargo.toml:
##########
@@ -37,6 +37,7 @@ arrow-schema = { workspace = true }
 chrono = { workspace = true }
 indexmap = "2.10.0"
 
+simdutf8 = { version = "0.1.5", default-features = false }

Review Comment:
   `simdutf8` is used in other crates as well. I wonder if we should move this 
out to a workspace dependency?



##########
parquet-variant/src/utils.rs:
##########
@@ -84,6 +84,16 @@ pub(crate) fn string_from_slice(
         .map_err(|_| ArrowError::InvalidArgumentError("invalid UTF-8 
string".to_string()))
 }
 
+/// Extracts a byte slice from the given range and validates it as UTF-8.
+pub(crate) fn extract_and_validate_utf8_slice(
+    bytes: &[u8],
+    range: Range<usize>,
+) -> Result<&str, ArrowError> {
+    let offset_buffer = slice_from_slice(bytes, range)?;
+    simdutf8::basic::from_utf8(offset_buffer)
+        .map_err(|_| ArrowError::InvalidArgumentError("invalid UTF-8 
string".to_string()))
+}
+

Review Comment:
   This looks great. I think @viirya pointed out that if this function errs, 
the error message will contain the entire byte slice. 
   
   I wonder if we could use something like: 
   
   
https://github.com/apache/arrow-rs/blob/7b219f98c25fcd318a0c207f51a41398d1b23724/parquet/src/util/utf8.rs#L40-L57



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to