friendlymatthew commented on code in PR #7908: URL: https://github.com/apache/arrow-rs/pull/7908#discussion_r2202057389
########## parquet-variant/Cargo.toml: ########## @@ -37,6 +37,7 @@ arrow-schema = { workspace = true } chrono = { workspace = true } indexmap = "2.10.0" +simdutf8 = { version = "0.1.5", default-features = false } Review Comment: `simdutf8` is used in other crates as well. I wonder if we should move this out to a workspace dependency? ########## parquet-variant/src/utils.rs: ########## @@ -84,6 +84,16 @@ pub(crate) fn string_from_slice( .map_err(|_| ArrowError::InvalidArgumentError("invalid UTF-8 string".to_string())) } +/// Extracts a byte slice from the given range and validates it as UTF-8. +pub(crate) fn extract_and_validate_utf8_slice( + bytes: &[u8], + range: Range<usize>, +) -> Result<&str, ArrowError> { + let offset_buffer = slice_from_slice(bytes, range)?; + simdutf8::basic::from_utf8(offset_buffer) + .map_err(|_| ArrowError::InvalidArgumentError("invalid UTF-8 string".to_string())) +} + Review Comment: This looks great. I think @viirya pointed out that if this function errs, the error message will contain the entire byte slice. I wonder if we could use something like: https://github.com/apache/arrow-rs/blob/7b219f98c25fcd318a0c207f51a41398d1b23724/parquet/src/util/utf8.rs#L40-L57 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org