friendlymatthew commented on code in PR #7908:
URL: https://github.com/apache/arrow-rs/pull/7908#discussion_r2202057389
##########
parquet-variant/Cargo.toml:
##########
@@ -37,6 +37,7 @@ arrow-schema = { workspace = true }
chrono = { workspace = true }
indexmap = "2.10.0"
+simdutf8 = { version = "0.1.5", default-features = false }
Review Comment:
`simdutf8` is used in other crates as well. I wonder if we should move this
out to a workspace dependency?
##########
parquet-variant/src/utils.rs:
##########
@@ -84,6 +84,16 @@ pub(crate) fn string_from_slice(
.map_err(|_| ArrowError::InvalidArgumentError("invalid UTF-8
string".to_string()))
}
+/// Extracts a byte slice from the given range and validates it as UTF-8.
+pub(crate) fn extract_and_validate_utf8_slice(
+ bytes: &[u8],
+ range: Range<usize>,
+) -> Result<&str, ArrowError> {
+ let offset_buffer = slice_from_slice(bytes, range)?;
+ simdutf8::basic::from_utf8(offset_buffer)
+ .map_err(|_| ArrowError::InvalidArgumentError("invalid UTF-8
string".to_string()))
+}
+
Review Comment:
This looks great. I think @viirya pointed out that if this function errs,
the error message will contain the entire byte slice.
I wonder if we could use something like:
https://github.com/apache/arrow-rs/blob/7b219f98c25fcd318a0c207f51a41398d1b23724/parquet/src/util/utf8.rs#L40-L57
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]