alamb commented on code in PR #9971:
URL: https://github.com/apache/arrow-datafusion/pull/9971#discussion_r1554556162
##########
datafusion/functions/src/string/lower.rs:
##########
@@ -62,6 +62,6 @@ impl ScalarUDFImpl for LowerFunc {
}
fn invoke(&self, args: &[ColumnarValue]) -> Result<ColumnarValue> {
- handle(args, |string| string.to_lowercase(), "lower")
+ case_conversion(args, |string| string.to_lowercase(), "lower")
Review Comment:
It seems like to_lowercase actually works on unicode values, so it may
change the length of the strings:
https://doc.rust-lang.org/std/primitive.str.html#method.to_lowercase
##########
datafusion/functions/src/string/common.rs:
##########
@@ -103,6 +104,7 @@ pub(crate) fn general_trim<T: OffsetSizeTrait>(
/// This function errors when:
/// * the number of arguments is not 1
/// * the first argument is not castable to a `GenericStringArray`
+#[allow(dead_code)]
Review Comment:
Are these functions still needed? Can we instead simply delete them if they
are no longer used (since they are crate private)?
They don't appear to be used anywhere else:
https://github.com/search?q=repo%3Aapache%2Farrow-datafusion%20unary_string_function&type=code
##########
datafusion/functions/src/string/upper.rs:
##########
@@ -59,6 +59,49 @@ impl ScalarUDFImpl for UpperFunc {
}
fn invoke(&self, args: &[ColumnarValue]) -> Result<ColumnarValue> {
- handle(args, |string| string.to_uppercase(), "upper")
+ case_conversion(args, |string| string.to_uppercase(), "upper")
+ }
+}
+
+#[cfg(test)]
+mod tests {
+ use super::*;
+ use arrow::array::{ArrayRef, StringArray};
+ use std::sync::Arc;
+
+ #[test]
+ fn upper() -> Result<()> {
+ let string_array = Arc::new(StringArray::from(vec![
+ Some("arrow"),
+ None,
+ Some("datafusion"),
Review Comment:
Can you please also add tests for non ascii characters whose upper/lowercase
lengths are different?
It seems like this is possible
https://stackoverflow.com/questions/48351343/can-lowercasing-a-utf-8-string-cause-it-to-grow
##########
datafusion/functions/src/string/common.rs:
##########
@@ -174,3 +177,62 @@ where
},
}
}
+
+pub(crate) fn case_conversion<'a, F>(
+ args: &'a [ColumnarValue],
+ op: F,
+ name: &str,
+) -> Result<ColumnarValue>
+where
+ F: Fn(&'a str) -> String,
+{
+ match &args[0] {
+ ColumnarValue::Array(array) => match array.data_type() {
+ DataType::Utf8 => {
+ Ok(ColumnarValue::Array(convert_array::<i32, _>(array, op)?))
+ }
+ DataType::LargeUtf8 => {
+ Ok(ColumnarValue::Array(convert_array::<i64, _>(array, op)?))
+ }
+ other => exec_err!("Unsupported data type {other:?} for function
{name}"),
+ },
+ ColumnarValue::Scalar(scalar) => match scalar {
+ ScalarValue::Utf8(a) => {
+ let result = a.as_ref().map(|x| op(x));
+ Ok(ColumnarValue::Scalar(ScalarValue::Utf8(result)))
+ }
+ ScalarValue::LargeUtf8(a) => {
+ let result = a.as_ref().map(|x| op(x));
+ Ok(ColumnarValue::Scalar(ScalarValue::LargeUtf8(result)))
+ }
+ other => exec_err!("Unsupported data type {other:?} for function
{name}"),
+ },
+ }
+}
+
+fn convert_array<'a, O, F>(array: &'a ArrayRef, op: F) -> Result<ArrayRef>
+where
+ O: OffsetSizeTrait,
+ F: Fn(&'a str) -> String,
+{
+ let string_array = as_generic_string_array::<O>(array)?;
+ let value_data = string_array.value_data();
+
+ // SAFETY: all items stored in value_data satisfy UTF8.
+ // ref: impl ByteArrayNativeType for str {...}
+ let str_values = unsafe { std::str::from_utf8_unchecked(value_data) };
+
+ // conversion
+ let converted_values = op(str_values);
+ let bytes = converted_values.into_bytes();
+
+ // build result
+ let values = Buffer::from_vec(bytes);
+ let offsets = string_array.offsets().clone();
+ let nulls = string_array.nulls().cloned();
+
+ // SAFETY: offsets and nulls are consistent with the input array.
Review Comment:
I think the safety of this also relies on `op` not decreasing the length of
the characters, as in that case the offsets would be pointing into invalid
characters. For example if `op()` always returned `""` I think this would
result in an invalid array.
Can you please also add a sanity check that the the new string didn't get
smaller?
Something like (untested) this above:
```rust
let orig_len = str_values.len()
let converted_values = op(str_values);
// offsets may refer to anywhere in str_values so new string must be at
least as long
assert!(converted_values.len() >= orig_len);
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]