neilconway commented on code in PR #21366:
URL: https://github.com/apache/datafusion/pull/21366#discussion_r3046323361


##########
datafusion/functions/src/unicode/substr.rs:
##########
@@ -326,17 +325,111 @@ fn string_view_substr(
     }
 }
 
-fn string_substr<'a, V>(string_array: V, args: &[ArrayRef]) -> Result<ArrayRef>
-where
-    V: StringArrayType<'a> + Copy,
-{
+fn values_fit_in_u32<T: OffsetSizeTrait>(string_array: &GenericStringArray<T>) 
-> bool {
+    string_array
+        .offsets()
+        .last()
+        .map(|offset| offset.as_usize() <= u32::MAX as usize)
+        .unwrap_or(true)
+}
+
+#[inline]
+fn append_new_view(
+    views_buf: &mut Vec<u128>,
+    null_builder: &mut NullBufferBuilder,
+    substr: &str,
+    byte_offset: usize,
+) -> bool {
+    let is_out_of_line = substr.len() > 12;
+    let view = if is_out_of_line {
+        let byte_offset = u32::try_from(byte_offset)
+            .expect("validated string buffer offset fits in u32");
+        make_view(substr.as_bytes(), 0, byte_offset)
+    } else {
+        make_view(substr.as_bytes(), 0, 0)
+    };
+
+    views_buf.push(view);
+    null_builder.append_non_null();
+    is_out_of_line
+}
+
+#[expect(clippy::needless_range_loop)]
+fn generic_string_substr<T: OffsetSizeTrait>(
+    string_array: &GenericStringArray<T>,
+    args: &[ArrayRef],
+) -> Result<ArrayRef> {
+    // We return a StringViewArray that points into the input string array's
+    // values buffer, avoiding copies. This is only possible when the values
+    // buffer is <= 4GB, since StringView offsets are u32.
+    if !values_fit_in_u32(string_array) {

Review Comment:
   I think (1) would probably be useful, although we could implement it via 
`simplify` so there's no runtime overhead at all.
   
   For the other cases, they seem a bit marginal to me, personally -- the 
existing code should handle them pretty efficiently. In any case, can we defer 
this to a different PR?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to