alamb commented on code in PR #21980:
URL: https://github.com/apache/datafusion/pull/21980#discussion_r3178094871


##########
datafusion/functions/src/strings.rs:
##########
@@ -545,13 +545,11 @@ const MAX_BLOCK_SIZE: u32 = 2 * 1024 * 1024;
 /// Short strings (≤ 12 bytes) are inlined into the view itself; long strings
 /// are appended into an in-progress data block. When the in-progress block
 /// fills up it is flushed into `completed` and a new block — double the size
-/// of the last, capped at [`MAX_BLOCK_SIZE`] — is started.
+/// of the last, capped at [`STRING_VIEW_MAX_BLOCK_SIZE`] — is started.

Review Comment:
   seeing this more I wonder if we could potentially reduce the duplication 
here by wrapping a StringViewBuilder -- I'll maybe give it a try



##########
datafusion/functions/src/string/common.rs:
##########
@@ -423,43 +435,164 @@ where
             } else {
                 // SAFETY: `n.is_null(i)` was false in the branch above.
                 let s = unsafe { string_array.value_unchecked(i) };
-                builder.append_value(&op(s));
+                builder.append_value(&unicode_case(s, lower));
             }
         }
     } else {
         for i in 0..item_len {
             // SAFETY: no null buffer means every index is valid.
             let s = unsafe { string_array.value_unchecked(i) };
-            builder.append_value(&op(s));
+            builder.append_value(&unicode_case(s, lower));
         }
     }
     Ok(Arc::new(builder.finish(nulls)?))
 }
 
+/// Fast path for case conversion on an all-ASCII `StringViewArray`.
+fn case_conversion_utf8view_ascii(
+    array: &StringViewArray,
+    lower: bool,
+) -> StringViewArray {
+    // Specialize per conversion so the byte call inlines in the hot loops 
below.
+    if lower {
+        case_conversion_utf8view_ascii_inner(array, u8::to_ascii_lowercase)
+    } else {
+        case_conversion_utf8view_ascii_inner(array, u8::to_ascii_uppercase)
+    }
+}
+
+/// Walks the views once and produces a new `StringViewArray` with
+/// case-converted bytes. Inline strings (<= 12 bytes) are converted in-place;
+/// long strings copy-and-convert into output buffers and have their view 
fields
+/// rewritten to address the new bytes. ASCII case conversion preserves is byte
+/// length, so no row migrates between the inline and long layouts.
+fn case_conversion_utf8view_ascii_inner<F: Fn(&u8) -> u8>(
+    array: &StringViewArray,
+    convert: F,
+) -> StringViewArray {
+    let item_len = array.len();
+    let views = array.views();
+    let data_buffers = array.data_buffers();
+    let nulls = array.nulls();
+
+    let mut new_views: Vec<u128> = Vec::with_capacity(item_len);
+    // Long values are packed into `in_progress`; when full it is sealed into
+    // `completed` and a new, larger block is started — same block-doubling
+    // scheme as Arrow's `GenericByteViewBuilder`.
+    let mut in_progress: Vec<u8> = Vec::new();
+    let mut completed: Vec<Buffer> = Vec::new();
+    let mut block_size: u32 = STRING_VIEW_INIT_BLOCK_SIZE;
+
+    for i in 0..item_len {
+        if nulls.is_some_and(|n| n.is_null(i)) {
+            // Zero view = empty, no buffer reference; the null buffer is what
+            // marks the row null, so the view's value is irrelevant.
+            new_views.push(0);
+            continue;
+        }
+        let view = views[i];
+        // Length is the low 32 bits; `as u32` discards the rest of the view.
+        let len = view as u32 as usize;
+        if len == 0 {
+            new_views.push(0);
+            continue;
+        }
+        let mut bytes = view.to_le_bytes();
+        if len <= 12 {
+            // Inline: value is in bytes[4..4+len], no buffer reference. 
Convert

Review Comment:
   This patterns would be really nice to abstract -- namely a method that makes 
a new StringArray where it is guaranteed that each output value will be exactly 
as long ad the input value. 
   
   Upper/lower are good examples. Maybe also `translate` with single cahracnters



##########
datafusion/functions/src/string/common.rs:
##########
@@ -423,43 +435,164 @@ where
             } else {
                 // SAFETY: `n.is_null(i)` was false in the branch above.
                 let s = unsafe { string_array.value_unchecked(i) };
-                builder.append_value(&op(s));
+                builder.append_value(&unicode_case(s, lower));
             }
         }
     } else {
         for i in 0..item_len {
             // SAFETY: no null buffer means every index is valid.
             let s = unsafe { string_array.value_unchecked(i) };
-            builder.append_value(&op(s));
+            builder.append_value(&unicode_case(s, lower));
         }
     }
     Ok(Arc::new(builder.finish(nulls)?))
 }
 
+/// Fast path for case conversion on an all-ASCII `StringViewArray`.
+fn case_conversion_utf8view_ascii(
+    array: &StringViewArray,
+    lower: bool,
+) -> StringViewArray {
+    // Specialize per conversion so the byte call inlines in the hot loops 
below.
+    if lower {
+        case_conversion_utf8view_ascii_inner(array, u8::to_ascii_lowercase)
+    } else {
+        case_conversion_utf8view_ascii_inner(array, u8::to_ascii_uppercase)
+    }
+}
+
+/// Walks the views once and produces a new `StringViewArray` with
+/// case-converted bytes. Inline strings (<= 12 bytes) are converted in-place;
+/// long strings copy-and-convert into output buffers and have their view 
fields
+/// rewritten to address the new bytes. ASCII case conversion preserves is byte
+/// length, so no row migrates between the inline and long layouts.
+fn case_conversion_utf8view_ascii_inner<F: Fn(&u8) -> u8>(
+    array: &StringViewArray,
+    convert: F,
+) -> StringViewArray {
+    let item_len = array.len();
+    let views = array.views();
+    let data_buffers = array.data_buffers();
+    let nulls = array.nulls();
+
+    let mut new_views: Vec<u128> = Vec::with_capacity(item_len);
+    // Long values are packed into `in_progress`; when full it is sealed into
+    // `completed` and a new, larger block is started — same block-doubling
+    // scheme as Arrow's `GenericByteViewBuilder`.
+    let mut in_progress: Vec<u8> = Vec::new();
+    let mut completed: Vec<Buffer> = Vec::new();
+    let mut block_size: u32 = STRING_VIEW_INIT_BLOCK_SIZE;
+
+    for i in 0..item_len {

Review Comment:
   I think you could avoid a lot of this copy / paste  by using a  
`StringViewBuilder` with the `append_block` and `append_view_unchecked` methods
   
   
https://docs.rs/arrow/latest/arrow/array/type.StringViewBuilder.html#method.append_block
   
   
   That being said, I do think it would be slightly slower than this 
implementation because it would have to re-check the length
   
   It almost seems like what we want is some sort of API on `StringViewArray` 
itself, similar to 
https://docs.rs/arrow/latest/arrow/array/struct.PrimitiveArray.html#method.unary
   
   So this code could be written something like
   ```rust
   let new_array = orig_array.map_values(convert)
   ```
   
   That would also let us do potentially crazy things like reuse the buffer 
allocations and modify the values in place if they weren't shared 🤔 
   
   If that makes sense to you I can file a ticket in arrow-rs perhaps.
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to