Re: [PR] Improve StringArray(Utf8) sort performance (~2-4x faster) [arrow-rs]

via GitHub Sat, 05 Jul 2025 04:22:17 -0700


zhuqi-lucas commented on code in PR #7860:
URL: https://github.com/apache/arrow-rs/pull/7860#discussion_r2187109592



##########
arrow-ord/src/sort.rs:
##########
@@ -295,12 +295,81 @@ fn sort_bytes<T: ByteArrayType>(
     options: SortOptions,
     limit: Option<usize>,
 ) -> UInt32Array {
-    let mut valids = value_indices
+    // Note: Why do we use 4‑byte prefix?
+    // Compute the 4‑byte prefix in BE order, or left‑pad if shorter.
+    // Most byte‐sequences differ in their first few bytes, so by
+    // comparing up to 4 bytes as a single u32 we avoid the overhead
+    // of a full lexicographical compare for the vast majority of cases.
+
+    // 1. Build a vector of (index, 64‑bit prefix) pairs
+    //    The high 32 bits store the big‑endian u32 of the first up to 4 bytes,
+    //    the low 32 bits store the original slice length.
+    // This is similar to inline_key_fast function in GenericByteViewArray.
+    let mut valids: Vec<(u32, u64)> = value_indices
         .into_iter()
-        .map(|index| (index, values.value(index as usize).as_ref()))
-        .collect::<Vec<(u32, &[u8])>>();
+        .map(|idx| {
+            let slice: &[u8] = values.value(idx as usize).as_ref();
+            let len = slice.len() as u32;
+            // Compute the 4‑byte prefix in BE order, or left‑pad if shorter
+            let prefix_u32 = if len >= 4 {
+                let raw = unsafe { std::ptr::read_unaligned(slice.as_ptr() as 
*const u32) };
+                u32::from_be(raw)
+            } else {
+                let mut v = 0u32;
+                for &b in slice {
+                    v = (v << 8) | (b as u32);
+                }
+                v << (8 * (4 - len))
+            };
+            // Pack into a single u64: (prefix << 32) | length
+            let prefix64 = ((prefix_u32 as u64) << 32) | (len as u64);

Review Comment:
   I found change to 8 bytes prefix, it will cause about 50% slow than 4 bytes 
prefix for most of the cases, it seems due to the 4 bytes prefix better L1 
cache matching and less compare overhead.
   
   ```rust
   pub fn sort_to_indices(
       array: &dyn Array,
       options: Option<SortOptions>,
       limit: Option<usize>,
   ) -> Result<UInt32Array, ArrowError> {
       let options = options.unwrap_or_default();
   
       let (v, n) = partition_validity(array);
   
       Ok(downcast_primitive_array! {
           array => sort_primitive(array, v, n, options, limit),
           DataType::Boolean => sort_boolean(array.as_boolean(), v, n, options, 
limit),
           DataType::Utf8 => sort_bytes(array.as_string::<i32>(), v, n, 
options, limit),
           DataType::LargeUtf8 => sort_bytes(array.as_string::<i64>(), v, n, 
options, limit),
           DataType::Utf8View => sort_byte_view(array.as_string_view(), v, n, 
options, limit),
           DataType::Binary => sort_bytes(array.as_binary::<i32>(), v, n, 
options, limit),
           DataType::LargeBinary => sort_bytes(array.as_binary::<i64>(), v, n, 
options, limit),
   ```
   
   
   I am wandering if we can have two specified sort_bytes:
   
   sort_bytes (4 bytes prefix, and u32 len) for DataType::Utf8 and 
DataType::Binary
   
   sort_big_bytes(8bytes prefix, and u64 len) for DataType::LargeUtf8 and 
DataType::LargeBinary 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] Improve StringArray(Utf8) sort performance (~2-4x faster) [arrow-rs]

Reply via email to