Re: [PR] Improve StringArray(Utf8) sort performance (~2-4x faster) [arrow-rs]

via GitHub Sat, 02 Aug 2025 01:41:13 -0700


zhuqi-lucas commented on code in PR #7860:
URL: https://github.com/apache/arrow-rs/pull/7860#discussion_r2249171756



##########
arrow-ord/src/sort.rs:
##########
@@ -295,12 +295,85 @@ fn sort_bytes<T: ByteArrayType>(
     options: SortOptions,
     limit: Option<usize>,
 ) -> UInt32Array {
-    let mut valids = value_indices
+    // Note: Why do we use 4‑byte prefix?
+    // Compute the 4‑byte prefix in BE order, or left‑pad if shorter.
+    // Most byte‐sequences differ in their first few bytes, so by
+    // comparing up to 4 bytes as a single u32 we avoid the overhead
+    // of a full lexicographical compare for the vast majority of cases.
+
+    // 1. Build a vector of (index, prefix, length) tuples
+    let mut valids: Vec<(u32, u32, u64)> = value_indices
         .into_iter()
-        .map(|index| (index, values.value(index as usize).as_ref()))
-        .collect::<Vec<(u32, &[u8])>>();
+        .map(|idx| {
+            let slice: &[u8] = values.value(idx as usize).as_ref();
+            let len = slice.len() as u64;
+            // Compute the 4‑byte prefix in BE order, or left‑pad if shorter
+            let prefix = if slice.len() >= 4 {
+                let raw = unsafe { std::ptr::read_unaligned(slice.as_ptr() as 
*const u32) };

Review Comment:
   Thank you @alamb , this is a good question:
   ```rust
   let mut buf = [0u8; 4];
   buf.copy_from_slice(&slice[..4]);
   let prefix = u32::from_be_bytes(buf);
   ```
   
   ```rust
   let prefix = u32::from_be_bytes([
       slice[0], slice[1], slice[2], slice[3],
   ]);
   ```
   
   Both using u32::from_be_bytes() will have more cost besides current 
implement:
   
   1. Four separate bounds-checks Each slice[0], slice[1], etc. must check 
index < len, so you pay four panics-guards instead of zero.
   2. A small stack copy buf.copy_from_slice(&slice[..4]) copies 4 bytes into 
your local [u8;4], and even the array-literal version writes each byte into the 
stack frame.
   3. Extra register/memory moves
   The compiler has to marshal those four bytes into registers (or spill them 
to the stack) before calling u32::from_be_bytes, versus the single unaligned 
load in the unsafe version.
   
   
   I am wondering if i can also test other use cases using this similar code, 
but it may has little difference. Thanks.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Improve StringArray(Utf8) sort performance (~2-4x faster) [arrow-rs]

Reply via email to