zhuqi-lucas commented on code in PR #7860: URL: https://github.com/apache/arrow-rs/pull/7860#discussion_r2187109592
########## arrow-ord/src/sort.rs: ########## @@ -295,12 +295,81 @@ fn sort_bytes<T: ByteArrayType>( options: SortOptions, limit: Option<usize>, ) -> UInt32Array { - let mut valids = value_indices + // Note: Why do we use 4‑byte prefix? + // Compute the 4‑byte prefix in BE order, or left‑pad if shorter. + // Most byte‐sequences differ in their first few bytes, so by + // comparing up to 4 bytes as a single u32 we avoid the overhead + // of a full lexicographical compare for the vast majority of cases. + + // 1. Build a vector of (index, 64‑bit prefix) pairs + // The high 32 bits store the big‑endian u32 of the first up to 4 bytes, + // the low 32 bits store the original slice length. + // This is similar to inline_key_fast function in GenericByteViewArray. + let mut valids: Vec<(u32, u64)> = value_indices .into_iter() - .map(|index| (index, values.value(index as usize).as_ref())) - .collect::<Vec<(u32, &[u8])>>(); + .map(|idx| { + let slice: &[u8] = values.value(idx as usize).as_ref(); + let len = slice.len() as u32; + // Compute the 4‑byte prefix in BE order, or left‑pad if shorter + let prefix_u32 = if len >= 4 { + let raw = unsafe { std::ptr::read_unaligned(slice.as_ptr() as *const u32) }; + u32::from_be(raw) + } else { + let mut v = 0u32; + for &b in slice { + v = (v << 8) | (b as u32); + } + v << (8 * (4 - len)) + }; + // Pack into a single u64: (prefix << 32) | length + let prefix64 = ((prefix_u32 as u64) << 32) | (len as u64); Review Comment: I found change to 8 bytes prefix, it will cause about 50% slow than 4 bytes prefix for most of the cases, it seems due to the 4 bytes prefix better L1 cache matching and less compare overhead. ```rust pub fn sort_to_indices( array: &dyn Array, options: Option<SortOptions>, limit: Option<usize>, ) -> Result<UInt32Array, ArrowError> { let options = options.unwrap_or_default(); let (v, n) = partition_validity(array); Ok(downcast_primitive_array! { array => sort_primitive(array, v, n, options, limit), DataType::Boolean => sort_boolean(array.as_boolean(), v, n, options, limit), DataType::Utf8 => sort_bytes(array.as_string::<i32>(), v, n, options, limit), DataType::LargeUtf8 => sort_bytes(array.as_string::<i64>(), v, n, options, limit), DataType::Utf8View => sort_byte_view(array.as_string_view(), v, n, options, limit), DataType::Binary => sort_bytes(array.as_binary::<i32>(), v, n, options, limit), DataType::LargeBinary => sort_bytes(array.as_binary::<i64>(), v, n, options, limit), ``` I am wandering if we can have two specified sort_bytes: sort_bytes (4 bytes prefix, and u32 len) for DataType::Utf8 and DataType::Binary sort_big_bytes(8bytes prefix, and u64 len) for DataType::LargeUtf8 and DataType::LargeBinary -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org