Re: [I] Use prefix first for comparisons, resort to data buffer for remaining data on equal values [arrow-rs]

via GitHub Tue, 24 Jun 2025 20:46:29 -0700


zhuqi-lucas commented on issue #7744:
URL: https://github.com/apache/arrow-rs/issues/7744#issuecomment-3002839353


   @Dandandan @alamb 
   
   I tried the patch for this ticket:
   
   ```rust
   diff --git a/arrow-array/src/array/byte_view_array.rs 
b/arrow-array/src/array/byte_view_array.rs
   index 713e275d18..20962abc9b 100644
   --- a/arrow-array/src/array/byte_view_array.rs
   +++ b/arrow-array/src/array/byte_view_array.rs
   @@ -338,6 +338,15 @@ impl<T: ByteViewType + ?Sized> GenericByteViewArray<T> {
            std::slice::from_raw_parts((view as *const u128 as *const 
u8).wrapping_add(4), len)
        }
    
   +
   +    /// Returns the first 4-byte prefix of the inline value of the view.
   +    /// # Safety
   +    /// - The `view` must be a valid element from `Self::views()` that 
adheres to the view layout.
   +    #[inline(always)]
   +    pub unsafe fn inline_prefix(view: &u128) -> &[u32] {
   +        std::slice::from_raw_parts((view as *const u128 as *const 
u32).wrapping_add(1), 1)
   +    }
   +
        /// Constructs a new iterator for iterating over the values of this 
array
        pub fn iter(&self) -> ArrayIter<&Self> {
            ArrayIter::new(self)
   @@ -552,8 +561,8 @@ impl<T: ByteViewType + ?Sized> GenericByteViewArray<T> {
    
            // one of the string is larger than 12 bytes,
            // we then try to compare the inlined data first
   -        let l_inlined_data = unsafe { 
GenericByteViewArray::<T>::inline_value(l_view, 4) };
   -        let r_inlined_data = unsafe { 
GenericByteViewArray::<T>::inline_value(r_view, 4) };
   +        let l_inlined_data = unsafe { 
GenericByteViewArray::<T>::inline_prefix(l_view) };
   +        let r_inlined_data = unsafe { 
GenericByteViewArray::<T>::inline_prefix(r_view) };
            if r_inlined_data != l_inlined_data {
                return l_inlined_data.cmp(r_inlined_data);
            }
   
   ```
   
   But no performance improvement for benchmark, currently our code already 
using 4 bytes inline to check:
   ```rust
    // one of the string is larger than 12 bytes,
           // we then try to compare the inlined data first
           let l_inlined_data = unsafe { 
GenericByteViewArray::<T>::inline_value(l_view, 4) };
           let r_inlined_data = unsafe { 
GenericByteViewArray::<T>::inline_value(r_view, 4) };
           if r_inlined_data != l_inlined_data {
               return l_inlined_data.cmp(r_inlined_data);
           }
   ```
   
   As a try to change to u8 compare to u32 for the 4 bytes, we don't see 
performance improvement in the benchmark.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Use prefix first for comparisons, resort to data buffer for remaining data on equal values [arrow-rs]

Reply via email to