HyukjinKwon opened a new pull request, #48734:
URL: https://github.com/apache/arrow/pull/48734

   ### Rationale for this change
   
   Five TODO comments were left in `scalar_string_ascii.cc` when ArraySpan was 
introduced (ARROW-16756). This boxing creates unnecessary temporary Array 
allocations and reduces performance.
   
   
https://github.com/apache/arrow/blob/53752adc6b81166cd4ee7db5a819494042f29197/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc#L398
   
   
https://github.com/apache/arrow/blob/53752adc6b81166cd4ee7db5a819494042f29197/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc#L450
   
   
https://github.com/apache/arrow/blob/53752adc6b81166cd4ee7db5a819494042f29197/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc#L2887
   
   
https://github.com/apache/arrow/blob/53752adc6b81166cd4ee7db5a819494042f29197/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc#L2916
   
   
https://github.com/apache/arrow/blob/53752adc6b81166cd4ee7db5a819494042f29197/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc#L2952
   
   
   ### What changes are included in this PR?
   
   This PR eliminates Array object construction by accessing ArraySpan buffers 
directly.
   
   - **String Transform Operations 
(`StringBinaryTransformExecBase::ExecScalarArray`, `ExecArrayArray`)**: Access 
ArraySpan buffers directly instead of creating temporary Array objects
   
   - **Type Dispatch for String vs Numeric Types 
(`StringBinaryTransformExecBase::GetViewFromSpan`)**: Added SFINAE helper to 
handle both string types (construct string_view) and numeric types (return raw 
values)
   
   - **Binary Join Operations (`BinaryJoin::ExecScalarArray`, 
`ExecArrayScalar`, `ExecArrayArray`)**: Replaced Array-based lookup helpers 
with ArraySpan-based helpers that store buffer pointers directly
   
   (Generated by ChatGPT)
   
   **String Transform Operations  
(`StringBinaryTransformExecBase::ExecScalarArray`, `ExecArrayArray`)**
   
   Before:
   ```cpp
   const ArrayType2 array2(data2.ToArrayData());  // Creates temporary Array 
object
   
   RETURN_NOT_OK(arrow::internal::VisitBitBlocks(
       data2.buffers[0].data, data2.offset, data2.length,
       [&](int64_t i) {
         ViewType2 value2 = array2.GetView(i);  // Access through Array API
         // ... transform logic
       }));
   ```
   
   After:
   ```cpp
   RETURN_NOT_OK(arrow::internal::VisitBitBlocks(
       data2.buffers[0].data, data2.offset, data2.length,
       [&](int64_t i) {
         ViewType2 value2 = GetViewFromSpan(data2, i);  // Direct buffer access
         // ... transform logic
       }));
   ```
   
   Data Flow:
   ```
   BEFORE: ArraySpan → ToArrayData() → Array object → GetView() → string_view
                       [allocation]    [vtable call]
   
   AFTER:  ArraySpan → GetViewFromSpan() → string_view
                       [direct buffer access]
   ```
   
   **Type Dispatch for String vs Numeric Types 
(`StringBinaryTransformExecBase::GetViewFromSpan`)**
   
   Added SFINAE-based helper to handle both string types (binary, utf8) and 
numeric types (int64 for `binary_repeat`):
   
   ```cpp
   // For string types: construct string_view from offset buffer
   template <typename T = Type2>
   static typename std::enable_if<is_base_binary_type<T>::value, 
ViewType2>::type
   GetViewFromSpan(const ArraySpan& span, int64_t i) {
     const auto* offsets = span.GetValues<offset_type_2>(1);
     const uint8_t* data = span.GetValues<uint8_t>(2, 0);
     const auto start = offsets[i];
     const auto end = offsets[i + 1];
     return ViewType2(reinterpret_cast<const char*>(data + start), end - start);
   }
   
   // For numeric types: return value directly
   template <typename T = Type2>
   static typename std::enable_if<!is_base_binary_type<T>::value, 
ViewType2>::type
   GetViewFromSpan(const ArraySpan& span, int64_t i) {
     return span.GetValues<ViewType2>(1)[i];
   }
   ```
   
   **Binary Join Operations (`BinaryJoin::ExecScalarArray`, `ExecArrayScalar`, 
`ExecArrayArray`)**
   
   Before:
   ```cpp
   const ArrayType separators(right.ToArrayData());  // Creates Array object
   
   struct SeparatorArrayLookup {
     const ArrayType& separators;
     std::string_view GetView(int64_t i) { return separators.GetView(i); }
   };
   
   JoinStrings(..., SeparatorArrayLookup{separators}, ...);
   ```
   
   After:
   ```cpp
   struct SeparatorArraySpanLookup {
     explicit SeparatorArraySpanLookup(const ArraySpan& separators)
         : validity_(separators.buffers[0].data),
           offset_(separators.offset),
           offsets_(separators.GetValues<typename ArrayType::offset_type>(1)),
           data_(separators.GetValues<uint8_t>(2, 0)) {}
   
     std::string_view GetView(int64_t i) {
       auto start = offsets_[i];
       auto end = offsets_[i + 1];
       return std::string_view(reinterpret_cast<const char*>(data_ + start), 
end - start);
     }
     // ... IsNull() implementation
   
   private:
     const uint8_t* validity_;
     int64_t offset_;
     const typename ArrayType::offset_type* offsets_;
     const uint8_t* data_;
   };
   
   JoinStrings(..., SeparatorArraySpanLookup{right}, ...);  // No Array 
construction
   ```
   
   Memory Layout:
   ```
   ArraySpan Structure:
   ┌────────────────────────────────────────────┐
   │ buffers[0]: validity bitmap                │
   │ buffers[1]: offsets [0, 3, 7, 12, ...]     │  ← Direct access
   │ buffers[2]: data ['foo', 'bar', 'baz'...]  │  ← Direct access
   │ offset: logical start position             │
   │ length: number of elements                 │
   └────────────────────────────────────────────┘
                       ↓
       SeparatorArraySpanLookup stores pointers to buffers
                       ↓
       GetView(i) directly constructs string_view from buffers
   ```
   
   ### Are these changes tested?
   
   Yes. All existing tests pass:
   
   
https://github.com/apache/arrow/blob/727106f7ff65065298e1e79071fed2a408b4b4d6/cpp/src/arrow/compute/kernels/scalar_string_test.cc#L1293
   
   
https://github.com/apache/arrow/blob/727106f7ff65065298e1e79071fed2a408b4b4d6/cpp/src/arrow/compute/kernels/scalar_string_test.cc#L1335
   
   
https://github.com/apache/arrow/blob/727106f7ff65065298e1e79071fed2a408b4b4d6/cpp/src/arrow/compute/kernels/scalar_string_test.cc#L621
   
   
https://github.com/apache/arrow/blob/727106f7ff65065298e1e79071fed2a408b4b4d6/cpp/src/arrow/compute/kernels/scalar_string_test.cc#L2169
   
   ### Are there any user-facing changes?
   
   No, this is an internal optimization.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to