HyukjinKwon opened a new pull request, #48734: URL: https://github.com/apache/arrow/pull/48734
### Rationale for this change Five TODO comments were left in `scalar_string_ascii.cc` when ArraySpan was introduced (ARROW-16756). This boxing creates unnecessary temporary Array allocations and reduces performance. https://github.com/apache/arrow/blob/53752adc6b81166cd4ee7db5a819494042f29197/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc#L398 https://github.com/apache/arrow/blob/53752adc6b81166cd4ee7db5a819494042f29197/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc#L450 https://github.com/apache/arrow/blob/53752adc6b81166cd4ee7db5a819494042f29197/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc#L2887 https://github.com/apache/arrow/blob/53752adc6b81166cd4ee7db5a819494042f29197/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc#L2916 https://github.com/apache/arrow/blob/53752adc6b81166cd4ee7db5a819494042f29197/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc#L2952 ### What changes are included in this PR? This PR eliminates Array object construction by accessing ArraySpan buffers directly. - **String Transform Operations (`StringBinaryTransformExecBase::ExecScalarArray`, `ExecArrayArray`)**: Access ArraySpan buffers directly instead of creating temporary Array objects - **Type Dispatch for String vs Numeric Types (`StringBinaryTransformExecBase::GetViewFromSpan`)**: Added SFINAE helper to handle both string types (construct string_view) and numeric types (return raw values) - **Binary Join Operations (`BinaryJoin::ExecScalarArray`, `ExecArrayScalar`, `ExecArrayArray`)**: Replaced Array-based lookup helpers with ArraySpan-based helpers that store buffer pointers directly (Generated by ChatGPT) **String Transform Operations (`StringBinaryTransformExecBase::ExecScalarArray`, `ExecArrayArray`)** Before: ```cpp const ArrayType2 array2(data2.ToArrayData()); // Creates temporary Array object RETURN_NOT_OK(arrow::internal::VisitBitBlocks( data2.buffers[0].data, data2.offset, data2.length, [&](int64_t i) { ViewType2 value2 = array2.GetView(i); // Access through Array API // ... transform logic })); ``` After: ```cpp RETURN_NOT_OK(arrow::internal::VisitBitBlocks( data2.buffers[0].data, data2.offset, data2.length, [&](int64_t i) { ViewType2 value2 = GetViewFromSpan(data2, i); // Direct buffer access // ... transform logic })); ``` Data Flow: ``` BEFORE: ArraySpan → ToArrayData() → Array object → GetView() → string_view [allocation] [vtable call] AFTER: ArraySpan → GetViewFromSpan() → string_view [direct buffer access] ``` **Type Dispatch for String vs Numeric Types (`StringBinaryTransformExecBase::GetViewFromSpan`)** Added SFINAE-based helper to handle both string types (binary, utf8) and numeric types (int64 for `binary_repeat`): ```cpp // For string types: construct string_view from offset buffer template <typename T = Type2> static typename std::enable_if<is_base_binary_type<T>::value, ViewType2>::type GetViewFromSpan(const ArraySpan& span, int64_t i) { const auto* offsets = span.GetValues<offset_type_2>(1); const uint8_t* data = span.GetValues<uint8_t>(2, 0); const auto start = offsets[i]; const auto end = offsets[i + 1]; return ViewType2(reinterpret_cast<const char*>(data + start), end - start); } // For numeric types: return value directly template <typename T = Type2> static typename std::enable_if<!is_base_binary_type<T>::value, ViewType2>::type GetViewFromSpan(const ArraySpan& span, int64_t i) { return span.GetValues<ViewType2>(1)[i]; } ``` **Binary Join Operations (`BinaryJoin::ExecScalarArray`, `ExecArrayScalar`, `ExecArrayArray`)** Before: ```cpp const ArrayType separators(right.ToArrayData()); // Creates Array object struct SeparatorArrayLookup { const ArrayType& separators; std::string_view GetView(int64_t i) { return separators.GetView(i); } }; JoinStrings(..., SeparatorArrayLookup{separators}, ...); ``` After: ```cpp struct SeparatorArraySpanLookup { explicit SeparatorArraySpanLookup(const ArraySpan& separators) : validity_(separators.buffers[0].data), offset_(separators.offset), offsets_(separators.GetValues<typename ArrayType::offset_type>(1)), data_(separators.GetValues<uint8_t>(2, 0)) {} std::string_view GetView(int64_t i) { auto start = offsets_[i]; auto end = offsets_[i + 1]; return std::string_view(reinterpret_cast<const char*>(data_ + start), end - start); } // ... IsNull() implementation private: const uint8_t* validity_; int64_t offset_; const typename ArrayType::offset_type* offsets_; const uint8_t* data_; }; JoinStrings(..., SeparatorArraySpanLookup{right}, ...); // No Array construction ``` Memory Layout: ``` ArraySpan Structure: ┌────────────────────────────────────────────┐ │ buffers[0]: validity bitmap │ │ buffers[1]: offsets [0, 3, 7, 12, ...] │ ← Direct access │ buffers[2]: data ['foo', 'bar', 'baz'...] │ ← Direct access │ offset: logical start position │ │ length: number of elements │ └────────────────────────────────────────────┘ ↓ SeparatorArraySpanLookup stores pointers to buffers ↓ GetView(i) directly constructs string_view from buffers ``` ### Are these changes tested? Yes. All existing tests pass: https://github.com/apache/arrow/blob/727106f7ff65065298e1e79071fed2a408b4b4d6/cpp/src/arrow/compute/kernels/scalar_string_test.cc#L1293 https://github.com/apache/arrow/blob/727106f7ff65065298e1e79071fed2a408b4b4d6/cpp/src/arrow/compute/kernels/scalar_string_test.cc#L1335 https://github.com/apache/arrow/blob/727106f7ff65065298e1e79071fed2a408b4b4d6/cpp/src/arrow/compute/kernels/scalar_string_test.cc#L621 https://github.com/apache/arrow/blob/727106f7ff65065298e1e79071fed2a408b4b4d6/cpp/src/arrow/compute/kernels/scalar_string_test.cc#L2169 ### Are there any user-facing changes? No, this is an internal optimization. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
