HyukjinKwon opened a new pull request, #48729:
URL: https://github.com/apache/arrow/pull/48729

   ### Rationale for this change
   
   String operations with regex patterns (match, replace, extract) were 
recompiling regex patterns on every invocation. This PR implements caching to 
compile once and reuse.
   
   Benchmark shows roughly 36% performance improvement (2.52s → 1.61s for 200 
operations).
   
   
https://github.com/apache/arrow/blob/727106f7ff65065298e1e79071fed2a408b4b4d6/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc#L1371
   
   
https://github.com/apache/arrow/blob/727106f7ff65065298e1e79071fed2a408b4b4d6/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc#L1381
   
   
https://github.com/apache/arrow/blob/727106f7ff65065298e1e79071fed2a408b4b4d6/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc#L1965
   
   
https://github.com/apache/arrow/blob/727106f7ff65065298e1e79071fed2a408b4b4d6/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc#L2218
   
   ### What changes are included in this PR?
   
   - Added `CachedOptionsWrapper<T>` template for kernel state with caching 
support
   - Updated `MatchSubstringState`, `ReplaceState`, and `ExtractRegexState` to 
use caching
   - Modified `Exec()` methods to call `GetOrCreate<Matcher>()` instead of 
direct `Matcher::Make()`
   
   ### Are these changes tested?
   
   Yes. All existing tests pass. Benchmark demonstrates measurable performance 
improvement when same pattern is used across multiple operations.
   
   (Generated by ChatGPT)
   
   Benchmark:
   
   ```bash
   # Benchmark script: Compare WITH vs WITHOUT caching
   
   # Step 1: Measure WITH caching (current implementation)
   cd /.../arrow/cpp/build
   /usr/bin/time -p ./debug/arrow-compute-scalar-type-test \
     --gtest_filter="TestStringKernels/0.MatchSubstringRegex" \
     --gtest_repeat=200 \
     --gtest_brief=1
   
   # Step 2: Temporarily remove caching
   cd /.../arrow/cpp
   git stash push -m "Temp for benchmark" 
src/arrow/compute/kernels/scalar_string_ascii.cc
   
   # Step 3: Rebuild WITHOUT caching
   cd build
   touch ../src/arrow/compute/kernels/scalar_string_ascii.cc
   cmake --build .
   
   # Step 4: Measure WITHOUT caching (reverted to old TODO code)
   /usr/bin/time -p ./debug/arrow-compute-scalar-type-test \
     --gtest_filter="TestStringKernels/0.MatchSubstringRegex" \
     --gtest_repeat=200 \
     --gtest_brief=1
   
   # Step 5: Restore caching
   cd ..
   git stash pop
   cd build && touch ../src/arrow/compute/kernels/scalar_string_ascii.cc
   cmake --build .
   ```
   
   Results:
   
   ```
   ╔════════════════════════════════════════════════════════╗
   ║              BENCHMARK RESULTS                         ║
   ╠════════════════════════════════════════════════════════╣
   ║  WITHOUT Caching:      2.52 seconds                    ║
   ║  WITH Caching:         1.61 seconds                    ║
   ║  ─────────────────────────────────────                 ║
   ║  Time Saved:           0.91 seconds                    ║
   ║  Improvement:          36.1% FASTER                    ║
   ╚════════════════════════════════════════════════════════╝
   
   Test Configuration:
     • Test: TestStringKernels/0.MatchSubstringRegex
     • Iterations: 200 repetitions
     • Pattern: Complex regex with groups/alternation
     • Per-operation: 12.6ms → 8.05ms (4.5ms saved)
   ```
   
   ### Are there any user-facing changes?
   
   No, this is an optiomization.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to