HyukjinKwon opened a new pull request, #48729: URL: https://github.com/apache/arrow/pull/48729
### Rationale for this change String operations with regex patterns (match, replace, extract) were recompiling regex patterns on every invocation. This PR implements caching to compile once and reuse. Benchmark shows roughly 36% performance improvement (2.52s → 1.61s for 200 operations). https://github.com/apache/arrow/blob/727106f7ff65065298e1e79071fed2a408b4b4d6/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc#L1371 https://github.com/apache/arrow/blob/727106f7ff65065298e1e79071fed2a408b4b4d6/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc#L1381 https://github.com/apache/arrow/blob/727106f7ff65065298e1e79071fed2a408b4b4d6/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc#L1965 https://github.com/apache/arrow/blob/727106f7ff65065298e1e79071fed2a408b4b4d6/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc#L2218 ### What changes are included in this PR? - Added `CachedOptionsWrapper<T>` template for kernel state with caching support - Updated `MatchSubstringState`, `ReplaceState`, and `ExtractRegexState` to use caching - Modified `Exec()` methods to call `GetOrCreate<Matcher>()` instead of direct `Matcher::Make()` ### Are these changes tested? Yes. All existing tests pass. Benchmark demonstrates measurable performance improvement when same pattern is used across multiple operations. (Generated by ChatGPT) Benchmark: ```bash # Benchmark script: Compare WITH vs WITHOUT caching # Step 1: Measure WITH caching (current implementation) cd /.../arrow/cpp/build /usr/bin/time -p ./debug/arrow-compute-scalar-type-test \ --gtest_filter="TestStringKernels/0.MatchSubstringRegex" \ --gtest_repeat=200 \ --gtest_brief=1 # Step 2: Temporarily remove caching cd /.../arrow/cpp git stash push -m "Temp for benchmark" src/arrow/compute/kernels/scalar_string_ascii.cc # Step 3: Rebuild WITHOUT caching cd build touch ../src/arrow/compute/kernels/scalar_string_ascii.cc cmake --build . # Step 4: Measure WITHOUT caching (reverted to old TODO code) /usr/bin/time -p ./debug/arrow-compute-scalar-type-test \ --gtest_filter="TestStringKernels/0.MatchSubstringRegex" \ --gtest_repeat=200 \ --gtest_brief=1 # Step 5: Restore caching cd .. git stash pop cd build && touch ../src/arrow/compute/kernels/scalar_string_ascii.cc cmake --build . ``` Results: ``` ╔════════════════════════════════════════════════════════╗ ║ BENCHMARK RESULTS ║ ╠════════════════════════════════════════════════════════╣ ║ WITHOUT Caching: 2.52 seconds ║ ║ WITH Caching: 1.61 seconds ║ ║ ───────────────────────────────────── ║ ║ Time Saved: 0.91 seconds ║ ║ Improvement: 36.1% FASTER ║ ╚════════════════════════════════════════════════════════╝ Test Configuration: • Test: TestStringKernels/0.MatchSubstringRegex • Iterations: 200 repetitions • Pattern: Complex regex with groups/alternation • Per-operation: 12.6ms → 8.05ms (4.5ms saved) ``` ### Are there any user-facing changes? No, this is an optiomization. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
