lidavidm commented on a change in pull request #10448:
URL: https://github.com/apache/arrow/pull/10448#discussion_r646827535



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -636,13 +728,15 @@ std::string MakeLikeRegex(const MatchSubstringOptions& 
options) {
 }
 
 // A LIKE pattern matching this regex can be translated into a substring 
search.
-static RE2 kLikePatternIsSubstringMatch("%+([^%_]*)%+");
+static RE2 kLikePatternIsSubstringMatch(R"(%+([^%_]*[^\\%_])?%+)");
+// A LIKE pattern matching this regex can be translated into a prefix search.
+static RE2 kLikePatternIsStartsWith(R"(([^%_]*[^\\%_])?%+)");
+// A LIKE pattern matching this regex can be translated into a suffix search.
+static RE2 kLikePatternIsEndsWith(R"(%+([^%_]*))");

Review comment:
       I added a benchmark (the latest commit).
   
   With RE2:
   
   ```
   -----------------------------------------------------------------------------
   Benchmark                   Time             CPU   Iterations UserCounters...
   -----------------------------------------------------------------------------
   MatchLike            24478428 ns     24474766 ns           29 
bytes_per_second=647.127M/s items_per_second=42.8431M/s
   MatchLikeSubstring  105684357 ns    105673666 ns            7 
bytes_per_second=149.879M/s items_per_second=9.92277M/s
   MatchLikePrefix     105720204 ns    105698786 ns            7 
bytes_per_second=149.844M/s items_per_second=9.92042M/s
   MatchLikeSuffix     105730852 ns    105712489 ns            7 
bytes_per_second=149.824M/s items_per_second=9.91913M/s
   ```
   
   With the optimization:
   
   ```
   -----------------------------------------------------------------------------
   Benchmark                   Time             CPU   Iterations UserCounters...
   -----------------------------------------------------------------------------
   MatchLike            24035525 ns     24035373 ns           31 
bytes_per_second=658.958M/s items_per_second=43.6264M/s
   MatchLikeSubstring   44747614 ns     44747029 ns           16 
bytes_per_second=353.952M/s items_per_second=23.4334M/s
   MatchLikePrefix       5927800 ns      5927691 ns          116 
bytes_per_second=2.60929G/s items_per_second=176.895M/s
   MatchLikeSuffix       5988512 ns      5988423 ns          118 
bytes_per_second=2.58283G/s items_per_second=175.101M/s
   ```
   
   This is actually a little surprising.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to