lidavidm commented on a change in pull request #10448:
URL: https://github.com/apache/arrow/pull/10448#discussion_r646827535
##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -636,13 +728,15 @@ std::string MakeLikeRegex(const MatchSubstringOptions&
options) {
}
// A LIKE pattern matching this regex can be translated into a substring
search.
-static RE2 kLikePatternIsSubstringMatch("%+([^%_]*)%+");
+static RE2 kLikePatternIsSubstringMatch(R"(%+([^%_]*[^\\%_])?%+)");
+// A LIKE pattern matching this regex can be translated into a prefix search.
+static RE2 kLikePatternIsStartsWith(R"(([^%_]*[^\\%_])?%+)");
+// A LIKE pattern matching this regex can be translated into a suffix search.
+static RE2 kLikePatternIsEndsWith(R"(%+([^%_]*))");
Review comment:
I added a benchmark (the latest commit).
With RE2:
```
-----------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-----------------------------------------------------------------------------
MatchLike 24478428 ns 24474766 ns 29
bytes_per_second=647.127M/s items_per_second=42.8431M/s
MatchLikeSubstring 105684357 ns 105673666 ns 7
bytes_per_second=149.879M/s items_per_second=9.92277M/s
MatchLikePrefix 105720204 ns 105698786 ns 7
bytes_per_second=149.844M/s items_per_second=9.92042M/s
MatchLikeSuffix 105730852 ns 105712489 ns 7
bytes_per_second=149.824M/s items_per_second=9.91913M/s
```
With the optimization:
```
-----------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-----------------------------------------------------------------------------
MatchLike 24035525 ns 24035373 ns 31
bytes_per_second=658.958M/s items_per_second=43.6264M/s
MatchLikeSubstring 44747614 ns 44747029 ns 16
bytes_per_second=353.952M/s items_per_second=23.4334M/s
MatchLikePrefix 5927800 ns 5927691 ns 116
bytes_per_second=2.60929G/s items_per_second=176.895M/s
MatchLikeSuffix 5988512 ns 5988423 ns 118
bytes_per_second=2.58283G/s items_per_second=175.101M/s
```
This is actually a little surprising.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]