GitHub user sachouche opened a pull request:
https://github.com/apache/drill/pull/1072
DRILL-5879: Improved SQL Pattern Contains Performance
**BACKGROUND**
- JIRA [DRILL-5879](https://issues.apache.org/jira/browse/DRILL-5879) goal
is to improve the "Contains" pattern performance
- [DRILL-5899](https://issues.apache.org/jira/browse/DRILL-5899) (sub-task)
was created subsequently to avoid the ASCII / Unicode pre-processing overhead.
- This pull-request addresses the algorithmic part of this functionality
**ANALYSIS**
- Contains has O(n*m) complexity
- There are two ways to optimize the associated runtime
1) Minimize the number of instructions, pipelining, and CPU stalls
2) Use smarter algorithms (compared to the naive one); for example, the
Boyer-Moore algorithm (which is implemented by several popular Open Source
tools such as grep)
**IMPLEMENTATION**
Our approach contains both suggestions (listed in the analysis)
- Created five matchers that are chosen based on the pattern length
- The first three, are based on pattern 1) and target patterns with length
[1..4[
- The forth one, has a similar runtime then the current implementation and
targets patterns with length [4..10[
- We use the the Boyer-Moore algorithm for patterns with a length larger
than 9 bytes
NOTE - the JDK doesn't use this algorithm because of two main reasons
- Two extra arrays are necessary with a size relative to the supported
character-set and the pattern length (this would be particularly costly for
unicode as this would require 64k entries)
- Each Contains (or indexOf) invocation would require memory and
initialization overhead
- Drill doesn't have this issue as a) initialization overhead is amortized
since the pattern will be matched against many input values and b) our Contains
logic is centered around bytes so the memory overhead is around 256 integers
per fragment
**PERFORMANCE IMPROVEMENTS**
- We observe at least 25% improvement per Contains operation for matchers
with pattern length lower than 4; 100% for negative cases (as the code never
accesses a secondary for-loop)
- It was noticed the Boyer-Moor algorithm performs poorly for small
patterns as lookup accesses erase the associated optimizations; this algorithm
performs extremely well when the pattern length increases as unlike the naive
implementation it is able to use the lookup table to jump beyond the next
character (since the matching phase has already gained so insight).
- One popular introduction to this algorithm is the following use-case
- Input: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
- Pattern: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaax
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/sachouche/drill DRILL-5879
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/drill/pull/1072.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1072
----
commit 50ac5140420057692cac8a33f181eb16580a28e6
Author: Salim Achouche <[email protected]>
Date: 2017-12-13T22:24:45Z
DRILL-5879: Improved SQL Pattern Contains Performance
----
---