GitHub user sachouche opened a pull request:

    https://github.com/apache/drill/pull/1072

    DRILL-5879: Improved SQL Pattern Contains Performance

    **BACKGROUND**
    - JIRA [DRILL-5879](https://issues.apache.org/jira/browse/DRILL-5879) goal 
is to improve the "Contains" pattern performance
    - [DRILL-5899](https://issues.apache.org/jira/browse/DRILL-5899) (sub-task) 
was created subsequently to avoid the ASCII / Unicode pre-processing overhead.
    - This pull-request addresses the algorithmic part of this functionality
    
    **ANALYSIS**
    - Contains has O(n*m) complexity
    - There are two ways to optimize the associated runtime
    1) Minimize the number of instructions, pipelining, and CPU stalls
    2) Use smarter algorithms (compared to the naive one); for example, the 
Boyer-Moore algorithm (which is implemented by several popular Open Source 
tools such as grep)
    
    **IMPLEMENTATION**
    Our approach contains both suggestions (listed in the analysis)
    - Created five matchers that are chosen based on the pattern length
    - The first three, are based on pattern 1) and target patterns with length 
[1..4[ 
    - The forth one, has a similar runtime then the current implementation and 
targets patterns with length [4..10[
    - We use the the Boyer-Moore algorithm for patterns with a length larger 
than 9 bytes
    NOTE - the JDK doesn't use this algorithm because of two main  reasons
    - Two extra arrays are necessary with a size relative to the supported 
character-set and the pattern length (this would be particularly costly for 
unicode as this would require 64k entries)
    - Each Contains (or indexOf) invocation would require memory and 
initialization overhead
    - Drill doesn't have this issue as a) initialization overhead is amortized 
since the pattern will be matched against many input values and b) our Contains 
logic is centered around bytes so the memory overhead is around 256 integers 
per fragment
    
    **PERFORMANCE IMPROVEMENTS**
    - We observe at least 25% improvement per Contains operation for matchers 
with pattern length lower than 4; 100% for negative cases (as the code never 
accesses a secondary for-loop)
    - It was noticed the Boyer-Moor algorithm performs poorly for small 
patterns as lookup accesses erase the associated optimizations; this algorithm 
performs extremely well when the pattern length increases as unlike the naive 
implementation it is able to use the lookup table to jump beyond the next 
character (since the matching phase has already gained so insight).
    - One popular introduction to this algorithm is the following use-case
    - Input: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    - Pattern: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaax
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sachouche/drill DRILL-5879

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/1072.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1072
    
----
commit 50ac5140420057692cac8a33f181eb16580a28e6
Author: Salim Achouche <sachouc...@gmail.com>
Date:   2017-12-13T22:24:45Z

    DRILL-5879: Improved SQL Pattern Contains Performance

----


---

Reply via email to