GitHub user sachouche opened a pull request:
https://github.com/apache/drill/pull/1001
JIRA DRILL-5879: Like operator performance improvements
- Recently, custom code has been added to handle common search patterns
(Like operator)
- Contains, Starts With, and Ends With
- Custom code is way faster than the generic RegEx based implementation
- This pull request is another attempt to improve the Contains pattern
since it is more CPU intensive
Query: select <column-list> from <table> where colA like '%a%' or
colA like '%xyz%';
Improvement Opportunities
Avoid isAscii computation (full access of the input string) since we're
dealing with the same column twice
Optimize the "contains" for-loop
Implementation Details
1)
Added a new integer variable "asciiMode" to the VarCharHolder class
The default value is -1 which indicates this info is not known
Otherwise this value will be set to either 1 or 0 based on the string being
in ASCII mode or Unicode
The execution plan already shares the same VarCharHolder instance for all
evaluations of the same column value
The asciiMode will be correctly set during the first LIKE evaluation and
will be reused across other LIKE evaluations
2)
The "Contains" LIKE operation is quite expensive as the code needs to
access the input string to perform character based comparisons
Created 4 versions of the same for-loop to a) make the loop simpler to
optimize (Vectorization) and b) minimize comparisons
Benchmarks
Lineitem table 100GB
Query: select l_returnflag, count from dfs.`<source>` where l_comment not
like '%a%' or l_comment like '%the%' group by l_returnflag
Before changes: 33sec
After changes : 27sec
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/sachouche/drill yodlee-cherry-pick
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/drill/pull/1001.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1001
----
commit c2b05b2e8665daf3f7b43d49c428539b3753595f
Author: Salim Achouche <[email protected]>
Date: 2017-10-18T18:40:18Z
JIRA 5879: Like operator performance improvements
----
---