[ https://issues.apache.org/jira/browse/DRILL-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16209835#comment-16209835 ]
ASF GitHub Bot commented on DRILL-5879: --------------------------------------- GitHub user sachouche opened a pull request: https://github.com/apache/drill/pull/1001 JIRA DRILL-5879: Like operator performance improvements - Recently, custom code has been added to handle common search patterns (Like operator) - Contains, Starts With, and Ends With - Custom code is way faster than the generic RegEx based implementation - This pull request is another attempt to improve the Contains pattern since it is more CPU intensive Query: select <column-list> from <table> where colA like '%a%' or colA like '%xyz%'; Improvement Opportunities Avoid isAscii computation (full access of the input string) since we're dealing with the same column twice Optimize the "contains" for-loop Implementation Details 1) Added a new integer variable "asciiMode" to the VarCharHolder class The default value is -1 which indicates this info is not known Otherwise this value will be set to either 1 or 0 based on the string being in ASCII mode or Unicode The execution plan already shares the same VarCharHolder instance for all evaluations of the same column value The asciiMode will be correctly set during the first LIKE evaluation and will be reused across other LIKE evaluations 2) The "Contains" LIKE operation is quite expensive as the code needs to access the input string to perform character based comparisons Created 4 versions of the same for-loop to a) make the loop simpler to optimize (Vectorization) and b) minimize comparisons Benchmarks Lineitem table 100GB Query: select l_returnflag, count from dfs.`<source>` where l_comment not like '%a%' or l_comment like '%the%' group by l_returnflag Before changes: 33sec After changes : 27sec You can merge this pull request into a Git repository by running: $ git pull https://github.com/sachouche/drill yodlee-cherry-pick Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/1001.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1001 ---- commit c2b05b2e8665daf3f7b43d49c428539b3753595f Author: Salim Achouche <sachouc...@gmail.com> Date: 2017-10-18T18:40:18Z JIRA 5879: Like operator performance improvements ---- > Optimize "Like" operator > ------------------------ > > Key: DRILL-5879 > URL: https://issues.apache.org/jira/browse/DRILL-5879 > Project: Apache Drill > Issue Type: Improvement > Components: Execution - Relational Operators > Environment: * > Reporter: salim achouche > Assignee: salim achouche > Priority: Minor > Fix For: 1.12.0 > > > Query: select <column-list> from <table> where colA like '%a%' or colA like > '%xyz%'; > Improvement Opportunities > # Avoid isAscii computation (full access of the input string) since we're > dealing with the same column twice > # Optimize the "contains" for-loop > Implementation Details > 1) > * Added a new integer variable "asciiMode" to the VarCharHolder class > * The default value is -1 which indicates this info is not known > * Otherwise this value will be set to either 1 or 0 based on the string being > in ASCII mode or Unicode > * The execution plan already shares the same VarCharHolder instance for all > evaluations of the same column value > * The asciiMode will be correctly set during the first LIKE evaluation and > will be reused across other LIKE evaluations > 2) > * The "Contains" LIKE operation is quite expensive as the code needs to > access the input string to perform character based comparisons > * Created 4 versions of the same for-loop to a) make the loop simpler to > optimize (Vectorization) and b) minimize comparisons > Benchmarks > * Lineitem table 100GB > * Query: select l_returnflag, count(*) from dfs.`<source>` where l_comment > not like '%a%' or l_comment like '%the%' group by l_returnflag > * Before changes: 33sec > * After changes : 27sec -- This message was sent by Atlassian JIRA (v6.4.14#64029)