[ https://issues.apache.org/jira/browse/DRILL-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16217807#comment-16217807 ]
ASF GitHub Bot commented on DRILL-5879: --------------------------------------- Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1001#discussion_r146708658 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java --- @@ -17,37 +17,166 @@ */ package org.apache.drill.exec.expr.fn.impl; -public class SqlPatternContainsMatcher implements SqlPatternMatcher { +public final class SqlPatternContainsMatcher implements SqlPatternMatcher { final String patternString; CharSequence charSequenceWrapper; final int patternLength; + final MatcherFcn matcherFcn; public SqlPatternContainsMatcher(String patternString, CharSequence charSequenceWrapper) { - this.patternString = patternString; + this.patternString = patternString; this.charSequenceWrapper = charSequenceWrapper; - patternLength = patternString.length(); + patternLength = patternString.length(); + + // The idea is to write loops with simple condition checks to allow the Java Hotspot achieve + // better optimizations (especially vectorization) + if (patternLength == 1) { + matcherFcn = new Matcher1(); --- End diff -- Padma, I have two reasons to follow the added complexity 1) The new code is encapsulated within the Contains matching logic; doesn't increase code complexity 2) o I created a test with the original match logic, pattern and input were Strings though passed as CharSequence o Ran the test with the new and old method (1 billion iterations) on MacOS o pattern length o The old match method performed in 43sec where as the new one performed in 15sec o The reason for the speedup is the custom matcher functions have less instructions (load and comparison) > Optimize "Like" operator > ------------------------ > > Key: DRILL-5879 > URL: https://issues.apache.org/jira/browse/DRILL-5879 > Project: Apache Drill > Issue Type: Improvement > Components: Execution - Relational Operators > Environment: * > Reporter: salim achouche > Assignee: salim achouche > Priority: Minor > Fix For: 1.12.0 > > > Query: select <column-list> from <table> where colA like '%a%' or colA like > '%xyz%'; > Improvement Opportunities > # Avoid isAscii computation (full access of the input string) since we're > dealing with the same column twice > # Optimize the "contains" for-loop > Implementation Details > 1) > * Added a new integer variable "asciiMode" to the VarCharHolder class > * The default value is -1 which indicates this info is not known > * Otherwise this value will be set to either 1 or 0 based on the string being > in ASCII mode or Unicode > * The execution plan already shares the same VarCharHolder instance for all > evaluations of the same column value > * The asciiMode will be correctly set during the first LIKE evaluation and > will be reused across other LIKE evaluations > 2) > * The "Contains" LIKE operation is quite expensive as the code needs to > access the input string to perform character based comparisons > * Created 4 versions of the same for-loop to a) make the loop simpler to > optimize (Vectorization) and b) minimize comparisons > Benchmarks > * Lineitem table 100GB > * Query: select l_returnflag, count(*) from dfs.`<source>` where l_comment > not like '%a%' or l_comment like '%the%' group by l_returnflag > * Before changes: 33sec > * After changes : 27sec -- This message was sent by Atlassian JIRA (v6.4.14#64029)