[jira] [Comment Edited] (DRILL-5697) Improve performance of filter operator for pattern matching

Padma Penumarthy (JIRA) Thu, 10 Aug 2017 15:03:39 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122402#comment-16122402
 ]


Padma Penumarthy edited comment on DRILL-5697 at 8/10/17 10:02 PM:
-------------------------------------------------------------------

I did bunch of experiments to figure out what should be the best approach.

Basically, here is what we do for "like" operation :
1. Build a charSequence wrapper for varChar UTF8 input.  If input is all ASCII, 
we directly read the byte as character from PlatformDependent. Else, we decode 
UTF-8 bytes, copy them to charBuffer and read characters from that. 
2. regex matching is done on this charSequenceWrapper, which provides charAt 
functionality as explained above.

All the numbers below are processing time of filter operation.

Baseline:
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a' 
1m 10 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
9.7 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
1m 6 sec


For all ASCII, since getByte is doing a bounds check every time we call it, I 
want to see if getting the bytes  in one shot is better. That did not help much 
with performance. In fact, it made it worse for 'a%' type of  match.

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
1m 2s (vs 1m 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
16.688s (vs 9.7 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
55 sec (vs 1min 6 sec baseline)


Use find instead of matcher.matches(). The numbers are better, but not by much.
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a';
30 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%';
14 sec (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’;
32 sec (vs 1min 6s baseline)


Next, I tried building charBuffer always (even if it is all ASCII) and use 
String functions startsWith, endsWith and contains.
Numbers are better. But, not by much.
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
45 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’
34 sec (vs 1min 6s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘a%’
46  (vs 9.794s baseline)


I tried Google RE2 library. Got much worse numbers than what we are getting 
with Java Regex Library.


Finally, I implemented simple character by character comparison functions for 
each of the special cases 
and got pretty good numbers.

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
6.576 sec (vs. 1m 10s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
6.190s (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
11.34s (vs. 1m 6s baseline)














was (Author: ppenumarthy):
I did bunch of experiments to figure out what should be the best approach.

Basically, here is what we do for "like" operation :
1. Build a charSequence wrapper for varChar UTF8 input.  If input is all ASCII, 
we directly read the byte as character from PlatformDependent. Else, we decode 
UTF-8 bytes, copy them to charBuffer and read characters from that. 
2. regex matching is done on this charSequenceWrapper, which provides charAt 
functionality as explained above.

All the numbers below are processing time of filter operation.

Baseline:
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a' 
1m 10 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
9.7 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
1m 6 sec




For all ASCII, since getByte is doing a bounds check every time we call it, I 
want to see if getting the bytes  in one shot is better. That did not help much 
with performance. In fact, it made it worse for 'a%' type of  match.

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
1m 2s (vs 1m 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
16.688s (vs 9.7 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
55 sec (vs 1min 6 sec baseline)


Use find instead of matcher.matches(). The numbers are better, but not by much.
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a';
30 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%';
14 sec (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’;
32 sec (vs 1min 6s baseline)


Next, I tried building charBuffer always (even if it is all ASCII) and use 
String functions startsWith, endsWith and contains.
Numbers are better. But, not by much.
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
45 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’
34 sec (vs 1min 6s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘a%’
46  (vs 9.794s baseline)


I tried Google RE2 library. Got much worse numbers than what we are getting 
with Java Regex Library.


Finally, I implemented simple character by character comparison functions for 
each of the special cases 
and got pretty good numbers.

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
6.576 sec (vs. 1m 10s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
6.190s (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
11.34s (vs. 1m 6s baseline)













> Improve performance of filter operator for pattern matching
> -----------------------------------------------------------
>
>                 Key: DRILL-5697
>                 URL: https://issues.apache.org/jira/browse/DRILL-5697
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Execution - Flow
>    Affects Versions: 1.11.0
>            Reporter: Padma Penumarthy
>            Assignee: Padma Penumarthy
>
> Queries using filter with sql like operator use Java regex library for 
> pattern matching. However, for cases like %abc (ends with abc), abc% (starts 
> with abc), %abc% (contains abc), it is observed that implementing these cases 
> with simple code instead of using regex library provides good performance 
> boost (4-6x). Idea is to use special case code for simple, common cases and 
> fall back to Java regex library for complicated ones. That will provide good 
> performance benefit for most common cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (DRILL-5697) Improve performance of filter operator for pattern matching

Reply via email to