[jira] [Commented] (PHOENIX-1287) Use the joni byte[] regex engine in place of j.u.regex

Shuxiong Ye (JIRA) Tue, 07 Apr 2015 07:44:49 -0700

    [ 
https://issues.apache.org/jira/browse/PHOENIX-1287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14483280#comment-14483280
 ]


Shuxiong Ye commented on PHOENIX-1287:
--------------------------------------

Thanks, [~mujtabachohan] [~jamestaylor].

I add a VARCHAR column, "statements", which is not in primary key and equals to 
one of {"ONE:TWO:THREE", "ABC:DEF", "PKU:THU:FDU"}.

Performance test result shows as below. Scale: 10m, 5 times for each query.

{code}
Query # 6 - Like + Count - SELECT COUNT(1) FROM PERFORMANCE_10000000 WHERE 
STATS.DESCRIPTION LIKE '%U%U%U%';
Query # 7 - Replace + Count - SELECT COUNT(1) FROM PERFORMANCE_10000000 WHERE 
REGEXP_REPLACE(STATS.DESCRIPTION, '[A-Z]+')='::';
Query # 8 - Substr + Count - SELECT COUNT(1) FROM PERFORMANCE_10000000 WHERE 
REGEXP_SUBSTR(STATS.DESCRIPTION, '[A-Z]+')='ONE';
Query # 9 - Split + Count - SELECT COUNT(1) FROM PERFORMANCE_10000000 WHERE 
ARRAY_ELEM(REGEXP_SPLIT(STATS.DESCRIPTION, '\\:'), 1)='ONE';
{code}

|| ||String Based||Byte Based|| Speedup(String/Byte) ||
|Like|14.784 / 16.436 / 15.706 / 16.247 / 15.148|15.157 / 15.749 / 15.403 / 
15.129 / 16.306| 1.007 |
|Replace|15.630 / 15.762 / 16.503 / 16.507 / 17.094|15.892 / 15.712 / 15.808 / 
15.607 / 16.643| 1.023 |
|Substr | 13.891 / 14.537 / 14.575 / 15.553 / 15.862 | 13.648 / 14.431 / 15.719 
/ 14.654 / 13.994 | 1.027 |
|Split| 17.442 / 17.767 / 17.070 / 17.089 / 17.462 | 16.453 / 16.033 / 15.954 / 
14.737 / 15.713 | 1.101 |

[~jamestaylor] It is ready for review.

I want to discuss some details about REGEXP_SUBSTR implementation.

For REGEXP_SUBSTR, when given an non-zero offset, it is difficult to get the 
right byte based offset from the given string based offset.

Currently, for a given non-zero string based offset,
1) turn bytes to string
2) check if the given offset is legal. Notice, the offset might be negative.
3) get the right byte based offset.

We can also compute the byte based offset directly, but it depends on how Hbase 
encodes string to bytes.

How Hbase encodes string to bytes is in 
org.apache.hadoop.hbase.util.Bytes#toBytes(String s).


> Use the joni byte[] regex engine in place of j.u.regex
> ------------------------------------------------------
>
>                 Key: PHOENIX-1287
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-1287
>             Project: Phoenix
>          Issue Type: Bug
>            Reporter: James Taylor
>            Assignee: Shuxiong Ye
>              Labels: gsoc2015
>         Attachments: add_varchar_to_performance_script.patch
>
>
> See HBASE-11907. We'd get a 2x perf benefit plus it's driven off of byte[] 
> instead of strings.Thanks for the pointer, [~apurtell].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PHOENIX-1287) Use the joni byte[] regex engine in place of j.u.regex

Reply via email to