[ https://issues.apache.org/jira/browse/PHOENIX-1287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14483280#comment-14483280 ]
Shuxiong Ye commented on PHOENIX-1287: -------------------------------------- Thanks, [~mujtabachohan] [~jamestaylor]. I add a VARCHAR column, "statements", which is not in primary key and equals to one of {"ONE:TWO:THREE", "ABC:DEF", "PKU:THU:FDU"}. Performance test result shows as below. Scale: 10m, 5 times for each query. {code} Query # 6 - Like + Count - SELECT COUNT(1) FROM PERFORMANCE_10000000 WHERE STATS.DESCRIPTION LIKE '%U%U%U%'; Query # 7 - Replace + Count - SELECT COUNT(1) FROM PERFORMANCE_10000000 WHERE REGEXP_REPLACE(STATS.DESCRIPTION, '[A-Z]+')='::'; Query # 8 - Substr + Count - SELECT COUNT(1) FROM PERFORMANCE_10000000 WHERE REGEXP_SUBSTR(STATS.DESCRIPTION, '[A-Z]+')='ONE'; Query # 9 - Split + Count - SELECT COUNT(1) FROM PERFORMANCE_10000000 WHERE ARRAY_ELEM(REGEXP_SPLIT(STATS.DESCRIPTION, '\\:'), 1)='ONE'; {code} || ||String Based||Byte Based|| Speedup(String/Byte) || |Like|14.784 / 16.436 / 15.706 / 16.247 / 15.148|15.157 / 15.749 / 15.403 / 15.129 / 16.306| 1.007 | |Replace|15.630 / 15.762 / 16.503 / 16.507 / 17.094|15.892 / 15.712 / 15.808 / 15.607 / 16.643| 1.023 | |Substr | 13.891 / 14.537 / 14.575 / 15.553 / 15.862 | 13.648 / 14.431 / 15.719 / 14.654 / 13.994 | 1.027 | |Split| 17.442 / 17.767 / 17.070 / 17.089 / 17.462 | 16.453 / 16.033 / 15.954 / 14.737 / 15.713 | 1.101 | [~jamestaylor] It is ready for review. I want to discuss some details about REGEXP_SUBSTR implementation. For REGEXP_SUBSTR, when given an non-zero offset, it is difficult to get the right byte based offset from the given string based offset. Currently, for a given non-zero string based offset, 1) turn bytes to string 2) check if the given offset is legal. Notice, the offset might be negative. 3) get the right byte based offset. We can also compute the byte based offset directly, but it depends on how Hbase encodes string to bytes. How Hbase encodes string to bytes is in org.apache.hadoop.hbase.util.Bytes#toBytes(String s). > Use the joni byte[] regex engine in place of j.u.regex > ------------------------------------------------------ > > Key: PHOENIX-1287 > URL: https://issues.apache.org/jira/browse/PHOENIX-1287 > Project: Phoenix > Issue Type: Bug > Reporter: James Taylor > Assignee: Shuxiong Ye > Labels: gsoc2015 > Attachments: add_varchar_to_performance_script.patch > > > See HBASE-11907. We'd get a 2x perf benefit plus it's driven off of byte[] > instead of strings.Thanks for the pointer, [~apurtell]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)