[ https://issues.apache.org/jira/browse/PHOENIX-1287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14491783#comment-14491783 ]
James Taylor commented on PHOENIX-1287: --------------------------------------- [~shuxi0ng] - I was hoping the perf difference would be bigger. [~apurtell] - did you guys measure the perf diff when you implemented the HBase JIRA for using the JONI byte[] based regex engine? {quote} We can also compute the byte based offset directly, but it depends on how Hbase encodes string to bytes. How Hbase encodes string to bytes is in org.apache.hadoop.hbase.util.Bytes#toBytes(String s). {quote} Yep, that's exactly the part I was going to comment on. Everything else looks good, but this part can be improved. Strings are stored in UTF8 format. Rather than turn them into Strings, you can calculate the byte offset given a character offset using StringUtil.calculateUTF8Length(). That starts from the beginning. Doing it in reverse, you'd need to start from the last byte and walk backwards counting one each time you do get a match on one of the BYTES_1_MASK, BYTES_2_MASK, BYTES_3_MASK, BYTES_4_MASK. Not sure if Guava or Apache Commons has utils for this that are better than ours. > Use the joni byte[] regex engine in place of j.u.regex > ------------------------------------------------------ > > Key: PHOENIX-1287 > URL: https://issues.apache.org/jira/browse/PHOENIX-1287 > Project: Phoenix > Issue Type: Bug > Reporter: James Taylor > Assignee: Shuxiong Ye > Labels: gsoc2015 > Attachments: add_varchar_to_performance_script.patch > > > See HBASE-11907. We'd get a 2x perf benefit plus it's driven off of byte[] > instead of strings.Thanks for the pointer, [~apurtell]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)