[ 
https://issues.apache.org/jira/browse/PHOENIX-1287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14491783#comment-14491783
 ] 

James Taylor commented on PHOENIX-1287:
---------------------------------------

[~shuxi0ng] - I was hoping the perf difference would be bigger. [~apurtell] - 
did you guys measure the perf diff when you implemented the HBase JIRA for 
using the JONI byte[] based regex engine?

{quote}
We can also compute the byte based offset directly, but it depends on how Hbase 
encodes string to bytes.
How Hbase encodes string to bytes is in 
org.apache.hadoop.hbase.util.Bytes#toBytes(String s).
{quote}
Yep, that's exactly the part I was going to comment on. Everything else looks 
good, but this part can be improved.
Strings are stored in UTF8 format. Rather than turn them into Strings, you can 
calculate the byte offset given a character offset using 
StringUtil.calculateUTF8Length(). That starts from the beginning. Doing it in 
reverse, you'd need to start from the last byte and walk backwards counting one 
each time you do get a match on one of the BYTES_1_MASK, BYTES_2_MASK, 
BYTES_3_MASK, BYTES_4_MASK. Not sure if Guava or Apache Commons has utils for 
this that are better than ours.

> Use the joni byte[] regex engine in place of j.u.regex
> ------------------------------------------------------
>
>                 Key: PHOENIX-1287
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-1287
>             Project: Phoenix
>          Issue Type: Bug
>            Reporter: James Taylor
>            Assignee: Shuxiong Ye
>              Labels: gsoc2015
>         Attachments: add_varchar_to_performance_script.patch
>
>
> See HBASE-11907. We'd get a 2x perf benefit plus it's driven off of byte[] 
> instead of strings.Thanks for the pointer, [~apurtell].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to