[GitHub] drill pull request: DRILL-4573: Zero copy LIKE, REGEXP_MATCHES, SU...

jinfengni Tue, 17 May 2016 14:25:52 -0700

Github user jinfengni commented on the pull request:

    https://github.com/apache/drill/pull/458#issuecomment-219858967
  
    @jcmcote ,
    
    (copy from my comment in DRILL-4573)
    
     I re-visited your first patch. Looks like that the change you made would 
cause incorrect result when the input string is a udf-8 with each character 
consisting of multiple bytes. In particular, the original implementation would 
encode the byte array with udf-8 (which is the default encoding in drill). 
However, in your CharSequenceWrapper, you will treat each byte as a character. 
This will cause incorrect result for case when a character is represented by >1 
bytes.
    
    For instance, look at the following example, the first query of 
regexp_matches will produce wrong result. 
     
    {code:sql}
    select regexp_matches('MÃ¼nchen', 'MÃ¼nchen') res3 from (values(1));
    +--------+
    |  res3  |
    +--------+
    | false  |
    +--------+
    1 row selected (0.148 seconds)
    0: jdbc:drill:zk=local> select regexp_matches('abc', 'abc') from 
(values(1));
    +---------+
    | EXPR$0  |
    +---------+
    | true    |
    +---------+
    1 row selected (0.189 seconds)
    0: jdbc:drill:zk=local> select 'MÃ¼nchen' = 'MÃ¼nchen' res1 from 
(values(1));
    +-------+
    | res1  |
    +-------+
    | true  |
    +-------+
    1 row selected (0.186 seconds)
    {code:sql}
    
    Here is the result for 1st query, without your patch
    
    {code:sql}
    select regexp_matches('MÃ¼nchen', 'MÃ¼nchen') res3 from (values(1));
    +-------+
    | res3  |
    +-------+
    | true  |
    +-------+
    {code:sql}
    
    I think you should modify CharSequenceWrapper, so that the encoding method 
is honored.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request: DRILL-4573: Zero copy LIKE, REGEXP_MATCHES, SU...

Reply via email to