Github user jinfengni commented on the pull request: https://github.com/apache/drill/pull/458#issuecomment-219858967 @jcmcote , (copy from my comment in DRILL-4573) I re-visited your first patch. Looks like that the change you made would cause incorrect result when the input string is a udf-8 with each character consisting of multiple bytes. In particular, the original implementation would encode the byte array with udf-8 (which is the default encoding in drill). However, in your CharSequenceWrapper, you will treat each byte as a character. This will cause incorrect result for case when a character is represented by >1 bytes. For instance, look at the following example, the first query of regexp_matches will produce wrong result. {code:sql} select regexp_matches('München', 'München') res3 from (values(1)); +--------+ | res3 | +--------+ | false | +--------+ 1 row selected (0.148 seconds) 0: jdbc:drill:zk=local> select regexp_matches('abc', 'abc') from (values(1)); +---------+ | EXPR$0 | +---------+ | true | +---------+ 1 row selected (0.189 seconds) 0: jdbc:drill:zk=local> select 'München' = 'München' res1 from (values(1)); +-------+ | res1 | +-------+ | true | +-------+ 1 row selected (0.186 seconds) {code:sql} Here is the result for 1st query, without your patch {code:sql} select regexp_matches('München', 'München') res3 from (values(1)); +-------+ | res3 | +-------+ | true | +-------+ {code:sql} I think you should modify CharSequenceWrapper, so that the encoding method is honored.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---