Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/1001 @sachouche, thanks for the first PR to Drill! Thanks for the detailed explanation! Before reviewing the code, a comment on the design: > Added a new integer variable "asciiMode" ... this value will be set ... during the first LIKE evaluation and will be reused across other LIKE evaluations The problem with this design is that there is no guarantee that the first value is representative of the other columns. Maybe my list looks like this: ``` Hello ä½ å¥½ ``` The first value is ASCII. The second is not. So, we must treat each value as independent of the others. On the other hand, we *can* exploit the nature of UTF-8. The encoding is such that no valid UTF-8 character is a prefix of any other valid character. Thus, if a character is 0xXX 0xYY 0xZZ, then there can *never* be a valid character which is 0xXX 0xYY. As a result, starts-with, ends-width, equals and contains can be done without either converting to UTF-16 or even caring if the data is ASCII or not. What does this mean? It means that, for the simple operations: 1. Convert the Java UTF-16 string to UTF-8. 2. Do the classic byte comparison methods for starts with, ends with or contains. No special processing is needed for multi-byte Unlike other multi-byte encodings, UTF-8 was designed to make this possible. If we go this route, we would not need the ASCII mode flag. Note: all of this applies only to the "basic four" operations: if we do a real regex, then we must decode the Varchar into a Java UTF-16 string.
---