[GitHub] drill issue #1001: JIRA DRILL-5879: Like operator performance improvements

paul-rogers Wed, 18 Oct 2017 17:08:53 -0700

Github user paul-rogers commented on the issue:

    https://github.com/apache/drill/pull/1001
  
    @sachouche, thanks for the first PR to Drill! Thanks for the detailed 
explanation!
    
    Before reviewing the code, a comment on the design:
    
    > Added a new integer variable "asciiMode" ... this value will be set ... 
during the first LIKE evaluation and will be reused across other LIKE 
evaluations
    
    The problem with this design is that there is no guarantee that the first 
value is representative of the other columns. Maybe my list looks like this:
    
    ```
    Hello
    ä½ å¥½
    ```
    
    The first value is ASCII. The second is not. So, we must treat each value 
as independent of the others.
    
    On the other hand, we *can* exploit the nature of UTF-8. The encoding is 
such that no valid UTF-8 character is a prefix of any other valid character. 
Thus, if a character is 0xXX 0xYY 0xZZ, then there can *never* be a valid 
character which is 0xXX 0xYY. As a result, starts-with, ends-width, equals and 
contains can be done without either converting to UTF-16 or even caring if the 
data is ASCII or not.
    
    What does this mean? It means that, for the simple operations:
    
    1. Convert the Java UTF-16 string to UTF-8.
    2. Do the classic byte comparison methods for starts with, ends with or 
contains. No special processing is needed for multi-byte
    
    Unlike other multi-byte encodings, UTF-8 was designed to make this possible.
    
    If we go this route, we would not need the ASCII mode flag.
    
    Note: all of this applies only to the "basic four" operations: if we do a 
real regex, then we must decode the Varchar into a Java UTF-16 string.

---

[GitHub] drill issue #1001: JIRA DRILL-5879: Like operator performance improvements

Reply via email to