Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20796#discussion_r175577812
  
    --- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
    @@ -57,12 +57,39 @@
       public Object getBaseObject() { return base; }
       public long getBaseOffset() { return offset; }
     
    -  private static int[] bytesOfCodePointInUTF8 = {2, 2, 2, 2, 2, 2, 2, 2, 
2, 2, 2,
    -    2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
    -    3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
    -    4, 4, 4, 4, 4, 4, 4, 4,
    -    5, 5, 5, 5,
    -    6, 6};
    +  /**
    +   * A char in UTF-8 encoding can take 1-4 bytes depending on the first 
byte which
    +   * indicates the size of the char. See Unicode standard in page 126:
    +   * http://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf
    +   *
    +   * Binary    Hex          Comments
    +   * 0xxxxxxx  0x00..0x7F   Only byte of a 1-byte character encoding
    +   * 10xxxxxx  0x80..0xBF   Continuation bytes (1-3 continuation bytes)
    +   * 110xxxxx  0xC0..0xDF   First byte of a 2-byte character encoding
    +   * 1110xxxx  0xE0..0xEF   First byte of a 3-byte character encoding
    +   * 11110xxx  0xF0..0xF4   First byte of a 4-byte character encoding
    --- End diff --
    
    I will add additional comment about which bytes are not allowed according 
to the table:
    <img width="540" alt="screen shot 2018-03-19 at 9 39 17 pm" 
src="https://user-images.githubusercontent.com/1580697/37621033-21ffbae6-2bbe-11e8-9a8f-ec05263ef7f7.png";>



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to