[GitHub] spark pull request #20796: [SPARK-23649][SQL] Skipping chars disallowed in U...

cloud-fan Mon, 19 Mar 2018 11:36:31 -0700

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20796#discussion_r175542711
  
    --- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
    @@ -57,12 +57,39 @@
       public Object getBaseObject() { return base; }
       public long getBaseOffset() { return offset; }
     
    -  private static int[] bytesOfCodePointInUTF8 = {2, 2, 2, 2, 2, 2, 2, 2, 
2, 2, 2,
    -    2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
    -    3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
    -    4, 4, 4, 4, 4, 4, 4, 4,
    -    5, 5, 5, 5,
    -    6, 6};
    +  /**
    +   * A char in UTF-8 encoding can take 1-4 bytes depending on the first 
byte which
    +   * indicates the size of the char. See Unicode standard in page 126:
    +   * http://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf
    +   *
    +   * Binary    Hex          Comments
    +   * 0xxxxxxx  0x00..0x7F   Only byte of a 1-byte character encoding
    +   * 10xxxxxx  0x80..0xBF   Continuation bytes (1-3 continuation bytes)
    +   * 110xxxxx  0xC0..0xDF   First byte of a 2-byte character encoding
    --- End diff --
    
    yea, seems we should need to list `0xC0, 0xC1` here.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20796: [SPARK-23649][SQL] Skipping chars disallowed in U...

Reply via email to