Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20796#discussion_r175626064
  
    --- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
    @@ -187,8 +218,9 @@ public void writeTo(OutputStream out) throws 
IOException {
        * @param b The first byte of a code point
        */
       private static int numBytesForFirstByte(final byte b) {
    -    final int offset = (b & 0xFF) - 192;
    -    return (offset >= 0) ? bytesOfCodePointInUTF8[offset] : 1;
    +    final int offset = b & 0xFF;
    +    byte numBytes = bytesOfCodePointInUTF8[offset];
    +    return (numBytes == 0) ? 1: numBytes; // Skip the first byte 
disallowed in UTF-8
    --- End diff --
    
    Is the comment valid? Do we skip it? Don't we still count the disallowed 
byte as one code point in `numChars`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to