[GitHub] spark pull request #20796: [SPARK-23649][SQL] Skipping chars disallowed in U...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20796 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20796: [SPARK-23649][SQL] Skipping chars disallowed in U...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/20796#discussion_r175679506 --- Diff: common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java --- @@ -187,8 +218,9 @@ public void writeTo(OutputStream out) throws IOException { * @param b The first byte of a code point */ private static int numBytesForFirstByte(final byte b) { -final int offset = (b & 0xFF) - 192; -return (offset >= 0) ? bytesOfCodePointInUTF8[offset] : 1; +final int offset = b & 0xFF; +byte numBytes = bytesOfCodePointInUTF8[offset]; +return (numBytes == 0) ? 1: numBytes; // Skip the first byte disallowed in UTF-8 --- End diff -- I think so. We jump over (skip by definition) such bytes and count it as one entity. If we don't count the bytes, we break `substring`, `toUpperCase`, `toLowerCase`, `trimRight/trimLeft` and etc. The reason of the changes is to not crash on bad input as previously we threw IndexOutOfBoundsexception on some wrong chars but could pass (count as 1) another wrong chars. This PR allows to cover whole range. I believe ignoring/removing of wrong chars should be addressed in changes for https://issues.apache.org/jira/browse/SPARK-23741 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20796: [SPARK-23649][SQL] Skipping chars disallowed in U...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/20796#discussion_r175676498 --- Diff: common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java --- @@ -57,12 +57,39 @@ public Object getBaseObject() { return base; } public long getBaseOffset() { return offset; } - private static int[] bytesOfCodePointInUTF8 = {2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, -2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, -3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, -4, 4, 4, 4, 4, 4, 4, 4, -5, 5, 5, 5, -6, 6}; + /** + * A char in UTF-8 encoding can take 1-4 bytes depending on the first byte which + * indicates the size of the char. See Unicode standard in page 126: + * http://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf + * + * BinaryHex Comments + * 0xxx 0x00..0x7F Only byte of a 1-byte character encoding + * 10xx 0x80..0xBF Continuation bytes (1-3 continuation bytes) + * 110x 0xC0..0xDF First byte of a 2-byte character encoding --- End diff -- I added a comment about the first bytes disallowed by UTF-8. The comment describes from where the byte ranges and restrictions come otherwise the comments just duplicate the implementation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20796: [SPARK-23649][SQL] Skipping chars disallowed in U...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20796#discussion_r175626064 --- Diff: common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java --- @@ -187,8 +218,9 @@ public void writeTo(OutputStream out) throws IOException { * @param b The first byte of a code point */ private static int numBytesForFirstByte(final byte b) { -final int offset = (b & 0xFF) - 192; -return (offset >= 0) ? bytesOfCodePointInUTF8[offset] : 1; +final int offset = b & 0xFF; +byte numBytes = bytesOfCodePointInUTF8[offset]; +return (numBytes == 0) ? 1: numBytes; // Skip the first byte disallowed in UTF-8 --- End diff -- Is the comment valid? Do we skip it? Don't we still count the disallowed byte as one code point in `numChars`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20796: [SPARK-23649][SQL] Skipping chars disallowed in U...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20796#discussion_r175624764 --- Diff: common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java --- @@ -57,12 +57,39 @@ public Object getBaseObject() { return base; } public long getBaseOffset() { return offset; } - private static int[] bytesOfCodePointInUTF8 = {2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, -2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, -3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, -4, 4, 4, 4, 4, 4, 4, 4, -5, 5, 5, 5, -6, 6}; + /** + * A char in UTF-8 encoding can take 1-4 bytes depending on the first byte which + * indicates the size of the char. See Unicode standard in page 126: + * http://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf + * + * BinaryHex Comments + * 0xxx 0x00..0x7F Only byte of a 1-byte character encoding + * 10xx 0x80..0xBF Continuation bytes (1-3 continuation bytes) + * 110x 0xC0..0xDF First byte of a 2-byte character encoding --- End diff -- Yes, `0xC2..0xDF` should be the first byte for 2-byte character encoding. But here you said `0xC0..0xDF`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20796: [SPARK-23649][SQL] Skipping chars disallowed in U...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/20796#discussion_r175590297 --- Diff: common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java --- @@ -57,12 +57,39 @@ public Object getBaseObject() { return base; } public long getBaseOffset() { return offset; } - private static int[] bytesOfCodePointInUTF8 = {2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, -2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, -3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, -4, 4, 4, 4, 4, 4, 4, 4, -5, 5, 5, 5, -6, 6}; + /** + * A char in UTF-8 encoding can take 1-4 bytes depending on the first byte which + * indicates the size of the char. See Unicode standard in page 126: + * http://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf + * + * BinaryHex Comments + * 0xxx 0x00..0x7F Only byte of a 1-byte character encoding + * 10xx 0x80..0xBF Continuation bytes (1-3 continuation bytes) + * 110x 0xC0..0xDF First byte of a 2-byte character encoding --- End diff -- Here is the table of allowed first bytes: https://user-images.githubusercontent.com/1580697/37623134-eca5756a-2bc3-11e8-99a0-3b538697d15c.png";> --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20796: [SPARK-23649][SQL] Skipping chars disallowed in U...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/20796#discussion_r175589638 --- Diff: common/unsafe/src/test/java/org/apache/spark/unsafe/types/UTF8StringSuite.java --- @@ -791,4 +795,21 @@ public void trimRightWithTrimString() { assertEquals(fromString("头"), fromString("头a???/").trimRight(fromString("æ°?/*&^%a"))); assertEquals(fromString("头"), fromString("头æ°bæ°æ° [").trimRight(fromString(" []æ°b"))); } + + @Test + public void skipWrongFirstByte() { +int[] wrongFirstBytes = { --- End diff -- The bytes are not filtered by UTF8String methods. For instance, in the case of csv datasource the invalid bytes are just passed to the final result. See https://issues.apache.org/jira/browse/SPARK-23649 I have created a separate ticket to fix the issue: https://issues.apache.org/jira/browse/SPARK-23741 . I am not sure that the issue of output of wrong UTF-8 chars should be addressed by this PR (this pr just fixes crashes on wrong input) because it could impact on users and other Spark components. Need to discuss it and test it carefully. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20796: [SPARK-23649][SQL] Skipping chars disallowed in U...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/20796#discussion_r175577812 --- Diff: common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java --- @@ -57,12 +57,39 @@ public Object getBaseObject() { return base; } public long getBaseOffset() { return offset; } - private static int[] bytesOfCodePointInUTF8 = {2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, -2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, -3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, -4, 4, 4, 4, 4, 4, 4, 4, -5, 5, 5, 5, -6, 6}; + /** + * A char in UTF-8 encoding can take 1-4 bytes depending on the first byte which + * indicates the size of the char. See Unicode standard in page 126: + * http://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf + * + * BinaryHex Comments + * 0xxx 0x00..0x7F Only byte of a 1-byte character encoding + * 10xx 0x80..0xBF Continuation bytes (1-3 continuation bytes) + * 110x 0xC0..0xDF First byte of a 2-byte character encoding + * 1110 0xE0..0xEF First byte of a 3-byte character encoding + * 0xxx 0xF0..0xF4 First byte of a 4-byte character encoding --- End diff -- I will add additional comment about which bytes are not allowed according to the table: https://user-images.githubusercontent.com/1580697/37621033-21ffbae6-2bbe-11e8-9a8f-ec05263ef7f7.png";> --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20796: [SPARK-23649][SQL] Skipping chars disallowed in U...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/20796#discussion_r175573949 --- Diff: common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java --- @@ -57,12 +57,39 @@ public Object getBaseObject() { return base; } public long getBaseOffset() { return offset; } - private static int[] bytesOfCodePointInUTF8 = {2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, -2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, -3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, -4, 4, 4, 4, 4, 4, 4, 4, -5, 5, 5, 5, -6, 6}; + /** + * A char in UTF-8 encoding can take 1-4 bytes depending on the first byte which + * indicates the size of the char. See Unicode standard in page 126: + * http://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf + * + * BinaryHex Comments + * 0xxx 0x00..0x7F Only byte of a 1-byte character encoding + * 10xx 0x80..0xBF Continuation bytes (1-3 continuation bytes) + * 110x 0xC0..0xDF First byte of a 2-byte character encoding --- End diff -- Actually this table is from the unicode standard (10.0, Table 3-6, page 126): https://user-images.githubusercontent.com/1580697/37620318-2c173f7e-2bbc-11e8-8fa5-5f04d11925de.png";> 0xC0, 0xC1 are first bytes of 2 bytes chars disallowed by UTF-8 (for now) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20796: [SPARK-23649][SQL] Skipping chars disallowed in U...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20796#discussion_r175543480 --- Diff: common/unsafe/src/test/java/org/apache/spark/unsafe/types/UTF8StringSuite.java --- @@ -791,4 +795,21 @@ public void trimRightWithTrimString() { assertEquals(fromString("头"), fromString("头a???/").trimRight(fromString("æ°?/*&^%a"))); assertEquals(fromString("头"), fromString("头æ°bæ°æ° [").trimRight(fromString(" []æ°b"))); } + + @Test + public void skipWrongFirstByte() { +int[] wrongFirstBytes = { --- End diff -- what will happen if we print UTF8String with invalid bytes? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20796: [SPARK-23649][SQL] Skipping chars disallowed in U...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20796#discussion_r175542911 --- Diff: common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java --- @@ -57,12 +57,39 @@ public Object getBaseObject() { return base; } public long getBaseOffset() { return offset; } - private static int[] bytesOfCodePointInUTF8 = {2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, -2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, -3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, -4, 4, 4, 4, 4, 4, 4, 4, -5, 5, 5, 5, -6, 6}; + /** + * A char in UTF-8 encoding can take 1-4 bytes depending on the first byte which + * indicates the size of the char. See Unicode standard in page 126: + * http://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf + * + * BinaryHex Comments + * 0xxx 0x00..0x7F Only byte of a 1-byte character encoding + * 10xx 0x80..0xBF Continuation bytes (1-3 continuation bytes) + * 110x 0xC0..0xDF First byte of a 2-byte character encoding + * 1110 0xE0..0xEF First byte of a 3-byte character encoding + * 0xxx 0xF0..0xF4 First byte of a 4-byte character encoding --- End diff -- and also `0xF5..0xFF` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20796: [SPARK-23649][SQL] Skipping chars disallowed in U...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20796#discussion_r175542711 --- Diff: common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java --- @@ -57,12 +57,39 @@ public Object getBaseObject() { return base; } public long getBaseOffset() { return offset; } - private static int[] bytesOfCodePointInUTF8 = {2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, -2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, -3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, -4, 4, 4, 4, 4, 4, 4, 4, -5, 5, 5, 5, -6, 6}; + /** + * A char in UTF-8 encoding can take 1-4 bytes depending on the first byte which + * indicates the size of the char. See Unicode standard in page 126: + * http://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf + * + * BinaryHex Comments + * 0xxx 0x00..0x7F Only byte of a 1-byte character encoding + * 10xx 0x80..0xBF Continuation bytes (1-3 continuation bytes) + * 110x 0xC0..0xDF First byte of a 2-byte character encoding --- End diff -- yea, seems we should need to list `0xC0, 0xC1` here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20796: [SPARK-23649][SQL] Skipping chars disallowed in U...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20796#discussion_r175281945 --- Diff: common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java --- @@ -57,12 +57,39 @@ public Object getBaseObject() { return base; } public long getBaseOffset() { return offset; } - private static int[] bytesOfCodePointInUTF8 = {2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, -2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, -3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, -4, 4, 4, 4, 4, 4, 4, 4, -5, 5, 5, 5, -6, 6}; + /** + * A char in UTF-8 encoding can take 1-4 bytes depending on the first byte which + * indicates the size of the char. See Unicode standard in page 126: + * http://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf + * + * BinaryHex Comments + * 0xxx 0x00..0x7F Only byte of a 1-byte character encoding + * 10xx 0x80..0xBF Continuation bytes (1-3 continuation bytes) + * 110x 0xC0..0xDF First byte of a 2-byte character encoding --- End diff -- hmm, is this `0xC2..0xDF`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org