spark git commit: [SPARK-23649][SQL] Skipping chars disallowed in UTF-8

2018-03-20 Thread wenchen
Repository: spark
Updated Branches:
  refs/heads/branch-2.2 175b221bc -> 367a16118


[SPARK-23649][SQL] Skipping chars disallowed in UTF-8

The mapping of UTF-8 char's first byte to char's size doesn't cover whole range 
0-255. It is defined only for 0-253:
https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L60-L65
https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L190

If the first byte of a char is 253-255, IndexOutOfBoundsException is thrown. 
Besides of that values for 244-252 are not correct according to recent unicode 
standard for UTF-8: 
http://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf

As a consequence of the exception above, the length of input string in UTF-8 
encoding cannot be calculated if the string contains chars started from 253 
code. It is visible on user's side as for example crashing of schema inferring 
of csv file which contains such chars but the file can be read if the schema is 
specified explicitly or if the mode set to multiline.

The proposed changes build correct mapping of first byte of UTF-8 char to its 
size (now it covers all cases) and skip disallowed chars (counts it as one 
octet).

Added a test and a file with a char which is disallowed in UTF-8 - 0xFF.

Author: Maxim Gekk 

Closes #20796 from MaxGekk/skip-wrong-utf8-chars.

(cherry picked from commit 5e7bc2acef4a1e11d0d8056ef5c12cd5c8f220da)
Signed-off-by: Wenchen Fan 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/367a1611
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/367a1611
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/367a1611

Branch: refs/heads/branch-2.2
Commit: 367a16118289e1c03507c14f966e8b1ebd688489
Parents: 175b221
Author: Maxim Gekk 
Authored: Tue Mar 20 10:34:56 2018 -0700
Committer: Wenchen Fan 
Committed: Tue Mar 20 10:37:29 2018 -0700

--
 .../apache/spark/unsafe/types/UTF8String.java   | 48 
 .../spark/unsafe/types/UTF8StringSuite.java | 23 +-
 2 files changed, 62 insertions(+), 9 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/367a1611/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
--
diff --git 
a/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java 
b/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
index 23636ca..a11e63c 100644
--- a/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
+++ b/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
@@ -57,12 +57,43 @@ public final class UTF8String implements 
Comparable, Externalizable,
   public Object getBaseObject() { return base; }
   public long getBaseOffset() { return offset; }
 
-  private static int[] bytesOfCodePointInUTF8 = {2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
2,
-2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
-3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
-4, 4, 4, 4, 4, 4, 4, 4,
-5, 5, 5, 5,
-6, 6};
+  /**
+   * A char in UTF-8 encoding can take 1-4 bytes depending on the first byte 
which
+   * indicates the size of the char. See Unicode standard in page 126, Table 
3-6:
+   * http://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf
+   *
+   * BinaryHex  Comments
+   * 0xxx  0x00..0x7F   Only byte of a 1-byte character encoding
+   * 10xx  0x80..0xBF   Continuation bytes (1-3 continuation bytes)
+   * 110x  0xC0..0xDF   First byte of a 2-byte character encoding
+   * 1110  0xE0..0xEF   First byte of a 3-byte character encoding
+   * 0xxx  0xF0..0xF7   First byte of a 4-byte character encoding
+   *
+   * As a consequence of the well-formedness conditions specified in
+   * Table 3-7 (page 126), the following byte values are disallowed in UTF-8:
+   *   C0–C1, F5–FF.
+   */
+  private static byte[] bytesOfCodePointInUTF8 = {
+1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0x00..0x0F
+1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0x10..0x1F
+1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0x20..0x2F
+1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0x30..0x3F
+1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0x40..0x4F
+1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0x50..0x5F
+1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0x60..0x6F
+1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0x70..0x7F
+// Continuation bytes cannot appear as the first byte
+0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0x80..0x8F
+0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0x90..0x9F
+0

spark git commit: [SPARK-23649][SQL] Skipping chars disallowed in UTF-8

2018-03-20 Thread wenchen
Repository: spark
Updated Branches:
  refs/heads/branch-2.3 c854b6ca7 -> 0b880db65


[SPARK-23649][SQL] Skipping chars disallowed in UTF-8

## What changes were proposed in this pull request?

The mapping of UTF-8 char's first byte to char's size doesn't cover whole range 
0-255. It is defined only for 0-253:
https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L60-L65
https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L190

If the first byte of a char is 253-255, IndexOutOfBoundsException is thrown. 
Besides of that values for 244-252 are not correct according to recent unicode 
standard for UTF-8: 
http://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf

As a consequence of the exception above, the length of input string in UTF-8 
encoding cannot be calculated if the string contains chars started from 253 
code. It is visible on user's side as for example crashing of schema inferring 
of csv file which contains such chars but the file can be read if the schema is 
specified explicitly or if the mode set to multiline.

The proposed changes build correct mapping of first byte of UTF-8 char to its 
size (now it covers all cases) and skip disallowed chars (counts it as one 
octet).

## How was this patch tested?

Added a test and a file with a char which is disallowed in UTF-8 - 0xFF.

Author: Maxim Gekk 

Closes #20796 from MaxGekk/skip-wrong-utf8-chars.

(cherry picked from commit 5e7bc2acef4a1e11d0d8056ef5c12cd5c8f220da)
Signed-off-by: Wenchen Fan 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0b880db6
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0b880db6
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0b880db6

Branch: refs/heads/branch-2.3
Commit: 0b880db65b647e549b78721859b1712dff733ec9
Parents: c854b6c
Author: Maxim Gekk 
Authored: Tue Mar 20 10:34:56 2018 -0700
Committer: Wenchen Fan 
Committed: Tue Mar 20 10:35:14 2018 -0700

--
 .../apache/spark/unsafe/types/UTF8String.java   | 48 
 .../spark/unsafe/types/UTF8StringSuite.java | 23 +-
 2 files changed, 62 insertions(+), 9 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/0b880db6/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
--
diff --git 
a/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java 
b/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
index b0d0c44..5d468ae 100644
--- a/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
+++ b/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
@@ -57,12 +57,43 @@ public final class UTF8String implements 
Comparable, Externalizable,
   public Object getBaseObject() { return base; }
   public long getBaseOffset() { return offset; }
 
-  private static int[] bytesOfCodePointInUTF8 = {2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
2,
-2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
-3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
-4, 4, 4, 4, 4, 4, 4, 4,
-5, 5, 5, 5,
-6, 6};
+  /**
+   * A char in UTF-8 encoding can take 1-4 bytes depending on the first byte 
which
+   * indicates the size of the char. See Unicode standard in page 126, Table 
3-6:
+   * http://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf
+   *
+   * BinaryHex  Comments
+   * 0xxx  0x00..0x7F   Only byte of a 1-byte character encoding
+   * 10xx  0x80..0xBF   Continuation bytes (1-3 continuation bytes)
+   * 110x  0xC0..0xDF   First byte of a 2-byte character encoding
+   * 1110  0xE0..0xEF   First byte of a 3-byte character encoding
+   * 0xxx  0xF0..0xF7   First byte of a 4-byte character encoding
+   *
+   * As a consequence of the well-formedness conditions specified in
+   * Table 3-7 (page 126), the following byte values are disallowed in UTF-8:
+   *   C0–C1, F5–FF.
+   */
+  private static byte[] bytesOfCodePointInUTF8 = {
+1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0x00..0x0F
+1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0x10..0x1F
+1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0x20..0x2F
+1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0x30..0x3F
+1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0x40..0x4F
+1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0x50..0x5F
+1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0x60..0x6F
+1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0x70..0x7F
+// Continuation bytes cannot appear as the first byte
+0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 

spark git commit: [SPARK-23649][SQL] Skipping chars disallowed in UTF-8

2018-03-20 Thread wenchen
Repository: spark
Updated Branches:
  refs/heads/master 566321852 -> 5e7bc2ace


[SPARK-23649][SQL] Skipping chars disallowed in UTF-8

## What changes were proposed in this pull request?

The mapping of UTF-8 char's first byte to char's size doesn't cover whole range 
0-255. It is defined only for 0-253:
https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L60-L65
https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L190

If the first byte of a char is 253-255, IndexOutOfBoundsException is thrown. 
Besides of that values for 244-252 are not correct according to recent unicode 
standard for UTF-8: 
http://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf

As a consequence of the exception above, the length of input string in UTF-8 
encoding cannot be calculated if the string contains chars started from 253 
code. It is visible on user's side as for example crashing of schema inferring 
of csv file which contains such chars but the file can be read if the schema is 
specified explicitly or if the mode set to multiline.

The proposed changes build correct mapping of first byte of UTF-8 char to its 
size (now it covers all cases) and skip disallowed chars (counts it as one 
octet).

## How was this patch tested?

Added a test and a file with a char which is disallowed in UTF-8 - 0xFF.

Author: Maxim Gekk 

Closes #20796 from MaxGekk/skip-wrong-utf8-chars.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5e7bc2ac
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5e7bc2ac
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5e7bc2ac

Branch: refs/heads/master
Commit: 5e7bc2acef4a1e11d0d8056ef5c12cd5c8f220da
Parents: 5663218
Author: Maxim Gekk 
Authored: Tue Mar 20 10:34:56 2018 -0700
Committer: Wenchen Fan 
Committed: Tue Mar 20 10:34:56 2018 -0700

--
 .../apache/spark/unsafe/types/UTF8String.java   | 48 
 .../spark/unsafe/types/UTF8StringSuite.java | 23 +-
 2 files changed, 62 insertions(+), 9 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/5e7bc2ac/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
--
diff --git 
a/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java 
b/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
index b0d0c44..5d468ae 100644
--- a/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
+++ b/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
@@ -57,12 +57,43 @@ public final class UTF8String implements 
Comparable, Externalizable,
   public Object getBaseObject() { return base; }
   public long getBaseOffset() { return offset; }
 
-  private static int[] bytesOfCodePointInUTF8 = {2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
2,
-2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
-3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
-4, 4, 4, 4, 4, 4, 4, 4,
-5, 5, 5, 5,
-6, 6};
+  /**
+   * A char in UTF-8 encoding can take 1-4 bytes depending on the first byte 
which
+   * indicates the size of the char. See Unicode standard in page 126, Table 
3-6:
+   * http://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf
+   *
+   * BinaryHex  Comments
+   * 0xxx  0x00..0x7F   Only byte of a 1-byte character encoding
+   * 10xx  0x80..0xBF   Continuation bytes (1-3 continuation bytes)
+   * 110x  0xC0..0xDF   First byte of a 2-byte character encoding
+   * 1110  0xE0..0xEF   First byte of a 3-byte character encoding
+   * 0xxx  0xF0..0xF7   First byte of a 4-byte character encoding
+   *
+   * As a consequence of the well-formedness conditions specified in
+   * Table 3-7 (page 126), the following byte values are disallowed in UTF-8:
+   *   C0–C1, F5–FF.
+   */
+  private static byte[] bytesOfCodePointInUTF8 = {
+1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0x00..0x0F
+1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0x10..0x1F
+1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0x20..0x2F
+1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0x30..0x3F
+1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0x40..0x4F
+1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0x50..0x5F
+1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0x60..0x6F
+1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0x70..0x7F
+// Continuation bytes cannot appear as the first byte
+0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0x80..0x8F
+0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0x90..0x9F
+0, 0, 0, 0, 0, 0, 0, 0,