[GitHub] spark pull request #20796: [SPARK-23649][SQL] Skipping chars disallowed in U...

2018-03-20 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20796


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20796: [SPARK-23649][SQL] Skipping chars disallowed in U...

2018-03-20 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/20796#discussion_r175679506
  
--- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
@@ -187,8 +218,9 @@ public void writeTo(OutputStream out) throws 
IOException {
* @param b The first byte of a code point
*/
   private static int numBytesForFirstByte(final byte b) {
-final int offset = (b & 0xFF) - 192;
-return (offset >= 0) ? bytesOfCodePointInUTF8[offset] : 1;
+final int offset = b & 0xFF;
+byte numBytes = bytesOfCodePointInUTF8[offset];
+return (numBytes == 0) ? 1: numBytes; // Skip the first byte 
disallowed in UTF-8
--- End diff --

I think so. We jump over (skip by definition) such bytes and count it as 
one entity. If we don't count the bytes, we break `substring`, `toUpperCase`, 
`toLowerCase`, `trimRight/trimLeft` and etc. The reason of the changes is to 
not crash on bad input as previously we threw IndexOutOfBoundsexception on some 
wrong chars but could pass (count as 1) another wrong chars. This PR allows to 
cover whole range. I believe ignoring/removing of wrong chars should be 
addressed in changes for  https://issues.apache.org/jira/browse/SPARK-23741 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20796: [SPARK-23649][SQL] Skipping chars disallowed in U...

2018-03-20 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/20796#discussion_r175676498
  
--- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
@@ -57,12 +57,39 @@
   public Object getBaseObject() { return base; }
   public long getBaseOffset() { return offset; }
 
-  private static int[] bytesOfCodePointInUTF8 = {2, 2, 2, 2, 2, 2, 2, 2, 
2, 2, 2,
-2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
-3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
-4, 4, 4, 4, 4, 4, 4, 4,
-5, 5, 5, 5,
-6, 6};
+  /**
+   * A char in UTF-8 encoding can take 1-4 bytes depending on the first 
byte which
+   * indicates the size of the char. See Unicode standard in page 126:
+   * http://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf
+   *
+   * BinaryHex  Comments
+   * 0xxx  0x00..0x7F   Only byte of a 1-byte character encoding
+   * 10xx  0x80..0xBF   Continuation bytes (1-3 continuation bytes)
+   * 110x  0xC0..0xDF   First byte of a 2-byte character encoding
--- End diff --

I added a comment about the first bytes disallowed by UTF-8. The comment 
describes from where the byte ranges and restrictions come otherwise the 
comments just duplicate the implementation.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20796: [SPARK-23649][SQL] Skipping chars disallowed in U...

2018-03-19 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20796#discussion_r175626064
  
--- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
@@ -187,8 +218,9 @@ public void writeTo(OutputStream out) throws 
IOException {
* @param b The first byte of a code point
*/
   private static int numBytesForFirstByte(final byte b) {
-final int offset = (b & 0xFF) - 192;
-return (offset >= 0) ? bytesOfCodePointInUTF8[offset] : 1;
+final int offset = b & 0xFF;
+byte numBytes = bytesOfCodePointInUTF8[offset];
+return (numBytes == 0) ? 1: numBytes; // Skip the first byte 
disallowed in UTF-8
--- End diff --

Is the comment valid? Do we skip it? Don't we still count the disallowed 
byte as one code point in `numChars`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20796: [SPARK-23649][SQL] Skipping chars disallowed in U...

2018-03-19 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20796#discussion_r175624764
  
--- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
@@ -57,12 +57,39 @@
   public Object getBaseObject() { return base; }
   public long getBaseOffset() { return offset; }
 
-  private static int[] bytesOfCodePointInUTF8 = {2, 2, 2, 2, 2, 2, 2, 2, 
2, 2, 2,
-2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
-3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
-4, 4, 4, 4, 4, 4, 4, 4,
-5, 5, 5, 5,
-6, 6};
+  /**
+   * A char in UTF-8 encoding can take 1-4 bytes depending on the first 
byte which
+   * indicates the size of the char. See Unicode standard in page 126:
+   * http://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf
+   *
+   * BinaryHex  Comments
+   * 0xxx  0x00..0x7F   Only byte of a 1-byte character encoding
+   * 10xx  0x80..0xBF   Continuation bytes (1-3 continuation bytes)
+   * 110x  0xC0..0xDF   First byte of a 2-byte character encoding
--- End diff --

Yes, `0xC2..0xDF` should be the first byte for 2-byte character encoding. 
But here you said `0xC0..0xDF`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20796: [SPARK-23649][SQL] Skipping chars disallowed in U...

2018-03-19 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/20796#discussion_r175590297
  
--- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
@@ -57,12 +57,39 @@
   public Object getBaseObject() { return base; }
   public long getBaseOffset() { return offset; }
 
-  private static int[] bytesOfCodePointInUTF8 = {2, 2, 2, 2, 2, 2, 2, 2, 
2, 2, 2,
-2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
-3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
-4, 4, 4, 4, 4, 4, 4, 4,
-5, 5, 5, 5,
-6, 6};
+  /**
+   * A char in UTF-8 encoding can take 1-4 bytes depending on the first 
byte which
+   * indicates the size of the char. See Unicode standard in page 126:
+   * http://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf
+   *
+   * BinaryHex  Comments
+   * 0xxx  0x00..0x7F   Only byte of a 1-byte character encoding
+   * 10xx  0x80..0xBF   Continuation bytes (1-3 continuation bytes)
+   * 110x  0xC0..0xDF   First byte of a 2-byte character encoding
--- End diff --

Here is the table of allowed first bytes:
https://user-images.githubusercontent.com/1580697/37623134-eca5756a-2bc3-11e8-99a0-3b538697d15c.png";>



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20796: [SPARK-23649][SQL] Skipping chars disallowed in U...

2018-03-19 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/20796#discussion_r175589638
  
--- Diff: 
common/unsafe/src/test/java/org/apache/spark/unsafe/types/UTF8StringSuite.java 
---
@@ -791,4 +795,21 @@ public void trimRightWithTrimString() {
 assertEquals(fromString("头"), 
fromString("头a???/").trimRight(fromString("数?/*&^%a")));
 assertEquals(fromString("头"), fromString("头数b数数 
[").trimRight(fromString(" []数b")));
   }
+
+  @Test
+  public void skipWrongFirstByte() {
+int[] wrongFirstBytes = {
--- End diff --

The bytes are not filtered by UTF8String methods. For instance, in the case 
of csv datasource  the invalid bytes are just passed to the final result. See 
https://issues.apache.org/jira/browse/SPARK-23649

I have created a separate ticket to fix the issue: 
https://issues.apache.org/jira/browse/SPARK-23741 . 

I am not sure that the issue of output of wrong UTF-8 chars should be 
addressed by this PR (this pr just fixes crashes on wrong input) because it 
could impact on users and other Spark components. Need to discuss it and test 
it carefully.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20796: [SPARK-23649][SQL] Skipping chars disallowed in U...

2018-03-19 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/20796#discussion_r175577812
  
--- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
@@ -57,12 +57,39 @@
   public Object getBaseObject() { return base; }
   public long getBaseOffset() { return offset; }
 
-  private static int[] bytesOfCodePointInUTF8 = {2, 2, 2, 2, 2, 2, 2, 2, 
2, 2, 2,
-2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
-3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
-4, 4, 4, 4, 4, 4, 4, 4,
-5, 5, 5, 5,
-6, 6};
+  /**
+   * A char in UTF-8 encoding can take 1-4 bytes depending on the first 
byte which
+   * indicates the size of the char. See Unicode standard in page 126:
+   * http://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf
+   *
+   * BinaryHex  Comments
+   * 0xxx  0x00..0x7F   Only byte of a 1-byte character encoding
+   * 10xx  0x80..0xBF   Continuation bytes (1-3 continuation bytes)
+   * 110x  0xC0..0xDF   First byte of a 2-byte character encoding
+   * 1110  0xE0..0xEF   First byte of a 3-byte character encoding
+   * 0xxx  0xF0..0xF4   First byte of a 4-byte character encoding
--- End diff --

I will add additional comment about which bytes are not allowed according 
to the table:
https://user-images.githubusercontent.com/1580697/37621033-21ffbae6-2bbe-11e8-9a8f-ec05263ef7f7.png";>



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20796: [SPARK-23649][SQL] Skipping chars disallowed in U...

2018-03-19 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/20796#discussion_r175573949
  
--- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
@@ -57,12 +57,39 @@
   public Object getBaseObject() { return base; }
   public long getBaseOffset() { return offset; }
 
-  private static int[] bytesOfCodePointInUTF8 = {2, 2, 2, 2, 2, 2, 2, 2, 
2, 2, 2,
-2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
-3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
-4, 4, 4, 4, 4, 4, 4, 4,
-5, 5, 5, 5,
-6, 6};
+  /**
+   * A char in UTF-8 encoding can take 1-4 bytes depending on the first 
byte which
+   * indicates the size of the char. See Unicode standard in page 126:
+   * http://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf
+   *
+   * BinaryHex  Comments
+   * 0xxx  0x00..0x7F   Only byte of a 1-byte character encoding
+   * 10xx  0x80..0xBF   Continuation bytes (1-3 continuation bytes)
+   * 110x  0xC0..0xDF   First byte of a 2-byte character encoding
--- End diff --

Actually this table is from the unicode standard (10.0, Table 3-6, page 
126):
https://user-images.githubusercontent.com/1580697/37620318-2c173f7e-2bbc-11e8-8fa5-5f04d11925de.png";>

0xC0, 0xC1 are first bytes of 2 bytes chars disallowed by UTF-8 (for now)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20796: [SPARK-23649][SQL] Skipping chars disallowed in U...

2018-03-19 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20796#discussion_r175543480
  
--- Diff: 
common/unsafe/src/test/java/org/apache/spark/unsafe/types/UTF8StringSuite.java 
---
@@ -791,4 +795,21 @@ public void trimRightWithTrimString() {
 assertEquals(fromString("头"), 
fromString("头a???/").trimRight(fromString("数?/*&^%a")));
 assertEquals(fromString("头"), fromString("头数b数数 
[").trimRight(fromString(" []数b")));
   }
+
+  @Test
+  public void skipWrongFirstByte() {
+int[] wrongFirstBytes = {
--- End diff --

what will happen if we print UTF8String with invalid bytes?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20796: [SPARK-23649][SQL] Skipping chars disallowed in U...

2018-03-19 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20796#discussion_r175542911
  
--- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
@@ -57,12 +57,39 @@
   public Object getBaseObject() { return base; }
   public long getBaseOffset() { return offset; }
 
-  private static int[] bytesOfCodePointInUTF8 = {2, 2, 2, 2, 2, 2, 2, 2, 
2, 2, 2,
-2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
-3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
-4, 4, 4, 4, 4, 4, 4, 4,
-5, 5, 5, 5,
-6, 6};
+  /**
+   * A char in UTF-8 encoding can take 1-4 bytes depending on the first 
byte which
+   * indicates the size of the char. See Unicode standard in page 126:
+   * http://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf
+   *
+   * BinaryHex  Comments
+   * 0xxx  0x00..0x7F   Only byte of a 1-byte character encoding
+   * 10xx  0x80..0xBF   Continuation bytes (1-3 continuation bytes)
+   * 110x  0xC0..0xDF   First byte of a 2-byte character encoding
+   * 1110  0xE0..0xEF   First byte of a 3-byte character encoding
+   * 0xxx  0xF0..0xF4   First byte of a 4-byte character encoding
--- End diff --

and also `0xF5..0xFF`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20796: [SPARK-23649][SQL] Skipping chars disallowed in U...

2018-03-19 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20796#discussion_r175542711
  
--- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
@@ -57,12 +57,39 @@
   public Object getBaseObject() { return base; }
   public long getBaseOffset() { return offset; }
 
-  private static int[] bytesOfCodePointInUTF8 = {2, 2, 2, 2, 2, 2, 2, 2, 
2, 2, 2,
-2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
-3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
-4, 4, 4, 4, 4, 4, 4, 4,
-5, 5, 5, 5,
-6, 6};
+  /**
+   * A char in UTF-8 encoding can take 1-4 bytes depending on the first 
byte which
+   * indicates the size of the char. See Unicode standard in page 126:
+   * http://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf
+   *
+   * BinaryHex  Comments
+   * 0xxx  0x00..0x7F   Only byte of a 1-byte character encoding
+   * 10xx  0x80..0xBF   Continuation bytes (1-3 continuation bytes)
+   * 110x  0xC0..0xDF   First byte of a 2-byte character encoding
--- End diff --

yea, seems we should need to list `0xC0, 0xC1` here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20796: [SPARK-23649][SQL] Skipping chars disallowed in U...

2018-03-18 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20796#discussion_r175281945
  
--- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
@@ -57,12 +57,39 @@
   public Object getBaseObject() { return base; }
   public long getBaseOffset() { return offset; }
 
-  private static int[] bytesOfCodePointInUTF8 = {2, 2, 2, 2, 2, 2, 2, 2, 
2, 2, 2,
-2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
-3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
-4, 4, 4, 4, 4, 4, 4, 4,
-5, 5, 5, 5,
-6, 6};
+  /**
+   * A char in UTF-8 encoding can take 1-4 bytes depending on the first 
byte which
+   * indicates the size of the char. See Unicode standard in page 126:
+   * http://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf
+   *
+   * BinaryHex  Comments
+   * 0xxx  0x00..0x7F   Only byte of a 1-byte character encoding
+   * 10xx  0x80..0xBF   Continuation bytes (1-3 continuation bytes)
+   * 110x  0xC0..0xDF   First byte of a 2-byte character encoding
--- End diff --

hmm, is this `0xC2..0xDF`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org