[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

MaxGekk Wed, 28 Mar 2018 10:39:12 -0700

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/20849

When I was trying to remove the flexible format for lineSep
(recordDelimiter), I faced to a problem. I cannot fix the test:
https://github.com/MaxGekk/spark-1/blob/54fd42b64e0715540010c4d59b8b4f7a4a1b0876/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala#L2071-L2081

There are no any combination of charset and lineSep that allow me to read
the file. Here is the structure of the file:
```
BOM json_record1 delimiter json_record2 delimiter
```
The delimiter in hex: **x0d 00 0a 00** . Basically it is `\r\n` in
UTF-16LE. If I set:
```
.option("charset", "UTF-16LE").option("lineSep", "\r\n")
```
The first record is ignored because it contains BOM which `UTF-16LE` must
not contain. As the result I am getting only the the second record. If I set
`UTF-16`, I am getting the first record because it contains BOM (as UTF-16
string must contain) but the second is rejected.

How does it work in the case if `.option("recordDelimiter", "x0d 00 0a
00")` and charset is not specified. The answer is charset auto-detection of
jackson-json. Hadoop's LineRecord Reader just splits the json by the delimiter
and we have:
```
Seq("BOM json_record1", "json_record2")
```
The first string is detected according to BOM. And BOM is removed from the
result by jackson. The second string is detected according to its chars as
`UTF-16LE`. And we are getting correct result.

So, if we don't support lineSep format in which sequence of bytes are
expressed explicitly, we cannot read unicode jsons with BOM in per-line mode.




---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20849: [SPARK-23723] New charset option for json datasource

Reply via email to