Github user MaxGekk commented on the issue:

    https://github.com/apache/spark/pull/20849
  
    When I was trying to remove the flexible format for lineSep 
(recordDelimiter), I faced to a problem. I cannot fix the test: 
https://github.com/MaxGekk/spark-1/blob/54fd42b64e0715540010c4d59b8b4f7a4a1b0876/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala#L2071-L2081
    
    There are no any combination of charset and lineSep that allow me to read 
the file. Here is the structure of the file:
    ```
    BOM json_record1 delimiter json_record2 delimiter
    ```
    The delimiter in hex: **x0d 00 0a 00** . Basically it is `\r\n` in 
UTF-16LE. If I set:
    ```
    .option("charset", "UTF-16LE").option("lineSep", "\r\n")
    ```
    The first record is ignored because it contains BOM which `UTF-16LE` must 
not contain. As the result I am getting only the the second record. If I set 
`UTF-16`, I am getting the first record because it contains BOM (as UTF-16 
string must contain) but the second is rejected.
    
    How does it work in the case if `.option("recordDelimiter", "x0d 00 0a 
00")` and charset is not specified. The answer is charset auto-detection of 
jackson-json. Hadoop's LineRecord Reader just splits the json by the delimiter 
and we have:
    ```
    Seq("BOM json_record1", "json_record2")
    ```
    The first string is detected according to BOM. And BOM is removed from the 
result by jackson. The second string is detected according to its chars as 
`UTF-16LE`. And we are getting correct result.
    
    So, if we don't support lineSep format in which sequence of bytes are 
expressed explicitly, we cannot read unicode jsons with BOM in per-line mode.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to