[jira] [Commented] (SPARK-23410) Unable to read jsons in charset different from UTF-8

xuqianjin (JIRA) Mon, 26 Nov 2018 17:52:36 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16699803#comment-16699803
 ]


xuqianjin commented on SPARK-23410:
-----------------------------------

hi [~maxgekk]  [~hyukjin.kwon] I think there are two things to consider：
1. Even if lineSeps is set, it is still necessary to identify the file bom 
charset. The charset of lineSep may be inconsistent with the encoding of the 
file, resulting in parsing errors.
2. For example, commas are different in utf-8, utf-16le, utf-16be, utf-32le and 
utf32-be. These formats are also supported for lineSeps.

> Unable to read jsons in charset different from UTF-8
> ----------------------------------------------------
>
>                 Key: SPARK-23410
>                 URL: https://issues.apache.org/jira/browse/SPARK-23410
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Maxim Gekk
>            Priority: Major
>         Attachments: utf16WithBOM.json
>
>
> Currently the Json Parser is forced to read json files in UTF-8. Such 
> behavior breaks backward compatibility with Spark 2.2.1 and previous versions 
> that can read json files in UTF-16, UTF-32 and other encodings due to using 
> of the auto detection mechanism of the jackson library. Need to give back to 
> users possibility to read json files in specified charset and/or detect 
> charset automatically as it was before.    



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23410) Unable to read jsons in charset different from UTF-8

Reply via email to