[ https://issues.apache.org/jira/browse/SPARK-23723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16403448#comment-16403448 ]
Apache Spark commented on SPARK-23723: -------------------------------------- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/20849 > New charset option for json datasource > -------------------------------------- > > Key: SPARK-23723 > URL: https://issues.apache.org/jira/browse/SPARK-23723 > Project: Spark > Issue Type: Sub-task > Components: SQL > Affects Versions: 2.4.0 > Reporter: Maxim Gekk > Priority: Major > > Currently JSON Reader can read json files in different charset/encodings. The > JSON Reader uses the jackson-json library to automatically detect the charset > of input text/stream. Here you can see the method which detects encoding: > [https://github.com/FasterXML/jackson-core/blob/master/src/main/java/com/fasterxml/jackson/core/json/ByteSourceJsonBootstrapper.java#L111-L174] > > The detectEncoding method checks the BOM > ([https://en.wikipedia.org/wiki/Byte_order_mark]) at the beginning of a text. > The BOM can be in the file but it is not mandatory. If it is not present, the > auto detection mechanism can select wrong charset. And as a consequence of > that, the user cannot read the json file. *The proposed option will allow to > bypass the auto detection mechanism and set the charset explicitly.* > > The charset option is already exposed as a CSV option: > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L87-L88] > . I propose to add the same option for JSON. > > Regarding to JSON Writer, *the charset option will give to the user > opportunity* to read json files in charset different from UTF-8, modify the > dataset and *write results back to json files in the original encoding.* At > the moment it is not possible to do because the result can be saved in UTF-8 > only. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org