[ https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16699803#comment-16699803 ]
xuqianjin commented on SPARK-23410: ----------------------------------- hi [~maxgekk] [~hyukjin.kwon] I think there are two things to consider: 1. Even if lineSeps is set, it is still necessary to identify the file bom charset. The charset of lineSep may be inconsistent with the encoding of the file, resulting in parsing errors. 2. For example, commas are different in utf-8, utf-16le, utf-16be, utf-32le and utf32-be. These formats are also supported for lineSeps. > Unable to read jsons in charset different from UTF-8 > ---------------------------------------------------- > > Key: SPARK-23410 > URL: https://issues.apache.org/jira/browse/SPARK-23410 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.3.0 > Reporter: Maxim Gekk > Priority: Major > Attachments: utf16WithBOM.json > > > Currently the Json Parser is forced to read json files in UTF-8. Such > behavior breaks backward compatibility with Spark 2.2.1 and previous versions > that can read json files in UTF-16, UTF-32 and other encodings due to using > of the auto detection mechanism of the jackson library. Need to give back to > users possibility to read json files in specified charset and/or detect > charset automatically as it was before. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org