[ https://issues.apache.org/jira/browse/SPARK-38801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Himanshu Arora updated SPARK-38801: ----------------------------------- Attachment: Screenshot 2022-04-06 at 09.29.24.png > ISO-8859-1 encoding doesn't work for text format > ------------------------------------------------ > > Key: SPARK-38801 > URL: https://issues.apache.org/jira/browse/SPARK-38801 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 3.2.1 > Environment: I tested this issue on Databricks runtime 10.3 (spark > 3.2.1, scala 2.12) > Reporter: Himanshu Arora > Priority: Major > Attachments: Screenshot 2022-04-06 at 09.29.24.png, Screenshot > 2022-04-06 at 09.30.02.png > > > When reading text files from spark which are not in UTF-8 charset it doesn't > work well for foreign language characters (for ex. French chars like è and > é). They are all replaced by �. In my case the text files were in ISO-8859-1 > encoding. > After digging into docs, it seems that spark still uses Hadoop's > LineRecordReader class for text format which only supports UTF-8. Here's the > source code of that class: > [LineRecordReader.java|https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java#L154] > > You can see this issue in the screenshot below: > !image-2022-04-06-09-30-21-751.png! > As you can see the French word *données* is read as {*}donn�es{*}. The work > *Clôturé* is read as {*}Cl�tur�.{*}{*}{*} > > I also read the same text file as CSV format while providing the correct > charset value and it works fine in this case as you can see the screenshot > below: > !image-2022-04-06-09-31-45-062.png! > > So this issue is specifically for text format. Therefore reporting this > issue. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org