Hyukjin Kwon created SPARK-13108:
------------------------------------

             Summary: Encoding not working with non-ascii compatible encodings 
(UTF-16/32 etc.)
                 Key: SPARK-13108
                 URL: https://issues.apache.org/jira/browse/SPARK-13108
             Project: Spark
          Issue Type: Sub-task
          Components: SQL
    Affects Versions: 2.0.0
            Reporter: Hyukjin Kwon
            Priority: Minor


This library uses Hadoop's 
[{{TextInputFormat}}|https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.java],
 which uses 
[{{LineRecordReader}}|https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java].

According to 
[MAPREDUCE-232|https://issues.apache.org/jira/browse/MAPREDUCE-232], it looks 
[{{TextInputFormat}}|https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.java]
 does not guarantee all encoding types but officially only UTF-8 (as commented 
in 
[{{LineRecordReader#L147}}|https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java#L147].

According to 
[MAPREDUCE-232#comment-13183601|https://issues.apache.org/jira/browse/MAPREDUCE-232?focusedCommentId=13183601&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13183601],
 it still looks fine with most encodings though but without UTF-16/32.

In more details, 

I tested this in Max OS. I converted `cars_iso-8859-1.csv` into 
`cars_utf-16.csv` as below:

{code}
iconv -f iso-8859-1 -t utf-16 < cars_iso-8859-1.csv > cars_utf-16.csv
{code}

and run the codes below:

{code}
val cars = "src/test/resources/cars_utf-16.csv"
sqlContext.csvFile(cars, parserLib = parserLib, charset = "utf-16", delimiter = 
'þ').show()
{code}

This produces a wrong results below:
{code}
+----+-----+-----+--------------------+------+
|year| make|model|             comment|blank�|
+----+-----+-----+--------------------+------+
|2012|Tesla|    S|          No comment|     �|
|   �| null| null|                null|  null|
|1997| Ford| E350|Go get one now th...|     �|
|2015|Chevy|Volt�|                null|  null|
|   �| null| null|                null|  null|
+----+-----+-----+--------------------+------+
{code}

Instead of the correct results below:
{code}
+----+-----+-----+--------------------+-----+
|year| make|model|             comment|blank|
+----+-----+-----+--------------------+-----+
|2012|Tesla|    S|          No comment|     |
|1997| Ford| E350|Go get one now th...|     |
|2015|Chevy| Volt|                null| null|
+----+-----+-----+--------------------+-----+
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to