[jira] [Commented] (SPARK-32961) PySpark CSV read with UTF-16 encoding is not working correctly

2020-09-24 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17201363#comment-17201363
 ] 

Hyukjin Kwon commented on SPARK-32961:
--

For the issue itself, I am almost 100% sure we can't fix with {{multiLine}} 
disabled (or we will end up with mimicking {{multiLine}} behaviour). I will 
leave this JIRA resolved.

> PySpark CSV read with UTF-16 encoding is not working correctly
> --
>
> Key: SPARK-32961
> URL: https://issues.apache.org/jira/browse/SPARK-32961
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.1
> Environment: both spark local and cluster mode
>Reporter: Bui Bao Anh
>Priority: Major
>  Labels: Correctness
> Attachments: pandas df.png, pyspark df.png, pyspark utf-16 with 
> multiline csv.png, pyspark utf-16le.png, sendo_sample.csv
>
>
> There are weird characters in the output when printing out to console or 
> writing to files.
> Find attached files to see how it look in Spark Dataframe and Pandas 
> Dataframe.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32961) PySpark CSV read with UTF-16 encoding is not working correctly

2020-09-24 Thread Bui Bao Anh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17201362#comment-17201362
 ] 

Bui Bao Anh commented on SPARK-32961:
-

Hi [~hyukjin.kwon], got it, 

For now we don't want to change the encoding of the file, we can go with 
multiline option.

Thanks! :)

> PySpark CSV read with UTF-16 encoding is not working correctly
> --
>
> Key: SPARK-32961
> URL: https://issues.apache.org/jira/browse/SPARK-32961
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.1
> Environment: both spark local and cluster mode
>Reporter: Bui Bao Anh
>Priority: Major
>  Labels: Correctness
> Attachments: pandas df.png, pyspark df.png, pyspark utf-16 with 
> multiline csv.png, pyspark utf-16le.png, sendo_sample.csv
>
>
> There are weird characters in the output when printing out to console or 
> writing to files.
> Find attached files to see how it look in Spark Dataframe and Pandas 
> Dataframe.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32961) PySpark CSV read with UTF-16 encoding is not working correctly

2020-09-24 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17201356#comment-17201356
 ] 

Hyukjin Kwon commented on SPARK-32961:
--

For UTF-16LE or UTF-16BE, the file {{sendo.csv}} has to be included 
correspondingly. It's different from just UTF-16.

 

> PySpark CSV read with UTF-16 encoding is not working correctly
> --
>
> Key: SPARK-32961
> URL: https://issues.apache.org/jira/browse/SPARK-32961
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.1
> Environment: both spark local and cluster mode
>Reporter: Bui Bao Anh
>Priority: Major
>  Labels: Correctness
> Attachments: pandas df.png, pyspark df.png, pyspark utf-16 with 
> multiline csv.png, pyspark utf-16le.png, sendo_sample.csv
>
>
> There are weird characters in the output when printing out to console or 
> writing to files.
> Find attached files to see how it look in Spark Dataframe and Pandas 
> Dataframe.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32961) PySpark CSV read with UTF-16 encoding is not working correctly

2020-09-24 Thread Bui Bao Anh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17201351#comment-17201351
 ] 

Bui Bao Anh commented on SPARK-32961:
-

Thanks a lot [~hyukjin.kwon], 

I tried and it works with multiline option enabled, however it does not work 
correctly with UTF-16LE or UTF-16B encoding.

DF with *multiline* enabled:

!pyspark utf-16 with multiline csv.png!

DF with *UTF-16LE* encoding:

!pyspark utf-16le.png!

> PySpark CSV read with UTF-16 encoding is not working correctly
> --
>
> Key: SPARK-32961
> URL: https://issues.apache.org/jira/browse/SPARK-32961
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.1
> Environment: both spark local and cluster mode
>Reporter: Bui Bao Anh
>Priority: Major
>  Labels: Correctness
> Attachments: pandas df.png, pyspark df.png, pyspark utf-16 with 
> multiline csv.png, pyspark utf-16le.png, sendo_sample.csv
>
>
> There are weird characters in the output when printing out to console or 
> writing to files.
> Find attached files to see how it look in Spark Dataframe and Pandas 
> Dataframe.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32961) PySpark CSV read with UTF-16 encoding is not working correctly

2020-09-24 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17201338#comment-17201338
 ] 

Hyukjin Kwon commented on SPARK-32961:
--

UTF-16 doesn't correctly work with CSV when {{multiLine}} is disabled. It 
should be either UTF-16LE or UTF-16BE explicitly.

This is because BOM exists at the first of the CSV file (by UTF-16), and the 
CSV parsing process happen in the partitions of the file which does not contain 
the BOM.

To workaround, you can enable {{multiLine}}, or use  UTF-16LE or UTF-16BE.

> PySpark CSV read with UTF-16 encoding is not working correctly
> --
>
> Key: SPARK-32961
> URL: https://issues.apache.org/jira/browse/SPARK-32961
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.1
> Environment: both spark local and cluster mode
>Reporter: Bui Bao Anh
>Priority: Major
>  Labels: Correctness
> Attachments: pandas df.png, pyspark df.png, sendo_sample.csv
>
>
> There are weird characters in the output when printing out to console or 
> writing to files.
> Find attached files to see how it look in Spark Dataframe and Pandas 
> Dataframe.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32961) PySpark CSV read with UTF-16 encoding is not working correctly

2020-09-22 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17200551#comment-17200551
 ] 

Takeshi Yamamuro commented on SPARK-32961:
--

cc: [~yumwang]

> PySpark CSV read with UTF-16 encoding is not working correctly
> --
>
> Key: SPARK-32961
> URL: https://issues.apache.org/jira/browse/SPARK-32961
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.1
> Environment: both spark local and cluster mode
>Reporter: Bui Bao Anh
>Priority: Major
>  Labels: Correctness
> Attachments: pandas df.png, pyspark df.png, sendo_sample.csv
>
>
> There are weird characters in the output when printing out to console or 
> writing to files.
> Find attached files to see how it look in Spark Dataframe and Pandas 
> Dataframe.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org