[jira] [Commented] (SPARK-32961) PySpark CSV read with UTF-16 encoding is not working correctly
[ https://issues.apache.org/jira/browse/SPARK-32961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17201363#comment-17201363 ] Hyukjin Kwon commented on SPARK-32961: -- For the issue itself, I am almost 100% sure we can't fix with {{multiLine}} disabled (or we will end up with mimicking {{multiLine}} behaviour). I will leave this JIRA resolved. > PySpark CSV read with UTF-16 encoding is not working correctly > -- > > Key: SPARK-32961 > URL: https://issues.apache.org/jira/browse/SPARK-32961 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.0.1 > Environment: both spark local and cluster mode >Reporter: Bui Bao Anh >Priority: Major > Labels: Correctness > Attachments: pandas df.png, pyspark df.png, pyspark utf-16 with > multiline csv.png, pyspark utf-16le.png, sendo_sample.csv > > > There are weird characters in the output when printing out to console or > writing to files. > Find attached files to see how it look in Spark Dataframe and Pandas > Dataframe. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32961) PySpark CSV read with UTF-16 encoding is not working correctly
[ https://issues.apache.org/jira/browse/SPARK-32961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17201362#comment-17201362 ] Bui Bao Anh commented on SPARK-32961: - Hi [~hyukjin.kwon], got it, For now we don't want to change the encoding of the file, we can go with multiline option. Thanks! :) > PySpark CSV read with UTF-16 encoding is not working correctly > -- > > Key: SPARK-32961 > URL: https://issues.apache.org/jira/browse/SPARK-32961 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.0.1 > Environment: both spark local and cluster mode >Reporter: Bui Bao Anh >Priority: Major > Labels: Correctness > Attachments: pandas df.png, pyspark df.png, pyspark utf-16 with > multiline csv.png, pyspark utf-16le.png, sendo_sample.csv > > > There are weird characters in the output when printing out to console or > writing to files. > Find attached files to see how it look in Spark Dataframe and Pandas > Dataframe. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32961) PySpark CSV read with UTF-16 encoding is not working correctly
[ https://issues.apache.org/jira/browse/SPARK-32961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17201356#comment-17201356 ] Hyukjin Kwon commented on SPARK-32961: -- For UTF-16LE or UTF-16BE, the file {{sendo.csv}} has to be included correspondingly. It's different from just UTF-16. > PySpark CSV read with UTF-16 encoding is not working correctly > -- > > Key: SPARK-32961 > URL: https://issues.apache.org/jira/browse/SPARK-32961 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.0.1 > Environment: both spark local and cluster mode >Reporter: Bui Bao Anh >Priority: Major > Labels: Correctness > Attachments: pandas df.png, pyspark df.png, pyspark utf-16 with > multiline csv.png, pyspark utf-16le.png, sendo_sample.csv > > > There are weird characters in the output when printing out to console or > writing to files. > Find attached files to see how it look in Spark Dataframe and Pandas > Dataframe. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32961) PySpark CSV read with UTF-16 encoding is not working correctly
[ https://issues.apache.org/jira/browse/SPARK-32961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17201351#comment-17201351 ] Bui Bao Anh commented on SPARK-32961: - Thanks a lot [~hyukjin.kwon], I tried and it works with multiline option enabled, however it does not work correctly with UTF-16LE or UTF-16B encoding. DF with *multiline* enabled: !pyspark utf-16 with multiline csv.png! DF with *UTF-16LE* encoding: !pyspark utf-16le.png! > PySpark CSV read with UTF-16 encoding is not working correctly > -- > > Key: SPARK-32961 > URL: https://issues.apache.org/jira/browse/SPARK-32961 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.0.1 > Environment: both spark local and cluster mode >Reporter: Bui Bao Anh >Priority: Major > Labels: Correctness > Attachments: pandas df.png, pyspark df.png, pyspark utf-16 with > multiline csv.png, pyspark utf-16le.png, sendo_sample.csv > > > There are weird characters in the output when printing out to console or > writing to files. > Find attached files to see how it look in Spark Dataframe and Pandas > Dataframe. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32961) PySpark CSV read with UTF-16 encoding is not working correctly
[ https://issues.apache.org/jira/browse/SPARK-32961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17201338#comment-17201338 ] Hyukjin Kwon commented on SPARK-32961: -- UTF-16 doesn't correctly work with CSV when {{multiLine}} is disabled. It should be either UTF-16LE or UTF-16BE explicitly. This is because BOM exists at the first of the CSV file (by UTF-16), and the CSV parsing process happen in the partitions of the file which does not contain the BOM. To workaround, you can enable {{multiLine}}, or use UTF-16LE or UTF-16BE. > PySpark CSV read with UTF-16 encoding is not working correctly > -- > > Key: SPARK-32961 > URL: https://issues.apache.org/jira/browse/SPARK-32961 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.0.1 > Environment: both spark local and cluster mode >Reporter: Bui Bao Anh >Priority: Major > Labels: Correctness > Attachments: pandas df.png, pyspark df.png, sendo_sample.csv > > > There are weird characters in the output when printing out to console or > writing to files. > Find attached files to see how it look in Spark Dataframe and Pandas > Dataframe. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32961) PySpark CSV read with UTF-16 encoding is not working correctly
[ https://issues.apache.org/jira/browse/SPARK-32961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17200551#comment-17200551 ] Takeshi Yamamuro commented on SPARK-32961: -- cc: [~yumwang] > PySpark CSV read with UTF-16 encoding is not working correctly > -- > > Key: SPARK-32961 > URL: https://issues.apache.org/jira/browse/SPARK-32961 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.0.1 > Environment: both spark local and cluster mode >Reporter: Bui Bao Anh >Priority: Major > Labels: Correctness > Attachments: pandas df.png, pyspark df.png, sendo_sample.csv > > > There are weird characters in the output when printing out to console or > writing to files. > Find attached files to see how it look in Spark Dataframe and Pandas > Dataframe. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org