[jira] [Updated] (SPARK-18571) pyspark: UTF-8 not written correctly (as CSV) when locale is not UTF-8

Adrian Bridgett (JIRA) Wed, 23 Nov 2016 12:35:51 -0800

     [ 
https://issues.apache.org/jira/browse/SPARK-18571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Adrian Bridgett updated SPARK-18571:
------------------------------------
    Description: 
Sample code attached, code run with hadoop 2.7.3, python3.5

If I run this with --master='local[*]' and LANG=en_US.UTF-8, then in _another_ 
terminal (which has LANG=en_US.UTF-8 set) cat the file, I see the Pi character 
I expect.

Back to the first terminal, set LANG=C (or unset it) and rerun, then check the 
output in the other terminal (still set to en_US.UTF-8) and it's corrupted.

I actually noticed this as when I run it with our normal mesos scheduler, the 
data is corrupted (those boxes do have LANG=en_US.UTF-8 but perhaps it's not 
being picked up).

I don't remember needing to do this on Spark-1.6.1 (hadoop-2.7.1).

Expected characters: 0x80cf
Received: 0xbfef efbd bdbf

  was:
Sample code attached, code run with hadoop 2.7.3

If I run this with --master='local[*]' and LANG=en_US.UTF-8, then in _another_ 
terminal (which has LANG=en_US.UTF-8 set) cat the file, I see the Pi character 
I expect.

Back to the first terminal, set LANG=C (or unset it) and rerun, then check the 
output in the other terminal (still set to en_US.UTF-8) and it's corrupted.

I actually noticed this as when I run it with our normal mesos scheduler, the 
data is corrupted (those boxes do have LANG=en_US.UTF-8 but perhaps it's not 
being picked up).

I don't remember needing to do this on Spark-1.6.1 (hadoop-2.7.1).

Expected characters: 0x80cf
Received: 0xbfef efbd bdbf


> pyspark: UTF-8 not written correctly (as CSV) when locale is not UTF-8
> ----------------------------------------------------------------------
>
>                 Key: SPARK-18571
>                 URL: https://issues.apache.org/jira/browse/SPARK-18571
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 2.0.2
>            Reporter: Adrian Bridgett
>         Attachments: unicode.py
>
>
> Sample code attached, code run with hadoop 2.7.3, python3.5
> If I run this with --master='local[*]' and LANG=en_US.UTF-8, then in 
> _another_ terminal (which has LANG=en_US.UTF-8 set) cat the file, I see the 
> Pi character I expect.
> Back to the first terminal, set LANG=C (or unset it) and rerun, then check 
> the output in the other terminal (still set to en_US.UTF-8) and it's 
> corrupted.
> I actually noticed this as when I run it with our normal mesos scheduler, the 
> data is corrupted (those boxes do have LANG=en_US.UTF-8 but perhaps it's not 
> being picked up).
> I don't remember needing to do this on Spark-1.6.1 (hadoop-2.7.1).
> Expected characters: 0x80cf
> Received: 0xbfef efbd bdbf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18571) pyspark: UTF-8 not written correctly (as CSV) when locale is not UTF-8

Reply via email to