[ https://issues.apache.org/jira/browse/SPARK-18571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Adrian Bridgett updated SPARK-18571: ------------------------------------ Description: Sample code attached, code run with hadoop 2.7.3, python3.5 If I run this with --master='local[*]' and LANG=en_US.UTF-8, then in _another_ terminal (which has LANG=en_US.UTF-8 set) cat the file, I see the Pi character I expect. Back to the first terminal, set LANG=C (or unset it) and rerun, then check the output in the other terminal (still set to en_US.UTF-8) and it's corrupted. I actually noticed this as when I run it with our normal mesos scheduler, the data is corrupted (those boxes do have LANG=en_US.UTF-8 but perhaps it's not being picked up). I don't remember needing to do this on Spark-1.6.1 (hadoop-2.7.1). Expected characters: 0x80cf Received: 0xbfef efbd bdbf was: Sample code attached, code run with hadoop 2.7.3 If I run this with --master='local[*]' and LANG=en_US.UTF-8, then in _another_ terminal (which has LANG=en_US.UTF-8 set) cat the file, I see the Pi character I expect. Back to the first terminal, set LANG=C (or unset it) and rerun, then check the output in the other terminal (still set to en_US.UTF-8) and it's corrupted. I actually noticed this as when I run it with our normal mesos scheduler, the data is corrupted (those boxes do have LANG=en_US.UTF-8 but perhaps it's not being picked up). I don't remember needing to do this on Spark-1.6.1 (hadoop-2.7.1). Expected characters: 0x80cf Received: 0xbfef efbd bdbf > pyspark: UTF-8 not written correctly (as CSV) when locale is not UTF-8 > ---------------------------------------------------------------------- > > Key: SPARK-18571 > URL: https://issues.apache.org/jira/browse/SPARK-18571 > Project: Spark > Issue Type: Bug > Components: Input/Output > Affects Versions: 2.0.2 > Reporter: Adrian Bridgett > Attachments: unicode.py > > > Sample code attached, code run with hadoop 2.7.3, python3.5 > If I run this with --master='local[*]' and LANG=en_US.UTF-8, then in > _another_ terminal (which has LANG=en_US.UTF-8 set) cat the file, I see the > Pi character I expect. > Back to the first terminal, set LANG=C (or unset it) and rerun, then check > the output in the other terminal (still set to en_US.UTF-8) and it's > corrupted. > I actually noticed this as when I run it with our normal mesos scheduler, the > data is corrupted (those boxes do have LANG=en_US.UTF-8 but perhaps it's not > being picked up). > I don't remember needing to do this on Spark-1.6.1 (hadoop-2.7.1). > Expected characters: 0x80cf > Received: 0xbfef efbd bdbf -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org