I opened a pull request containing a fix and regression test:
https://github.com/apache/incubator-spark/pull/218


On Fri, Nov 29, 2013 at 5:18 AM, Andrei <faithlessfri...@gmail.com> wrote:

> Thanks, Josh! Looking forward for your patch! Meanwhile, I've tried to
> change it manually and can confirm that it works fine.
>
>
> On Thu, Nov 28, 2013 at 8:11 PM, Josh Rosen <rosenvi...@gmail.com> wrote:
>
>> This is a bug.  The str() is there because I want to convert objects to
>> strings like Java's toString(), but I should have used unicode() instead.
>>  I'll submit a patch to fix this (I think it should be as simple as
>> replacing str() with unicode()).
>>
>>
>> On Thu, Nov 28, 2013 at 12:14 AM, Andrei <faithlessfri...@gmail.com>wrote:
>>
>>> Hi,
>>>
>>> I have a very simple script that just reads file from HDFS and
>>> immediately saves it back:
>>>
>>> from pyspark import SparkContext
>>> if __name__ == '__main__':
>>>     sc = SparkContext('spark://master:7077', 'UnicodeTest')
>>>     data = sc.textFile('hdfs://master/path/to/file.txt')
>>>     data.saveAsTextFile('hdfs://master/path/to/copy')
>>>
>>> If contents of a file are ascii-compatible, it works fine. But if there
>>> are unicode characters in the file, I'm getting the *UnicodeEncodeError*
>>> :
>>>
>>>   File "/usr/local/spark/python/pyspark/worker.py", line 82, in main
>>>     for obj in func(split_index, iterator):
>>>   File "/usr/local/spark/python/pyspark/rdd.py", line 555, in <genexpr>
>>>     *return (str(x).encode("utf-8") for x in iterator)*
>>> UnicodeEncodeError: 'ascii' codec can't encode character u'\xf1' in
>>> position 56: ordinal not in range(128)
>>>
>>> As far as I understand, PySpark works with *unicode* objects
>>> internally, and to save it into a file it tries to encode such an object
>>> into UTF-8. But why does it try to encode to 'ascii' first? How can I fix
>>> it to process UTF characters?
>>>
>>
>>
>

Reply via email to