[jira] [Updated] (SPARK-19872) UnicodeDecodeError in Pyspark on sc.textFile read with repartition

Brian Bruggeman (JIRA) Wed, 08 Mar 2017 12:53:10 -0800

     [ 
https://issues.apache.org/jira/browse/SPARK-19872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Brian Bruggeman updated SPARK-19872:
------------------------------------
    Description: 
I'm receiving the following traceback:

{{
>>> sc.textFile('test.txt').repartition(10).collect()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
line 810, in collect
    return list(_load_from_socket(port, self._jrdd_deserializer))
  File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
line 140, in _load_from_socket
    for item in serializer.load_stream(rf):
  File 
"/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
line 539, in load_stream
    yield self.loads(stream)
  File 
"/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
line 534, in loads
    return s.decode("utf-8") if self.use_unicode else s
  File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", line 
16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid 
start byte
}}

I created a textfile (text.txt) with standard linux newlines:
{{
a
b

d
e
f
g
h
i
j
k
l

}}

I think ran pyspark:
{{
$ pyspark
Python 2.7.13 (default, Dec 18 2016, 07:03:39)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
17/03/08 13:59:27 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
17/03/08 13:59:32 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.1.0
      /_/

Using Python version 2.7.13 (default, Dec 18 2016 07:03:39)
SparkSession available as 'spark'.
>>> sc.textFile('test.txt').collect()
[u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l']
>>> sc.textFile('test.txt', use_unicode=False).collect()
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l']
>>> sc.textFile('test.txt', use_unicode=False).repartition(10).collect()
['\x80\x02]q\x01(U\x01aU\x01bU\x01cU\x01dU\x01eU\x01fU\x01ge.', 
'\x80\x02]q\x01(U\x01hU\x01iU\x01jU\x01kU\x01le.']
>>> sc.textFile('test.txt').repartition(10).collect()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
line 810, in collect
    return list(_load_from_socket(port, self._jrdd_deserializer))
  File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
line 140, in _load_from_socket
    for item in serializer.load_stream(rf):
  File 
"/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
line 539, in load_stream
    yield self.loads(stream)
  File 
"/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
line 534, in loads
    return s.decode("utf-8") if self.use_unicode else s
  File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", line 
16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid 
start byte
}}

This really looks like a bug in the `serializers.py` code.

  was:
I'm receiving the following traceback:

```
>>> sc.textFile('test.txt').repartition(10).collect()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
line 810, in collect
    return list(_load_from_socket(port, self._jrdd_deserializer))
  File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
line 140, in _load_from_socket
    for item in serializer.load_stream(rf):
  File 
"/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
line 539, in load_stream
    yield self.loads(stream)
  File 
"/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
line 534, in loads
    return s.decode("utf-8") if self.use_unicode else s
  File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", line 
16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid 
start byte
```

I created a textfile (text.txt) with standard linux newlines:
```
a
b

d
e
f
g
h
i
j
k
l

```

I think ran pyspark:
```
$ pyspark
Python 2.7.13 (default, Dec 18 2016, 07:03:39)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
17/03/08 13:59:27 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
17/03/08 13:59:32 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.1.0
      /_/

Using Python version 2.7.13 (default, Dec 18 2016 07:03:39)
SparkSession available as 'spark'.
>>> sc.textFile('test.txt').collect()
[u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l']
>>> sc.textFile('test.txt', use_unicode=False).collect()
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l']
>>> sc.textFile('test.txt', use_unicode=False).repartition(10).collect()
['\x80\x02]q\x01(U\x01aU\x01bU\x01cU\x01dU\x01eU\x01fU\x01ge.', 
'\x80\x02]q\x01(U\x01hU\x01iU\x01jU\x01kU\x01le.']
>>> sc.textFile('test.txt').repartition(10).collect()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
line 810, in collect
    return list(_load_from_socket(port, self._jrdd_deserializer))
  File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
line 140, in _load_from_socket
    for item in serializer.load_stream(rf):
  File 
"/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
line 539, in load_stream
    yield self.loads(stream)
  File 
"/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
line 534, in loads
    return s.decode("utf-8") if self.use_unicode else s
  File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", line 
16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid 
start byte
```

This really looks like a bug in the `serializers.py` code.


> UnicodeDecodeError in Pyspark on sc.textFile read with repartition
> ------------------------------------------------------------------
>
>                 Key: SPARK-19872
>                 URL: https://issues.apache.org/jira/browse/SPARK-19872
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.1.0
>         Environment: Mac and EC2
>            Reporter: Brian Bruggeman
>
> I'm receiving the following traceback:
> {{
> >>> sc.textFile('test.txt').repartition(10).collect()
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 810, in collect
>     return list(_load_from_socket(port, self._jrdd_deserializer))
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 140, in _load_from_socket
>     for item in serializer.load_stream(rf):
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 539, in load_stream
>     yield self.loads(stream)
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 534, in loads
>     return s.decode("utf-8") if self.use_unicode else s
>   File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", 
> line 16, in decode
>     return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: 
> invalid start byte
> }}
> I created a textfile (text.txt) with standard linux newlines:
> {{
> a
> b
> d
> e
> f
> g
> h
> i
> j
> k
> l
> }}
> I think ran pyspark:
> {{
> $ pyspark
> Python 2.7.13 (default, Dec 18 2016, 07:03:39)
> [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 17/03/08 13:59:27 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 17/03/08 13:59:32 WARN ObjectStore: Failed to get database global_temp, 
> returning NoSuchObjectException
> Welcome to
>       ____              __
>      / __/__  ___ _____/ /__
>     _\ \/ _ \/ _ `/ __/  '_/
>    /__ / .__/\_,_/_/ /_/\_\   version 2.1.0
>       /_/
> Using Python version 2.7.13 (default, Dec 18 2016 07:03:39)
> SparkSession available as 'spark'.
> >>> sc.textFile('test.txt').collect()
> [u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l']
> >>> sc.textFile('test.txt', use_unicode=False).collect()
> ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l']
> >>> sc.textFile('test.txt', use_unicode=False).repartition(10).collect()
> ['\x80\x02]q\x01(U\x01aU\x01bU\x01cU\x01dU\x01eU\x01fU\x01ge.', 
> '\x80\x02]q\x01(U\x01hU\x01iU\x01jU\x01kU\x01le.']
> >>> sc.textFile('test.txt').repartition(10).collect()
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 810, in collect
>     return list(_load_from_socket(port, self._jrdd_deserializer))
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 140, in _load_from_socket
>     for item in serializer.load_stream(rf):
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 539, in load_stream
>     yield self.loads(stream)
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 534, in loads
>     return s.decode("utf-8") if self.use_unicode else s
>   File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", 
> line 16, in decode
>     return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: 
> invalid start byte
> }}
> This really looks like a bug in the `serializers.py` code.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19872) UnicodeDecodeError in Pyspark on sc.textFile read with repartition

Reply via email to