[ https://issues.apache.org/jira/browse/SPARK-19872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Brian Bruggeman updated SPARK-19872: ------------------------------------ Description: I'm receiving the following traceback: {{ >>> sc.textFile('test.txt').repartition(10).collect() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", line 810, in collect return list(_load_from_socket(port, self._jrdd_deserializer)) File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", line 140, in _load_from_socket for item in serializer.load_stream(rf): File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", line 539, in load_stream yield self.loads(stream) File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", line 534, in loads return s.decode("utf-8") if self.use_unicode else s File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte }} I created a textfile (text.txt) with standard linux newlines: {{ a b d e f g h i j k l }} I think ran pyspark: {{ $ pyspark Python 2.7.13 (default, Dec 18 2016, 07:03:39) [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin Type "help", "copyright", "credits" or "license" for more information. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 17/03/08 13:59:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/03/08 13:59:32 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.1.0 /_/ Using Python version 2.7.13 (default, Dec 18 2016 07:03:39) SparkSession available as 'spark'. >>> sc.textFile('test.txt').collect() [u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l'] >>> sc.textFile('test.txt', use_unicode=False).collect() ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l'] >>> sc.textFile('test.txt', use_unicode=False).repartition(10).collect() ['\x80\x02]q\x01(U\x01aU\x01bU\x01cU\x01dU\x01eU\x01fU\x01ge.', '\x80\x02]q\x01(U\x01hU\x01iU\x01jU\x01kU\x01le.'] >>> sc.textFile('test.txt').repartition(10).collect() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", line 810, in collect return list(_load_from_socket(port, self._jrdd_deserializer)) File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", line 140, in _load_from_socket for item in serializer.load_stream(rf): File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", line 539, in load_stream yield self.loads(stream) File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", line 534, in loads return s.decode("utf-8") if self.use_unicode else s File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte }} This really looks like a bug in the `serializers.py` code. was: I'm receiving the following traceback: ``` >>> sc.textFile('test.txt').repartition(10).collect() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", line 810, in collect return list(_load_from_socket(port, self._jrdd_deserializer)) File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", line 140, in _load_from_socket for item in serializer.load_stream(rf): File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", line 539, in load_stream yield self.loads(stream) File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", line 534, in loads return s.decode("utf-8") if self.use_unicode else s File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte ``` I created a textfile (text.txt) with standard linux newlines: ``` a b d e f g h i j k l ``` I think ran pyspark: ``` $ pyspark Python 2.7.13 (default, Dec 18 2016, 07:03:39) [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin Type "help", "copyright", "credits" or "license" for more information. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 17/03/08 13:59:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/03/08 13:59:32 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.1.0 /_/ Using Python version 2.7.13 (default, Dec 18 2016 07:03:39) SparkSession available as 'spark'. >>> sc.textFile('test.txt').collect() [u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l'] >>> sc.textFile('test.txt', use_unicode=False).collect() ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l'] >>> sc.textFile('test.txt', use_unicode=False).repartition(10).collect() ['\x80\x02]q\x01(U\x01aU\x01bU\x01cU\x01dU\x01eU\x01fU\x01ge.', '\x80\x02]q\x01(U\x01hU\x01iU\x01jU\x01kU\x01le.'] >>> sc.textFile('test.txt').repartition(10).collect() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", line 810, in collect return list(_load_from_socket(port, self._jrdd_deserializer)) File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", line 140, in _load_from_socket for item in serializer.load_stream(rf): File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", line 539, in load_stream yield self.loads(stream) File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", line 534, in loads return s.decode("utf-8") if self.use_unicode else s File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte ``` This really looks like a bug in the `serializers.py` code. > UnicodeDecodeError in Pyspark on sc.textFile read with repartition > ------------------------------------------------------------------ > > Key: SPARK-19872 > URL: https://issues.apache.org/jira/browse/SPARK-19872 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.1.0 > Environment: Mac and EC2 > Reporter: Brian Bruggeman > > I'm receiving the following traceback: > {{ > >>> sc.textFile('test.txt').repartition(10).collect() > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 810, in collect > return list(_load_from_socket(port, self._jrdd_deserializer)) > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 140, in _load_from_socket > for item in serializer.load_stream(rf): > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 539, in load_stream > yield self.loads(stream) > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 534, in loads > return s.decode("utf-8") if self.use_unicode else s > File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", > line 16, in decode > return codecs.utf_8_decode(input, errors, True) > UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: > invalid start byte > }} > I created a textfile (text.txt) with standard linux newlines: > {{ > a > b > d > e > f > g > h > i > j > k > l > }} > I think ran pyspark: > {{ > $ pyspark > Python 2.7.13 (default, Dec 18 2016, 07:03:39) > [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin > Type "help", "copyright", "credits" or "license" for more information. > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 17/03/08 13:59:27 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 17/03/08 13:59:32 WARN ObjectStore: Failed to get database global_temp, > returning NoSuchObjectException > Welcome to > ____ __ > / __/__ ___ _____/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /__ / .__/\_,_/_/ /_/\_\ version 2.1.0 > /_/ > Using Python version 2.7.13 (default, Dec 18 2016 07:03:39) > SparkSession available as 'spark'. > >>> sc.textFile('test.txt').collect() > [u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l'] > >>> sc.textFile('test.txt', use_unicode=False).collect() > ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l'] > >>> sc.textFile('test.txt', use_unicode=False).repartition(10).collect() > ['\x80\x02]q\x01(U\x01aU\x01bU\x01cU\x01dU\x01eU\x01fU\x01ge.', > '\x80\x02]q\x01(U\x01hU\x01iU\x01jU\x01kU\x01le.'] > >>> sc.textFile('test.txt').repartition(10).collect() > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 810, in collect > return list(_load_from_socket(port, self._jrdd_deserializer)) > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 140, in _load_from_socket > for item in serializer.load_stream(rf): > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 539, in load_stream > yield self.loads(stream) > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 534, in loads > return s.decode("utf-8") if self.use_unicode else s > File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", > line 16, in decode > return codecs.utf_8_decode(input, errors, True) > UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: > invalid start byte > }} > This really looks like a bug in the `serializers.py` code. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org