[ https://issues.apache.org/jira/browse/SPARK-11406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-11406: ------------------------------------ Assignee: Apache Spark > utf-8 decode issue w/ kinesis > ----------------------------- > > Key: SPARK-11406 > URL: https://issues.apache.org/jira/browse/SPARK-11406 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 1.5.1 > Environment: osx, spark:master > Reporter: Brian ONeill > Assignee: Apache Spark > Priority: Minor > > If we get a bad message over Kinesis, the spark job blows up with: > Caused by: org.apache.spark.api.python.PythonException: Traceback (most > recent call last): > File > "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/worker.py", line > 111, in main > process() > File > "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/worker.py", line > 106, in process > serializer.dump_stream(func(split_index, iterator), outfile) > File "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/rdd.py", > line 2347, in pipeline_func > File "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/rdd.py", > line 2347, in pipeline_func > File "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/rdd.py", > line 317, in func > File "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/rdd.py", > line 1777, in combineLocally > File > "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/shuffle.py", > line 236, in mergeValues > for k, v in iterator: > File > "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/streaming/kinesis.py", > line 88, in <lambda> > File > "/Users/brianoneill/git/spark/python/lib/pyspark.zip/pyspark/streaming/kinesis.py", > line 31, in utf8_decoder > return s.decode('utf-8') > File "/Users/brianoneill/venv/monetate/lib/python2.7/encodings/utf_8.py", > line 16, in decode > return codecs.utf_8_decode(input, errors, True) > UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-2: invalid > continuation byte > This kills the job. Should there be a more elegant way to handle this case? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org