[jira] [Assigned] (SPARK-20947) Encoding/decoding issue in PySpark pipe implementation

2018-01-21 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-20947:


Assignee: Xiaozhe Wang

> Encoding/decoding issue in PySpark pipe implementation
> --
>
> Key: SPARK-20947
> URL: https://issues.apache.org/jira/browse/SPARK-20947
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0, 
> 2.1.1
>Reporter: Xiaozhe Wang
>Assignee: Xiaozhe Wang
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: fix-pipe-encoding-error.patch
>
>
> Pipe action convert objects into strings using a way that was affected by the 
> default encoding setting of Python environment.
> Here is the related code fragment (L717-721@python/pyspark/rdd.py):
> {code}
> def pipe_objs(out):
> for obj in iterator:
> s = str(obj).rstrip('\n') + '\n'
> out.write(s.encode('utf-8'))
> out.close()
> {code}
> The `str(obj)` part implicitly convert `obj` to an unicode string, then 
> encode it into a byte string using default encoding; On the other hand, the 
> `s.encode('utf-8')` part implicitly decode `s` into an unicode string using 
> default encoding and then encode it (AGAIN!) into a UTF-8 encoded byte string.
> Typically the default encoding of Python environment would be 'ascii', which 
> means passing  an unicode string containing characters beyond 'ascii' charset 
> will raise UnicodeEncodeError exception at `str(obj)` and passing a byte 
> string containing bytes greater than 128 will again raise UnicodeEncodeError 
> exception at 's.encode('utf-8')`.
> Changing `str(obj)` to `unicode(obj)` would eliminate these problems.
> The following code snippet reproduces these errors:
> {code}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 1.6.3
>   /_/
> Using Python version 2.7.12 (default, Jul 25 2016 15:06:45)
> SparkContext available as sc, HiveContext available as sqlContext.
> >>> sc.parallelize([u'\u6d4b\u8bd5']).pipe('cat').collect()
> [Stage 0:>  (0 + 4) / 
> 4]Exception in thread Thread-1:
> Traceback (most recent call last):
>   File 
> "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py",
>  line 801, in __bootstrap_inner
> self.run()
>   File 
> "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py",
>  line 754, in run
> self.__target(*self.__args, **self.__kwargs)
>   File 
> "/Users/wxz/Downloads/spark-1.6.3-bin-hadoop2.6/python/pyspark/rdd.py", line 
> 719, in pipe_objs
> s = str(obj).rstrip('\n') + '\n'
> UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: 
> ordinal not in range(128)
> >>>
> >>> sc.parallelize([u'\u6d4b\u8bd5']).map(lambda x: 
> >>> x.encode('utf-8')).pipe('cat').collect()
> Exception in thread Thread-1:
> Traceback (most recent call last):
>   File 
> "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py",
>  line 801, in __bootstrap_inner
> self.run()
>   File 
> "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py",
>  line 754, in run
> self.__target(*self.__args, **self.__kwargs)
>   File 
> "/Users/wxz/Downloads/spark-1.6.3-bin-hadoop2.6/python/pyspark/rdd.py", line 
> 720, in pipe_objs
> out.write(s.encode('utf-8'))
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 0: 
> ordinal not in range(128)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20947) Encoding/decoding issue in PySpark pipe implementation

2017-06-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20947:


Assignee: Apache Spark

> Encoding/decoding issue in PySpark pipe implementation
> --
>
> Key: SPARK-20947
> URL: https://issues.apache.org/jira/browse/SPARK-20947
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0, 
> 2.1.1
>Reporter: Xiaozhe Wang
>Assignee: Apache Spark
> Attachments: fix-pipe-encoding-error.patch
>
>
> Pipe action convert objects into strings using a way that was affected by the 
> default encoding setting of Python environment.
> Here is the related code fragment (L717-721@python/pyspark/rdd.py):
> {code}
> def pipe_objs(out):
> for obj in iterator:
> s = str(obj).rstrip('\n') + '\n'
> out.write(s.encode('utf-8'))
> out.close()
> {code}
> The `str(obj)` part implicitly convert `obj` to an unicode string, then 
> encode it into a byte string using default encoding; On the other hand, the 
> `s.encode('utf-8')` part implicitly decode `s` into an unicode string using 
> default encoding and then encode it (AGAIN!) into a UTF-8 encoded byte string.
> Typically the default encoding of Python environment would be 'ascii', which 
> means passing  an unicode string containing characters beyond 'ascii' charset 
> will raise UnicodeEncodeError exception at `str(obj)` and passing a byte 
> string containing bytes greater than 128 will again raise UnicodeEncodeError 
> exception at 's.encode('utf-8')`.
> Changing `str(obj)` to `unicode(obj)` would eliminate these problems.
> The following code snippet reproduces these errors:
> {code}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 1.6.3
>   /_/
> Using Python version 2.7.12 (default, Jul 25 2016 15:06:45)
> SparkContext available as sc, HiveContext available as sqlContext.
> >>> sc.parallelize([u'\u6d4b\u8bd5']).pipe('cat').collect()
> [Stage 0:>  (0 + 4) / 
> 4]Exception in thread Thread-1:
> Traceback (most recent call last):
>   File 
> "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py",
>  line 801, in __bootstrap_inner
> self.run()
>   File 
> "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py",
>  line 754, in run
> self.__target(*self.__args, **self.__kwargs)
>   File 
> "/Users/wxz/Downloads/spark-1.6.3-bin-hadoop2.6/python/pyspark/rdd.py", line 
> 719, in pipe_objs
> s = str(obj).rstrip('\n') + '\n'
> UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: 
> ordinal not in range(128)
> >>>
> >>> sc.parallelize([u'\u6d4b\u8bd5']).map(lambda x: 
> >>> x.encode('utf-8')).pipe('cat').collect()
> Exception in thread Thread-1:
> Traceback (most recent call last):
>   File 
> "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py",
>  line 801, in __bootstrap_inner
> self.run()
>   File 
> "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py",
>  line 754, in run
> self.__target(*self.__args, **self.__kwargs)
>   File 
> "/Users/wxz/Downloads/spark-1.6.3-bin-hadoop2.6/python/pyspark/rdd.py", line 
> 720, in pipe_objs
> out.write(s.encode('utf-8'))
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 0: 
> ordinal not in range(128)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20947) Encoding/decoding issue in PySpark pipe implementation

2017-06-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20947:


Assignee: (was: Apache Spark)

> Encoding/decoding issue in PySpark pipe implementation
> --
>
> Key: SPARK-20947
> URL: https://issues.apache.org/jira/browse/SPARK-20947
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0, 
> 2.1.1
>Reporter: Xiaozhe Wang
> Attachments: fix-pipe-encoding-error.patch
>
>
> Pipe action convert objects into strings using a way that was affected by the 
> default encoding setting of Python environment.
> Here is the related code fragment (L717-721@python/pyspark/rdd.py):
> {code}
> def pipe_objs(out):
> for obj in iterator:
> s = str(obj).rstrip('\n') + '\n'
> out.write(s.encode('utf-8'))
> out.close()
> {code}
> The `str(obj)` part implicitly convert `obj` to an unicode string, then 
> encode it into a byte string using default encoding; On the other hand, the 
> `s.encode('utf-8')` part implicitly decode `s` into an unicode string using 
> default encoding and then encode it (AGAIN!) into a UTF-8 encoded byte string.
> Typically the default encoding of Python environment would be 'ascii', which 
> means passing  an unicode string containing characters beyond 'ascii' charset 
> will raise UnicodeEncodeError exception at `str(obj)` and passing a byte 
> string containing bytes greater than 128 will again raise UnicodeEncodeError 
> exception at 's.encode('utf-8')`.
> Changing `str(obj)` to `unicode(obj)` would eliminate these problems.
> The following code snippet reproduces these errors:
> {code}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 1.6.3
>   /_/
> Using Python version 2.7.12 (default, Jul 25 2016 15:06:45)
> SparkContext available as sc, HiveContext available as sqlContext.
> >>> sc.parallelize([u'\u6d4b\u8bd5']).pipe('cat').collect()
> [Stage 0:>  (0 + 4) / 
> 4]Exception in thread Thread-1:
> Traceback (most recent call last):
>   File 
> "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py",
>  line 801, in __bootstrap_inner
> self.run()
>   File 
> "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py",
>  line 754, in run
> self.__target(*self.__args, **self.__kwargs)
>   File 
> "/Users/wxz/Downloads/spark-1.6.3-bin-hadoop2.6/python/pyspark/rdd.py", line 
> 719, in pipe_objs
> s = str(obj).rstrip('\n') + '\n'
> UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: 
> ordinal not in range(128)
> >>>
> >>> sc.parallelize([u'\u6d4b\u8bd5']).map(lambda x: 
> >>> x.encode('utf-8')).pipe('cat').collect()
> Exception in thread Thread-1:
> Traceback (most recent call last):
>   File 
> "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py",
>  line 801, in __bootstrap_inner
> self.run()
>   File 
> "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py",
>  line 754, in run
> self.__target(*self.__args, **self.__kwargs)
>   File 
> "/Users/wxz/Downloads/spark-1.6.3-bin-hadoop2.6/python/pyspark/rdd.py", line 
> 720, in pipe_objs
> out.write(s.encode('utf-8'))
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 0: 
> ordinal not in range(128)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org