Repository: spark Updated Branches: refs/heads/master 12faae295 -> 602c6d82d
[SPARK-20947][PYTHON] Fix encoding/decoding error in pipe action ## What changes were proposed in this pull request? Pipe action convert objects into strings using a way that was affected by the default encoding setting of Python environment. This patch fixed the problem. The detailed description is added here: https://issues.apache.org/jira/browse/SPARK-20947 ## How was this patch tested? Run the following statement in pyspark-shell, and it will NOT raise exception if this patch is applied: ```python sc.parallelize([u'\u6d4b\u8bd5']).pipe('cat').collect() ``` Author: çæå² <w...@linkdoc.com> Closes #18277 from chaoslawful/fix_pipe_encoding_error. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/602c6d82 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/602c6d82 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/602c6d82 Branch: refs/heads/master Commit: 602c6d82d893a7f34b37d674642669048eb59b03 Parents: 12faae2 Author: çæå² <w...@linkdoc.com> Authored: Mon Jan 22 10:43:12 2018 +0900 Committer: hyukjinkwon <gurwls...@gmail.com> Committed: Mon Jan 22 10:43:12 2018 +0900 ---------------------------------------------------------------------- python/pyspark/rdd.py | 2 +- python/pyspark/tests.py | 7 +++++++ 2 files changed, 8 insertions(+), 1 deletion(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/spark/blob/602c6d82/python/pyspark/rdd.py ---------------------------------------------------------------------- diff --git a/python/pyspark/rdd.py b/python/pyspark/rdd.py index 340bc3a..1b39155 100644 --- a/python/pyspark/rdd.py +++ b/python/pyspark/rdd.py @@ -766,7 +766,7 @@ class RDD(object): def pipe_objs(out): for obj in iterator: - s = str(obj).rstrip('\n') + '\n' + s = unicode(obj).rstrip('\n') + '\n' out.write(s.encode('utf-8')) out.close() Thread(target=pipe_objs, args=[pipe.stdin]).start() http://git-wip-us.apache.org/repos/asf/spark/blob/602c6d82/python/pyspark/tests.py ---------------------------------------------------------------------- diff --git a/python/pyspark/tests.py b/python/pyspark/tests.py index da99872..5115857 100644 --- a/python/pyspark/tests.py +++ b/python/pyspark/tests.py @@ -1239,6 +1239,13 @@ class RDDTests(ReusedPySparkTestCase): self.assertRaises(Py4JJavaError, rdd.pipe('grep 4', checkCode=True).collect) self.assertEqual([], rdd.pipe('grep 4').collect()) + def test_pipe_unicode(self): + # Regression test for SPARK-20947 + data = [u'\u6d4b\u8bd5', '1'] + rdd = self.sc.parallelize(data) + result = rdd.pipe('cat').collect() + self.assertEqual(data, result) + class ProfilerTests(PySparkTestCase): --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org