[GitHub] spark issue #18277: [SPARK-20947][PYTHON] Fix encoding/decoding error in pip...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18277 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18277: [SPARK-20947][PYTHON] Fix encoding/decoding error in pip...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18277 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86450/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18277: [SPARK-20947][PYTHON] Fix encoding/decoding error in pip...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18277 **[Test build #86450 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86450/testReport)** for PR 18277 at commit [`8c88595`](https://github.com/apache/spark/commit/8c88595125fbd328a3ed2383a9e96db7ad96f0e9). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18277: [SPARK-20947][PYTHON] Fix encoding/decoding error in pip...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18277 **[Test build #86450 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86450/testReport)** for PR 18277 at commit [`8c88595`](https://github.com/apache/spark/commit/8c88595125fbd328a3ed2383a9e96db7ad96f0e9). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18277: [SPARK-20947][PYTHON] Fix encoding/decoding error in pip...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18277 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18277: [SPARK-20947][PYTHON] Fix encoding/decoding error in pip...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18277 Let me merge this one only into master considering the concerns - https://github.com/apache/spark/pull/18277#pullrequestreview-90007120 and https://github.com/apache/spark/pull/18277#issuecomment-358876719. Adding a note could be fine. I don't feel strongly about it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18277: [SPARK-20947][PYTHON] Fix encoding/decoding error in pip...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18277 Let me merge this one in few days if there's no more comments. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18277: [SPARK-20947][PYTHON] Fix encoding/decoding error in pip...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18277 cc @ueshin too. I think we were in several PRs related with encoding / decoding stuff. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18277: [SPARK-20947][PYTHON] Fix encoding/decoding error in pip...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18277 So, after this change, we will get rid of system default roundtrip in **When `obj`: `unicode`** and **When `obj`: other types**. In case of **When `obj`: other types**, we _might_ have a behaviour change if `__unicode__()` is defined differently with `__str__()` but I believe it's quite rare. So, LGTM but I want a double check from you @holdenk and @viirya if I missed anything. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18277: [SPARK-20947][PYTHON] Fix encoding/decoding error in pip...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18277 Wanted to make a clarification on what we will change here to myself because it's quite confusing to me. In Python 3, it's declared above `basestring = unicode = str`. So, it won't change anything. I think this is not our concern. In Python 2, ### Before: ``` str(obj).encode("utf8") ``` **When `obj` is `unicode`**: 1. `str(obj)`: encoded to bytes by system default (`ascii`) 2. `.encode("utf-8")`: decoded to unicodes by system default (`ascii`) and then encoded to bytes by UTF8. **When `obj` is `str`**: 1. `str(obj)`: bytes as are 2. `.encode("utf-8")`: decoded to unicodes by system default (`ascii`) and then encoded to bytes by UTF8 **When `obj` is other types**: 1. `str(obj)`: call `__str__()` 2. `.encode("utf-8")`: decoded to unicodes by system default (`ascii`) and then encoded to bytes by UTF8 ### After: ``` unicode(obj).encode("utf8") ``` **When `obj` is `unicode`**: 1. `unicode(obj)`: unicodes as are 2. `.encode("utf-8")`: encoded to bytes by UTF8 **When `obj` is `str`** 1.`unicode(obj)`: decoded to unicode by system default (`ascii`) 2.`.encode("utf-8")`: encoded to bytes by UTF8 **When `obj` is other types** 1. `unicode(obj)`: call `__unicode__()`. It falls back to `__str__()` if `__unicode__()` is not defined. 2. `.encode("utf-8")`: encoded to bytes by UTF8 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18277: [SPARK-20947][PYTHON] Fix encoding/decoding error in pip...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18277 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18277: [SPARK-20947][PYTHON] Fix encoding/decoding error in pip...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18277 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86375/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18277: [SPARK-20947][PYTHON] Fix encoding/decoding error in pip...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18277 **[Test build #86375 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86375/testReport)** for PR 18277 at commit [`8c88595`](https://github.com/apache/spark/commit/8c88595125fbd328a3ed2383a9e96db7ad96f0e9). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18277: [SPARK-20947][PYTHON] Fix encoding/decoding error in pip...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/18277 This change looks reasonable to me for now. But I'm also concerned about the behavior change. A note into release notes should be good or maybe we need a note at migration guide in `RDD Programming Guide`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18277: [SPARK-20947][PYTHON] Fix encoding/decoding error in pip...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18277 **[Test build #86375 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86375/testReport)** for PR 18277 at commit [`8c88595`](https://github.com/apache/spark/commit/8c88595125fbd328a3ed2383a9e96db7ad96f0e9). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18277: [SPARK-20947][PYTHON] Fix encoding/decoding error in pip...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/18277 retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18277: [SPARK-20947][PYTHON] Fix encoding/decoding error in pip...
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/18277 Jenkins OK to test. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18277: [SPARK-20947][PYTHON] Fix encoding/decoding error in pip...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18277 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18277: [SPARK-20947][PYTHON] Fix encoding/decoding error in pip...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18277 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85574/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18277: [SPARK-20947][PYTHON] Fix encoding/decoding error in pip...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18277 **[Test build #85574 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85574/testReport)** for PR 18277 at commit [`8c88595`](https://github.com/apache/spark/commit/8c88595125fbd328a3ed2383a9e96db7ad96f0e9). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18277: [SPARK-20947][PYTHON] Fix encoding/decoding error in pip...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18277 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18277: [SPARK-20947][PYTHON] Fix encoding/decoding error in pip...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18277 **[Test build #85574 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85574/testReport)** for PR 18277 at commit [`8c88595`](https://github.com/apache/spark/commit/8c88595125fbd328a3ed2383a9e96db7ad96f0e9). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18277: [SPARK-20947][PYTHON] Fix encoding/decoding error in pip...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18277 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18277: [SPARK-20947][PYTHON] Fix encoding/decoding error in pip...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18277 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18277: [SPARK-20947][PYTHON] Fix encoding/decoding error in pip...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18277 it seems okay without a close look. Let me take the close look if I can take the look first soon. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18277: [SPARK-20947][PYTHON] Fix encoding/decoding error in pip...
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/18277 What do you think @HyukjinKwon ? I think this is probably a reasonable fix, but we might break some peoples code who have been depending on the bug. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18277: [SPARK-20947][PYTHON] Fix encoding/decoding error in pip...
Github user sasameti commented on the issue: https://github.com/apache/spark/pull/18277 how do I apply the patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18277: [SPARK-20947][PYTHON] Fix encoding/decoding error in pip...
Github user chaoslawful commented on the issue: https://github.com/apache/spark/pull/18277 Well, the difference comes from repr()'s divergent default behaviors between Python2 and Python3. And the previous code does no better than the patched one but causing troubles while processing unicode strings. On the other hand, pipe() action involved implicit serialization from any type to bytes by its definition, so IMHO the application itself should take care of consistent serialization/deserialization of data before/after pipe() action, IF it wants to always get the same behavior in different environments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18277: [SPARK-20947][PYTHON] Fix encoding/decoding error in pip...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/18277 When you try to do this on a rdd of array of unicode string. The result of Python2 looks a bit weird. Using Python version 2.7.12 (default, Jul 1 2016 15:12:24) SparkSession available as 'spark'. >>> data = [u'\u6d4b\u8bd5', '1'] >>> rdd = sc.parallelize(data) >>> result = rdd.pipe('cat').collect() >>> result [u'\u6d4b\u8bd5', u'1'] >>> data = [[u'\u6d4b\u8bd5', '1'], ['1', '2']] >>> rdd = sc.parallelize(data) >>> rdd.collect() [[u'\u6d4b\u8bd5', '1'], ['1', '2']] >>> result = rdd.pipe('cat').collect() >>> result [u"[u'\\u6d4b\\u8bd5', '1']", u"['1', '2']"] # looks weird and different to Python3. >>> Using Python version 3.5.2 (default, Nov 17 2016 17:05:23) SparkSession available as 'spark'. >>> data = [u'\u6d4b\u8bd5', '1'] >>> rdd = sc.parallelize(data) >>> result = rdd.pipe('cat').collect() >>> result ['\u6d4b\u8bd5', '1'] >>> data = [[u'\u6d4b\u8bd5', '1'], ['1', '2']] >>> rdd = sc.parallelize(data) >>> rdd.collect() [['\u6d4b\u8bd5', '1'], ['1', '2']] >>> result = rdd.pipe('cat').collect() >>> result ["['\u6d4b\u8bd5', '1']", "['1', '2']"] >>> --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18277: [SPARK-20947][PYTHON] Fix encoding/decoding error in pip...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18277 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org