[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21157 Closing this PR to continue the discussion in the new one. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21157 @HyukjinKwon done in #23008. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21157 I meant to use https://github.com/apache/spark/blob/a97001d21757ae214c86371141bd78a376200f66/python/pyspark/serializers.py#L583 Instead of https://github.com/apache/spark/blob/a97001d21757ae214c86371141bd78a376200f66/python/pyspark/serializers.py#L561 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21157 > I think people do defined NamedTuples in Notebooks, so I'm going to stick with -1. @holdenk I understand your point, but there is still something we can do without breaking existing code relying on namedtuple serialization. Option 1: switch to cloudpickle as suggested by @HyukjinKwon. Option 2: #21180. What would be your choice between the two? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21157 @HyukjinKwon do you mean change the default serializer to cloudpickle and remove _hack_namedtuple? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21157 Adding @gatorsmile and @cloud-fan as well since this might be potentially breaking changes for 3.0 release (it affects RDD operation only with namedtuple in certain case tho) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21157 And you can also run profiler to show the performance effect. See https://github.com/apache/spark/pull/19246#discussion_r139874732 to run the profile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21157 You can just replace it to CloudPickler, remove changes at tests, and push that commit here to show no case is broken --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21157 Yea, so to avoid to break, we could change the default pickler to CloudPickler or document this workaround. @superbobry, can you check if the case can be preserved if we use CloudPickler instead? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21157 `cloudpickle` does indeed support pickling namedtuples. Maybe the way to go is to remove the patch advertise `cloudpickle` serializer for projects relying on the old behaviour. Wdyt @holdenk? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21157 The workaround is to use CloudPickler btw. Technically we many cases that normal pickler does not support. This one specific case (namedtuple) was allowed by this weird hack. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21157 > If removing the hack entirely is going to brake named tuples defined in the repl I'm a -1 on that change. Yes, but it might be OK for two reasons: people rarely define namedtuples in the REPL (hypothesis); and non-namedtuple classes do not work in the REPL even with the hack. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/21157 If removing the hack entirely is going to brake named tuples defined in the repl I'm a -1 on that change. While we certainly are more free to make breaking API changes in a majour version release we still have to think through the scope of the change we're going to be pushing onto users and that's pretty large. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21157 To keep the current behaviour without the workaround above (using CloudPickler), the weird fix is required (https://github.com/apache/spark/pull/21180) where some private methods should be used. I also gave few quick tries but looks not quite easy to fix. It is a hack to remove and looks difficult to remove without any behaviour change but still now a rough workaround looked possible (CloudPickler) so I inclined to get rid of it at Spark 3.0 for now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21157 To keep the current behaviour without the workaround above (using CloudPickler), the weird fix is required (https://github.com/apache/spark/pull/21180) where some private methods should be used. I also gave few quick tries but looks not quite easy to fix. It is a hack to remove and looks difficult to remove. Now a rough workaround was possible (CloudPickler) so I inclined to get rid of it at Spark 3.0 for now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21157 > Ok it looks like it was @HyukjinKwon who suggested that we remove this hack in general rather than the partial work around can I get your thoughts on why? It seems like the partial work around would give us the best of both worlds (e.g. we don't break peoples existing Spark code and we handle Python tuples better). Sorry for the late response. Yes, I spent some time to take a look for this named tuple hack, and my impression was that we should have not added such fixes to only allow named tuple pickling. The named tuple hack was introduced for both cloudpickle (SQL path) and normal pickle path, if I am not mistaken. Cloudpickle at PySpark side now supports it so workaround to allow the cases above should be to use CloudPickler when it's possible. I think PySpark API exposes this pickler (see the `SparkContext`'s `__init__`) (@superbobry, mind if I ask to document this workaround and add a test (and see if it really works?) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21157 Yes, it will break IPython notebooks as well. I wonder how often people actually defined namedtuples in a notebook? Emitting a warning is a less extreme option, yes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/21157 I mean, we could warn if we are doing the hijacking and not break peoples pipelines? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/21157 But that would break both ipython notebooks and repl right? Pretty significant breaking change. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21157 Yes, that is correct. That is why I think hijacking behaviour should be removed. It silently slows down the job and does not notify the user that a trivial change such as making the namedtuple importable could result in a speedup. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21157 Nope, the job I was referring to is not open source; but I guess the speedup is easy to justify: much less payload and faster deserialization: ``` >>> from collections import namedtuple >>> Stats = namedtuple("Stats", ["sample_mean", "sample_variance"]) >>> import pickle >>> len(pickle.dumps(Stats(42, 42))) 31 >>> len(pickle.dumps(("Stats", Stats._fields, (42, 42 68 ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/21157 Do you have the code for demonstrating the 2x speed up @superbobry ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/21157 Ok it looks like it was @HyukjinKwon who suggested that we remove this hack in general rather than the partial work around can I get your thoughts on why? It seems like the partial work around would give us the best of both worlds (e.g. we don't break peoples existing Spark code and we handle Python tuples better). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21157 > Is it possible to keep the current hack for things which can't be pickled, but remove the hack in the situation where the namedtuple is well behaved and it could be pickled directly by cloudpickle? @holdenk yes, this has been proposed in #21180, and later rejected in favour of this one. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/21157 Is it possible to keep the current hack for things which can't be pickled, but remove the hack in the situation where the namedtuple is well behaved and it could be pickled directly by cloudpickle? That way we don't have a functionality regression but we also improve handling of named tuples more generally. Even if so, it would probably be best to wait for 3.0 since this is a pretty core change in terms of PySpark. Before you put in the work though let's see if that the consensus approach (if possible). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21157 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21157 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96815/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21157 **[Test build #96815 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96815/testReport)** for PR 21157 at commit [`2addefb`](https://github.com/apache/spark/commit/2addefb9ae2ec12f30e0e939f7782b521efccdd0). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21157 **[Test build #96815 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96815/testReport)** for PR 21157 at commit [`2addefb`](https://github.com/apache/spark/commit/2addefb9ae2ec12f30e0e939f7782b521efccdd0). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21157 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96813/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21157 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21157 **[Test build #96813 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96813/testReport)** for PR 21157 at commit [`9c69047`](https://github.com/apache/spark/commit/9c690474986be63e41eed7aba59399c5a154b72e). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21157 **[Test build #96813 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96813/testReport)** for PR 21157 at commit [`9c69047`](https://github.com/apache/spark/commit/9c690474986be63e41eed7aba59399c5a154b72e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21157 @rxin the ones mentioned as 1 and 2 in the PR description: https://superbobry.github.io/pyspark-silently-breaks-your-namedtuples.html https://superbobry.github.io/tensorflowonspark-or-the-namedtuple-patch-strikes-again.html --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/21157 @superbobry which blog were you referring to? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21157 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96760/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21157 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21157 **[Test build #96760 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96760/testReport)** for PR 21157 at commit [`765362e`](https://github.com/apache/spark/commit/765362efb1f40fde3820704fee1628c90d6c0566). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21157 **[Test build #96760 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96760/testReport)** for PR 21157 at commit [`765362e`](https://github.com/apache/spark/commit/765362efb1f40fde3820704fee1628c90d6c0566). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21157 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96758/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21157 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21157 **[Test build #96758 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96758/testReport)** for PR 21157 at commit [`65dc266`](https://github.com/apache/spark/commit/65dc2662287c64fdd26e7d46f5da5354e4e9eff6). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21157 **[Test build #96758 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96758/testReport)** for PR 21157 at commit [`65dc266`](https://github.com/apache/spark/commit/65dc2662287c64fdd26e7d46f5da5354e4e9eff6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21157 **[Test build #96754 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96754/testReport)** for PR 21157 at commit [`ac6c7bc`](https://github.com/apache/spark/commit/ac6c7bce70b1929d54162090a0cb5df0cf53041a). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21157 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96754/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21157 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21157 **[Test build #96754 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96754/testReport)** for PR 21157 at commit [`ac6c7bc`](https://github.com/apache/spark/commit/ac6c7bce70b1929d54162090a0cb5df0cf53041a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21157 > so this change would introduce a pretty big regression? The change does introduce a regression as some namedtuples will become unpicklable. However, it makes pickling in PySpark more predictable and robust (see the linked blog posts for details). Sidenote: `mllib.fpm` contains another case of non-picklable namedtuple -- "nested": ```python >>> class A: ... class B(namedtuple("B", [])): pass ... >>> import pickle >>> pickle.loads(pickle.dumps(A.B)) Traceback (most recent call last): [...] pickle.PicklingError: Can't pickle : it's not found as __main__.B ``` It is pickleable with `_hijack_namedtuple` enabled, because the namedtuple class is recreated during unpickling. However, I think that "nested" classes are an anti-pattern in Python, because the outer class is not a proper namespace (like locals/globals/...): ```python >>> A.B ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/21157 so this change would introduce a pretty big regression? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21157 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21157 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96706/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21157 **[Test build #96706 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96706/testReport)** for PR 21157 at commit [`7f2ad87`](https://github.com/apache/spark/commit/7f2ad870dc82e058a76403099e2fc98d9f62e899). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21157 > So, real downside of removing this now is we disallow global scope namedtuple. Importable namedtuples and their subclasses could still be used inside an RDD. Only the namedtuples defined in the REPL would fail to pickle once this PR is merged. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21157 Woah. Okay. Let me add some guys interested in this again (@felixcheung looks already here) - @ueshin, @BryanCutler, @holdenk amd @JoshRosen Additionally @rxin too. Here's my understanding: Reynold, here's what's going on: this is about the namedtuple hack removal we added a long long while ago. This hack isn't now super crucial since cloudpickle can handle this by its own without this hack. If we remove this, in case of normal RDD operations, that named tuple should be defined in local scope. If they are defined in global scope, it fails to pickle in the normal pickle (not cloudpickle which SQL code path uses). 1. So, real downside of removing this now is we disallow global scope namedtuple. 2. actual advantage of this is, that we can get rid of weird behaviours by this hack. For instance, see the PR description (both links https://superbobry.github.io/pyspark-silently-breaks-your-namedtuples.html and https://superbobry.github.io/tensorflowonspark-or-the-namedtuple-patch-strikes-again.html). @superbobry, wanna add some more words? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21157 **[Test build #96706 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96706/testReport)** for PR 21157 at commit [`7f2ad87`](https://github.com/apache/spark/commit/7f2ad870dc82e058a76403099e2fc98d9f62e899). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21157 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21157 Reopened and rebased to be merged into the 3.X branch. See discussion in #21180. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21157 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21157 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21157 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21157 Closing in favour of #21180. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/21157 agree we should avoid removing test code --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21157 Please go ahead if there's another approach to avoid to remove but fix it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21157 One improvement we can make is change the patch to bypass namedtuples which are importable. This would resolve the issues with namedtuples coming from third-party libraries. I can open a new PR doing this, wdyt? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21157 Yea, my point is that it breaks other codes without a warning at all. We already have the copy of cloudpickle. The best should be a deduplicated fix for it, shouldn't it? I am still solid -1 on the complete removal for Spark 2.x. We should find another way first for now. Removing out is the last resort. I would consider the complete removal in Spark 3.x after having sufficient discussions. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21157 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89883/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21157 **[Test build #89883 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89883/testReport)** for PR 21157 at commit [`c67ce29`](https://github.com/apache/spark/commit/c67ce29a3279073812070e6ff4bb2e2624961b36). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21157 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21157 **[Test build #89883 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89883/testReport)** for PR 21157 at commit [`c67ce29`](https://github.com/apache/spark/commit/c67ce29a3279073812070e6ff4bb2e2624961b36). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21157 Yes, we can backport some of the cloudpickle code to make the patch less fragile. This would be a nontrivial change in an already complex code, but I'd be happy to sketch this if there's a consensus on the ML Also, note that even without the patch it is possible to have an RDD of namedtuples as long as the namedtuple classes are defined inside an importable module, i.e. NOT inside a function/REPL. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21157 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89878/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21157 **[Test build #89878 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89878/testReport)** for PR 21157 at commit [`67c4f67`](https://github.com/apache/spark/commit/67c4f6707aab670b9cde5a3afa34fda3abbbf46d). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21157 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21157 I don't like the hack too but the complete removal just basically means we are going to drop namedtuple supports in RDD without, for example, any deprecation warnings. Spark is being super conservative and this's going to break compatibility. So, I was thinking we could do this for Spark 3.0. We already start to talk about this. This should probably be something we should discuss in the mailing list since it's a breaking change. One thing clear is that the complete removal should target 3.0.0. For now, yea, `cloudpickle` solves it but probably it's less performant. Logically, `cloudpickle` [fixed it](https://github.com/cloudpipe/cloudpickle/pull/113) (our cloudpickle copy is matched to [0.4.3](https://github.com/cloudpipe/cloudpickle/releases/tag/v0.4.3)) and we can take after the fix with the normal pickle side, aren't we? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21157 **[Test build #89878 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89878/testReport)** for PR 21157 at commit [`67c4f67`](https://github.com/apache/spark/commit/67c4f6707aab670b9cde5a3afa34fda3abbbf46d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21157 > Does the test even pass? The tests should pass module the tests specifically checking the behaviour being removed. I think the failing RDD test is in this group as well. > Why don't we try to fix it rather than removing out? I might be overly pessimistic but I don't see how we can make the patch work in all cases without making the implementation more magical, and as a result, producing even more confusing error messages when things go wrong. Consider, for instance, a widespread pattern ```python class Foo(namedtuple("Foo", [])): def foo(self): return 42 ``` If the outer `Foo` class does not explicitly customize pickling, it would use the "fallback" implementation added by `_hijack_namedtuple`, which only knows about the inner namedtuple class. Therefore, confusingly enough `issubclass(pickle.loads(foo), Foo)` is False (as detailed in [2]). What can we do about this? We somehow need to serialize the **full definition** of the outer `Foo` class alongside every instance. Maybe this can be done by recursively pickling the class `__name__`, `__bases__` and `__dict__`, but `__dict__` could have some other hard-to-pickle objects like user-defined methods. Should we serialize these in the deconstructed form as well? These are tough questions, and I think they are better left outside the scope of PySpark. That said, I think an alternative to completely removing the patch might be deprecating it, and advertizing `cloudpickle` for workloads using namedtuples (or even making it the default?). I've played with `cloudpickle` a little bit, and it seems to solve the aforementioned issues in a consistent manner. The price, however, is the added overhead: ```python >>> len(pickle.dumps(Foo())) 23 >>> len(cloudpickle.dumps(Foo())) 3538 ``` or, even more extreme, ```python >>> class A: pass ... >>> len(cloudpickle.dumps(A())) 177 ``` What do you think? [2]: https://superbobry.github.io/tensorflowonspark-or-the-namedtuple-patch-strikes-again.html --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21157 Let's think about other ways to fix them until 3.0.0. I think the complete removal is the last resort we could consider for 3.0.0. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21157 Solid -1 if it breaks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21157 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89865/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21157 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21157 **[Test build #89865 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89865/testReport)** for PR 21157 at commit [`eadc0c8`](https://github.com/apache/spark/commit/eadc0c8af853a57ee80f5e80fb708451931eedc0). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21157 **[Test build #89865 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89865/testReport)** for PR 21157 at commit [`eadc0c8`](https://github.com/apache/spark/commit/eadc0c8af853a57ee80f5e80fb708451931eedc0). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21157 Why don't we try to fix it rather than removing out? Does the test even pass? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21157 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21157 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21157 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org