Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21157
Closing this PR to continue the discussion in the new one.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21157
@HyukjinKwon done in #23008.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands,
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/21157
I meant to use
https://github.com/apache/spark/blob/a97001d21757ae214c86371141bd78a376200f66/python/pyspark/serializers.py#L583
Instead of
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21157
> I think people do defined NamedTuples in Notebooks, so I'm going to stick
with -1.
@holdenk I understand your point, but there is still something we can do
without breaking existing
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21157
@HyukjinKwon do you mean change the default serializer to cloudpickle and
remove _hack_namedtuple?
---
-
To unsubscribe,
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/21157
Adding @gatorsmile and @cloud-fan as well since this might be potentially
breaking changes for 3.0 release (it affects RDD operation only with namedtuple
in certain case tho)
---
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/21157
And you can also run profiler to show the performance effect. See
https://github.com/apache/spark/pull/19246#discussion_r139874732 to run the
profile
---
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/21157
You can just replace it to CloudPickler, remove changes at tests, and push
that commit here to show no case is broken
---
-
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/21157
Yea, so to avoid to break, we could change the default pickler to
CloudPickler or document this workaround. @superbobry, can you check if the
case can be preserved if we use CloudPickler
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21157
`cloudpickle` does indeed support pickling namedtuples. Maybe the way to go
is to remove the patch advertise `cloudpickle` serializer for projects relying
on the old behaviour. Wdyt @holdenk?
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/21157
The workaround is to use CloudPickler btw. Technically we many cases that
normal pickler does not support. This one specific case (namedtuple) was
allowed by this weird hack.
---
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21157
> If removing the hack entirely is going to brake named tuples defined in
the repl I'm a -1 on that change.
Yes, but it might be OK for two reasons: people rarely define namedtuples
in
Github user holdenk commented on the issue:
https://github.com/apache/spark/pull/21157
If removing the hack entirely is going to brake named tuples defined in the
repl I'm a -1 on that change. While we certainly are more free to make breaking
API changes in a majour version release
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/21157
To keep the current behaviour without the workaround above (using
CloudPickler), the weird fix is required
(https://github.com/apache/spark/pull/21180) where some private methods should
be
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/21157
To keep the current behaviour without the workaround above (using
CloudPickler), the weird fix is required
(https://github.com/apache/spark/pull/21180) where some private methods should
be
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/21157
> Ok it looks like it was @HyukjinKwon who suggested that we remove this
hack in general rather than the partial work around can I get your thoughts on
why? It seems like the partial work
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21157
Yes, it will break IPython notebooks as well. I wonder how often people
actually defined namedtuples in a notebook?
Emitting a warning is a less extreme option, yes.
---
Github user holdenk commented on the issue:
https://github.com/apache/spark/pull/21157
I mean, we could warn if we are doing the hijacking and not break peoples
pipelines?
---
-
To unsubscribe, e-mail:
Github user rxin commented on the issue:
https://github.com/apache/spark/pull/21157
But that would break both ipython notebooks and repl right? Pretty
significant breaking change.
---
-
To unsubscribe, e-mail:
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21157
Yes, that is correct. That is why I think hijacking behaviour should be
removed. It silently slows down the job and does not notify the user that a
trivial change such as making the namedtuple
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21157
Nope, the job I was referring to is not open source; but I guess the
speedup is easy to justify: much less payload and faster deserialization:
```
>>> from collections import
Github user holdenk commented on the issue:
https://github.com/apache/spark/pull/21157
Do you have the code for demonstrating the 2x speed up @superbobry ?
---
-
To unsubscribe, e-mail:
Github user holdenk commented on the issue:
https://github.com/apache/spark/pull/21157
Ok it looks like it was @HyukjinKwon who suggested that we remove this hack
in general rather than the partial work around can I get your thoughts on why?
It seems like the partial work around
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21157
> Is it possible to keep the current hack for things which can't be
pickled, but remove the hack in the situation where the namedtuple is well
behaved and it could be pickled directly by
Github user holdenk commented on the issue:
https://github.com/apache/spark/pull/21157
Is it possible to keep the current hack for things which can't be pickled,
but remove the hack in the situation where the namedtuple is well behaved and
it could be pickled directly by cloudpickle?
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21157
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21157
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96815/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21157
**[Test build #96815 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96815/testReport)**
for PR 21157 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21157
**[Test build #96815 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96815/testReport)**
for PR 21157 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21157
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96813/
Test FAILed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21157
Merged build finished. Test FAILed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21157
**[Test build #96813 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96813/testReport)**
for PR 21157 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21157
**[Test build #96813 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96813/testReport)**
for PR 21157 at commit
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21157
@rxin the ones mentioned as 1 and 2 in the PR description:
https://superbobry.github.io/pyspark-silently-breaks-your-namedtuples.html
Github user rxin commented on the issue:
https://github.com/apache/spark/pull/21157
@superbobry which blog were you referring to?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21157
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96760/
Test FAILed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21157
Merged build finished. Test FAILed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21157
**[Test build #96760 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96760/testReport)**
for PR 21157 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21157
**[Test build #96760 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96760/testReport)**
for PR 21157 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21157
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96758/
Test FAILed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21157
Merged build finished. Test FAILed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21157
**[Test build #96758 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96758/testReport)**
for PR 21157 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21157
**[Test build #96758 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96758/testReport)**
for PR 21157 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21157
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96754/
Test FAILed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21157
**[Test build #96754 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96754/testReport)**
for PR 21157 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21157
Merged build finished. Test FAILed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21157
**[Test build #96754 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96754/testReport)**
for PR 21157 at commit
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21157
> so this change would introduce a pretty big regression?
The change does introduce a regression as some namedtuples will become
unpicklable. However, it makes pickling in PySpark more
Github user rxin commented on the issue:
https://github.com/apache/spark/pull/21157
so this change would introduce a pretty big regression?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21157
Merged build finished. Test FAILed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21157
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96706/
Test FAILed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21157
**[Test build #96706 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96706/testReport)**
for PR 21157 at commit
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21157
> So, real downside of removing this now is we disallow global scope
namedtuple.
Importable namedtuples and their subclasses could still be used inside an
RDD. Only the namedtuples
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/21157
Woah. Okay. Let me add some guys interested in this again (@felixcheung
looks already here) - @ueshin, @BryanCutler, @holdenk amd @JoshRosen
Additionally @rxin too. Here's my
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21157
**[Test build #96706 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96706/testReport)**
for PR 21157 at commit
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/21157
ok to test
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail:
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21157
Reopened and rebased to be merged into the 3.X branch. See discussion in
#21180.
---
-
To unsubscribe, e-mail:
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21157
Can one of the admins verify this patch?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21157
Can one of the admins verify this patch?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21157
Can one of the admins verify this patch?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21157
Closing in favour of #21180.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands,
Github user felixcheung commented on the issue:
https://github.com/apache/spark/pull/21157
agree we should avoid removing test code
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/21157
Please go ahead if there's another approach to avoid to remove but fix it.
---
-
To unsubscribe, e-mail:
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21157
One improvement we can make is change the patch to bypass namedtuples which
are importable. This would resolve the issues with namedtuples coming from
third-party libraries. I can open a new PR
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/21157
Yea, my point is that it breaks other codes without a warning at all. We
already have the copy of cloudpickle. The best should be a deduplicated fix for
it, shouldn't it?
I am still
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21157
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89883/
Test FAILed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21157
**[Test build #89883 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89883/testReport)**
for PR 21157 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21157
Merged build finished. Test FAILed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21157
**[Test build #89883 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89883/testReport)**
for PR 21157 at commit
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21157
Yes, we can backport some of the cloudpickle code to make the patch less
fragile. This would be a nontrivial change in an already complex code, but I'd
be happy to sketch this if there's a
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21157
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89878/
Test FAILed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21157
**[Test build #89878 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89878/testReport)**
for PR 21157 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21157
Merged build finished. Test FAILed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/21157
I don't like the hack too but the complete removal just basically means we
are going to drop namedtuple supports in RDD without, for example, any
deprecation warnings. Spark is being super
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21157
**[Test build #89878 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89878/testReport)**
for PR 21157 at commit
Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/21157
> Does the test even pass?
The tests should pass module the tests specifically checking the behaviour
being removed. I think the failing RDD test is in this group as well.
>
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/21157
Let's think about other ways to fix them until 3.0.0. I think the complete
removal is the last resort we could consider for 3.0.0.
---
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/21157
Solid -1 if it breaks.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail:
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21157
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89865/
Test FAILed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21157
Merged build finished. Test FAILed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21157
**[Test build #89865 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89865/testReport)**
for PR 21157 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21157
**[Test build #89865 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89865/testReport)**
for PR 21157 at commit
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/21157
Why don't we try to fix it rather than removing out? Does the test even
pass?
---
-
To unsubscribe, e-mail:
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/21157
ok to test
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail:
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21157
Can one of the admins verify this patch?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21157
Can one of the admins verify this patch?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
86 matches
Mail list logo