Github user superbobry commented on the issue: https://github.com/apache/spark/pull/21157 > Does the test even pass? The tests should pass module the tests specifically checking the behaviour being removed. I think the failing RDD test is in this group as well. > Why don't we try to fix it rather than removing out? I might be overly pessimistic but I don't see how we can make the patch work in all cases without making the implementation more magical, and as a result, producing even more confusing error messages when things go wrong. Consider, for instance, a widespread pattern ```python class Foo(namedtuple("Foo", [])): def foo(self): return 42 ``` If the outer `Foo` class does not explicitly customize pickling, it would use the "fallback" implementation added by `_hijack_namedtuple`, which only knows about the inner namedtuple class. Therefore, confusingly enough `issubclass(pickle.loads(foo), Foo)` is False (as detailed in [2]). What can we do about this? We somehow need to serialize the **full definition** of the outer `Foo` class alongside every instance. Maybe this can be done by recursively pickling the class `__name__`, `__bases__` and `__dict__`, but `__dict__` could have some other hard-to-pickle objects like user-defined methods. Should we serialize these in the deconstructed form as well? These are tough questions, and I think they are better left outside the scope of PySpark. That said, I think an alternative to completely removing the patch might be deprecating it, and advertizing `cloudpickle` for workloads using namedtuples (or even making it the default?). I've played with `cloudpickle` a little bit, and it seems to solve the aforementioned issues in a consistent manner. The price, however, is the added overhead: ```python >>> len(pickle.dumps(Foo())) 23 >>> len(cloudpickle.dumps(Foo())) 3538 ``` or, even more extreme, ```python >>> class A: pass ... >>> len(cloudpickle.dumps(A())) 177 ``` What do you think? [2]: https://superbobry.github.io/tensorflowonspark-or-the-namedtuple-patch-strikes-again.html
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org