GitHub user superbobry opened a pull request: https://github.com/apache/spark/pull/23008
[SPARK-22674][PYTHON] Removed the namedtuple pickling patch ## What changes were proposed in this pull request? Prior to this PR PySpark patched ``collections.namedtuple`` to make namedtuple instances serializable even if the namedtuple class has been defined outside of ``globals()``, e.g. def do_something(): Foo = namedtuple("Foo", ["foo"]) sc.parallelize(range(1)).map(lambda _: Foo(42)) The patch changed the pickled representation of the namedtuple instance to include the structure of namedtuple class, and recreate the class on each unpickling. This behaviour causes hard to diagnose failures both in the user code with namedtuples, as well as third-party libraries relying on them. See [1] and [2] for details. [1]: https://superbobry.github.io/pyspark-silently-breaks-your-namedtuples.html [2]: https://superbobry.github.io/tensorflowonspark-or-the-namedtuple-patch-strikes-again.html The PR changes the default serializer to `CloudPickleSerializer` which natively supports pickling namedtuples and does not require the aforementioned patch. To the best of my knowledge, this is **not** a breaking change. ## How was this patch tested? PySpark test suite. You can merge this pull request into a Git repository by running: $ git pull https://github.com/superbobry/spark no-hijack-namedtuple Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23008.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23008 ---- commit 36ff69717c411e72f60c3006eb6a491c3d88862d Author: Sergei Lebedev <superbobry@...> Date: 2018-11-11T21:05:07Z Removed namedtuple hack and made cloudpickle the default serializer This is a followup of the discussion in #21157. See the PR and the linked JIRA ticket for context and motivation. commit 9a818797603f5804b32202d28474493c80966f58 Author: Sergei Lebedev <superbobry@...> Date: 2018-11-11T22:11:02Z Changed SerializationTestCase to use cloudpickle ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org