[jira] [Commented] (SPARK-22674) PySpark breaks serialization of namedtuple subclasses

Sergei Lebedev (JIRA) Tue, 13 Mar 2018 06:43:25 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-22674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16396948#comment-16396948
 ]


Sergei Lebedev commented on SPARK-22674:
----------------------------------------

Just out of curiosity: what is the original motivation for hacking the 
namedtuple serialization?

I've been exploring the quirks caused by {{_hijack_namedtuple}}, and I feel 
PySpark might be better off without it. Some of the reasons are:
 * it silently incurs a performance cost for namedtuples defined outside of 
__main__ (both user-defined and third-party);
 * it leads to confusing and hard to debug error messages in the presence of 
inheritance;
 * it makes namedtuple classes different from normal classes, because the 
latter cannot be serialized if defined in the REPL.

What do you think?

> PySpark breaks serialization of namedtuple subclasses
> -----------------------------------------------------
>
>                 Key: SPARK-22674
>                 URL: https://issues.apache.org/jira/browse/SPARK-22674
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.2.0
>            Reporter: Jonas Amrich
>            Priority: Major
>
> Pyspark monkey patches the namedtuple class to make it serializable, however 
> this breaks serialization of its subclasses. With current implementation, any 
> subclass will be serialized (and deserialized) as it's parent namedtuple. 
> Consider this code, which will fail with {{AttributeError: 'Point' object has 
> no attribute 'sum'}}:
> {code}
> from collections import namedtuple
> Point = namedtuple("Point", "x y")
> class PointSubclass(Point):
>     def sum(self):
>         return self.x + self.y
> rdd = spark.sparkContext.parallelize([[PointSubclass(1, 1)]])
> rdd.collect()[0][0].sum()
> {code}
> Moreover, as PySpark hijacks all namedtuples in the main module, importing 
> pyspark breaks serialization of namedtuple subclasses even in code which is 
> not related to spark / distributed execution. I don't see any clean solution 
> to this; a possible workaround may be to limit serialization hack only to 
> direct namedtuple subclasses like in 
> https://github.com/JonasAmrich/spark/commit/f3efecee28243380ecf6657fe54e1a165c1b7204



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22674) PySpark breaks serialization of namedtuple subclasses

Reply via email to