Jonas Amrich created SPARK-22674:
------------------------------------

             Summary: PySpark breaks serialization of namedtuple subclasses
                 Key: SPARK-22674
                 URL: https://issues.apache.org/jira/browse/SPARK-22674
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 2.2.0
            Reporter: Jonas Amrich


Pyspark monkey patches the namedtuple class to make it serializable, however 
this breaks serialization of its subclasses. With current implementation, any 
subclass will be serialized (and deserialized) as it's parent namedtuple. 
Consider this code, which will fail with {{AttributeError: 'Point' object has 
no attribute 'sum'}}:

{code}
from collections import namedtuple

Point = namedtuple("Point", "x y")

class PointSubclass(Point):
    def sum(self):
        return self.x + self.y

rdd = spark.sparkContext.parallelize([[PointSubclass(1, 1)]])
rdd.collect()[0][0].sum()
{code}

Moreover, as PySpark hijacks all namedtuples in the main module, importing 
pyspark breaks serialization of namedtuple subclasses even in code which is not 
related to spark / distributed execution. I don't see any clean solution to 
this; a possible workaround may be to limit serialization hack only to direct 
namedtuple subclasses like in 
https://github.com/JonasAmrich/spark/commit/f3efecee28243380ecf6657fe54e1a165c1b7204



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to