Jonas Amrich created SPARK-22674: ------------------------------------ Summary: PySpark breaks serialization of namedtuple subclasses Key: SPARK-22674 URL: https://issues.apache.org/jira/browse/SPARK-22674 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.2.0 Reporter: Jonas Amrich
Pyspark monkey patches the namedtuple class to make it serializable, however this breaks serialization of its subclasses. With current implementation, any subclass will be serialized (and deserialized) as it's parent namedtuple. Consider this code, which will fail with {{AttributeError: 'Point' object has no attribute 'sum'}}: {code} from collections import namedtuple Point = namedtuple("Point", "x y") class PointSubclass(Point): def sum(self): return self.x + self.y rdd = spark.sparkContext.parallelize([[PointSubclass(1, 1)]]) rdd.collect()[0][0].sum() {code} Moreover, as PySpark hijacks all namedtuples in the main module, importing pyspark breaks serialization of namedtuple subclasses even in code which is not related to spark / distributed execution. I don't see any clean solution to this; a possible workaround may be to limit serialization hack only to direct namedtuple subclasses like in https://github.com/JonasAmrich/spark/commit/f3efecee28243380ecf6657fe54e1a165c1b7204 -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org