[ https://issues.apache.org/jira/browse/SPARK-27810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-27810. ---------------------------------- Resolution: Duplicate > PySpark breaks Cloudpickle serialization of collections.namedtuple objects > -------------------------------------------------------------------------- > > Key: SPARK-27810 > URL: https://issues.apache.org/jira/browse/SPARK-27810 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.4.3 > Reporter: Travis Addair > Priority: Major > > After importing pyspark, cloudpickle is no longer able to properly serialize > objects inheriting from collections.namedtuple, and drops all other class > data such that calls to isinstance will fail. > Here's a minimal reproduction of the issue: > {{import collections}} > {{import cloudpickle}} > {{import pyspark}} > {{class A(object):}} > {{ pass}} > {{class B(object):}} > {{ pass}} > {{class C(A, B, collections.namedtuple('C', ['field'])):}} > {{ pass}} > {{c = C(1)}} > {{def print_bases(obj):}} > {{ bases = obj.__class__.__bases__}} > {{ for base in bases:}} > {{ print(base)}} > {{print('original objects:')}} > {{print_bases(c)}} > {{print('\ncloudpickled objects:')}} > {{c2 = cloudpickle.loads(cloudpickle.dumps(c))}} > {{print_bases(c2)}} > This prints: > {{original objects:}} > {{<class '__main__.A'>}} > {{<class '__main__.B'>}} > {{<class 'collections.C'>}} > {{cloudpickled objects:}} > {{<class 'tuple'>}} > Effectively dropping all other types. It appears this issue is being caused > by the > [_hijack_namedtuple|https://github.com/apache/spark/blob/master/python/pyspark/serializers.py#L600] > function, which replaces the namedtuple class with another one. > Note that I can workaround this issue by setting > {{collections.namedtuple.__hijack = 1}} before importing pyspark, so I feel > pretty confident this is what's causing the issue. > This issue comes up when working with [TensorFlow feature > columns|https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/feature_column/feature_column.py], > which derive from collections.namedtuple among other classes. > Cloudpickle also > [supports|https://github.com/cloudpipe/cloudpickle/blob/3f4d9da8c567c8e0363880b760b789b40563f5c3/cloudpickle/cloudpickle.py#L900] > collections.namedtuple serialization, but doesn't appear to need to replace > the class. Possibly PySpark can do something similar? > > -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org