Travis Addair created SPARK-27810: ------------------------------------- Summary: PySpark breaks Cloudpickle serialization of collections.namedtuple objects Key: SPARK-27810 URL: https://issues.apache.org/jira/browse/SPARK-27810 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.4.3 Reporter: Travis Addair
After importing pyspark, cloudpickle is no longer able to properly serialize objects inheriting from collections.namedtuple, and drops all other class data such that calls to isinstance will fail. Here's a minimal reproduction of the issue: {{import collections}} {{import cloudpickle}} {{import pyspark}}{\{class }} {{A(object):}} {{ pass}} {{class B(object):}} {{ pass}} {{class C(A, B, collections.namedtuple('C', ['field'])):}} {{ pass}} {{c = C(1)}} {{def print_bases(obj):}} {{ bases = obj.__class__.__bases__}} {{ for base in bases:}} {{ print(base)}} {{print('original objects:')}} {{print_bases(c)}} {{print('\ncloudpickled objects:')}} {{c2 = cloudpickle.loads(cloudpickle.dumps(c))}} {{print_bases(c2)}} This prints: {{original objects:}} {{ <class '__main__.A'>}} {{ <class '__main__.B'>}} {{ <class 'collections.C'>}} {{cloudpickled objects:}} {{ <class 'tuple'>}} Effectively dropping all other types. It appears this issue is being caused by the [_hijack_namedtuple|https://github.com/apache/spark/blob/master/python/pyspark/serializers.py#L600] function, which replaces the namedtuple class with another one. This issue comes up when working with [TensorFlow feature columns|https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/feature_column/feature_column.py], which derive from collections.namedtuple among other classes. Cloudpickle also [supports|https://github.com/cloudpipe/cloudpickle/blob/3f4d9da8c567c8e0363880b760b789b40563f5c3/cloudpickle/cloudpickle.py#L900] collections.namedtuple serialization, but doesn't appear to need to replace the class. Possibly PySpark can do something similar? -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org