Travis Addair created SPARK-27810:
-------------------------------------

             Summary: PySpark breaks Cloudpickle serialization of 
collections.namedtuple objects
                 Key: SPARK-27810
                 URL: https://issues.apache.org/jira/browse/SPARK-27810
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 2.4.3
            Reporter: Travis Addair


After importing pyspark, cloudpickle is no longer able to properly serialize 
objects inheriting from collections.namedtuple, and drops all other class data 
such that calls to isinstance will fail.

Here's a minimal reproduction of the issue:

{{import collections}}
 {{import cloudpickle}}
 {{import pyspark}}{\{class }}

{{A(object):}}
 {{    pass}}

{{class B(object):}}
 {{    pass}}

{{class C(A, B, collections.namedtuple('C', ['field'])):}}
 {{    pass}}

{{c = C(1)}}

{{def print_bases(obj):}}
 {{    bases = obj.__class__.__bases__}}
 {{    for base in bases:}}
 {{        print(base)}}

{{print('original objects:')}}
 {{print_bases(c)}}

{{print('\ncloudpickled objects:')}}
 {{c2 = cloudpickle.loads(cloudpickle.dumps(c))}}
 {{print_bases(c2)}}

This prints:

{{original objects:}}
{{ <class '__main__.A'>}}
{{ <class '__main__.B'>}}
{{ <class 'collections.C'>}}

{{cloudpickled objects:}}
{{ <class 'tuple'>}}

Effectively dropping all other types.  It appears this issue is being caused by 
the 
[_hijack_namedtuple|https://github.com/apache/spark/blob/master/python/pyspark/serializers.py#L600]
 function, which replaces the namedtuple class with another one.

This issue comes up when working with [TensorFlow feature 
columns|https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/feature_column/feature_column.py],
 which derive from collections.namedtuple among other classes.

Cloudpickle also 
[supports|https://github.com/cloudpipe/cloudpickle/blob/3f4d9da8c567c8e0363880b760b789b40563f5c3/cloudpickle/cloudpickle.py#L900]
 collections.namedtuple serialization, but doesn't appear to need to replace 
the class.  Possibly PySpark can do something similar?

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to