[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...

superbobry Thu, 26 Apr 2018 02:42:02 -0700

Github user superbobry commented on the issue:

    https://github.com/apache/spark/pull/21157
  
    > Does the test even pass?
    
    The tests should pass module the tests specifically checking the behaviour 
being removed. I think the failing RDD test is in this group as well. 
    
    > Why don't we try to fix it rather than removing out? 
    
    I might be overly pessimistic but I don't see how we can make the patch 
work in all cases without making the implementation more magical, and as a 
result, producing even more confusing error messages when things go wrong. 
Consider, for instance, a widespread pattern
    
    ```python
    class Foo(namedtuple("Foo", [])):
        def foo(self):
            return 42
    ```
    
    If the outer `Foo` class does not explicitly customize pickling, it would 
use the "fallback" implementation added by `_hijack_namedtuple`, which only 
knows about the inner namedtuple class. Therefore, confusingly enough 
`issubclass(pickle.loads(foo), Foo)` is False (as detailed in [2]).
    
    What can we do about this? We somehow need to serialize the **full 
definition** of the outer `Foo` class alongside every instance. Maybe this can 
be done by recursively pickling the class `__name__`, `__bases__` and 
`__dict__`, but `__dict__` could have some other hard-to-pickle objects like 
user-defined methods. Should we serialize these in the deconstructed form as 
well? These are tough questions, and I think they are better left outside the 
scope of PySpark. 
    
    That said, I think an alternative to completely removing the patch might be 
deprecating it, and advertizing `cloudpickle` for workloads using namedtuples 
(or even making it the default?). I've played with `cloudpickle` a little bit, 
and it seems to solve the aforementioned issues in a consistent manner. The 
price, however, is the added overhead:
    
    ```python
    >>> len(pickle.dumps(Foo()))
    23
    >>> len(cloudpickle.dumps(Foo()))
    3538
    ```
    
    or, even more extreme,
    
    ```python
    >>> class A: pass
    ...
    >>> len(cloudpickle.dumps(A()))
    177
    ```
    
    What do you think?
    
    [2]: 
https://superbobry.github.io/tensorflowonspark-or-the-namedtuple-patch-strikes-again.html



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...

Reply via email to