Thanks for the link! I wasn't aware of Dill, but it looks like a nice library. I like that it's being actively developed: https://github.com/uqfoundation/dill
It also seems to work correctly for a few edge-cases that cloudpickle didn't handle properly, such as serializing operator.itemgetter instances (see https://spark-project.atlassian.net/browse/SPARK-791). I'll put together a pull request to replace CloudPickle with Dill. Dill uses a 3-clause BSD license, so we should be able to package it into an .egg in the python/lib/ folder like we did for Py4J. It will be interesting to see whether the change has any performance impact, although the recent custom serializers pull request should help with that since it would let us use Dill for serializing functions and a faster serializer for serializing data. - Josh On Thu, Dec 5, 2013 at 4:49 AM, Nick Pentreath <nick.pentre...@gmail.com>wrote: > Hi devs > > I came across Dill ( > http://trac.mystic.cacr.caltech.edu/project/pathos/wiki/dill) for Python > serialization. Was wondering if it may be a replacement to the cloudpickle > stuff (and remove that piece of code that needs to be maintained within > PySpark)? > > Josh have you looked into Dill? Any thoughts? > > N >