Looks cool! Josh, if you replace CloudPickle with this, make sure to also update the LICENSE file, which is supposed to contain third-party licenses.
Matei On Dec 5, 2013, at 8:02 PM, Josh Rosen <rosenvi...@gmail.com> wrote: > Thanks for the link! I wasn't aware of Dill, but it looks like a nice > library. I like that it's being actively developed: > https://github.com/uqfoundation/dill > > It also seems to work correctly for a few edge-cases that cloudpickle > didn't handle properly, such as serializing operator.itemgetter instances > (see https://spark-project.atlassian.net/browse/SPARK-791). > > I'll put together a pull request to replace CloudPickle with Dill. Dill > uses a 3-clause BSD license, so we should be able to package it into an > .egg in the python/lib/ folder like we did for Py4J. It will be > interesting to see whether the change has any performance impact, although > the recent custom serializers pull request should help with that since it > would let us use Dill for serializing functions and a faster serializer for > serializing data. > > - Josh > > > > > On Thu, Dec 5, 2013 at 4:49 AM, Nick Pentreath > <nick.pentre...@gmail.com>wrote: > >> Hi devs >> >> I came across Dill ( >> http://trac.mystic.cacr.caltech.edu/project/pathos/wiki/dill) for Python >> serialization. Was wondering if it may be a replacement to the cloudpickle >> stuff (and remove that piece of code that needs to be maintained within >> PySpark)? >> >> Josh have you looked into Dill? Any thoughts? >> >> N >>