It’s true that it can’t. You can try to use the CloudPickle library instead, which is what we use within PySpark to serialize functions (see python/pyspark/cloudpickle.py). However I’m also curious, why do you need an RDD of functions?
Matei On Jun 15, 2014, at 4:49 PM, madeleine <madeleine.ud...@gmail.com> wrote: > It seems that the default serializer used by pyspark can't serialize a list > of functions. > I've seen some posts about trying to fix this by using dill to serialize > rather than pickle. > Does anyone know what the status of that project is, or whether there's > another easy workaround? > > I've pasted a sample error message below. Here, regs is a function defined > in another file myfile.py that has been included on all workers via the > pyFiles argument to SparkContext: sc = SparkContext("local", > "myapp",pyFiles=["myfile.py"]). > > File "runfile.py", line 45, in __init__ > regsRDD = sc.parallelize([regs]*self.n) > File "/Applications/spark-0.9.1-bin-hadoop2/python/pyspark/context.py", > line 223, in parallelize > serializer.dump_stream(c, tempFile) > File > "/Applications/spark-0.9.1-bin-hadoop2/python/pyspark/serializers.py", line > 182, in dump_stream > self.serializer.dump_stream(self._batched(iterator), stream) > File > "/Applications/spark-0.9.1-bin-hadoop2/python/pyspark/serializers.py", line > 118, in dump_stream > self._write_with_length(obj, stream) > File > "/Applications/spark-0.9.1-bin-hadoop2/python/pyspark/serializers.py", line > 128, in _write_with_length > serialized = self.dumps(obj) > File > "/Applications/spark-0.9.1-bin-hadoop2/python/pyspark/serializers.py", line > 270, in dumps > def dumps(self, obj): return cPickle.dumps(obj, 2) > cPickle.PicklingError: Can't pickle <type 'function'>: attribute lookup > __builtin__.function failed > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-serializer-can-t-handle-functions-tp7650.html > Sent from the Apache Spark User List mailing list archive at Nabble.com.