I ran the below code in my Standalone mode. Python version 2.7.6. Spacy 1.7+ version. Spark 2.0.1 version.
I'm new pie to pyspark. please help me to understand the below two versions of code. why first version run fine whereas second throws pickle.PicklingError: Can't pickle <cyfunction load.<locals>.<lambda> at 0x107e39110>. (i was doubting that Second approach failure because it could not serialize the object and sent it to worker). *1) Run-Success:* *(SpacyExample-Module)* import spacy nlp = spacy.load('en_default') def spacyChunks(content): doc = nlp(content) mp=[] for chunk in doc.noun_chunks: phrase = content[chunk.start_char: chunk.end_char] mp.append(phrase) #print(mp) return mp if __name__ == '__main__': pass *Main-Module:* spark = SparkSession.builder.appName("readgzip" ).config(conf=conf).getOrCreate() gzfile = spark.read.schema(schema).json("") ... ... textresult.rdd.map(lambda x:x[0]).\ flatMap(lambda data: SpacyExample.spacyChunks(data)).saveAsTextFile("") *2) Run-Failure:* *MainModule:* nlp= spacy.load('en_default') def spacyChunks(content): doc = nlp(content) mp=[] for chunk in doc.noun_chunks: phrase = content[chunk.start_char: chunk.end_char] mp.append(phrase) #print(mp) return mp if __name__ == '__main__' create spraksession,read file, file.rdd.map(..).flatmap(lambdat data:spacyChunks(data).saveAsTextFile() Stack Trace: File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save f(self, obj) # Call unbound method with explicit self File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 649, in save_dict self._batch_setitems(obj.iteritems()) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 681, in _batch_setitems save(v) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 331, in save self.save_reduce(obj=obj, *rv) File "/Users/rs/Downloads/spark-2.0.1-bin-hadoop2.7/python/pyspark/cloudpickle.py", line 535, in save_reduce save(args) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save f(self, obj) # Call unbound method with explicit self File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 562, in save_tuple save(element) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save f(self, obj) # Call unbound method with explicit self File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 649, in save_dict self._batch_setitems(obj.iteritems()) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 681, in _batch_setitems save(v) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 317, in save self.save_global(obj, rv) File "/Users/rs/Downloads/spark-2.0.1-bin-hadoop2.7/python/pyspark/cloudpickle.py", line 390, in save_global raise pickle.PicklingError("Can't pickle %r" % obj) pickle.PicklingError: Can't pickle <cyfunction load.<locals>.<lambda> at 0x107e39110> -- Selvam Raman "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"