I ran the below code in my Standalone mode. Python version 2.7.6. Spacy
1.7+ version. Spark 2.0.1 version.

I'm new pie to pyspark. please help me to understand the below two versions
of code.

why first version run fine whereas second throws pickle.PicklingError:
Can't pickle <cyfunction load.<locals>.<lambda> at 0x107e39110>.

(i was doubting that Second approach failure because it could not serialize
the object and sent it to worker).

*1) Run-Success:*

*(SpacyExample-Module)*

import spacy

nlp = spacy.load('en_default')

def spacyChunks(content):

    doc = nlp(content)

    mp=[]

    for chunk in doc.noun_chunks:

        phrase = content[chunk.start_char: chunk.end_char]

        mp.append(phrase)

    #print(mp)

    return mp



    if __name__ == '__main__':

        pass


*Main-Module:*

spark = SparkSession.builder.appName("readgzip"
).config(conf=conf).getOrCreate()

gzfile = spark.read.schema(schema).json("")

...

...

textresult.rdd.map(lambda x:x[0]).\

    flatMap(lambda data: SpacyExample.spacyChunks(data)).saveAsTextFile("")




*2) Run-Failure:*

*MainModule:*

nlp= spacy.load('en_default')

def spacyChunks(content):

    doc = nlp(content)

    mp=[]

    for chunk in doc.noun_chunks:

        phrase = content[chunk.start_char: chunk.end_char]

        mp.append(phrase)

    #print(mp)

    return mp


if __name__ == '__main__'

create spraksession,read file,

file.rdd.map(..).flatmap(lambdat data:spacyChunks(data).saveAsTextFile()


Stack Trace:

  File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 286, in save

    f(self, obj) # Call unbound method with explicit self

  File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 649, in save_dict

    self._batch_setitems(obj.iteritems())

  File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 681, in _batch_setitems

    save(v)

  File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 331, in save

    self.save_reduce(obj=obj, *rv)

  File
"/Users/rs/Downloads/spark-2.0.1-bin-hadoop2.7/python/pyspark/cloudpickle.py",
line 535, in save_reduce

    save(args)

  File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 286, in save

    f(self, obj) # Call unbound method with explicit self

  File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 562, in save_tuple

    save(element)

  File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 286, in save

    f(self, obj) # Call unbound method with explicit self

  File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 649, in save_dict

    self._batch_setitems(obj.iteritems())

  File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 681, in _batch_setitems

    save(v)

  File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 317, in save

    self.save_global(obj, rv)

  File
"/Users/rs/Downloads/spark-2.0.1-bin-hadoop2.7/python/pyspark/cloudpickle.py",
line 390, in save_global

    raise pickle.PicklingError("Can't pickle %r" % obj)

pickle.PicklingError: Can't pickle <cyfunction load.<locals>.<lambda> at
0x107e39110>

-- 
Selvam Raman
"லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"

Reply via email to