Currently, PySpark can not support pickle a class object in current script ( '__main__'), the workaround could be put the implementation of the class into a separate module, then use "bin/spark-submit --py-files xxx.py" in deploy it.
in xxx.py: class test(object): def __init__(self, a, b): self.total = a + b in job.py: from xxx import test a = sc.parallelize([(True,False),(False,False)]) a.map(lambda (x,y): test(x,y)) run it by: bin/spark-submit --py-files xxx.py job.py On Wed, Feb 18, 2015 at 1:48 PM, Guillaume Guy <guillaume.c....@gmail.com> wrote: > Hi, > > This is a duplicate of the stack-overflow question here. I hope to generate > more interest on this mailing list. > > > The problem: > > I am running into some attribute lookup problems when trying to initiate a > class within my RDD. > > My workflow is quite standard: > > 1- Start with an RDD > > 2- Take each element of the RDD, initiate an object for each > > 3- Reduce (I will write a method that will define the reduce operation later > on) > > Here is #2: > > class test(object): > def __init__(self, a,b): > self.total = a + b > > a = sc.parallelize([(True,False),(False,False)]) > a.map(lambda (x,y): test(x,y)) > > Here is the error I get: > > PicklingError: Can't pickle < class 'main.test' >: attribute lookup > main.test failed > > I'd like to know if there is any way around it. Please, answer with a > working example to achieve the intended results (i.e. creating a RDD of > objects of class "tests"). > > Thanks in advance! > > Related question: > > https://groups.google.com/forum/#!topic/edx-code/9xzRJFyQwn > > > GG > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org