Hi,

I think the problem is caused by Data Serialization. You can follow the link
https://spark.apache.org/docs/latest/tuning.html
<https://spark.apache.org/docs/latest/tuning.html>   to register your class
testing.

For pyspark 1.1.0, there is a problem about the default serializer. 
https://issues.apache.org/jira/browse/SPARK-2652?filter=-2
<https://issues.apache.org/jira/browse/SPARK-2652?filter=-2>  . 

Cheers.
Gen

 
sid wrote
> Hi , I am new to spark and I am trying to use pyspark.
> 
> I am trying to find mean of 128 dimension vectors present in a file .
> 
> Below is the code
> ----------------------------------------------------------------
> from cStringIO import StringIO
> 
> class testing:
>       def __str__(self):
>               file_str = StringIO()
>               for n in self.vector:
>                       file_str.write(str(n)) 
>                       file_str.write(" ")
>               return file_str.getvalue()
>       def __init__(self,txt="",initial=False):
>               self.vector = [0.0]*128
>               if len(txt)==0:
>                       return
>               i=0
>               for n in txt.split():
>                       if i<128:
>                               self.vector[i]=float(n)
>                               i = i+1
>                               continue
>                       self.filename=n
>                       break
> def addVec(self,r):
>       a = testing()
>       for n in xrange(0,128):
>               a.vector[n] = self.vector[n] + r.vector[n]
>       return a
> 
> def InitializeAndReturnPair(string,first=False):
>       vec = testing(string,first)
>       return 1,vec
> 
> 
> from pyspark import SparkConf, SparkContext
> conf = (SparkConf()
>          .setMaster("local")
>          .setAppName("My app")
>          .set("spark.executor.memory", "1g"))
> sc = SparkContext(conf = conf)
> 
> inp = sc.textFile("input.txt")
> output = inp.map(lambda s: InitializeAndReturnPair(s,True)).cache()
> output.saveAsTextFile("output")
> print output.reduceByKey(lambda a,b : a).collect()
> ---------------------------------------------
> 
> Example line in input.txt 
> 6.0 156.0 26.0 3.0 1.0 0.0 2.0 1.0 15.0 113.0 53.0 139.0 156.0 0.0 0.0 0.0
> 156.0 29.0 1.0 38.0 59.0 0.0 0.0 0.0 28.0 4.0 2.0 9.0 1.0 0.0 0.0 0.0 9.0
> 83.0 13.0 1.0 0.0 9.0 42.0 7.0 41.0 71.0 74.0 123.0 35.0 17.0 7.0 2.0
> 156.0 27.0 6.0 33.0 11.0 2.0 0.0 11.0 35.0 4.0 2.0 4.0 1.0 3.0 2.0 4.0 0.0
> 0.0 0.0 0.0 2.0 19.0 45.0 17.0 47.0 2.0 2.0 7.0 59.0 90.0 15.0 11.0 156.0
> 14.0 1.0 4.0 9.0 11.0 2.0 29.0 35.0 6.0 5.0 9.0 4.0 2.0 1.0 3.0 1.0 0.0
> 0.0 0.0 1.0 5.0 25.0 14.0 27.0 2.0 0.0 2.0 86.0 48.0 10.0 6.0 156.0 23.0
> 1.0 2.0 21.0 6.0 0.0 3.0 31.0 10.0 4.0 3.0 0.0 0.0 1.0 2.0
> 
> I am not able to figure out where I am missing out , I tried changing the
> serializer but still getting similar error.
> 
> Place the error here http://pastebin.com/0tqiiJQm





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Python-code-crashing-on-ReduceByKey-if-I-return-custom-class-object-tp17317p17345.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to