Hi, I think the problem is caused by Data Serialization. You can follow the link https://spark.apache.org/docs/latest/tuning.html <https://spark.apache.org/docs/latest/tuning.html> to register your class testing.
For pyspark 1.1.0, there is a problem about the default serializer. https://issues.apache.org/jira/browse/SPARK-2652?filter=-2 <https://issues.apache.org/jira/browse/SPARK-2652?filter=-2> . Cheers. Gen sid wrote > Hi , I am new to spark and I am trying to use pyspark. > > I am trying to find mean of 128 dimension vectors present in a file . > > Below is the code > ---------------------------------------------------------------- > from cStringIO import StringIO > > class testing: > def __str__(self): > file_str = StringIO() > for n in self.vector: > file_str.write(str(n)) > file_str.write(" ") > return file_str.getvalue() > def __init__(self,txt="",initial=False): > self.vector = [0.0]*128 > if len(txt)==0: > return > i=0 > for n in txt.split(): > if i<128: > self.vector[i]=float(n) > i = i+1 > continue > self.filename=n > break > def addVec(self,r): > a = testing() > for n in xrange(0,128): > a.vector[n] = self.vector[n] + r.vector[n] > return a > > def InitializeAndReturnPair(string,first=False): > vec = testing(string,first) > return 1,vec > > > from pyspark import SparkConf, SparkContext > conf = (SparkConf() > .setMaster("local") > .setAppName("My app") > .set("spark.executor.memory", "1g")) > sc = SparkContext(conf = conf) > > inp = sc.textFile("input.txt") > output = inp.map(lambda s: InitializeAndReturnPair(s,True)).cache() > output.saveAsTextFile("output") > print output.reduceByKey(lambda a,b : a).collect() > --------------------------------------------- > > Example line in input.txt > 6.0 156.0 26.0 3.0 1.0 0.0 2.0 1.0 15.0 113.0 53.0 139.0 156.0 0.0 0.0 0.0 > 156.0 29.0 1.0 38.0 59.0 0.0 0.0 0.0 28.0 4.0 2.0 9.0 1.0 0.0 0.0 0.0 9.0 > 83.0 13.0 1.0 0.0 9.0 42.0 7.0 41.0 71.0 74.0 123.0 35.0 17.0 7.0 2.0 > 156.0 27.0 6.0 33.0 11.0 2.0 0.0 11.0 35.0 4.0 2.0 4.0 1.0 3.0 2.0 4.0 0.0 > 0.0 0.0 0.0 2.0 19.0 45.0 17.0 47.0 2.0 2.0 7.0 59.0 90.0 15.0 11.0 156.0 > 14.0 1.0 4.0 9.0 11.0 2.0 29.0 35.0 6.0 5.0 9.0 4.0 2.0 1.0 3.0 1.0 0.0 > 0.0 0.0 1.0 5.0 25.0 14.0 27.0 2.0 0.0 2.0 86.0 48.0 10.0 6.0 156.0 23.0 > 1.0 2.0 21.0 6.0 0.0 3.0 31.0 10.0 4.0 3.0 0.0 0.0 1.0 2.0 > > I am not able to figure out where I am missing out , I tried changing the > serializer but still getting similar error. > > Place the error here http://pastebin.com/0tqiiJQm -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Python-code-crashing-on-ReduceByKey-if-I-return-custom-class-object-tp17317p17345.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org