Hi, I will write the code in python
{code:title=test.py} data = sc.textFile(...).map(...) ## Please make sure that the rdd is like[[id, c1, c2, c3], [id, c1, c2, c3],...] keypair = data.map(lambda l: ((l[0],l[1],l[2]), float(l[3]))) keypair = keypair.reduceByKey(add) out = keypair.map(lambda l: list(l[0]) + [l[1]]) {code} Kalyan wrote > I have a distribute system on 3 nodes and my dataset is distributed among > those nodes. for example, I have a test.csv file which is exist on all 3 > nodes and it contains 4 columns of > > **row | id, C1, C2, C3 > ---------------------- > row1 | A1 , c1 , c2 ,2 > row2 | A1 , c1 , c2 ,1 > row3 | A1 , c11, c2 ,1 > row4 | A2 , c1 , c2 ,1 > row5 | A2 , c1 , c2 ,1 > row6 | A2 , c11, c2 ,1 > row7 | A2 , c11, c21,1 > row8 | A3 , c1 , c2 ,1 > row9 | A3 , c1 , c2 ,2 > row10 | A4 , c1 , c2 ,1 > > I need help, how to aggregate data set by id, c1,c2,c3 columns and output > like this > > **row | id, C1, C2, C3 > ---------------------- > row1 | A1 , c1 , c2 ,3 > row2 | A1 , c11, c2 ,1 > row3 | A2 , c1 , c2 ,2 > row4 | A2 , c11, c2 ,1 > row5 | A2 , c11, c21,1 > row6 | A3 , c1 , c2 ,3 > row7 | A4 , c1 , c2 ,1 > > Thanks > Kalyan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-aggregate-data-in-Apach-Spark-tp16764p16803.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org