You also could use Spark SQL: from pyspark.sql import Row, SQLContext row = Row('id', 'C1', 'C2', 'C3') # convert each data = sc.textFile("test.csv").map(lambda line: line.split(',')) sqlContext = SQLContext(sc) rows = data.map(lambda r: row(*r)) sqlContext.inferSchema(rows).registerTempTable("data") result = sqlContext.sql("select id, C1, C2, sum(C3) from data group by id, C1, C2") # is SchemaRDD
On Mon, Oct 20, 2014 at 2:52 AM, Gen <gen.tan...@gmail.com> wrote: > Hi, > > I will write the code in python > > {code:title=test.py} > data = sc.textFile(...).map(...) ## Please make sure that the rdd is > like[[id, c1, c2, c3], [id, c1, c2, c3],...] > keypair = data.map(lambda l: ((l[0],l[1],l[2]), float(l[3]))) > keypair = keypair.reduceByKey(add) > out = keypair.map(lambda l: list(l[0]) + [l[1]]) > {code} > > > Kalyan wrote >> I have a distribute system on 3 nodes and my dataset is distributed among >> those nodes. for example, I have a test.csv file which is exist on all 3 >> nodes and it contains 4 columns of >> >> **row | id, C1, C2, C3 >> ---------------------- >> row1 | A1 , c1 , c2 ,2 >> row2 | A1 , c1 , c2 ,1 >> row3 | A1 , c11, c2 ,1 >> row4 | A2 , c1 , c2 ,1 >> row5 | A2 , c1 , c2 ,1 >> row6 | A2 , c11, c2 ,1 >> row7 | A2 , c11, c21,1 >> row8 | A3 , c1 , c2 ,1 >> row9 | A3 , c1 , c2 ,2 >> row10 | A4 , c1 , c2 ,1 >> >> I need help, how to aggregate data set by id, c1,c2,c3 columns and output >> like this >> >> **row | id, C1, C2, C3 >> ---------------------- >> row1 | A1 , c1 , c2 ,3 >> row2 | A1 , c11, c2 ,1 >> row3 | A2 , c1 , c2 ,2 >> row4 | A2 , c11, c2 ,1 >> row5 | A2 , c11, c21,1 >> row6 | A3 , c1 , c2 ,3 >> row7 | A4 , c1 , c2 ,1 >> >> Thanks >> Kalyan > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/How-to-aggregate-data-in-Apach-Spark-tp16764p16803.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org