Here you go.... t = [["A",10,"A10"],["A",20,"A20"],["A",30,"A30"],["B",15,"B15"],["C",10,"C10"],["C",20,"C200"]] TRDD = sc.parallelize(t).map(lambda t: Row(name=str(t[0]),age=int(t[1]),other=str(t[2]))) TDF = ssc.createDataFrame(TRDD) print TDF.printSchema() TDF.registerTempTable("tab") JN = ssc.sql("select t.name,t.age,t.other from tab t inner join (select name,max(age) age from tab group by name) t1 on t.name=t1.name and t.age=t1.age") for i in JN.collect(): print i
Result: Row(name=u'A', age=30, other=u'A30') Row(name=u'B', age=15, other=u'B15') Row(name=u'C', age=20, other=u'C200') On Sat, Apr 25, 2015 at 2:48 PM, Wenlei Xie <wenlei....@gmail.com> wrote: > Sure. A simple example of data would be (there might be many other columns) > > Name Age Other > A 10 A10 > A 20 A20 > A 30 A30 > B 15 B15 > C 10 C10 > C 20 C20 > > The desired output would be > Name Age Other > A 30 A30 > B 15 B15 > C 20 C20 > > Thank you so much for the help! > > On Sat, Apr 25, 2015 at 12:41 AM, ayan guha <guha.a...@gmail.com> wrote: > >> can you give an example set of data and desired output> >> >> On Sat, Apr 25, 2015 at 2:32 PM, Wenlei Xie <wenlei....@gmail.com> wrote: >> >>> Hi, >>> >>> I would like to answer the following customized aggregation query on >>> Spark SQL >>> 1. Group the table by the value of Name >>> 2. For each group, choose the tuple with the max value of Age (the ages >>> are distinct for every name) >>> >>> I am wondering what's the best way to do it on Spark SQL? Should I use >>> UDAF? Previously I am doing something like the following on Spark: >>> >>> personRDD.map(t => (t.name, t)) >>> .reduceByKey((a, b) => if (a.age > b.age) a else b) >>> >>> Thank you! >>> >>> Best, >>> Wenlei >>> >> >> >> >> -- >> Best Regards, >> Ayan Guha >> > > > > -- > Wenlei Xie (谢文磊) > > Ph.D. Candidate > Department of Computer Science > 456 Gates Hall, Cornell University > Ithaca, NY 14853, USA > Email: wenlei....@gmail.com > -- Best Regards, Ayan Guha