As I understand from SQL, group by allow you to do sum(), average(), max(), 
mn(). But how do I select the entire row in the group with maximum column 
timeStamp? For example

id1,  value1, 2015-01-01
id1, value2,  2015-01-02
id2,  value3, 2015-01-01
id2, value4,  2015-01-02

I want to return
id1, value2,  2015-01-02
id2, value4,  2015-01-02

I can use reduceByKey() in RDD but how to do it using DataFrame? Can you give 
an example code snipet?

Thanks
Ningjun

From: ayan guha [mailto:[email protected]]
Sent: Wednesday, April 29, 2015 5:54 PM
To: Wang, Ningjun (LNG-NPV)
Cc: [email protected]
Subject: Re: HOw can I merge multiple DataFrame and remove duplicated key


Its no different, you would use group by and aggregate function to do so.
On 30 Apr 2015 02:15, "Wang, Ningjun (LNG-NPV)" 
<[email protected]<mailto:[email protected]>> wrote:
I have multiple DataFrame objects each stored in a parquet file.  The DataFrame 
just contains 3 columns (id,  value,  timeStamp). I need to union all the 
DataFrame objects together but for duplicated id only keep the record with the 
latest timestamp. How can I  do that?

I can do this for RDDs by sc.union() to union all the RDDs and then do a 
reduceByKey() to remove duplicated id by keeping only the one with latest 
timeStamp field. But how do I do it for DataFrame?


Ningjun

Reply via email to