I think this will be slow because you have to do a group by then do a join (my table has 7 million records). I am looking for something like reduceByKey(), e.g.
rdd.reduceByKey( (a, b) => if (a.timeStamp > b.timeStamp) a else b ) Does it have similar thing in DataFrame? Of course I can convert a DataFrame to RDD and then invoke the recudeByKey Ningjun From: ayan guha [mailto:[email protected]] Sent: Thursday, April 30, 2015 3:41 AM To: Wang, Ningjun (LNG-NPV) Cc: [email protected] Subject: RE: HOw can I merge multiple DataFrame and remove duplicated key 1. Do a group by and get Max. In your example select id, Max(DT) from t group by id. Name this j. 2. Join t,j on id and DT=mxdt This is how we used to query RDBMS before window functions show up. As I understand from SQL, group by allow you to do sum(), average(), max(), mn(). But how do I select the entire row in the group with maximum column timeStamp? For example id1, value1, 2015-01-01 id1, value2, 2015-01-02 id2, value3, 2015-01-01 id2, value4, 2015-01-02 I want to return id1, value2, 2015-01-02 id2, value4, 2015-01-02 I can use reduceByKey() in RDD but how to do it using DataFrame? Can you give an example code snipet? Thanks Ningjun From: ayan guha [mailto:[email protected]<mailto:[email protected]>] Sent: Wednesday, April 29, 2015 5:54 PM To: Wang, Ningjun (LNG-NPV) Cc: [email protected]<mailto:[email protected]> Subject: Re: HOw can I merge multiple DataFrame and remove duplicated key Its no different, you would use group by and aggregate function to do so. On 30 Apr 2015 02:15, "Wang, Ningjun (LNG-NPV)" <[email protected]<mailto:[email protected]>> wrote: I have multiple DataFrame objects each stored in a parquet file. The DataFrame just contains 3 columns (id, value, timeStamp). I need to union all the DataFrame objects together but for duplicated id only keep the record with the latest timestamp. How can I do that? I can do this for RDDs by sc.union() to union all the RDDs and then do a reduceByKey() to remove duplicated id by keeping only the one with latest timeStamp field. But how do I do it for DataFrame? Ningjun
