Its no different, you would use group by and aggregate function to do so. On 30 Apr 2015 02:15, "Wang, Ningjun (LNG-NPV)" <[email protected]> wrote:
> I have multiple DataFrame objects each stored in a parquet file. The > DataFrame just contains 3 columns (id, value, timeStamp). I need to union > all the DataFrame objects together but for duplicated id only keep the > record with the latest timestamp. How can I do that? > > > > I can do this for RDDs by sc.union() to union all the RDDs and then do a > reduceByKey() to remove duplicated id by keeping only the one with latest > timeStamp field. But how do I do it for DataFrame? > > > > > > Ningjun > > >
