I have multiple DataFrame objects each stored in a parquet file.  The DataFrame 
just contains 3 columns (id,  value,  timeStamp). I need to union all the 
DataFrame objects together but for duplicated id only keep the record with the 
latest timestamp. How can I  do that?

I can do this for RDDs by sc.union() to union all the RDDs and then do a 
reduceByKey() to remove duplicated id by keeping only the one with latest 
timeStamp field. But how do I do it for DataFrame?


Ningjun

Reply via email to