I am not sure I understood your logic, but it seems to me that you could take a look of Hive's Lead/Lag functions.
On Monday, December 19, 2016 1:41 AM, Milin korath <milin.kor...@impelsys.com> wrote: thanks, I tried with left outer join. My dataset having around 400M records and lot of shuffling is happening.Is there any other workaround apart from Join,I tried use window function but I am not getting a proper solution, Thanks On Sat, Dec 17, 2016 at 4:55 AM, Michael Armbrust <mich...@databricks.com> wrote: Oh and to get the null for missing years, you'd need to do an outer join with a table containing all of the years you are interested in. On Fri, Dec 16, 2016 at 3:24 PM, Michael Armbrust <mich...@databricks.com> wrote: Are you looking for argmax? Here is an example. On Wed, Dec 14, 2016 at 8:49 PM, Milin korath <milin.kor...@impelsys.com> wrote: Hi | I have a spark data frame with following structure id flag price date a 0 100 2015 a 0 50 2015 a 1 200 2014 a 1 300 2013 a 0 400 2012I need to create a data frame with recent value of flag 1 and updated in the flag 0 rows. id flag price date new_column a 0 100 2015 200 a 0 50 2015 200 a 1 200 2014 null a 1 300 2013 null a 0 400 2012 nullWe have 2 rows having flag=0. Consider the first row(flag=0),I will have 2 values(200 and 300) and I am taking the recent one 200(2014). And the last row I don't have any recent value for flag 1 so it is updated with null.Looking for a solution using scala. Any help would be appreciated.Thanks | | Thanks Milin