I am not sure I understood your logic, but it seems to me that you could take a 
look of Hive's Lead/Lag functions. 

    On Monday, December 19, 2016 1:41 AM, Milin korath 
<milin.kor...@impelsys.com> wrote:
 

 thanks, I tried with left outer join. My dataset having around 400M records 
and lot of shuffling is happening.Is there any other workaround apart from 
Join,I tried use window function but I am not getting a proper solution, 

Thanks
On Sat, Dec 17, 2016 at 4:55 AM, Michael Armbrust <mich...@databricks.com> 
wrote:

Oh and to get the null for missing years, you'd need to do an outer join with a 
table containing all of the years you are interested in.
On Fri, Dec 16, 2016 at 3:24 PM, Michael Armbrust <mich...@databricks.com> 
wrote:

Are you looking for argmax? Here is an example.
On Wed, Dec 14, 2016 at 8:49 PM, Milin korath <milin.kor...@impelsys.com> wrote:

Hi 


| I have a spark data frame with following structure id  flag price date
  a   0    100  2015
  a   0    50   2015
  a   1    200  2014
  a   1    300  2013
  a   0    400  2012I need to create a data frame with recent value of flag 1 
and updated in the flag 0 rows.      id  flag price date new_column
      a   0    100  2015    200
      a   0    50   2015    200
      a   1    200  2014    null
      a   1    300  2013    null
      a   0    400  2012    nullWe have 2 rows having flag=0. Consider the 
first row(flag=0),I will have 2 values(200 and 300) and I am taking the recent 
one 200(2014). And the last row I don't have any recent value for flag 1 so it 
is updated with null.Looking for a solution using scala. Any help would be 
appreciated.Thanks |
|


Thanks Milin







   

Reply via email to