0
down vote
favorite
I have a spark data frame with following structure

 id  flag price date
  a   0    100  2015
  a   0    50   2015
  a   1    200  2014
  a   1    300  2013
  a   0    400  2012
I need to create a data frame with recent value of flag 1 and updated in the
flag 0 rows.

      id  flag price date new_column
      a   0    100  2015    200
      a   0    50   2015    200
      a   1    200  2014    null
      a   1    300  2013    null
      a   0    400  2012    null
We have 2 rows having flag=0. Consider the first row(flag=0),I will have 2
values(200 and 300) and I am taking the recent one 200(2014). And the last
row I don't have any recent value for flag 1 so it is updated with null.

I found a solution with left join.My dataset having around 400M records and
join cause lot of shuffling.Is there any better way to find recent value.


Looking for a solution using scala. Any help would be appreciated.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-recent-value-in-spark-dataframe-tp28230.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to