You can use analytical functions in spark sql. Something like select * from (select id, row_number() over (partition by id order by timestamp ) as rn from root) where rn=1
On Mon, Dec 17, 2018 at 4:03 PM Nikhil Goyal <nownik...@gmail.com> wrote: > Hi guys, > > I have a dataframe of type Record (id: Long, timestamp: Long, isValid: > Boolean, .... other metrics) > > Schema looks like this: > root > |-- id: long (nullable = true) > |-- timestamp: long (nullable = true) > |-- isValid: boolean (nullable = true) > ..... > > I need to find the earliest valid record per id. In RDD world I can do > groupBy 'id' and find the earliest one but I am not sure how I can do it in > SQL. Since I am doing this in PySpark I cannot really use DataSet API for > this. > > One thing I can do is groupBy 'id', find the earliest timestamp available > and then join with the original dataframe to get the right record (all the > metrics). > > Or I can create a single column with all the records and then implement a > UDAF in scala and use it in pyspark. > > Both solutions don't seem to be straight forward. Is there a simpler > solution to this? > > Thanks > Nikhil > -- Regards, Ramandeep Singh http://orastack.com +13474792296 ramannan...@gmail.com