You can use analytical functions in spark sql.

Something like select * from (select id, row_number() over (partition by id
order by timestamp ) as rn from root) where rn=1

On Mon, Dec 17, 2018 at 4:03 PM Nikhil Goyal <nownik...@gmail.com> wrote:

> Hi guys,
>
> I have a dataframe of type Record (id: Long, timestamp: Long, isValid:
> Boolean, .... other metrics)
>
> Schema looks like this:
> root
>  |-- id: long (nullable = true)
>  |-- timestamp: long (nullable = true)
>  |-- isValid: boolean (nullable = true)
> .....
>
> I need to find the earliest valid record per id. In RDD world I can do
> groupBy 'id' and find the earliest one but I am not sure how I can do it in
> SQL. Since I am doing this in PySpark I cannot really use DataSet API for
> this.
>
> One thing I can do is groupBy 'id', find the earliest timestamp available
> and then join with the original dataframe to get the right record (all the
> metrics).
>
> Or I can create a single column with all the records and then implement a
> UDAF in scala and use it in pyspark.
>
> Both solutions don't seem to be straight forward. Is there a simpler
> solution to this?
>
> Thanks
> Nikhil
>


-- 
Regards,
Ramandeep Singh
http://orastack.com
+13474792296
ramannan...@gmail.com

Reply via email to