Re: Need help with SparkSQL Query
You can use analytical functions in spark sql. Something like select * from (select id, row_number() over (partition by id order by timestamp ) as rn from root) where rn=1 On Mon, Dec 17, 2018 at 4:03 PM Nikhil Goyal wrote: > Hi guys, > > I have a dataframe of type Record (id: Long, timestamp: Long, isValid: > Boolean, other metrics) > > Schema looks like this: > root > |-- id: long (nullable = true) > |-- timestamp: long (nullable = true) > |-- isValid: boolean (nullable = true) > . > > I need to find the earliest valid record per id. In RDD world I can do > groupBy 'id' and find the earliest one but I am not sure how I can do it in > SQL. Since I am doing this in PySpark I cannot really use DataSet API for > this. > > One thing I can do is groupBy 'id', find the earliest timestamp available > and then join with the original dataframe to get the right record (all the > metrics). > > Or I can create a single column with all the records and then implement a > UDAF in scala and use it in pyspark. > > Both solutions don't seem to be straight forward. Is there a simpler > solution to this? > > Thanks > Nikhil > -- Regards, Ramandeep Singh http://orastack.com +13474792296 ramannan...@gmail.com
Re: Need help with SparkSQL Query
Untested, but something like the below should work: from pyspark.sql import functions as F from pyspark.sql import window as W (record .withColumn('ts_rank', F.dense_rank().over(W.Window.orderBy('timestamp').partitionBy("id")) .filter(F.col('ts_rank')==1) .drop('ts_rank') ) On Mon, Dec 17, 2018 at 4:04 PM Nikhil Goyal wrote: > Hi guys, > > I have a dataframe of type Record (id: Long, timestamp: Long, isValid: > Boolean, other metrics) > > Schema looks like this: > root > |-- id: long (nullable = true) > |-- timestamp: long (nullable = true) > |-- isValid: boolean (nullable = true) > . > > I need to find the earliest valid record per id. In RDD world I can do > groupBy 'id' and find the earliest one but I am not sure how I can do it in > SQL. Since I am doing this in PySpark I cannot really use DataSet API for > this. > > One thing I can do is groupBy 'id', find the earliest timestamp available > and then join with the original dataframe to get the right record (all the > metrics). > > Or I can create a single column with all the records and then implement a > UDAF in scala and use it in pyspark. > > Both solutions don't seem to be straight forward. Is there a simpler > solution to this? > > Thanks > Nikhil >