Hi guys, I have a dataframe of type Record (id: Long, timestamp: Long, isValid: Boolean, .... other metrics)
Schema looks like this: root |-- id: long (nullable = true) |-- timestamp: long (nullable = true) |-- isValid: boolean (nullable = true) ..... I need to find the earliest valid record per id. In RDD world I can do groupBy 'id' and find the earliest one but I am not sure how I can do it in SQL. Since I am doing this in PySpark I cannot really use DataSet API for this. One thing I can do is groupBy 'id', find the earliest timestamp available and then join with the original dataframe to get the right record (all the metrics). Or I can create a single column with all the records and then implement a UDAF in scala and use it in pyspark. Both solutions don't seem to be straight forward. Is there a simpler solution to this? Thanks Nikhil