Hi Xinh,

A co-worker also found that solution but I thought it was possibly 
overkill/brittle so looks into UDAFs (user defined aggregate functions). I 
don’t have code, but Databricks has a post that has an example 
https://databricks.com/blog/2015/09/16/apache-spark-1-5-dataframe-api-highlights.html.
 From that, I was able to write a MinLongByTimestamp function, but was having a 
hard time writing a generic aggregate to any column by an order able column.

Anyone know how you might go about using generics in a UDAF, or something that 
would mimic union types to express that order able spark sql types are allowed?

—
Pedro Rodriguez
PhD Student in Large-Scale Machine Learning | CU Boulder
Systems Oriented Data Scientist
UC Berkeley AMPLab Alumni

pedrorodriguez.io | 909-353-4423
github.com/EntilZha | LinkedIn

On July 8, 2016 at 6:06:32 PM, Xinh Huynh (xinh.hu...@gmail.com) wrote:

Hi Pedro,

I could not think of a way using an aggregate. It's possible with a window 
function, partitioned on user and ordered by time:

// Assuming "df" holds your dataframe ...

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val wSpec = Window.partitionBy("user").orderBy("time")
df.select($"user", $"time", rank().over(wSpec).as("rank"))
  .where($"rank" === 1)

Xinh

On Fri, Jul 8, 2016 at 12:57 PM, Pedro Rodriguez <ski.rodrig...@gmail.com> 
wrote:
Is there a way to on a GroupedData (from groupBy in DataFrame) to have an 
aggregate that returns column A based on a min of column B? For example, I have 
a list of sites visited by a given user and I would like to find the event with 
the minimum time (first event)

Thanks,
--
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni

ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
Github: github.com/EntilZha | LinkedIn: 
https://www.linkedin.com/in/pedrorodriguezscience


Reply via email to