I implemented a more generic version which I posted here: https://gist.github.com/EntilZha/3951769a011389fef25e930258c20a2a
I think I could generalize this by pattern matching on DataType to use different getLong/getDouble/etc functions ( not trying to use getAs[] because getting T from Array[T] is hard it seems). Is there a way to go further and make the arguments unnecessary or inferable at runtime, particularly for the valueType since it doesn’t matter what it is? DataType is abstract so I can’t instantiate it, is there a way to define the method so that it pulls from the user input at runtime? Thanks, — Pedro Rodriguez PhD Student in Large-Scale Machine Learning | CU Boulder Systems Oriented Data Scientist UC Berkeley AMPLab Alumni pedrorodriguez.io | 909-353-4423 github.com/EntilZha | LinkedIn On July 9, 2016 at 1:33:18 AM, Pedro Rodriguez (ski.rodrig...@gmail.com) wrote: Hi Xinh, A co-worker also found that solution but I thought it was possibly overkill/brittle so looks into UDAFs (user defined aggregate functions). I don’t have code, but Databricks has a post that has an example https://databricks.com/blog/2015/09/16/apache-spark-1-5-dataframe-api-highlights.html. From that, I was able to write a MinLongByTimestamp function, but was having a hard time writing a generic aggregate to any column by an order able column. Anyone know how you might go about using generics in a UDAF, or something that would mimic union types to express that order able spark sql types are allowed? — Pedro Rodriguez PhD Student in Large-Scale Machine Learning | CU Boulder Systems Oriented Data Scientist UC Berkeley AMPLab Alumni pedrorodriguez.io | 909-353-4423 github.com/EntilZha | LinkedIn On July 8, 2016 at 6:06:32 PM, Xinh Huynh (xinh.hu...@gmail.com) wrote: Hi Pedro, I could not think of a way using an aggregate. It's possible with a window function, partitioned on user and ordered by time: // Assuming "df" holds your dataframe ... import org.apache.spark.sql.functions._ import org.apache.spark.sql.expressions.Window val wSpec = Window.partitionBy("user").orderBy("time") df.select($"user", $"time", rank().over(wSpec).as("rank")) .where($"rank" === 1) Xinh On Fri, Jul 8, 2016 at 12:57 PM, Pedro Rodriguez <ski.rodrig...@gmail.com> wrote: Is there a way to on a GroupedData (from groupBy in DataFrame) to have an aggregate that returns column A based on a min of column B? For example, I have a list of sites visited by a given user and I would like to find the event with the minimum time (first event) Thanks, -- Pedro Rodriguez PhD Student in Distributed Machine Learning | CU Boulder UC Berkeley AMPLab Alumni ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 Github: github.com/EntilZha | LinkedIn: https://www.linkedin.com/in/pedrorodriguezscience