short answer: PySpark does not support UDAF (user defined aggregate function) for now.
On Tue, Feb 9, 2016 at 11:44 PM, Viktor ARDELEAN <viktor0...@gmail.com> wrote: > Hello, > > I am using following transformations on RDD: > > rddAgg = df.map(lambda l: (Row(a = l.a, b= l.b, c = l.c), l))\ > .aggregateByKey([], lambda accumulatorList, value: accumulatorList > + [value], lambda list1, list2: [list1] + [list2]) > > I want to use the dataframe groupBy + agg transformation instead of map + > aggregateByKey because as far as I know dataframe transformations are faster > than RDD transformations. > > I just can't figure out how to use custom aggregate functions with agg. > > *First step is clear:* > > groupedData = df.groupBy("a","b","c") > > *Second step is not very clear to me:* > > dfAgg = groupedData.agg(<I should call here a UDF that transforms each row to > a list and merges it?>) > > The agg documentations says the following: > agg(**exprs*) > <https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html?highlight=min#pyspark.sql.GroupedData.agg> > > Compute aggregates and returns the result as a DataFrame > <https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html?highlight=min#pyspark.sql.DataFrame> > . > > The available aggregate functions are avg, max, min, sum, count. > > If exprs is a single dict mapping from string to string, then the key is > the column to perform aggregation on, and the value is the aggregate > function. > > Alternatively, exprs can also be a list of aggregate Column > <https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html?highlight=min#pyspark.sql.Column> > expressions. > Parameters: *exprs* – a dict mapping from column name (string) to > aggregate functions (string), or a list of Column > <https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html?highlight=min#pyspark.sql.Column> > . > > Thanks for help! > -- > Viktor > > *P* Don't print this email, unless it's really necessary. Take care of > the environment. >