My guess is the same as UDAF of (collect_set) in Hive. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inAggregateFunctions(UDAF) Yong
From: sliznmail...@gmail.com Date: Wed, 14 Oct 2015 02:45:48 +0000 Subject: Re: Spark DataFrame GroupBy into List To: mich...@databricks.com CC: user@spark.apache.org Hi Michael, Can you be more specific on `collect_set`? Is it a built-in function or, if it is an UDF, how it is defined? BR,Todd Leo On Wed, Oct 14, 2015 at 2:12 AM Michael Armbrust <mich...@databricks.com> wrote: import org.apache.spark.sql.functions._ df.groupBy("category") .agg(callUDF("collect_set", df("id")).as("id_list")) On Mon, Oct 12, 2015 at 11:08 PM, SLiZn Liu <sliznmail...@gmail.com> wrote: Hey Spark users, I'm trying to group by a dataframe, by appending occurrences into a list instead of count. Let's say we have a dataframe as shown below:| category | id | | -------- |:--:| | A | 1 | | A | 2 | | B | 3 | | B | 4 | | C | 5 | ideally, after some magic group by (reverse explode?):| category | id_list | | -------- | -------- | | A | 1,2 | | B | 3,4 | | C | 5 | any tricks to achieve that? Scala Spark API is preferred. =D BR,Todd Leo