My guess is the same as UDAF of (collect_set) in Hive.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inAggregateFunctions(UDAF)
Yong

From: sliznmail...@gmail.com
Date: Wed, 14 Oct 2015 02:45:48 +0000
Subject: Re: Spark DataFrame GroupBy into List
To: mich...@databricks.com
CC: user@spark.apache.org

Hi Michael, 
Can you be more specific on `collect_set`? Is it a built-in function or, if it 
is an UDF, how it is defined?
BR,Todd Leo
On Wed, Oct 14, 2015 at 2:12 AM Michael Armbrust <mich...@databricks.com> wrote:
import org.apache.spark.sql.functions._
df.groupBy("category")  .agg(callUDF("collect_set", df("id")).as("id_list"))
On Mon, Oct 12, 2015 at 11:08 PM, SLiZn Liu <sliznmail...@gmail.com> wrote:
Hey Spark users,
I'm trying to group by a dataframe, by appending occurrences into a list 
instead of count. 
Let's say we have a dataframe as shown below:| category | id |
| -------- |:--:|
| A        | 1  |
| A        | 2  |
| B        | 3  |
| B        | 4  |
| C        | 5  |
ideally, after some magic group by (reverse explode?):| category | id_list  |
| -------- | -------- |
| A        | 1,2      |
| B        | 3,4      |
| C        | 5        |
any tricks to achieve that? Scala Spark API is preferred. =D
BR,Todd Leo 





                                          

Reply via email to