Hi Michael, Can you be more specific on `collect_set`? Is it a built-in function or, if it is an UDF, how it is defined?
BR, Todd Leo On Wed, Oct 14, 2015 at 2:12 AM Michael Armbrust <mich...@databricks.com> wrote: > import org.apache.spark.sql.functions._ > > df.groupBy("category") > .agg(callUDF("collect_set", df("id")).as("id_list")) > > On Mon, Oct 12, 2015 at 11:08 PM, SLiZn Liu <sliznmail...@gmail.com> > wrote: > >> Hey Spark users, >> >> I'm trying to group by a dataframe, by appending occurrences into a list >> instead of count. >> >> Let's say we have a dataframe as shown below: >> >> | category | id | >> | -------- |:--:| >> | A | 1 | >> | A | 2 | >> | B | 3 | >> | B | 4 | >> | C | 5 | >> >> ideally, after some magic group by (reverse explode?): >> >> | category | id_list | >> | -------- | -------- | >> | A | 1,2 | >> | B | 3,4 | >> | C | 5 | >> >> any tricks to achieve that? Scala Spark API is preferred. =D >> >> BR, >> Todd Leo >> >> >> >> >