collect_set and collect_list are built-in User Defined functions see https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
On 14 October 2015 at 03:45, SLiZn Liu <sliznmail...@gmail.com> wrote: > Hi Michael, > > Can you be more specific on `collect_set`? Is it a built-in function or, > if it is an UDF, how it is defined? > > BR, > Todd Leo > > On Wed, Oct 14, 2015 at 2:12 AM Michael Armbrust <mich...@databricks.com> > wrote: > >> import org.apache.spark.sql.functions._ >> >> df.groupBy("category") >> .agg(callUDF("collect_set", df("id")).as("id_list")) >> >> On Mon, Oct 12, 2015 at 11:08 PM, SLiZn Liu <sliznmail...@gmail.com> >> wrote: >> >>> Hey Spark users, >>> >>> I'm trying to group by a dataframe, by appending occurrences into a list >>> instead of count. >>> >>> Let's say we have a dataframe as shown below: >>> >>> | category | id | >>> | -------- |:--:| >>> | A | 1 | >>> | A | 2 | >>> | B | 3 | >>> | B | 4 | >>> | C | 5 | >>> >>> ideally, after some magic group by (reverse explode?): >>> >>> | category | id_list | >>> | -------- | -------- | >>> | A | 1,2 | >>> | B | 3,4 | >>> | C | 5 | >>> >>> any tricks to achieve that? Scala Spark API is preferred. =D >>> >>> BR, >>> Todd Leo >>> >>> >>> >>> >>