collect_set and collect_list are built-in User Defined functions see
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

On 14 October 2015 at 03:45, SLiZn Liu <sliznmail...@gmail.com> wrote:

> Hi Michael,
>
> Can you be more specific on `collect_set`? Is it a built-in function or,
> if it is an UDF, how it is defined?
>
> BR,
> Todd Leo
>
> On Wed, Oct 14, 2015 at 2:12 AM Michael Armbrust <mich...@databricks.com>
> wrote:
>
>> import org.apache.spark.sql.functions._
>>
>> df.groupBy("category")
>>   .agg(callUDF("collect_set", df("id")).as("id_list"))
>>
>> On Mon, Oct 12, 2015 at 11:08 PM, SLiZn Liu <sliznmail...@gmail.com>
>> wrote:
>>
>>> Hey Spark users,
>>>
>>> I'm trying to group by a dataframe, by appending occurrences into a list
>>> instead of count.
>>>
>>> Let's say we have a dataframe as shown below:
>>>
>>> | category | id |
>>> | -------- |:--:|
>>> | A        | 1  |
>>> | A        | 2  |
>>> | B        | 3  |
>>> | B        | 4  |
>>> | C        | 5  |
>>>
>>> ideally, after some magic group by (reverse explode?):
>>>
>>> | category | id_list  |
>>> | -------- | -------- |
>>> | A        | 1,2      |
>>> | B        | 3,4      |
>>> | C        | 5        |
>>>
>>> any tricks to achieve that? Scala Spark API is preferred. =D
>>>
>>> BR,
>>> Todd Leo
>>>
>>>
>>>
>>>
>>

Reply via email to