[ 
https://issues.apache.org/jira/browse/SPARK-17691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15652723#comment-15652723
 ] 

Michael Armbrust commented on SPARK-17691:
------------------------------------------

I think that should be able to use mutable buffers with the aggregator 
interface (and if there is bad performance there we should fix it).  Depending 
on what you are trying to do, I'd imagine groupByKey and mapGroups would also 
be fast.  You could also collect the top N per group using window functions.

Basically, this function sounds pretty specific (correct me if I'm wrong and 
this a common thing that other system support).  So I think it makes more sense 
to find fast/general mechanisms that let you build something specific like 
this, rather than adding yet another aggregate function.

> Add aggregate function to collect list with maximum number of elements
> ----------------------------------------------------------------------
>
>                 Key: SPARK-17691
>                 URL: https://issues.apache.org/jira/browse/SPARK-17691
>             Project: Spark
>          Issue Type: New Feature
>            Reporter: Assaf Mendelson
>            Priority: Minor
>
> One of the aggregate functions we have today is the collect_list function. 
> This is a useful tool to do a "catch all" aggregation which doesn't really 
> fit anywhere else.
> The problem with collect_list is that it is unbounded. I would like to see a 
> means to do a collect_list where we limit the maximum number of elements.
> I would see that the input for this would be the maximum number of elements 
> to use and the method of choosing (pick whatever, pick the top N, pick the 
> bottom B)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to