[ 
https://issues.apache.org/jira/browse/SPARK-13335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15192588#comment-15192588
 ] 

Ruslan Dautkhanov commented on SPARK-13335:
-------------------------------------------

It would be great to have this optimization in.
In our workflow we use Spark in 95% of cases, except for COLLECT_SET we switch 
to Hive because it works fine there, while in Spark 1.5 executors even with 8Gb 
of memory keeps running out of memory. 11 billion records dataset OUTER JOIN  
to a ~1 billion records dataset.

> Optimize Data Frames collect_list and collect_set with declarative aggregates
> -----------------------------------------------------------------------------
>
>                 Key: SPARK-13335
>                 URL: https://issues.apache.org/jira/browse/SPARK-13335
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Matt Cheah
>            Priority: Minor
>
> Based on discussion from SPARK-9301, we can optimize collect_set and 
> collect_list with declarative aggregate expressions, as opposed to using Hive 
> UDAFs. The problem with Hive UDAFs is that they require converting the data 
> items from catalyst types back to external types repeatedly. We can get 
> around this by implementing declarative aggregate expressions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to