[ 
https://issues.apache.org/jira/browse/SPARK-18728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15723786#comment-15723786
 ] 

Mansur Ashraf commented on SPARK-18728:
---------------------------------------

Alex,

Thanks for opening the issue. Let me add some more detail to it. 

We have tons of job on Spark 1.6 that are using Algebird Aggregators through 
`aggregateByKey` or `combineByKey` functions on RDD. Since Algebird aggregators 
are composable (meaning you can combine X number of aggregators to get 1 
combined aggregators), in our jobs we are combining 10+ number of aggregators 
and doing single pass aggregations using aggregateByKey/combineByKey. As we 
upgrade to Spark 2.0.0 and new Dataset 
API(http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset),
 we find out that aggregateByKey/combineByKey are all gone so we cant pass 
algebird aggregators directly, instead there is a new aggregator API based on 
algebird except (as far as I can tell) does not allow joining multiple 
aggregators and limiting number of aggregators to 4.  

It would be really nice if Spark use Algebird aggregators instead of creating 
its own or allow users to pass algebird aggregators in Dataset API in addition 
to Spark aggregators

Thanks

> Consider using Algebird's Aggregator instead of 
> org.apache.spark.sql.expressions.Aggregator
> -------------------------------------------------------------------------------------------
>
>                 Key: SPARK-18728
>                 URL: https://issues.apache.org/jira/browse/SPARK-18728
>             Project: Spark
>          Issue Type: Bug
>            Reporter: Alex Levenson
>            Priority: Minor
>
> Mansur (https://twitter.com/mansur_ashraf) pointed out this comment in 
> spark's Aggregator here:
> "Based loosely on Aggregator from Algebird: 
> https://github.com/twitter/algebird";
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/Aggregator.scala#L46
> Which got a few of us wondering, given that this API is still experimental, 
> would you consider using algebird's Aggregator API directly instead?
> The algebird API is not coupled with any implementation details, and 
> shouldn't have any extra dependencies.
> Are there any blockers to doing that?
> Thanks!
> Alex



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to