[jira] [Commented] (SPARK-18728) Consider using Algebird's Aggregator instead of org.apache.spark.sql.expressions.Aggregator
[ https://issues.apache.org/jira/browse/SPARK-18728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15726986#comment-15726986 ] Alex Levenson commented on SPARK-18728: --- I think my comment above lists some concrete benefits. Algebird is a very light dependency, and if you see anything wrong with it's (small) set of transitive dependencies I think we'd be open to figuring out how to fix those sorts of issues. > Consider using Algebird's Aggregator instead of > org.apache.spark.sql.expressions.Aggregator > --- > > Key: SPARK-18728 > URL: https://issues.apache.org/jira/browse/SPARK-18728 > Project: Spark > Issue Type: Improvement >Reporter: Alex Levenson >Priority: Minor > > Mansur (https://twitter.com/mansur_ashraf) pointed out this comment in > spark's Aggregator here: > "Based loosely on Aggregator from Algebird: > https://github.com/twitter/algebird"; > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/Aggregator.scala#L46 > Which got a few of us wondering, given that this API is still experimental, > would you consider using algebird's Aggregator API directly instead? > The algebird API is not coupled with any implementation details, and > shouldn't have any extra dependencies. > Are there any blockers to doing that? > Thanks! > Alex -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18728) Consider using Algebird's Aggregator instead of org.apache.spark.sql.expressions.Aggregator
[ https://issues.apache.org/jira/browse/SPARK-18728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15726647#comment-15726647 ] Sean Owen commented on SPARK-18728: --- It's pretty much what I mentioned there: another dependency in the tree. That's not a deal-breaker, it's just the question. We wouldn't be able to include third-party types in a public API, note. > Consider using Algebird's Aggregator instead of > org.apache.spark.sql.expressions.Aggregator > --- > > Key: SPARK-18728 > URL: https://issues.apache.org/jira/browse/SPARK-18728 > Project: Spark > Issue Type: Improvement >Reporter: Alex Levenson >Priority: Minor > > Mansur (https://twitter.com/mansur_ashraf) pointed out this comment in > spark's Aggregator here: > "Based loosely on Aggregator from Algebird: > https://github.com/twitter/algebird"; > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/Aggregator.scala#L46 > Which got a few of us wondering, given that this API is still experimental, > would you consider using algebird's Aggregator API directly instead? > The algebird API is not coupled with any implementation details, and > shouldn't have any extra dependencies. > Are there any blockers to doing that? > Thanks! > Alex -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18728) Consider using Algebird's Aggregator instead of org.apache.spark.sql.expressions.Aggregator
[ https://issues.apache.org/jira/browse/SPARK-18728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15726142#comment-15726142 ] Mansur Ashraf commented on SPARK-18728: --- Hi Sean, Dataset API has removed 'aggregateByKey` and variants and is only allowing passing up to 4 aggregators by doing ds.select(...) which is a downgrade in user experience from RDD. What we gain by making this change is that there is no arbitrary limit on number of custom aggregators user can pass. We are already using CountMinSketch, QTrees, TopK, HLL++ from algebird with Spark in addition to other aggregators that are built in house. Since Spark aggregators are inspired by Algebird aggregators based on the comment in the code, why not just use Algebird aggregators instead of copying the trait? Dataset API in its current form is not usable for us at Apple Inc. due to the limitation I listed above. > Consider using Algebird's Aggregator instead of > org.apache.spark.sql.expressions.Aggregator > --- > > Key: SPARK-18728 > URL: https://issues.apache.org/jira/browse/SPARK-18728 > Project: Spark > Issue Type: Improvement >Reporter: Alex Levenson >Priority: Minor > > Mansur (https://twitter.com/mansur_ashraf) pointed out this comment in > spark's Aggregator here: > "Based loosely on Aggregator from Algebird: > https://github.com/twitter/algebird"; > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/Aggregator.scala#L46 > Which got a few of us wondering, given that this API is still experimental, > would you consider using algebird's Aggregator API directly instead? > The algebird API is not coupled with any implementation details, and > shouldn't have any extra dependencies. > Are there any blockers to doing that? > Thanks! > Alex -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18728) Consider using Algebird's Aggregator instead of org.apache.spark.sql.expressions.Aggregator
[ https://issues.apache.org/jira/browse/SPARK-18728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15723866#comment-15723866 ] Alex Levenson commented on SPARK-18728: --- I think the main selling point of Algebird aggregators are: 1) They are composable (you can take a Min aggregator and combine it with a Max aggregator to get an aggregator that gets both the Min + Max in 1 pass) -- as [~mashraf] points out, you can compose many times to get lots of aggregations in 1 pass 2) They have the option for efficient addition methods -- they use algebird's Semigroup, which has both plus(a,b) for adding 2 items, and sumOption(iter: TraversableOnce[T]) for adding N items. This allows for opting in to efficient additions without having a mutable API (sumOption can be mutable internally, but it has to be referentially transparent) 3) There are many already built implementations of Aggregator for both common types as well as probabilistic data structures available in algebird. > Consider using Algebird's Aggregator instead of > org.apache.spark.sql.expressions.Aggregator > --- > > Key: SPARK-18728 > URL: https://issues.apache.org/jira/browse/SPARK-18728 > Project: Spark > Issue Type: Bug >Reporter: Alex Levenson >Priority: Minor > > Mansur (https://twitter.com/mansur_ashraf) pointed out this comment in > spark's Aggregator here: > "Based loosely on Aggregator from Algebird: > https://github.com/twitter/algebird"; > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/Aggregator.scala#L46 > Which got a few of us wondering, given that this API is still experimental, > would you consider using algebird's Aggregator API directly instead? > The algebird API is not coupled with any implementation details, and > shouldn't have any extra dependencies. > Are there any blockers to doing that? > Thanks! > Alex -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18728) Consider using Algebird's Aggregator instead of org.apache.spark.sql.expressions.Aggregator
[ https://issues.apache.org/jira/browse/SPARK-18728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15723786#comment-15723786 ] Mansur Ashraf commented on SPARK-18728: --- Alex, Thanks for opening the issue. Let me add some more detail to it. We have tons of job on Spark 1.6 that are using Algebird Aggregators through `aggregateByKey` or `combineByKey` functions on RDD. Since Algebird aggregators are composable (meaning you can combine X number of aggregators to get 1 combined aggregators), in our jobs we are combining 10+ number of aggregators and doing single pass aggregations using aggregateByKey/combineByKey. As we upgrade to Spark 2.0.0 and new Dataset API(http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset), we find out that aggregateByKey/combineByKey are all gone so we cant pass algebird aggregators directly, instead there is a new aggregator API based on algebird except (as far as I can tell) does not allow joining multiple aggregators and limiting number of aggregators to 4. It would be really nice if Spark use Algebird aggregators instead of creating its own or allow users to pass algebird aggregators in Dataset API in addition to Spark aggregators Thanks > Consider using Algebird's Aggregator instead of > org.apache.spark.sql.expressions.Aggregator > --- > > Key: SPARK-18728 > URL: https://issues.apache.org/jira/browse/SPARK-18728 > Project: Spark > Issue Type: Bug >Reporter: Alex Levenson >Priority: Minor > > Mansur (https://twitter.com/mansur_ashraf) pointed out this comment in > spark's Aggregator here: > "Based loosely on Aggregator from Algebird: > https://github.com/twitter/algebird"; > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/Aggregator.scala#L46 > Which got a few of us wondering, given that this API is still experimental, > would you consider using algebird's Aggregator API directly instead? > The algebird API is not coupled with any implementation details, and > shouldn't have any extra dependencies. > Are there any blockers to doing that? > Thanks! > Alex -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org