Thanks, Koert, for the great email. They are all great points. We should probably create an umbrella JIRA for easier tracking.
On Saturday, July 2, 2016, Koert Kuipers <ko...@tresata.com> wrote: > after working with the Dataset and Aggregator apis for a few weeks porting > some fairly complex RDD algos (an overall pleasant experience) i wanted to > summarize the pain points and some suggestions for improvement given my > experience. all of these are already mentioned on mailing list or jira, but > i figured its good to put them in one place. > see below. > best, > koert > > *) a lot of practical aggregation functions do not have a zero. this can > be dealt with correctly using null or None as the zero for Aggregator. in > algebird for example this is expressed as converting an algebird.Aggregator > (which does not have a zero) into a algebird.MonoidAggregator (which does > have a zero, so similar to spark Aggregator) by lifting it. see: > > https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/Aggregator.scala#L420 > something similar should be possible in spark. however currently > Aggregator does not like its zero to be null or an Option, making this > approach difficult. see: > https://www.mail-archive.com/user@spark.apache.org/msg53106.html > https://issues.apache.org/jira/browse/SPARK-15810 > > *) KeyValueGroupedDataset.reduceGroups needs to be efficient, probably > using an Aggregator (with null or None as the zero) under the hood. the > current implementation does a flatMapGroups which is suboptimal. > > *) KeyValueGroupedDataset needs mapValues. without this porting many algos > from RDD to Dataset is difficult and clumsy. see: > https://issues.apache.org/jira/browse/SPARK-15780 > > *) Aggregators need to also work within DataFrames (so > RelationalGroupedDataset) without having to fall back on using Row objects > as input. otherwise all code ends up being written twice, once for > Aggregator and once for UserDefinedAggregateFunction/UDAF. this doesn't > make sense to me. my attempt at addressing this: > https://issues.apache.org/jira/browse/SPARK-15769 > https://github.com/apache/spark/pull/13512 > > best, > koert > >