I would be in favor of both changes, if no one else minded. I love that Scala allows us to express the TraversableOnce concept, and I'm all for using it in this instance.
While we're on the subject of combiners, I was toying with the idea of a GenericAggregator, which would basically encapsulate a combine + a map so that you could do things like averages. Oh, and slightly related, I hope you saw where I pulled in Algebird for monoid calculations in Scrunch in https://issues.apache.org/jira/browse/CRUNCH-424 J J On Wed, Jul 30, 2014 at 6:52 AM, David Whiting <[email protected]> wrote: > The Scrunch version of combine accepts a function Iterable[V] => V . This > causes a lot of unexpected behaviour because the iterable that is wrapped > is actually a SingleUseIterable, and much of Scala's collection function > implementations actually try and access the underlying iterator multiple > times if they know that it's possible. This leads to often having to write > code like this: > > ... > .groupByKey() > .combine { _.iterator reduce { _ + _ } } > > This is a silly example of course, because there's an Aggregator for > summation, but if your reduce function is more complex you have to do this > indirection via iterator in order to get correct behaviour. > > Possible fixes: > a) Change combine to accept a function TraversableOnce[V] => V or > Iterator[V] => V, better reflecting the single-use nature of the underlying > Iterable > b) Given that most custom combines will in fact be folds over monoids, we > could promote the notion of reduce or fold up the the PGroupedTable itself, > so you can do .groupByKey().foldValues(_+_) > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
