Re: CombineFn vs. Aggregator

Matthias Friedrich Tue, 06 Nov 2012 10:51:52 -0800

Hi,

thanks a lot for your review! I think you're right, removing
combineValues(CombinerFn) really is premature since my refactoring of
the Aggregator interface that I proposed below didn't work out. The
simplified Aggregator isn't composable, so it can't be used for
aggregating tuples.


I'll post a new version that also includes documentation later this
week.

Regards,
  Matthias

On Sunday, 2012-11-04, Josh Wills wrote:
> I looked at the review board post for this, and it looks great-- much
> cleaner than the old way.
> 
> I think my only objection was w/marking combineValues(CombineFn) as
> deprecated, which struck me as premature. I'd be curious to hear the
> reasoning on that.
> 
> Josh
> 
> On Sun, Nov 4, 2012 at 4:46 AM, Matthias Friedrich <[email protected]> wrote:
> 
> > Hi,
> >
> > having looked at every single CombineFn and Aggregator in Crunch's
> > code base and in light of CRUNCH-106 [1], I wonder if we could make
> > things even easier to use. How about changing the Aggregator interface
> > to this:
> >
> > public interface Aggregator<T> extends Serializable {
> >   void initialize(Configuration conf);
> >   Iterable<T> aggregate(Iterable<T> values);
> > }
> >
> > This would make writing anonymous inner classes for combineValues()
> > much easier and shorter. Implementations could still keep a buffer
> > between calls to aggregate() if they want to optimize memory
> > management.
> >
> > Is there anything I'm overlooking?
> >
> > Regards,
> >   Matthias
> >
> > [1] https://issues.apache.org/jira/browse/CRUNCH-106
> >
> > On Friday, 2012-11-02, Matthias Friedrich wrote:
> > > Hi,
> > >
> > > in our last discussion on refactoring we agreed (I think) to clean
> > > up CombineFn and move implementations to .fn.CombineFns. While playing
> > > with the API, I noticed that it's harder than necessary to get your
> > > generics in line. Example:
> > >
> > >   PGroupedTable<String, Integer> table = /* ... */;
> > >   table.combineValues(CombineFn.<String>SUM_INTS());
> > >
> > > You need to specify the String type or you get a compilation error.
> > > This leaks an implementation detail that isn't obvious to users; you
> > > need to know how CombineFn works to know that you have to specify the
> > > key type. And it's just plain ugly.
> > >
> > > How about using Aggregator more prominently by adding a
> > > combineValues(Aggregator<V> a) method to PGroupedTable? We could
> > > simplify the code and make it easier to understand:
> > >
> > >   PGroupedTable<String, Integer> table = /* ... */;
> > >   table.combineValues(Aggregators.SUM_INTS());
> > >
> > > Writing custom aggregations would be easier, too, as you don't need to
> > > decorate your aggregator with AggregatorCombineFn yourself anymore. We
> > > could go even further and remove combineValues(CombineFn), because it
> > > doesn't add much value anymore. If you need maximum flexibility (that
> > > is, your aggregation depends on the table's key somehow), you can
> > > still use parallelDo() with CombineFn directly. I don't think that's
> > > a common use case though.
> > >
> > > Here's what I propose:
> > >
> > >   1) Leave CombineFn and Aggregator in base
> > >   2) Remove all static factories (SUM_INTS etc.) from CombineFn
> > >   3) Create an Aggregators class in .fn with static factories
> > >      for everything removed from CombineFn
> > >   4) Make the implementation classes private, they just take up
> > >      lots of space in javaodcs
> > >   5) Bonus: Remove combineValues(CombineFn)
> > >
> > > One thing I don't understand: Is there really a need for
> > > AggregatorFactory? Am I overlooking something? It's not like we have
> > > to create Aggregator instances for each key -- that's what
> > > Aggregator.reset() is for.
> > >
> > > Regards,
> > >   Matthias
> >
> 
> 
> 
> -- 
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: CombineFn vs. Aggregator

Reply via email to