I looked at the review board post for this, and it looks great-- much cleaner than the old way.
I think my only objection was w/marking combineValues(CombineFn) as deprecated, which struck me as premature. I'd be curious to hear the reasoning on that. Josh On Sun, Nov 4, 2012 at 4:46 AM, Matthias Friedrich <[email protected]> wrote: > Hi, > > having looked at every single CombineFn and Aggregator in Crunch's > code base and in light of CRUNCH-106 [1], I wonder if we could make > things even easier to use. How about changing the Aggregator interface > to this: > > public interface Aggregator<T> extends Serializable { > void initialize(Configuration conf); > Iterable<T> aggregate(Iterable<T> values); > } > > This would make writing anonymous inner classes for combineValues() > much easier and shorter. Implementations could still keep a buffer > between calls to aggregate() if they want to optimize memory > management. > > Is there anything I'm overlooking? > > Regards, > Matthias > > [1] https://issues.apache.org/jira/browse/CRUNCH-106 > > On Friday, 2012-11-02, Matthias Friedrich wrote: > > Hi, > > > > in our last discussion on refactoring we agreed (I think) to clean > > up CombineFn and move implementations to .fn.CombineFns. While playing > > with the API, I noticed that it's harder than necessary to get your > > generics in line. Example: > > > > PGroupedTable<String, Integer> table = /* ... */; > > table.combineValues(CombineFn.<String>SUM_INTS()); > > > > You need to specify the String type or you get a compilation error. > > This leaks an implementation detail that isn't obvious to users; you > > need to know how CombineFn works to know that you have to specify the > > key type. And it's just plain ugly. > > > > How about using Aggregator more prominently by adding a > > combineValues(Aggregator<V> a) method to PGroupedTable? We could > > simplify the code and make it easier to understand: > > > > PGroupedTable<String, Integer> table = /* ... */; > > table.combineValues(Aggregators.SUM_INTS()); > > > > Writing custom aggregations would be easier, too, as you don't need to > > decorate your aggregator with AggregatorCombineFn yourself anymore. We > > could go even further and remove combineValues(CombineFn), because it > > doesn't add much value anymore. If you need maximum flexibility (that > > is, your aggregation depends on the table's key somehow), you can > > still use parallelDo() with CombineFn directly. I don't think that's > > a common use case though. > > > > Here's what I propose: > > > > 1) Leave CombineFn and Aggregator in base > > 2) Remove all static factories (SUM_INTS etc.) from CombineFn > > 3) Create an Aggregators class in .fn with static factories > > for everything removed from CombineFn > > 4) Make the implementation classes private, they just take up > > lots of space in javaodcs > > 5) Bonus: Remove combineValues(CombineFn) > > > > One thing I don't understand: Is there really a need for > > AggregatorFactory? Am I overlooking something? It's not like we have > > to create Aggregator instances for each key -- that's what > > Aggregator.reset() is for. > > > > Regards, > > Matthias > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
