Hi, thanks a lot for your review! I think you're right, removing combineValues(CombinerFn) really is premature since my refactoring of the Aggregator interface that I proposed below didn't work out. The simplified Aggregator isn't composable, so it can't be used for aggregating tuples.
I'll post a new version that also includes documentation later this week. Regards, Matthias On Sunday, 2012-11-04, Josh Wills wrote: > I looked at the review board post for this, and it looks great-- much > cleaner than the old way. > > I think my only objection was w/marking combineValues(CombineFn) as > deprecated, which struck me as premature. I'd be curious to hear the > reasoning on that. > > Josh > > On Sun, Nov 4, 2012 at 4:46 AM, Matthias Friedrich <[email protected]> wrote: > > > Hi, > > > > having looked at every single CombineFn and Aggregator in Crunch's > > code base and in light of CRUNCH-106 [1], I wonder if we could make > > things even easier to use. How about changing the Aggregator interface > > to this: > > > > public interface Aggregator<T> extends Serializable { > > void initialize(Configuration conf); > > Iterable<T> aggregate(Iterable<T> values); > > } > > > > This would make writing anonymous inner classes for combineValues() > > much easier and shorter. Implementations could still keep a buffer > > between calls to aggregate() if they want to optimize memory > > management. > > > > Is there anything I'm overlooking? > > > > Regards, > > Matthias > > > > [1] https://issues.apache.org/jira/browse/CRUNCH-106 > > > > On Friday, 2012-11-02, Matthias Friedrich wrote: > > > Hi, > > > > > > in our last discussion on refactoring we agreed (I think) to clean > > > up CombineFn and move implementations to .fn.CombineFns. While playing > > > with the API, I noticed that it's harder than necessary to get your > > > generics in line. Example: > > > > > > PGroupedTable<String, Integer> table = /* ... */; > > > table.combineValues(CombineFn.<String>SUM_INTS()); > > > > > > You need to specify the String type or you get a compilation error. > > > This leaks an implementation detail that isn't obvious to users; you > > > need to know how CombineFn works to know that you have to specify the > > > key type. And it's just plain ugly. > > > > > > How about using Aggregator more prominently by adding a > > > combineValues(Aggregator<V> a) method to PGroupedTable? We could > > > simplify the code and make it easier to understand: > > > > > > PGroupedTable<String, Integer> table = /* ... */; > > > table.combineValues(Aggregators.SUM_INTS()); > > > > > > Writing custom aggregations would be easier, too, as you don't need to > > > decorate your aggregator with AggregatorCombineFn yourself anymore. We > > > could go even further and remove combineValues(CombineFn), because it > > > doesn't add much value anymore. If you need maximum flexibility (that > > > is, your aggregation depends on the table's key somehow), you can > > > still use parallelDo() with CombineFn directly. I don't think that's > > > a common use case though. > > > > > > Here's what I propose: > > > > > > 1) Leave CombineFn and Aggregator in base > > > 2) Remove all static factories (SUM_INTS etc.) from CombineFn > > > 3) Create an Aggregators class in .fn with static factories > > > for everything removed from CombineFn > > > 4) Make the implementation classes private, they just take up > > > lots of space in javaodcs > > > 5) Bonus: Remove combineValues(CombineFn) > > > > > > One thing I don't understand: Is there really a need for > > > AggregatorFactory? Am I overlooking something? It's not like we have > > > to create Aggregator instances for each key -- that's what > > > Aggregator.reset() is for. > > > > > > Regards, > > > Matthias > > > > > > -- > Director of Data Science > Cloudera <http://www.cloudera.com> > Twitter: @josh_wills <http://twitter.com/josh_wills>
