Hi,
having looked at every single CombineFn and Aggregator in Crunch's
code base and in light of CRUNCH-106 [1], I wonder if we could make
things even easier to use. How about changing the Aggregator interface
to this:
public interface Aggregator<T> extends Serializable {
void initialize(Configuration conf);
Iterable<T> aggregate(Iterable<T> values);
}
This would make writing anonymous inner classes for combineValues()
much easier and shorter. Implementations could still keep a buffer
between calls to aggregate() if they want to optimize memory
management.
Is there anything I'm overlooking?
Regards,
Matthias
[1] https://issues.apache.org/jira/browse/CRUNCH-106
On Friday, 2012-11-02, Matthias Friedrich wrote:
> Hi,
>
> in our last discussion on refactoring we agreed (I think) to clean
> up CombineFn and move implementations to .fn.CombineFns. While playing
> with the API, I noticed that it's harder than necessary to get your
> generics in line. Example:
>
> PGroupedTable<String, Integer> table = /* ... */;
> table.combineValues(CombineFn.<String>SUM_INTS());
>
> You need to specify the String type or you get a compilation error.
> This leaks an implementation detail that isn't obvious to users; you
> need to know how CombineFn works to know that you have to specify the
> key type. And it's just plain ugly.
>
> How about using Aggregator more prominently by adding a
> combineValues(Aggregator<V> a) method to PGroupedTable? We could
> simplify the code and make it easier to understand:
>
> PGroupedTable<String, Integer> table = /* ... */;
> table.combineValues(Aggregators.SUM_INTS());
>
> Writing custom aggregations would be easier, too, as you don't need to
> decorate your aggregator with AggregatorCombineFn yourself anymore. We
> could go even further and remove combineValues(CombineFn), because it
> doesn't add much value anymore. If you need maximum flexibility (that
> is, your aggregation depends on the table's key somehow), you can
> still use parallelDo() with CombineFn directly. I don't think that's
> a common use case though.
>
> Here's what I propose:
>
> 1) Leave CombineFn and Aggregator in base
> 2) Remove all static factories (SUM_INTS etc.) from CombineFn
> 3) Create an Aggregators class in .fn with static factories
> for everything removed from CombineFn
> 4) Make the implementation classes private, they just take up
> lots of space in javaodcs
> 5) Bonus: Remove combineValues(CombineFn)
>
> One thing I don't understand: Is there really a need for
> AggregatorFactory? Am I overlooking something? It's not like we have
> to create Aggregator instances for each key -- that's what
> Aggregator.reset() is for.
>
> Regards,
> Matthias