[
https://issues.apache.org/jira/browse/CRUNCH-133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13539093#comment-13539093
]
Josh Wills commented on CRUNCH-133:
-----------------------------------
Gabriel, thanks for the review. I agree w/the issues you point out, and that
the complexity this patch introduces isn't clearly worth the benefit.
We could go back to returning AggregatorFactory instances instead of Aggregator
instances from the factory methods, but again, that imposes a cognitive cost
that may not be worthwhile for the one use case. it would seem simpler to me to
limit the collections() and maps() aggregator methods to taking in
AggregatorFactory instances and leaving everything else (i.e., the general
case) alone. This is one of those times where Java's lack of first-class
functions is a real pain. :)
> Add Aggregator support for combineValues ops on secondary keys via maps and
> collections
> ---------------------------------------------------------------------------------------
>
> Key: CRUNCH-133
> URL: https://issues.apache.org/jira/browse/CRUNCH-133
> Project: Crunch
> Issue Type: New Feature
> Reporter: Josh Wills
> Attachments: CRUNCH-133.patch
>
>
> Sawzall has a neat trick where you can do aggregations on secondary keys via
> maps, which is useful in cases where you might want to aggregate some data at
> (for example) both a country and at a city level within a single MapReduce
> job. We had a thread on crunch-user about this pattern:
> http://mail-archives.apache.org/mod_mbox/incubator-crunch-user/201212.mbox/%3CCAH29n6O-aHXTPHCRpSuAkAGUjvDR%3D56%3D-OLq9K9mZje%2BwVB4-Q%40mail.gmail.com%3E
> The pattern ends up looking something like this:
> // Define a table that has long values at both the K and the <K, String>
> levels.
> PTable<K, Pair<Long, Map<String, Long>>> in = ...;
> // Define and apply an Aggregator that can handle sums at both levels within
> a single MR job.
> Aggregator<Pair<Long, Map<String, Long>>> a = pairAggregator(SUM_LONGS(),
> map(Aggregators.SUM_LONGS()));
> PTable<K, Pair<Long, Map<String, Long>>> out =
> in.groupByKey().combineValues(a);
> ...which would run substantially faster than executing two dependent MR jobs,
> one that did the city aggregation and then a second follow-up job that did
> the country aggregation.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira