[jira] [Commented] (CRUNCH-133) Add Aggregator support for combineValues ops on secondary keys via maps and collections

Josh Wills (JIRA) Sun, 23 Dec 2012 12:58:14 -0800

    [ 
https://issues.apache.org/jira/browse/CRUNCH-133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13539093#comment-13539093
 ]


Josh Wills commented on CRUNCH-133:
-----------------------------------

Gabriel, thanks for the review. I agree w/the issues you point out, and that 
the complexity this patch introduces isn't clearly worth the benefit.

We could go back to returning AggregatorFactory instances instead of Aggregator 
instances from the factory methods, but again, that imposes a cognitive cost 
that may not be worthwhile for the one use case. it would seem simpler to me to 
limit the collections() and maps() aggregator methods to taking in 
AggregatorFactory instances and leaving everything else (i.e., the general 
case) alone. This is one of those times where Java's lack of first-class 
functions is a real pain. :)
                
> Add Aggregator support for combineValues ops on secondary keys via maps and 
> collections
> ---------------------------------------------------------------------------------------
>
>                 Key: CRUNCH-133
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-133
>             Project: Crunch
>          Issue Type: New Feature
>            Reporter: Josh Wills
>         Attachments: CRUNCH-133.patch
>
>
> Sawzall has a neat trick where you can do aggregations on secondary keys via 
> maps, which is useful in cases where you might want to aggregate some data at 
> (for example) both a country and at a city level within a single MapReduce 
> job. We had a thread on crunch-user about this pattern:
> http://mail-archives.apache.org/mod_mbox/incubator-crunch-user/201212.mbox/%3CCAH29n6O-aHXTPHCRpSuAkAGUjvDR%3D56%3D-OLq9K9mZje%2BwVB4-Q%40mail.gmail.com%3E
> The pattern ends up looking something like this:
> // Define a table that has long values at both the K and the <K, String> 
> levels.
> PTable<K, Pair<Long, Map<String, Long>>> in = ...;
> // Define and apply an Aggregator that can handle sums at both levels within 
> a single MR job.
> Aggregator<Pair<Long, Map<String, Long>>> a = pairAggregator(SUM_LONGS(), 
> map(Aggregators.SUM_LONGS()));
> PTable<K, Pair<Long, Map<String, Long>>> out = 
> in.groupByKey().combineValues(a);
> ...which would run substantially faster than executing two dependent MR jobs, 
> one that did the city aggregation and then a second follow-up job that did 
> the country aggregation.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CRUNCH-133) Add Aggregator support for combineValues ops on secondary keys via maps and collections

Reply via email to