[ 
https://issues.apache.org/jira/browse/CRUNCH-133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13539002#comment-13539002
 ] 

Gabriel Reid commented on CRUNCH-133:
-------------------------------------

Sorry for the super slow uptake on this Josh. I started looking at this last 
week, and it took a while to grok it, and by then I was out of time to really 
review it -- and then it just didn't happen. Thanks for the reminder anyhow.

I've added some comments (or actually repeated the same comment over and over) 
on RB, but I also wanted to comment on the general approach here.

I can get that this is a use case that could be useful, but (in what seems to 
be a common thread for me lately) I'm a bit hesitant to add the arity and 
especially the copy methods to Aggregator, as these seem very specific to this 
use case, and it seems a shame to add two methods to what is otherwise a pretty 
simple and clean interface for this one use case.

I was thinking (although I'm not totally sure) that the Aggregator interface 
could possibly be split out into two interfaces, something like UnaryAggregator 
(which doesn't include an arity method, and could probably just be called 
Aggregator) and an ArityNAggregator, which would have an arity method (and 
would need a better name than ArityNAggregator). The CollectionAggregator and 
MapAggregator would then only use UnaryAggregators. Anyhow, it's just an idea, 
and I might be missing some important details.

I'm actually more worried about the copy method, because it feels like that can 
be a risky/tricky one to do properly. As I mentioned on RB, I think that a call 
to copy on a decorating Aggregator should probably call copy on the underlying 
Aggregators instead of using the same reference. Even then, I worry about 
issues around transient state that comes from Configurations being properly 
copied in implementations of copy, etc.

As an alternative, how about just using an AggregatorFactory (or 
UnaryAggregatorFactory) in the MapAggregator and CollectionAggregator? Then 
there's no need to call copy everywhere.


                
> Add Aggregator support for combineValues ops on secondary keys via maps and 
> collections
> ---------------------------------------------------------------------------------------
>
>                 Key: CRUNCH-133
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-133
>             Project: Crunch
>          Issue Type: New Feature
>            Reporter: Josh Wills
>         Attachments: CRUNCH-133.patch
>
>
> Sawzall has a neat trick where you can do aggregations on secondary keys via 
> maps, which is useful in cases where you might want to aggregate some data at 
> (for example) both a country and at a city level within a single MapReduce 
> job. We had a thread on crunch-user about this pattern:
> http://mail-archives.apache.org/mod_mbox/incubator-crunch-user/201212.mbox/%3CCAH29n6O-aHXTPHCRpSuAkAGUjvDR%3D56%3D-OLq9K9mZje%2BwVB4-Q%40mail.gmail.com%3E
> The pattern ends up looking something like this:
> // Define a table that has long values at both the K and the <K, String> 
> levels.
> PTable<K, Pair<Long, Map<String, Long>>> in = ...;
> // Define and apply an Aggregator that can handle sums at both levels within 
> a single MR job.
> Aggregator<Pair<Long, Map<String, Long>>> a = pairAggregator(SUM_LONGS(), 
> map(Aggregators.SUM_LONGS()));
> PTable<K, Pair<Long, Map<String, Long>>> out = 
> in.groupByKey().combineValues(a);
> ...which would run substantially faster than executing two dependent MR jobs, 
> one that did the city aggregation and then a second follow-up job that did 
> the country aggregation.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to