Re: map side ("replicated") joins in Crunch

Josh Wills Wed, 20 Jun 2012 15:32:24 -0700

I want to close the loop on this, since it's clearly a popular
feature. I know we don't have JIRA yet, but does anyone want to
volunteer to own map-side join development? I'm happy to help out with
the work, but I have a lot of Crunch documentation to write and I want
to make that my primary focus.


On Sat, Jun 16, 2012 at 1:49 AM, Gabriel Reid <[email protected]> wrote:
>>> One of the functions that I find most useful in Pig is the map side
>>> join; Pig will put a file in the distributed cache, load it into
>>> memory, and do a join from the mappers. I'd like to add this to
>>> Crunch, but wasn't sure what the best way to do this would be. Do any
>>> of you guys have any thoughts on this?
>>
>> I have a few, but they're not quite baked yet. We should have some
>> other folks weigh in.
>>
>
> Map-side joins is definitely #1 on my wish list of things for Crunch,
> and it's also something I've been thinking about a lot lately in terms
> of how to implement it.
>
> One of the ideas that I've had about this is adding an overload of the
> join method on PTable to allow supplying join settings, for example
> something like this:
>
>    JoinSettings joinSettings = new JoinSettings();
>    joinSettings.setJoinOperation(JoinSettings.LeftOuterJoin);
>    joinSettings.allowMapsideJoin();
>    PTable<K, Pair<U,V>> joined = tableA.join(tableB, joinSettings);
>
> The idea is that you could let Crunch decide (at the time of job
> creation) if a join would be done in memory or not, depending on the
> size of (one of) the incoming tables or any other heuristics. If a
> join is performed with a JoinSettings that has allowMapsideJoin set,
> then obviously the developer needs to be aware that there is a good
> chance that the joined table won't be sorted (which will be the case
> if a standard join is used).
>
> Obviously this approach removes some control from the user in terms of
> what exactly happens under the covers, so that's something that we
> would need to take into account. However, in my day job situations
> come up quite often where we're using the same code to deal with both
> large joins and small joins depending on the dataset, so it would be
> nice to use the same Crunch flow for all cases. Of course, it's also
> an option to just write this explicitly instead of baking it directly
> into Crunch.
>
> In any case, I'm definitely in favor of having map-side joins be
> possible (and easy) with PCollections, and not only with
> java.util.Collections. There are definitely use cases where you have a
> huge dataset that you want to reduce/aggregate down to a small dataset
> and then join with another huge dataset.
>
> Definitely happy to hear that other people are interested in having
> map-side joins as well!
>
> - Gabriel



-- 
Director of Data Science
Cloudera
Twitter: @josh_wills

Re: map side ("replicated") joins in Crunch

Reply via email to