Re: map side ("replicated") joins in Crunch

Gabriel Reid Tue, 03 Jul 2012 00:23:41 -0700

Hi guys,

Attached (hopefully) is a patch for an initial implementation of map
side joins. It's currently implemented as a static method in a class
called MapsideJoin, with the same interface as the existing Join class
(with only inner joins being implemented for now). The way it works is
that the right-side PTable of the join is put in the distributed cache
and then read by the join function at runtime.


There's one spot that I can see for a potentially interesting
optimization -- MRPipeline#run is called once for each map side join
that is set up, but if the setup of the joins was done within
MRPipeline, then we could set up multiple map side joins in parallel
with a single call to MRPipeline#run. OTOH, a whole bunch of map side
joins in parallel probably isn't that common of an operation.

If anyone feels like taking a look at the patch, any feedback is
appreciated. If nobody sees something that needs serious changes in
the patch, I'll commit it.

- Gabriel


On Thu, Jun 21, 2012 at 9:09 AM, Gabriel Reid <[email protected]> wrote:
> Replying to all...
>
> On Thu, Jun 21, 2012 at 8:40 AM, Josh Wills <[email protected]> wrote:
>>
>> So there's a philosophical issue here: should Crunch ever make
>> decisions about how to do something itself based on its estimates of
>> the size of the data sets, or should it always do exactly what the
>> developer indicates?
>>
>> I can make a case either way, but I think that no matter what, we
>> would want to have explicit functions for performing a join that reads
>> one data set into memory, so I think we can proceed w/the
>> implementation while folks weigh in on what their preferences are for
>> the default join() behavior (e.g., just do a reduce-side join, or try
>> to figure out the best join given information about the input data and
>> some configuration parameters.)
>>
>
> I definitely agree on needing to have an explicit way to invoke one or
> the other -- and in general I don't like having magic behind the
> scenes to decide on behaviour (especially considering Crunch is
> generally intended to be closer to the metal than Pig and Hive). I'm
> not sure if the runtime decision is something specific to some of my
> use cases or if it could be useful to a wider audience.
>
> The ability to dynamically decide at runtime whether a map side join
> should be used can also easily be tacked on outside of Crunch, and
> won't impact the underlying implementation (as you pointed out), so I
> definitely also agree on focusing on the underlying implementation
> first, and we can worry about the options used for exposing it later
> on.
>
> - Gabriel

Re: map side ("replicated") joins in Crunch

Reply via email to