Re: map side ("replicated") joins in Crunch

Gabriel Reid Thu, 21 Jun 2012 00:09:52 -0700

Replying to all...

On Thu, Jun 21, 2012 at 8:40 AM, Josh Wills <[email protected]> wrote:
>
> So there's a philosophical issue here: should Crunch ever make
> decisions about how to do something itself based on its estimates of
> the size of the data sets, or should it always do exactly what the
> developer indicates?
>
> I can make a case either way, but I think that no matter what, we
> would want to have explicit functions for performing a join that reads
> one data set into memory, so I think we can proceed w/the
> implementation while folks weigh in on what their preferences are for
> the default join() behavior (e.g., just do a reduce-side join, or try
> to figure out the best join given information about the input data and
> some configuration parameters.)
>


I definitely agree on needing to have an explicit way to invoke one or
the other -- and in general I don't like having magic behind the
scenes to decide on behaviour (especially considering Crunch is
generally intended to be closer to the metal than Pig and Hive). I'm
not sure if the runtime decision is something specific to some of my
use cases or if it could be useful to a wider audience.

The ability to dynamically decide at runtime whether a map side join
should be used can also easily be tacked on outside of Crunch, and
won't impact the underlying implementation (as you pointed out), so I
definitely also agree on focusing on the underlying implementation
first, and we can worry about the options used for exposing it later
on.

- Gabriel

Re: map side ("replicated") joins in Crunch

Reply via email to