Replying to all... On Thu, Jun 21, 2012 at 8:40 AM, Josh Wills <[email protected]> wrote: > > So there's a philosophical issue here: should Crunch ever make > decisions about how to do something itself based on its estimates of > the size of the data sets, or should it always do exactly what the > developer indicates? > > I can make a case either way, but I think that no matter what, we > would want to have explicit functions for performing a join that reads > one data set into memory, so I think we can proceed w/the > implementation while folks weigh in on what their preferences are for > the default join() behavior (e.g., just do a reduce-side join, or try > to figure out the best join given information about the input data and > some configuration parameters.) >
I definitely agree on needing to have an explicit way to invoke one or the other -- and in general I don't like having magic behind the scenes to decide on behaviour (especially considering Crunch is generally intended to be closer to the metal than Pig and Hive). I'm not sure if the runtime decision is something specific to some of my use cases or if it could be useful to a wider audience. The ability to dynamically decide at runtime whether a map side join should be used can also easily be tacked on outside of Crunch, and won't impact the underlying implementation (as you pointed out), so I definitely also agree on focusing on the underlying implementation first, and we can worry about the options used for exposing it later on. - Gabriel
