Hi guys, Attached (hopefully) is a patch for an initial implementation of map side joins. It's currently implemented as a static method in a class called MapsideJoin, with the same interface as the existing Join class (with only inner joins being implemented for now). The way it works is that the right-side PTable of the join is put in the distributed cache and then read by the join function at runtime.
There's one spot that I can see for a potentially interesting optimization -- MRPipeline#run is called once for each map side join that is set up, but if the setup of the joins was done within MRPipeline, then we could set up multiple map side joins in parallel with a single call to MRPipeline#run. OTOH, a whole bunch of map side joins in parallel probably isn't that common of an operation. If anyone feels like taking a look at the patch, any feedback is appreciated. If nobody sees something that needs serious changes in the patch, I'll commit it. - Gabriel On Thu, Jun 21, 2012 at 9:09 AM, Gabriel Reid <[email protected]> wrote: > Replying to all... > > On Thu, Jun 21, 2012 at 8:40 AM, Josh Wills <[email protected]> wrote: >> >> So there's a philosophical issue here: should Crunch ever make >> decisions about how to do something itself based on its estimates of >> the size of the data sets, or should it always do exactly what the >> developer indicates? >> >> I can make a case either way, but I think that no matter what, we >> would want to have explicit functions for performing a join that reads >> one data set into memory, so I think we can proceed w/the >> implementation while folks weigh in on what their preferences are for >> the default join() behavior (e.g., just do a reduce-side join, or try >> to figure out the best join given information about the input data and >> some configuration parameters.) >> > > I definitely agree on needing to have an explicit way to invoke one or > the other -- and in general I don't like having magic behind the > scenes to decide on behaviour (especially considering Crunch is > generally intended to be closer to the metal than Pig and Hive). I'm > not sure if the runtime decision is something specific to some of my > use cases or if it could be useful to a wider audience. > > The ability to dynamically decide at runtime whether a map side join > should be used can also easily be tacked on outside of Crunch, and > won't impact the underlying implementation (as you pointed out), so I > definitely also agree on focusing on the underlying implementation > first, and we can worry about the options used for exposing it later > on. > > - Gabriel
