I want to close the loop on this, since it's clearly a popular feature. I know we don't have JIRA yet, but does anyone want to volunteer to own map-side join development? I'm happy to help out with the work, but I have a lot of Crunch documentation to write and I want to make that my primary focus.
On Sat, Jun 16, 2012 at 1:49 AM, Gabriel Reid <[email protected]> wrote: >>> One of the functions that I find most useful in Pig is the map side >>> join; Pig will put a file in the distributed cache, load it into >>> memory, and do a join from the mappers. I'd like to add this to >>> Crunch, but wasn't sure what the best way to do this would be. Do any >>> of you guys have any thoughts on this? >> >> I have a few, but they're not quite baked yet. We should have some >> other folks weigh in. >> > > Map-side joins is definitely #1 on my wish list of things for Crunch, > and it's also something I've been thinking about a lot lately in terms > of how to implement it. > > One of the ideas that I've had about this is adding an overload of the > join method on PTable to allow supplying join settings, for example > something like this: > > JoinSettings joinSettings = new JoinSettings(); > joinSettings.setJoinOperation(JoinSettings.LeftOuterJoin); > joinSettings.allowMapsideJoin(); > PTable<K, Pair<U,V>> joined = tableA.join(tableB, joinSettings); > > The idea is that you could let Crunch decide (at the time of job > creation) if a join would be done in memory or not, depending on the > size of (one of) the incoming tables or any other heuristics. If a > join is performed with a JoinSettings that has allowMapsideJoin set, > then obviously the developer needs to be aware that there is a good > chance that the joined table won't be sorted (which will be the case > if a standard join is used). > > Obviously this approach removes some control from the user in terms of > what exactly happens under the covers, so that's something that we > would need to take into account. However, in my day job situations > come up quite often where we're using the same code to deal with both > large joins and small joins depending on the dataset, so it would be > nice to use the same Crunch flow for all cases. Of course, it's also > an option to just write this explicitly instead of baking it directly > into Crunch. > > In any case, I'm definitely in favor of having map-side joins be > possible (and easy) with PCollections, and not only with > java.util.Collections. There are definitely use cases where you have a > huge dataset that you want to reduce/aggregate down to a small dataset > and then join with another huge dataset. > > Definitely happy to hear that other people are interested in having > map-side joins as well! > > - Gabriel -- Director of Data Science Cloudera Twitter: @josh_wills
