Gabriel should probably weigh in, but I believe that reversing your PTables and using the RIGHT_OUTER_JOIN strategy should work correctly. I believe the reason the left outer join isn't supported is because each value in the "left" arg to ShadedJoinStrategy.join(left, right, type) is replicated N times (where N is the sharding number), whereas each value in the right arg is only written once (and randomly goes to one of the N sharded replicas generated for the left-side values.) That means that we expect some of the left-side values to not have right-side entries just b/c of the way the algorithm is implemented, and not because of the actual values in the underlying data, so we wouldn't be able to tell if a left-side value was missing a right-side entry b/c of the algorithm or b/c of the data.
J On Tue, Sep 29, 2015 at 4:46 PM, Robert Sanek <[email protected]> wrote: > I'm trying to use the ShardedJoinStrategy for a join where ~80% of the > underlying data has the same key; using DefaultJoinStrategy results in > one reducer holding up the entire pipeline. My intuition was that I could > use the ShardedJoinStrategy to work around the duplicate key problem. > > The issue that I am having is that when I try to do a left join, an > UnsupportedOperationException is throw. Looking into the code reveals > that a right join will not throw this exception. Why can I not use the > LEFT_OUTER_JOIN > JoinType, but Crunch allows using a RIGHT_OUTER_JOIN? If I restructure my > code to functionally do the same thing but use a right join instead, will I > see performance degradation? > > Possibly relevant: I also found this article > <https://labs.spotify.com/2014/12/19/torching-our-reducers-taught-us-this-essential-lesson/> > from Spotify Engineering that described how they did a sharded left join > "from scratch" using Crunch; they claim they are not using > ShardedJoinStrategy because of an "underlying bug". > > Note: I posted this > <https://groups.google.com/a/cloudera.org/forum/#!topic/crunch-users/hEQuCMzkT-Y> > on Google Groups as well before seeing that this mailing list was more > active. > > -- > Rob Sanek > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
