I'm trying to use the ShardedJoinStrategy for a join where ~80% of the underlying data has the same key; using DefaultJoinStrategy results in one reducer holding up the entire pipeline. My intuition was that I could use the ShardedJoinStrategy to work around the duplicate key problem.
The issue that I am having is that when I try to do a left join, an UnsupportedOperationException is throw. Looking into the code reveals that a right join will not throw this exception. Why can I not use the LEFT_OUTER_JOIN JoinType, but Crunch allows using a RIGHT_OUTER_JOIN? If I restructure my code to functionally do the same thing but use a right join instead, will I see performance degradation? Possibly relevant: I also found this article <https://labs.spotify.com/2014/12/19/torching-our-reducers-taught-us-this-essential-lesson/> from Spotify Engineering that described how they did a sharded left join "from scratch" using Crunch; they claim they are not using ShardedJoinStrategy because of an "underlying bug". Note: I posted this <https://groups.google.com/a/cloudera.org/forum/#!topic/crunch-users/hEQuCMzkT-Y> on Google Groups as well before seeing that this mailing list was more active. -- Rob Sanek
