M/R is a useful but sometimes leaky abstraction. :) 2012/10/15 Alberto Cordioli <cordioli.albe...@gmail.com>
> Ok, I've found.... > > I was using values that return all the same value % number of reducers. > For an unfortunate case I tested always multiple values......Ohhh, my > fault. > > > Cheers, > Alberto > > > On 13 October 2012 13:32, Alberto Cordioli <cordioli.albe...@gmail.com> > wrote: > > With 'key' you mean the join key, right? > > If so, something doesn't work. I have 24 distinct values for my join > > key, but with a parallel degree of 20, all the values go to the same > > reducer. > > > > > > Alberto. > > > > > > On 12 October 2012 22:17, Dmitriy Ryaboy <dvrya...@gmail.com> wrote: > >> The default partitioning algorithm is basically this: > >> > >> reducer_id = key.hashCode() % num_reducers > >> > >> If you are joining on values that all map to the same reducer_id using > >> this function, they will go to the same reducer. But if you have a > >> reasonable hash code distribution and a decent volume of unique keys, > >> they should be evenly distributed. > >> > >> Dmitriy > >> > >> On Fri, Oct 12, 2012 at 2:46 AM, Alberto Cordioli > >> <cordioli.albe...@gmail.com> wrote: > >>> Hi all, > >>> > >>> I have a simple question about join in Pig. > >>> I want to do a simple self join on a relation in Pig. So I load two > >>> instances of the same relation in this way: > >>> > >>> I1 = LOAD '/myPath/myFile' as (a:chararray, b:int, c:int); > >>> I2 = LOAD '/myPath/myFile' as (a:chararray, b:int, c:int); > >>> > >>> Then, I join by the second field: > >>> x = JOIN I1 BY b, I2 BY b PARALLEL 20; > >>> > >>> > >>> I would expect that Pig assigns to a reducer tuples with the same join > key. > >>> For example if my relation I1=I2 is: > >>> > >>> x 1 10 > >>> x 2 15 > >>> y 1 4 > >>> y 4 7 > >>> > >>> I expect one reducer join first and third tuples and another the other > two. > >>> What happens instead is that a single reducer do the join for all the > >>> tuples. This results in 19 useless reducer and 1 overloaded. > >>> > >>> > >>> Can someone explain me why this happens? The standard Pig join does > >>> not parallelize the work by join key? > >>> > >>> > >>> Thanks, > >>> Alberto > >>> > >>> -- > >>> Alberto Cordioli > > > > > > > > -- > > Alberto Cordioli > > > > -- > Alberto Cordioli >