I think I figured it out using replicated join.

My initial understanding of the pig M/R plan was incorrect.  It was
performing a reduce side join like so:

        Map1.1 (LOAD A) 

        Map1.2 (LOAD B) -> Reduce1 (CROSS, FILTER) -> Map2 (seemingly useless) 
->
Reduce2 (COUNT)



Since one of my relations is small enough to fit in memory, I can force it
to use a map side (replicated) join.  Now the plan looks like this:
        Map(LOAD A, LOAD B, JOIN, FILTER) -> Combine(COUNT) -> Reduce(COUNT)


On 2/9/14 12:53 PM, "Enns, Steven" <sae...@a9.com> wrote:

>I am trying to aggregate on the cross product of two relations.  It can be
>done using a single M/R job but pig is using two.  The pig code looks like
>this:
>
>       C = cross A, B;
>       C = filter C by Š;
>       G = group C by x;
>       G = foreach G generate group, COUNT(G);
>
>The resulting M/R plan is this:
>
>       Map1 (LOAD, CROSS) -> Reduce1 (FILTER) -> Map2 (seemingly useless) ->
>Reduce2 (COUNT)
>
>Of course, the IO between Reduce1 and Map2 is massive.  This job can only
>be done efficiently if done like so:
>
>       Map1 (LOAD, CROSS) -> Combine1(FILTER, COUNT) -> Reduce1(COUNT)
>
>Is there some way to force pig to use this M/R plan?  Or do I have to
>write my own M/R job?
>
>Thanks!
>
>
>

Reply via email to