I am trying to aggregate on the cross product of two relations.  It can be
done using a single M/R job but pig is using two.  The pig code looks like
this:

        C = cross A, B;
        C = filter C by Š;
        G = group C by x;
        G = foreach G generate group, COUNT(G);

The resulting M/R plan is this:

        Map1 (LOAD, CROSS) -> Reduce1 (FILTER) -> Map2 (seemingly useless) ->
Reduce2 (COUNT)

Of course, the IO between Reduce1 and Map2 is massive.  This job can only
be done efficiently if done like so:

        Map1 (LOAD, CROSS) -> Combine1(FILTER, COUNT) -> Reduce1(COUNT)

Is there some way to force pig to use this M/R plan?  Or do I have to
write my own M/R job?

Thanks!



Reply via email to