I am trying to aggregate on the cross product of two relations. It can be done using a single M/R job but pig is using two. The pig code looks like this:
C = cross A, B; C = filter C by Š; G = group C by x; G = foreach G generate group, COUNT(G); The resulting M/R plan is this: Map1 (LOAD, CROSS) -> Reduce1 (FILTER) -> Map2 (seemingly useless) -> Reduce2 (COUNT) Of course, the IO between Reduce1 and Map2 is massive. This job can only be done efficiently if done like so: Map1 (LOAD, CROSS) -> Combine1(FILTER, COUNT) -> Reduce1(COUNT) Is there some way to force pig to use this M/R plan? Or do I have to write my own M/R job? Thanks!