I think I figured it out using replicated join. My initial understanding of the pig M/R plan was incorrect. It was performing a reduce side join like so:
Map1.1 (LOAD A) Map1.2 (LOAD B) -> Reduce1 (CROSS, FILTER) -> Map2 (seemingly useless) -> Reduce2 (COUNT) Since one of my relations is small enough to fit in memory, I can force it to use a map side (replicated) join. Now the plan looks like this: Map(LOAD A, LOAD B, JOIN, FILTER) -> Combine(COUNT) -> Reduce(COUNT) On 2/9/14 12:53 PM, "Enns, Steven" <sae...@a9.com> wrote: >I am trying to aggregate on the cross product of two relations. It can be >done using a single M/R job but pig is using two. The pig code looks like >this: > > C = cross A, B; > C = filter C by Š; > G = group C by x; > G = foreach G generate group, COUNT(G); > >The resulting M/R plan is this: > > Map1 (LOAD, CROSS) -> Reduce1 (FILTER) -> Map2 (seemingly useless) -> >Reduce2 (COUNT) > >Of course, the IO between Reduce1 and Map2 is massive. This job can only >be done efficiently if done like so: > > Map1 (LOAD, CROSS) -> Combine1(FILTER, COUNT) -> Reduce1(COUNT) > >Is there some way to force pig to use this M/R plan? Or do I have to >write my own M/R job? > >Thanks! > > >