If you want to add a new operator the right place to add the logic should be LogicalPlanBuilder.
Just a question, are you sure this code is correct? I can't understand how it works. cubed = foreach rel generate flatten(CubeDimensions(a, b)); cube = foreach (group rel by $0) generate flatten(group) as (a, b), COUNT_STAR(rel); Cheers, -- Gianmarco De Francisci Morales On Thu, Jul 14, 2011 at 03:05, Dmitriy Ryaboy <dvrya...@gmail.com> wrote: > Arnab has a really interesting presentation at the post-hadoop-summit > Pig meeting about how Cubing could work in Map-Reduce, and suggested a > straightforward path to integrating into Pig. Arnab, do you have the > presentation posted somewhere? > > In any case, I started mucking around a little with this, trying to > hack in the naive solution. > > So far, one interesting result, followed by a question: > > I manually cubed by writing a bunch of group-bys, like so (using pig 8) : > > ab = foreach (group rel by (a, b)) generate flatten(group) as (a, b), > COUNT_STAR(rel) as cnt; > a_only = foreach (group rel by (a, null)) generate flatten(group) as > (a, b), COUNT_STAR(rel) as cnt; > b_only = foreach (group rel by (null, b)) generate flatten(group) as > (a, b), COUNT_STAR(rel) as cnt; > ab = foreach (group rel by (null, null)) generate flatten(group) as > (a, b), COUNT_STAR(rel) as cnt; > cube = union ab, a_only, b_only, ab; > store cube .... > > Except for extra fun, I did this with 3 dimensions and therefore 8 > groupings. This generated 4 MR jobs, the first of which moved all the > data across the wire despite the fact that COUNT_STAR is algebraic. On > my test dataset, the work took 18 minutes. > > I then wrote a UDF that given a tuple, created all the cube dimensions > of the tuple -- so CubeDimensions(a, b) returns { (a, b), (a, null), > (null, b), (null, null) }, and this works on any number of dimensions. > The naive cube then simply becomes this: > > cubed = foreach rel generate flatten(CubeDimensions(a, b)); > cube = foreach (group rel by $0) generate flatten(group) as (a, b), > COUNT_STAR(rel); > > On the same dataset, this generated only 1 MR job, and ran in 3 > minutes because we were able to take advantage of the combiners! > > Assuming algebraic aggregations, this is actually pretty good given > how little work it involves. > > I looked at adding a new operator that would be (for now) syntactic > sugar around this pattern -- basically, "CUBE rel by (a, b, c)" would > insert the operators equivalent to the code above. > > I can muddle my way through the grammar. What's the appropriate place > to put the translation logic? Logical to physical compiler? Optimizer? > The LogicalPlanBuilder? > > D >