I think showing the data at every step will help. rel:
(green,tall) (red,short) (green,short) cubed: (green,tall) (green,) (,tall) (,) (red,short) (red,) (,short) (,) (green,short) (green,) (,short) (,) cube: I did mess up typing the code in the email -- it should look more like this: cube = foreach (group cubed by (color, height)) generate flatten(group) as (color, height), COUNT_STAR(cubed); (red,short,1) (red,,1) (green,tall,1) (green,short,1) (green,,2) (,tall,1) (,short,2) (,,3) On Thu, Jul 14, 2011 at 6:13 AM, Gianmarco <gianmarco....@gmail.com> wrote: > If you want to add a new operator the right place to add the logic should be > LogicalPlanBuilder. > > Just a question, are you sure this code is correct? I can't understand how > it works. > > cubed = foreach rel generate flatten(CubeDimensions(a, b)); > cube = foreach (group rel by $0) generate flatten(group) as (a, b), > COUNT_STAR(rel); > > > Cheers, > -- > Gianmarco De Francisci Morales > > > On Thu, Jul 14, 2011 at 03:05, Dmitriy Ryaboy <dvrya...@gmail.com> wrote: > >> Arnab has a really interesting presentation at the post-hadoop-summit >> Pig meeting about how Cubing could work in Map-Reduce, and suggested a >> straightforward path to integrating into Pig. Arnab, do you have the >> presentation posted somewhere? >> >> In any case, I started mucking around a little with this, trying to >> hack in the naive solution. >> >> So far, one interesting result, followed by a question: >> >> I manually cubed by writing a bunch of group-bys, like so (using pig 8) : >> >> ab = foreach (group rel by (a, b)) generate flatten(group) as (a, b), >> COUNT_STAR(rel) as cnt; >> a_only = foreach (group rel by (a, null)) generate flatten(group) as >> (a, b), COUNT_STAR(rel) as cnt; >> b_only = foreach (group rel by (null, b)) generate flatten(group) as >> (a, b), COUNT_STAR(rel) as cnt; >> ab = foreach (group rel by (null, null)) generate flatten(group) as >> (a, b), COUNT_STAR(rel) as cnt; >> cube = union ab, a_only, b_only, ab; >> store cube .... >> >> Except for extra fun, I did this with 3 dimensions and therefore 8 >> groupings. This generated 4 MR jobs, the first of which moved all the >> data across the wire despite the fact that COUNT_STAR is algebraic. On >> my test dataset, the work took 18 minutes. >> >> I then wrote a UDF that given a tuple, created all the cube dimensions >> of the tuple -- so CubeDimensions(a, b) returns { (a, b), (a, null), >> (null, b), (null, null) }, and this works on any number of dimensions. >> The naive cube then simply becomes this: >> >> cubed = foreach rel generate flatten(CubeDimensions(a, b)); >> cube = foreach (group rel by $0) generate flatten(group) as (a, b), >> COUNT_STAR(rel); >> >> On the same dataset, this generated only 1 MR job, and ran in 3 >> minutes because we were able to take advantage of the combiners! >> >> Assuming algebraic aggregations, this is actually pretty good given >> how little work it involves. >> >> I looked at adding a new operator that would be (for now) syntactic >> sugar around this pattern -- basically, "CUBE rel by (a, b, c)" would >> insert the operators equivalent to the code above. >> >> I can muddle my way through the grammar. What's the appropriate place >> to put the translation logic? Logical to physical compiler? Optimizer? >> The LogicalPlanBuilder? >> >> D >> >