Jon, I ran the right script, I just wrote out the wrong one in the email :-). I also compared results of both computations to ensure correctness.
Arnab posted his slides: http://pdf.cx/44wrk My approach is the "naive approach" described in slides 11-17. D On Thu, Jul 14, 2011 at 11:54 AM, Jonathan Coveney <[email protected]> wrote: > Dmitry, a quick point on your approach... > > I assume that you meant to do, replacing rel with cubed? If you ran what you > pasted, you don't actually make reference to the cubed that you output, > which may have influenced run time. > > cubed = foreach rel generate flatten(CubeDimensions(a, b)); > cube = foreach (group cubed by $0) generate flatten(group) as (a, > b), COUNT_STAR(rel); > > Gianmarco: > > Let's say that rel looks like this: > > 1,1 > 1,2 > 2,2 > > your results from group on (a,b) > (1,1):1 > (1,2):1 > (2,2):1 > > grouping on (a,null): > (1,null): 2 > (2,null): 1 > > grouping on (null,b): > (null,1): 1 > (null,2): 2 > > group on (null,null): > (null,null): 3 > > here is what cubed would look like > > {(1,1),(1,null),(null,1),(null,null)} > {(1,2),(1,null),(null,2),(null,null)} > {(2,2),(2,null),(null,2),(null,null)} > > When you flatten it out, you'll have > > (1,1) > (1,null) > (null,1) > (null,null) > (1,2) > (1,null) > (null,2) > (null,null) > (2,2) > (2,null) > (null,2) > (null,null) > > now we group on the value, of which the posibilities/counts are... > > (1,1):1 > (1,2):1 > (2,2):1 > (1,null): 2 > (2,null): 1 > (null,1): 1 > (null,2): 2 > (null,null): 3 > > The same. What you're doing is blowing up the intermediate info. > > Now a point on methodology; > > To implement the CUBE command, might it be faster to do this in the map job > itself? IE when you hit a row, you emit all of the combinations. This is > essentially the same thing, just at a lower level. Of course for big cubes > the issue is going to be the exponential increase in space > > > 2011/7/14 Gianmarco <[email protected]> > >> If you want to add a new operator the right place to add the logic should >> be >> LogicalPlanBuilder. >> >> Just a question, are you sure this code is correct? I can't understand how >> it works. >> >> cubed = foreach rel generate flatten(CubeDimensions(a, b)); >> cube = foreach (group rel by $0) generate flatten(group) as (a, b), >> COUNT_STAR(rel); >> >> >> Cheers, >> -- >> Gianmarco De Francisci Morales >> >> >> On Thu, Jul 14, 2011 at 03:05, Dmitriy Ryaboy <[email protected]> wrote: >> >> > Arnab has a really interesting presentation at the post-hadoop-summit >> > Pig meeting about how Cubing could work in Map-Reduce, and suggested a >> > straightforward path to integrating into Pig. Arnab, do you have the >> > presentation posted somewhere? >> > >> > In any case, I started mucking around a little with this, trying to >> > hack in the naive solution. >> > >> > So far, one interesting result, followed by a question: >> > >> > I manually cubed by writing a bunch of group-bys, like so (using pig 8) : >> > >> > ab = foreach (group rel by (a, b)) generate flatten(group) as (a, b), >> > COUNT_STAR(rel) as cnt; >> > a_only = foreach (group rel by (a, null)) generate flatten(group) as >> > (a, b), COUNT_STAR(rel) as cnt; >> > b_only = foreach (group rel by (null, b)) generate flatten(group) as >> > (a, b), COUNT_STAR(rel) as cnt; >> > ab = foreach (group rel by (null, null)) generate flatten(group) as >> > (a, b), COUNT_STAR(rel) as cnt; >> > cube = union ab, a_only, b_only, ab; >> > store cube .... >> > >> > Except for extra fun, I did this with 3 dimensions and therefore 8 >> > groupings. This generated 4 MR jobs, the first of which moved all the >> > data across the wire despite the fact that COUNT_STAR is algebraic. On >> > my test dataset, the work took 18 minutes. >> > >> > I then wrote a UDF that given a tuple, created all the cube dimensions >> > of the tuple -- so CubeDimensions(a, b) returns { (a, b), (a, null), >> > (null, b), (null, null) }, and this works on any number of dimensions. >> > The naive cube then simply becomes this: >> > >> > cubed = foreach rel generate flatten(CubeDimensions(a, b)); >> > cube = foreach (group rel by $0) generate flatten(group) as (a, b), >> > COUNT_STAR(rel); >> > >> > On the same dataset, this generated only 1 MR job, and ran in 3 >> > minutes because we were able to take advantage of the combiners! >> > >> > Assuming algebraic aggregations, this is actually pretty good given >> > how little work it involves. >> > >> > I looked at adding a new operator that would be (for now) syntactic >> > sugar around this pattern -- basically, "CUBE rel by (a, b, c)" would >> > insert the operators equivalent to the code above. >> > >> > I can muddle my way through the grammar. What's the appropriate place >> > to put the translation logic? Logical to physical compiler? Optimizer? >> > The LogicalPlanBuilder? >> > >> > D >> > >> >
