Re: Cubing in Pig

Jonathan Coveney Thu, 14 Jul 2011 09:54:46 -0700

Dmitry, a quick point on your approach...

I assume that you meant to do, replacing rel with cubed? If you ran what you
pasted, you don't actually make reference to the cubed that you output,
which may have influenced run time.


cubed = foreach rel generate flatten(CubeDimensions(a, b));
cube = foreach (group cubed by $0) generate flatten(group) as (a,
b), COUNT_STAR(rel);

Gianmarco:

Let's say that rel looks like this:

1,1
1,2
2,2

your results from group on (a,b)
(1,1):1
(1,2):1
(2,2):1

grouping on (a,null):
(1,null): 2
(2,null): 1

grouping on (null,b):
(null,1): 1
(null,2): 2

group on (null,null):
(null,null): 3

here is what cubed would look like

{(1,1),(1,null),(null,1),(null,null)}
{(1,2),(1,null),(null,2),(null,null)}
{(2,2),(2,null),(null,2),(null,null)}

When you flatten it out, you'll have

(1,1)
(1,null)
(null,1)
(null,null)
(1,2)
(1,null)
(null,2)
(null,null)
(2,2)
(2,null)
(null,2)
(null,null)

now we group on the value, of which the posibilities/counts are...

(1,1):1
(1,2):1
(2,2):1
(1,null): 2
(2,null): 1
(null,1): 1
(null,2): 2
(null,null): 3

The same. What you're doing is blowing up the intermediate info.

Now a point on methodology;

To implement the CUBE command, might it be faster to do this in the map job
itself? IE when you hit a row, you emit all of the combinations. This is
essentially the same thing, just at a lower level. Of course for big cubes
the issue is going to be the exponential increase in space


2011/7/14 Gianmarco <[email protected]>

> If you want to add a new operator the right place to add the logic should
> be
> LogicalPlanBuilder.
>
> Just a question, are you sure this code is correct? I can't understand how
> it works.
>
> cubed = foreach rel generate flatten(CubeDimensions(a, b));
> cube = foreach (group rel by $0) generate flatten(group) as (a, b),
> COUNT_STAR(rel);
>
>
> Cheers,
> --
> Gianmarco De Francisci Morales
>
>
> On Thu, Jul 14, 2011 at 03:05, Dmitriy Ryaboy <[email protected]> wrote:
>
> > Arnab has a really interesting presentation at the post-hadoop-summit
> > Pig meeting about how Cubing could work in Map-Reduce, and suggested a
> > straightforward path to integrating into Pig. Arnab, do you have the
> > presentation posted somewhere?
> >
> > In any case, I started mucking around a little with this, trying to
> > hack in the naive solution.
> >
> > So far, one interesting result, followed by a question:
> >
> > I manually cubed by writing a bunch of group-bys, like so (using pig 8) :
> >
> > ab = foreach (group rel by (a, b)) generate flatten(group) as (a, b),
> > COUNT_STAR(rel) as cnt;
> > a_only = foreach (group rel by (a, null)) generate flatten(group) as
> > (a, b), COUNT_STAR(rel) as cnt;
> > b_only = foreach (group rel by (null, b)) generate flatten(group) as
> > (a, b), COUNT_STAR(rel) as cnt;
> > ab = foreach (group rel by (null, null)) generate flatten(group) as
> > (a, b), COUNT_STAR(rel) as cnt;
> > cube = union ab, a_only, b_only, ab;
> > store cube ....
> >
> > Except for extra fun, I did this with 3 dimensions and therefore 8
> > groupings. This generated 4 MR jobs, the first of which moved all the
> > data across the wire despite the fact that COUNT_STAR is algebraic. On
> > my test dataset, the work took 18 minutes.
> >
> > I then wrote a UDF that given a tuple, created all the cube dimensions
> > of the tuple -- so CubeDimensions(a, b) returns { (a, b), (a, null),
> > (null, b), (null, null) }, and this works on any number of dimensions.
> > The naive cube then simply becomes this:
> >
> > cubed = foreach rel generate flatten(CubeDimensions(a, b));
> > cube = foreach (group rel by $0) generate flatten(group) as (a, b),
> > COUNT_STAR(rel);
> >
> > On the same dataset, this generated only 1 MR job, and ran in 3
> > minutes because we were able to take advantage of the combiners!
> >
> > Assuming algebraic aggregations, this is actually pretty good given
> > how little work it involves.
> >
> > I looked at adding a new operator that would be (for now) syntactic
> > sugar around this pattern -- basically, "CUBE rel by (a, b, c)" would
> > insert the operators equivalent to the code above.
> >
> > I can muddle my way through the grammar. What's the appropriate place
> > to put the translation logic? Logical to physical compiler? Optimizer?
> > The LogicalPlanBuilder?
> >
> > D
> >
>

Re: Cubing in Pig

Reply via email to