I think showing the data at every step will help.

rel:

(green,tall)
(red,short)
(green,short)

cubed:
(green,tall)
(green,)
(,tall)
(,)
(red,short)
(red,)
(,short)
(,)
(green,short)
(green,)
(,short)
(,)

cube: I did mess up typing the code in the email -- it should look
more like this:
cube = foreach (group cubed by (color, height)) generate
flatten(group) as (color, height), COUNT_STAR(cubed);

(red,short,1)
(red,,1)
(green,tall,1)
(green,short,1)
(green,,2)
(,tall,1)
(,short,2)
(,,3)


On Thu, Jul 14, 2011 at 6:13 AM, Gianmarco <gianmarco....@gmail.com> wrote:
> If you want to add a new operator the right place to add the logic should be
> LogicalPlanBuilder.
>
> Just a question, are you sure this code is correct? I can't understand how
> it works.
>
> cubed = foreach rel generate flatten(CubeDimensions(a, b));
> cube = foreach (group rel by $0) generate flatten(group) as (a, b),
> COUNT_STAR(rel);
>
>
> Cheers,
> --
> Gianmarco De Francisci Morales
>
>
> On Thu, Jul 14, 2011 at 03:05, Dmitriy Ryaboy <dvrya...@gmail.com> wrote:
>
>> Arnab has a really interesting presentation at the post-hadoop-summit
>> Pig meeting about how Cubing could work in Map-Reduce, and suggested a
>> straightforward path to integrating into Pig. Arnab, do you have the
>> presentation posted somewhere?
>>
>> In any case, I started mucking around a little with this, trying to
>> hack in the naive solution.
>>
>> So far, one interesting result, followed by a question:
>>
>> I manually cubed by writing a bunch of group-bys, like so (using pig 8) :
>>
>> ab = foreach (group rel by (a, b)) generate flatten(group) as (a, b),
>> COUNT_STAR(rel) as cnt;
>> a_only = foreach (group rel by (a, null)) generate flatten(group) as
>> (a, b), COUNT_STAR(rel) as cnt;
>> b_only = foreach (group rel by (null, b)) generate flatten(group) as
>> (a, b), COUNT_STAR(rel) as cnt;
>> ab = foreach (group rel by (null, null)) generate flatten(group) as
>> (a, b), COUNT_STAR(rel) as cnt;
>> cube = union ab, a_only, b_only, ab;
>> store cube ....
>>
>> Except for extra fun, I did this with 3 dimensions and therefore 8
>> groupings. This generated 4 MR jobs, the first of which moved all the
>> data across the wire despite the fact that COUNT_STAR is algebraic. On
>> my test dataset, the work took 18 minutes.
>>
>> I then wrote a UDF that given a tuple, created all the cube dimensions
>> of the tuple -- so CubeDimensions(a, b) returns { (a, b), (a, null),
>> (null, b), (null, null) }, and this works on any number of dimensions.
>> The naive cube then simply becomes this:
>>
>> cubed = foreach rel generate flatten(CubeDimensions(a, b));
>> cube = foreach (group rel by $0) generate flatten(group) as (a, b),
>> COUNT_STAR(rel);
>>
>> On the same dataset, this generated only 1 MR job, and ran in 3
>> minutes because we were able to take advantage of the combiners!
>>
>> Assuming algebraic aggregations, this is actually pretty good given
>> how little work it involves.
>>
>> I looked at adding a new operator that would be (for now) syntactic
>> sugar around this pattern -- basically, "CUBE rel by (a, b, c)" would
>> insert the operators equivalent to the code above.
>>
>> I can muddle my way through the grammar. What's the appropriate place
>> to put the translation logic? Logical to physical compiler? Optimizer?
>> The LogicalPlanBuilder?
>>
>> D
>>
>

Reply via email to