GROUPING SETS syntax

Prasanth J Thu, 21 Jun 2012 13:29:30 -0700

Hello all 

I initially implemented ROLLUP as a separate operation with the following syntax


a = ROLLUP inp BY (x,y);

which does the same thing as CUBE (inserting foreach + group-by in logical 
plan) except that it uses RollupDimensions UDF. But the issue with this 
approach is that we cannot mix CUBE and ROLLUP operations together in the same 
syntax which is a typical case. SQL/Oracle supports using CUBE and ROLLUP 
together like

GROUP BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f);

so I modified the pig grammar to support the similar usage. So now we can use a 
syntax similar to SQL

out = CUBE rel BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f);

In this approach, the logical plan should introduce cartesian product between 
bags generated by CUBE(a,b,c), ROLLUP(c,d) and CUBE(e,f) for generating the 
final output. But I read from the documentation 
(http://pig.apache.org/docs/r0.10.0/basic.html#cross) that CROSS operator is an 
expensive operator and advices to use it sparingly. 

Is there any other way to achieve the cartesian product in a less expensive 
way? Also, does anyone have thoughts about this new syntax? 

Thanks
-- Prasanth

On May 30, 2012, at 8:10 PM, Jonathan Coveney wrote:

> As far as the underlying implementation, if they all use the same
> optimizations that you use in cube, then it can be LOCube. If they have
> their own optimizations etc (or could), it may be worth them having their
> own Logical operators (which might just be LOCube with flags for the time
> being) that allows us more flexibilty. But I suppose that's between you,
> eclipse, and your GSOC mentor.
> 
> 2012/5/30 Prasanth J <[email protected]>
> 
>> Thanks Alan and Jon for expressing your views.
>> 
>> I agree with Jon's point, if the syntax contains CUBE then user expects it
>> to perform CUBE operation. So Jon's syntax seems more meaningful and concise
>> 
>> rel = CUBE rel BY (dims);
>> rel = ROLLUP rel BY (dims);
>> rel = GROUPING_SET rel BY (dims);
>> 
>> 2 reasons why I do not prefer using SQL syntax is
>> 1) I do not want to break into existing Group operator implementation :)
>> 2) The syntax gets longer in case of partial hierarchical cubing/rollups
>> For ex:
>> 
>> rel = GROUP rel BY dim0, ROLLUP(dim1, dim2, dim3), ROLLUP(dim4,dim5,dim6),
>> ROLLUP(dim7,dim8,dim9);
>> 
>> whereas same thing can be expressed like
>> 
>> rel = ROLLUP rel BY dim0,
>> (dim1,dim2,dim3),(dim4,dim5,dim6),(dim7,dim8,dim9);
>> 
>> Thanks Alan for pointing out the way for independently managing the
>> operators in parser and logical/physical plan. So for all these operators
>> (CUBE, ROLLUP, GROUPING_SET) I can just generate LOCube and use flags to
>> differentiate between these three operations.
>> 
>> But, yes we are proliferating operators in this case.
>> 
>> Thanks
>> -- Prasanth
>> 
>> On May 30, 2012, at 4:42 PM, Alan Gates wrote:
>> 
>>> 
>>> On May 30, 2012, at 10:43 AM, Jonathan Coveney wrote:
>>> 
>>>> I was going to say the same thing Alan said w.r.t. operators: operators
>> in
>>>> the grammar can correspond to whatever logical and physical operators
>> you
>>>> want.
>>>> 
>>>> As far as the principle of least astonishment compared to SQL... Pig is
>>>> already pretty astonishing. I don't know why we would bend over
>> backwards
>>>> to make the syntax so similar in this case when even getting to the
>> point
>>>> of doing a CUBE means understanding an object model that is pretty
>>>> different from SQL.
>>>> 
>>>> On that note,
>>>> 
>>>> rel = CUBE rel BY GROUPING SETS(cols);
>>>> 
>>>> seems really confusing. I'd much rather overload the group operating
>> than
>>>> the cube operator. If I see "cube," I expect a cube. If you start doing
>>>> rollups etc, that's not a cube, it's a group. Or it's just a rollup. Pig
>>>> latin is simple enough that I don't think having a rollup, group_set,
>> etc
>>>> operator will be so confusing, because they're already going to be
>> typing
>>>> that stuff in the conext of
>>>> 
>>>> group rel by rollup(cols); and so on. I don't see how it's worth adding
>>>> more, confusing syntax for the sake of creating parallels with a
>> language
>>>> we now share very little with.
>>> 
>>> Fair points.
>>> 
>>>> 
>>>> But I won't beat it any further... if people prefer a different syntax,
>>>> that's fine. Just excited to have the features in Pig!
>>> +1, I can live with any of the 3 syntax choices (near SQL, original, and
>> Jon's).
>>> 
>>> Alan.
>>> 
>>>> Jon
>>>> 
>>>> 2012/5/30 Alan Gates <[email protected]>
>>>> 
>>>>> Some thoughts on this:
>>>>> 
>>>>> 1) +1 to what Dmitriy said on HAVING
>>>>> 
>>>>> 2) We need to be clear about separating operators in the grammar versus
>>>>> logical plan versus physical plan.  The choices you make in the
>> grammar are
>>>>> totally independent of the other two.  That is, you could choose the
>> syntax:
>>>>> 
>>>>> rel = GROUP rel BY CUBE (a, b, c)
>>>>> 
>>>>> and still have a separate POCube operator.  When the parser sees GROUP
>> BY
>>>>> CUBE it will generate an LOCube operator for the logical plan rather
>> than
>>>>> an LOGroup operator.  You can still have a separate POCube physical
>>>>> operator.  Separate optimizations can be applied to LOGroup vs. LOCube
>> and
>>>>> POGroup vs. POCube.
>>>>> 
>>>>> 3) On syntax I can see arguments for keeping as close to SQL as
>> possible
>>>>> and for the syntax proposed by Prasanth.  The argument for sticking
>> close
>>>>> to SQL is it conforms to the law of least astonishment.  It wouldn't be
>>>>> exactly SQL, as it would end up looking like:
>>>>> 
>>>>> rel = GROUP rel BY CUBE (cols)
>>>>> rel = GROUP rel BY ROLLUP (cols)
>>>>> rel = GROUP rel BY GROUPING SETS(cols);
>>>>> 
>>>>> The argument I see for sticking with Prasanth's approach is that GROUP
>> is
>>>>> really short for COGROUP in Pig Latin, and I don't think we're
>> proposing
>>>>> doing COGROUP rel BY CUBE, nor can I see a case where you'd want to do
>> such
>>>>> a thing.  This makes CUBE really a separate operation.  But if we go
>> this
>>>>> route I agree with Prasanth we should do CUBE rel BY ROLLUP and CUBE
>> rel BY
>>>>> GROUPING SETS.  Let's not proliferate operators.
>>>>> 
>>>>> Alan.
>>>>> 
>>>>> On May 29, 2012, at 3:55 PM, Prasanth J wrote:
>>>>> 
>>>>>> Thanks Jonathan for looking into it and for your suggestions.
>>>>>> 
>>>>>> The reason why I came with a clause rather than a separate operator
>> was
>>>>> to avoid adding additional operators to the grammar.
>>>>>> So adding ROLLUP, GROUPING_SET will need separate logical operators
>>>>> adding to the complexity. I am planning to keep everything under cube
>>>>> operator, so only LOCube and POCube operators will be added
>> additionally.
>>>>> And as you and Dmitriy have mentioned the purpose of HAVING clause is
>> the
>>>>> same as FILTER so we do not need a separate HAVING clause.
>>>>>> 
>>>>>> I will give a quick recap of cube related operations and multiple
>> syntax
>>>>> options for achieving the same. I am also adding partial cubing and
>> rollup
>>>>> in this discussion.
>>>>>> 
>>>>>> 1) CUBE
>>>>>> 
>>>>>> Current syntax:
>>>>>> alias = CUBE rel BY (a, b);
>>>>>> 
>>>>>> Following group-by's will be computed:
>>>>>> (a, b)
>>>>>> (a)
>>>>>> (b)
>>>>>> ()
>>>>>> 
>>>>>> 2) Partial CUBE
>>>>>> 
>>>>>> Proposed syntax:
>>>>>> alias = CUBE rel BY a, (b, c);
>>>>>> 
>>>>>> Following group-by's will be computed:
>>>>>> (a, b, c)
>>>>>> (a, b)
>>>>>> (a, c)
>>>>>> (a)
>>>>>> 
>>>>>> 3) ROLLUP
>>>>>> 
>>>>>> Proposed syntax 1:
>>>>>> alias = CUBE rel BY ROLLUP(a, b);
>>>>>> 
>>>>>> Proposed syntax 2:
>>>>>> alias = CUBE rel BY (a::b);
>>>>>> 
>>>>>> Proposed syntax 3:
>>>>>> alias = ROLLUP rel BY (a, b);
>>>>>> 
>>>>>> Following group-by's will be computed:
>>>>>> (a, b)
>>>>>> (a)
>>>>>> ()
>>>>>> 
>>>>>> 4) Partial ROLLUP
>>>>>> 
>>>>>> Proposed syntax 1:
>>>>>> alias = CUBE rel BY a, ROLLUP(b, c);
>>>>>> 
>>>>>> Proposed syntax 2:
>>>>>> alias = CUBE rel BY (a, b::c);
>>>>>> 
>>>>>> Proposed syntax 3:
>>>>>> alias = ROLLUP rel BY a, (b, c);
>>>>>> 
>>>>>> Following group-by's will be computed:
>>>>>> (a, b, c)
>>>>>> (a, b)
>>>>>> (a)
>>>>>> 
>>>>>> 5) GROUPING SETS
>>>>>> 
>>>>>> Proposed syntax 1:
>>>>>> alias = CUBE rel BY GROUPING SETS((a), (b, c), (c))
>>>>>> 
>>>>>> Proposed syntax 2:
>>>>>> alias = CUBE rel BY {(a), (b, c), (c)}
>>>>>> 
>>>>>> Proposed syntax 3:
>>>>>> alias = GROUPING_SET rel BY ((a), (b, c), (c))
>>>>>> 
>>>>>> Following group-by's will be computed:
>>>>>> (a)
>>>>>> (b, c)
>>>>>> (c)
>>>>>> 
>>>>>> Please vote for syntax 1, 2 or 3 so that we can come to a consensus
>>>>> before I start hacking the grammar file.
>>>>>> 
>>>>>> Thanks
>>>>>> -- Prasanth
>>>>>> 
>>>>>> On May 29, 2012, at 4:05 PM, Jonathan Coveney wrote:
>>>>>> 
>>>>>>> Hey Prashanth, happy hacking.
>>>>>>> 
>>>>>>> My opinion:
>>>>>>> 
>>>>>>> CUBE:
>>>>>>> 
>>>>>>> alias = CUBE rel BY (a,b,c);
>>>>>>> 
>>>>>>> 
>>>>>>> I like that syntax. It's unambiguous what is going on.
>>>>>>> 
>>>>>>> 
>>>>>>> ROLLUP:
>>>>>>> 
>>>>>>> 
>>>>>>> alias = CUBE rel BY ROLLUP(a,b,c);
>>>>>>> 
>>>>>>> 
>>>>>>> I never liked that syntax in SQL. I suggest we just do what we did
>> with
>>>>> CUBE. IE
>>>>>>> 
>>>>>>> 
>>>>>>> alias = ROLLUP rel BY (a,b,c);
>>>>>>> 
>>>>>>> 
>>>>>>> GROUPING SETS:
>>>>>>> 
>>>>>>> 
>>>>>>> alias = CUBE rel BY GROUPING SETS((a,b),(b),());
>>>>>>> 
>>>>>>> 
>>>>>>> I don't like this. The cube vs. grouping sets is confusing to me.
>> maybe
>>>>>>> following the
>>>>>>> same pattern you could do something like:
>>>>>>> 
>>>>>>> alias = GROUPING_SET rel BY ((a,b),(b),());
>>>>>>> 
>>>>>>> As far as having, is there an optimization that can be done with a
>>>>> HAVING
>>>>>>> clause that can't be done based on the logical plan that comes
>>>>> afterwards?
>>>>>>> That seems odd to me. Since you have to materialize the result
>> anyway,
>>>>>>> can't the having clause just be a FILTER that comes after the cube? I
>>>>> don't
>>>>>>> know why we need a special syntax.
>>>>>>> 
>>>>>>> My opinion. Forgive janky formatting, gmail + paste = pain.
>>>>>>> Jon
>>>>>>> 
>>>>>>> 2012/5/27 Prasanth J <[email protected]>
>>>>>>> 
>>>>>>>> Hello everyone
>>>>>>>> 
>>>>>>>> I am looking for feedback from the community about the syntax for
>>>>>>>> CUBE/ROLLUP/GROUPING SETS operations in pig.
>>>>>>>> I am moving the discussion from JIRA to dev-list so that everyone
>> can
>>>>>>>> share their opinion for operator syntax. Please have a look at the
>>>>> syntax
>>>>>>>> proposal at the link below and let me know your opinion
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>> 
>> https://issues.apache.org/jira/browse/PIG-2167?focusedCommentId=13277644&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13277644
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> -- Prasanth
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>> 
>> 
>>

Re: CUBE/ROLLUP/GROUPING SETS syntax

Reply via email to