GROUPING SETS syntax

Prasanth J Wed, 30 May 2012 17:03:33 -0700

Thanks Alan and Jon for expressing your views. 

I agree with Jon's point, if the syntax contains CUBE then user expects it to 
perform CUBE operation. So Jon's syntax seems more meaningful and concise


rel = CUBE rel BY (dims);
rel = ROLLUP rel BY (dims);
rel = GROUPING_SET rel BY (dims);

2 reasons why I do not prefer using SQL syntax is
1) I do not want to break into existing Group operator implementation :)
2) The syntax gets longer in case of partial hierarchical cubing/rollups
For ex:

rel = GROUP rel BY dim0, ROLLUP(dim1, dim2, dim3), ROLLUP(dim4,dim5,dim6), 
ROLLUP(dim7,dim8,dim9);

whereas same thing can be expressed like

rel = ROLLUP rel BY dim0, (dim1,dim2,dim3),(dim4,dim5,dim6),(dim7,dim8,dim9);

Thanks Alan for pointing out the way for independently managing the operators 
in parser and logical/physical plan. So for all these operators (CUBE, ROLLUP, 
GROUPING_SET) I can just generate LOCube and use flags to differentiate between 
these three operations.

But, yes we are proliferating operators in this case. 

Thanks
-- Prasanth

On May 30, 2012, at 4:42 PM, Alan Gates wrote:

> 
> On May 30, 2012, at 10:43 AM, Jonathan Coveney wrote:
> 
>> I was going to say the same thing Alan said w.r.t. operators: operators in
>> the grammar can correspond to whatever logical and physical operators you
>> want.
>> 
>> As far as the principle of least astonishment compared to SQL... Pig is
>> already pretty astonishing. I don't know why we would bend over backwards
>> to make the syntax so similar in this case when even getting to the point
>> of doing a CUBE means understanding an object model that is pretty
>> different from SQL.
>> 
>> On that note,
>> 
>> rel = CUBE rel BY GROUPING SETS(cols);
>> 
>> seems really confusing. I'd much rather overload the group operating than
>> the cube operator. If I see "cube," I expect a cube. If you start doing
>> rollups etc, that's not a cube, it's a group. Or it's just a rollup. Pig
>> latin is simple enough that I don't think having a rollup, group_set, etc
>> operator will be so confusing, because they're already going to be typing
>> that stuff in the conext of
>> 
>> group rel by rollup(cols); and so on. I don't see how it's worth adding
>> more, confusing syntax for the sake of creating parallels with a language
>> we now share very little with.
> 
> Fair points.
> 
>> 
>> But I won't beat it any further... if people prefer a different syntax,
>> that's fine. Just excited to have the features in Pig!
> +1, I can live with any of the 3 syntax choices (near SQL, original, and 
> Jon's).
> 
> Alan.
> 
>> Jon
>> 
>> 2012/5/30 Alan Gates <[email protected]>
>> 
>>> Some thoughts on this:
>>> 
>>> 1) +1 to what Dmitriy said on HAVING
>>> 
>>> 2) We need to be clear about separating operators in the grammar versus
>>> logical plan versus physical plan.  The choices you make in the grammar are
>>> totally independent of the other two.  That is, you could choose the syntax:
>>> 
>>> rel = GROUP rel BY CUBE (a, b, c)
>>> 
>>> and still have a separate POCube operator.  When the parser sees GROUP BY
>>> CUBE it will generate an LOCube operator for the logical plan rather than
>>> an LOGroup operator.  You can still have a separate POCube physical
>>> operator.  Separate optimizations can be applied to LOGroup vs. LOCube and
>>> POGroup vs. POCube.
>>> 
>>> 3) On syntax I can see arguments for keeping as close to SQL as possible
>>> and for the syntax proposed by Prasanth.  The argument for sticking close
>>> to SQL is it conforms to the law of least astonishment.  It wouldn't be
>>> exactly SQL, as it would end up looking like:
>>> 
>>> rel = GROUP rel BY CUBE (cols)
>>> rel = GROUP rel BY ROLLUP (cols)
>>> rel = GROUP rel BY GROUPING SETS(cols);
>>> 
>>> The argument I see for sticking with Prasanth's approach is that GROUP is
>>> really short for COGROUP in Pig Latin, and I don't think we're proposing
>>> doing COGROUP rel BY CUBE, nor can I see a case where you'd want to do such
>>> a thing.  This makes CUBE really a separate operation.  But if we go this
>>> route I agree with Prasanth we should do CUBE rel BY ROLLUP and CUBE rel BY
>>> GROUPING SETS.  Let's not proliferate operators.
>>> 
>>> Alan.
>>> 
>>> On May 29, 2012, at 3:55 PM, Prasanth J wrote:
>>> 
>>>> Thanks Jonathan for looking into it and for your suggestions.
>>>> 
>>>> The reason why I came with a clause rather than a separate operator was
>>> to avoid adding additional operators to the grammar.
>>>> So adding ROLLUP, GROUPING_SET will need separate logical operators
>>> adding to the complexity. I am planning to keep everything under cube
>>> operator, so only LOCube and POCube operators will be added additionally.
>>> And as you and Dmitriy have mentioned the purpose of HAVING clause is the
>>> same as FILTER so we do not need a separate HAVING clause.
>>>> 
>>>> I will give a quick recap of cube related operations and multiple syntax
>>> options for achieving the same. I am also adding partial cubing and rollup
>>> in this discussion.
>>>> 
>>>> 1) CUBE
>>>> 
>>>> Current syntax:
>>>> alias = CUBE rel BY (a, b);
>>>> 
>>>> Following group-by's will be computed:
>>>> (a, b)
>>>> (a)
>>>> (b)
>>>> ()
>>>> 
>>>> 2) Partial CUBE
>>>> 
>>>> Proposed syntax:
>>>> alias = CUBE rel BY a, (b, c);
>>>> 
>>>> Following group-by's will be computed:
>>>> (a, b, c)
>>>> (a, b)
>>>> (a, c)
>>>> (a)
>>>> 
>>>> 3) ROLLUP
>>>> 
>>>> Proposed syntax 1:
>>>> alias = CUBE rel BY ROLLUP(a, b);
>>>> 
>>>> Proposed syntax 2:
>>>> alias = CUBE rel BY (a::b);
>>>> 
>>>> Proposed syntax 3:
>>>> alias = ROLLUP rel BY (a, b);
>>>> 
>>>> Following group-by's will be computed:
>>>> (a, b)
>>>> (a)
>>>> ()
>>>> 
>>>> 4) Partial ROLLUP
>>>> 
>>>> Proposed syntax 1:
>>>> alias = CUBE rel BY a, ROLLUP(b, c);
>>>> 
>>>> Proposed syntax 2:
>>>> alias = CUBE rel BY (a, b::c);
>>>> 
>>>> Proposed syntax 3:
>>>> alias = ROLLUP rel BY a, (b, c);
>>>> 
>>>> Following group-by's will be computed:
>>>> (a, b, c)
>>>> (a, b)
>>>> (a)
>>>> 
>>>> 5) GROUPING SETS
>>>> 
>>>> Proposed syntax 1:
>>>> alias = CUBE rel BY GROUPING SETS((a), (b, c), (c))
>>>> 
>>>> Proposed syntax 2:
>>>> alias = CUBE rel BY {(a), (b, c), (c)}
>>>> 
>>>> Proposed syntax 3:
>>>> alias = GROUPING_SET rel BY ((a), (b, c), (c))
>>>> 
>>>> Following group-by's will be computed:
>>>> (a)
>>>> (b, c)
>>>> (c)
>>>> 
>>>> Please vote for syntax 1, 2 or 3 so that we can come to a consensus
>>> before I start hacking the grammar file.
>>>> 
>>>> Thanks
>>>> -- Prasanth
>>>> 
>>>> On May 29, 2012, at 4:05 PM, Jonathan Coveney wrote:
>>>> 
>>>>> Hey Prashanth, happy hacking.
>>>>> 
>>>>> My opinion:
>>>>> 
>>>>> CUBE:
>>>>> 
>>>>> alias = CUBE rel BY (a,b,c);
>>>>> 
>>>>> 
>>>>> I like that syntax. It's unambiguous what is going on.
>>>>> 
>>>>> 
>>>>> ROLLUP:
>>>>> 
>>>>> 
>>>>> alias = CUBE rel BY ROLLUP(a,b,c);
>>>>> 
>>>>> 
>>>>> I never liked that syntax in SQL. I suggest we just do what we did with
>>> CUBE. IE
>>>>> 
>>>>> 
>>>>> alias = ROLLUP rel BY (a,b,c);
>>>>> 
>>>>> 
>>>>> GROUPING SETS:
>>>>> 
>>>>> 
>>>>> alias = CUBE rel BY GROUPING SETS((a,b),(b),());
>>>>> 
>>>>> 
>>>>> I don't like this. The cube vs. grouping sets is confusing to me. maybe
>>>>> following the
>>>>> same pattern you could do something like:
>>>>> 
>>>>> alias = GROUPING_SET rel BY ((a,b),(b),());
>>>>> 
>>>>> As far as having, is there an optimization that can be done with a
>>> HAVING
>>>>> clause that can't be done based on the logical plan that comes
>>> afterwards?
>>>>> That seems odd to me. Since you have to materialize the result anyway,
>>>>> can't the having clause just be a FILTER that comes after the cube? I
>>> don't
>>>>> know why we need a special syntax.
>>>>> 
>>>>> My opinion. Forgive janky formatting, gmail + paste = pain.
>>>>> Jon
>>>>> 
>>>>> 2012/5/27 Prasanth J <[email protected]>
>>>>> 
>>>>>> Hello everyone
>>>>>> 
>>>>>> I am looking for feedback from the community about the syntax for
>>>>>> CUBE/ROLLUP/GROUPING SETS operations in pig.
>>>>>> I am moving the discussion from JIRA to dev-list so that everyone can
>>>>>> share their opinion for operator syntax. Please have a look at the
>>> syntax
>>>>>> proposal at the link below and let me know your opinion
>>>>>> 
>>>>>> 
>>>>>> 
>>> https://issues.apache.org/jira/browse/PIG-2167?focusedCommentId=13277644&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13277644
>>>>>> 
>>>>>> Thanks
>>>>>> -- Prasanth
>>>>>> 
>>>>>> 
>>>> 
>>> 
>>> 
>

Re: CUBE/ROLLUP/GROUPING SETS syntax

Reply via email to