GROUPING SETS syntax

Alan Gates Thu, 21 Jun 2012 14:11:35 -0700

I think I'm missing something here.  The result of the "out =" line is three 
bags, correct?  If that's the case, the cross product you want is achieved by 
doing:


result = foreach out generate flatten($0), flatten($1), flatten($2)

This is not the same as CROSS, which would be expensive.

Alan.

On Jun 21, 2012, at 1:28 PM, Prasanth J wrote:

> Hello all 
> 
> I initially implemented ROLLUP as a separate operation with the following 
> syntax
> 
> a = ROLLUP inp BY (x,y);
> 
> which does the same thing as CUBE (inserting foreach + group-by in logical 
> plan) except that it uses RollupDimensions UDF. But the issue with this 
> approach is that we cannot mix CUBE and ROLLUP operations together in the 
> same syntax which is a typical case. SQL/Oracle supports using CUBE and 
> ROLLUP together like
> 
> GROUP BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f);
> 
> so I modified the pig grammar to support the similar usage. So now we can use 
> a syntax similar to SQL
> 
> out = CUBE rel BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f);
> 
> In this approach, the logical plan should introduce cartesian product between 
> bags generated by CUBE(a,b,c), ROLLUP(c,d) and CUBE(e,f) for generating the 
> final output. But I read from the documentation 
> (http://pig.apache.org/docs/r0.10.0/basic.html#cross) that CROSS operator is 
> an expensive operator and advices to use it sparingly. 
> 
> Is there any other way to achieve the cartesian product in a less expensive 
> way? Also, does anyone have thoughts about this new syntax? 
> 
> Thanks
> -- Prasanth
> 
> On May 30, 2012, at 8:10 PM, Jonathan Coveney wrote:
> 
>> As far as the underlying implementation, if they all use the same
>> optimizations that you use in cube, then it can be LOCube. If they have
>> their own optimizations etc (or could), it may be worth them having their
>> own Logical operators (which might just be LOCube with flags for the time
>> being) that allows us more flexibilty. But I suppose that's between you,
>> eclipse, and your GSOC mentor.
>> 
>> 2012/5/30 Prasanth J <[email protected]>
>> 
>>> Thanks Alan and Jon for expressing your views.
>>> 
>>> I agree with Jon's point, if the syntax contains CUBE then user expects it
>>> to perform CUBE operation. So Jon's syntax seems more meaningful and concise
>>> 
>>> rel = CUBE rel BY (dims);
>>> rel = ROLLUP rel BY (dims);
>>> rel = GROUPING_SET rel BY (dims);
>>> 
>>> 2 reasons why I do not prefer using SQL syntax is
>>> 1) I do not want to break into existing Group operator implementation :)
>>> 2) The syntax gets longer in case of partial hierarchical cubing/rollups
>>> For ex:
>>> 
>>> rel = GROUP rel BY dim0, ROLLUP(dim1, dim2, dim3), ROLLUP(dim4,dim5,dim6),
>>> ROLLUP(dim7,dim8,dim9);
>>> 
>>> whereas same thing can be expressed like
>>> 
>>> rel = ROLLUP rel BY dim0,
>>> (dim1,dim2,dim3),(dim4,dim5,dim6),(dim7,dim8,dim9);
>>> 
>>> Thanks Alan for pointing out the way for independently managing the
>>> operators in parser and logical/physical plan. So for all these operators
>>> (CUBE, ROLLUP, GROUPING_SET) I can just generate LOCube and use flags to
>>> differentiate between these three operations.
>>> 
>>> But, yes we are proliferating operators in this case.
>>> 
>>> Thanks
>>> -- Prasanth
>>> 
>>> On May 30, 2012, at 4:42 PM, Alan Gates wrote:
>>> 
>>>> 
>>>> On May 30, 2012, at 10:43 AM, Jonathan Coveney wrote:
>>>> 
>>>>> I was going to say the same thing Alan said w.r.t. operators: operators
>>> in
>>>>> the grammar can correspond to whatever logical and physical operators
>>> you
>>>>> want.
>>>>> 
>>>>> As far as the principle of least astonishment compared to SQL... Pig is
>>>>> already pretty astonishing. I don't know why we would bend over
>>> backwards
>>>>> to make the syntax so similar in this case when even getting to the
>>> point
>>>>> of doing a CUBE means understanding an object model that is pretty
>>>>> different from SQL.
>>>>> 
>>>>> On that note,
>>>>> 
>>>>> rel = CUBE rel BY GROUPING SETS(cols);
>>>>> 
>>>>> seems really confusing. I'd much rather overload the group operating
>>> than
>>>>> the cube operator. If I see "cube," I expect a cube. If you start doing
>>>>> rollups etc, that's not a cube, it's a group. Or it's just a rollup. Pig
>>>>> latin is simple enough that I don't think having a rollup, group_set,
>>> etc
>>>>> operator will be so confusing, because they're already going to be
>>> typing
>>>>> that stuff in the conext of
>>>>> 
>>>>> group rel by rollup(cols); and so on. I don't see how it's worth adding
>>>>> more, confusing syntax for the sake of creating parallels with a
>>> language
>>>>> we now share very little with.
>>>> 
>>>> Fair points.
>>>> 
>>>>> 
>>>>> But I won't beat it any further... if people prefer a different syntax,
>>>>> that's fine. Just excited to have the features in Pig!
>>>> +1, I can live with any of the 3 syntax choices (near SQL, original, and
>>> Jon's).
>>>> 
>>>> Alan.
>>>> 
>>>>> Jon
>>>>> 
>>>>> 2012/5/30 Alan Gates <[email protected]>
>>>>> 
>>>>>> Some thoughts on this:
>>>>>> 
>>>>>> 1) +1 to what Dmitriy said on HAVING
>>>>>> 
>>>>>> 2) We need to be clear about separating operators in the grammar versus
>>>>>> logical plan versus physical plan.  The choices you make in the
>>> grammar are
>>>>>> totally independent of the other two.  That is, you could choose the
>>> syntax:
>>>>>> 
>>>>>> rel = GROUP rel BY CUBE (a, b, c)
>>>>>> 
>>>>>> and still have a separate POCube operator.  When the parser sees GROUP
>>> BY
>>>>>> CUBE it will generate an LOCube operator for the logical plan rather
>>> than
>>>>>> an LOGroup operator.  You can still have a separate POCube physical
>>>>>> operator.  Separate optimizations can be applied to LOGroup vs. LOCube
>>> and
>>>>>> POGroup vs. POCube.
>>>>>> 
>>>>>> 3) On syntax I can see arguments for keeping as close to SQL as
>>> possible
>>>>>> and for the syntax proposed by Prasanth.  The argument for sticking
>>> close
>>>>>> to SQL is it conforms to the law of least astonishment.  It wouldn't be
>>>>>> exactly SQL, as it would end up looking like:
>>>>>> 
>>>>>> rel = GROUP rel BY CUBE (cols)
>>>>>> rel = GROUP rel BY ROLLUP (cols)
>>>>>> rel = GROUP rel BY GROUPING SETS(cols);
>>>>>> 
>>>>>> The argument I see for sticking with Prasanth's approach is that GROUP
>>> is
>>>>>> really short for COGROUP in Pig Latin, and I don't think we're
>>> proposing
>>>>>> doing COGROUP rel BY CUBE, nor can I see a case where you'd want to do
>>> such
>>>>>> a thing.  This makes CUBE really a separate operation.  But if we go
>>> this
>>>>>> route I agree with Prasanth we should do CUBE rel BY ROLLUP and CUBE
>>> rel BY
>>>>>> GROUPING SETS.  Let's not proliferate operators.
>>>>>> 
>>>>>> Alan.
>>>>>> 
>>>>>> On May 29, 2012, at 3:55 PM, Prasanth J wrote:
>>>>>> 
>>>>>>> Thanks Jonathan for looking into it and for your suggestions.
>>>>>>> 
>>>>>>> The reason why I came with a clause rather than a separate operator
>>> was
>>>>>> to avoid adding additional operators to the grammar.
>>>>>>> So adding ROLLUP, GROUPING_SET will need separate logical operators
>>>>>> adding to the complexity. I am planning to keep everything under cube
>>>>>> operator, so only LOCube and POCube operators will be added
>>> additionally.
>>>>>> And as you and Dmitriy have mentioned the purpose of HAVING clause is
>>> the
>>>>>> same as FILTER so we do not need a separate HAVING clause.
>>>>>>> 
>>>>>>> I will give a quick recap of cube related operations and multiple
>>> syntax
>>>>>> options for achieving the same. I am also adding partial cubing and
>>> rollup
>>>>>> in this discussion.
>>>>>>> 
>>>>>>> 1) CUBE
>>>>>>> 
>>>>>>> Current syntax:
>>>>>>> alias = CUBE rel BY (a, b);
>>>>>>> 
>>>>>>> Following group-by's will be computed:
>>>>>>> (a, b)
>>>>>>> (a)
>>>>>>> (b)
>>>>>>> ()
>>>>>>> 
>>>>>>> 2) Partial CUBE
>>>>>>> 
>>>>>>> Proposed syntax:
>>>>>>> alias = CUBE rel BY a, (b, c);
>>>>>>> 
>>>>>>> Following group-by's will be computed:
>>>>>>> (a, b, c)
>>>>>>> (a, b)
>>>>>>> (a, c)
>>>>>>> (a)
>>>>>>> 
>>>>>>> 3) ROLLUP
>>>>>>> 
>>>>>>> Proposed syntax 1:
>>>>>>> alias = CUBE rel BY ROLLUP(a, b);
>>>>>>> 
>>>>>>> Proposed syntax 2:
>>>>>>> alias = CUBE rel BY (a::b);
>>>>>>> 
>>>>>>> Proposed syntax 3:
>>>>>>> alias = ROLLUP rel BY (a, b);
>>>>>>> 
>>>>>>> Following group-by's will be computed:
>>>>>>> (a, b)
>>>>>>> (a)
>>>>>>> ()
>>>>>>> 
>>>>>>> 4) Partial ROLLUP
>>>>>>> 
>>>>>>> Proposed syntax 1:
>>>>>>> alias = CUBE rel BY a, ROLLUP(b, c);
>>>>>>> 
>>>>>>> Proposed syntax 2:
>>>>>>> alias = CUBE rel BY (a, b::c);
>>>>>>> 
>>>>>>> Proposed syntax 3:
>>>>>>> alias = ROLLUP rel BY a, (b, c);
>>>>>>> 
>>>>>>> Following group-by's will be computed:
>>>>>>> (a, b, c)
>>>>>>> (a, b)
>>>>>>> (a)
>>>>>>> 
>>>>>>> 5) GROUPING SETS
>>>>>>> 
>>>>>>> Proposed syntax 1:
>>>>>>> alias = CUBE rel BY GROUPING SETS((a), (b, c), (c))
>>>>>>> 
>>>>>>> Proposed syntax 2:
>>>>>>> alias = CUBE rel BY {(a), (b, c), (c)}
>>>>>>> 
>>>>>>> Proposed syntax 3:
>>>>>>> alias = GROUPING_SET rel BY ((a), (b, c), (c))
>>>>>>> 
>>>>>>> Following group-by's will be computed:
>>>>>>> (a)
>>>>>>> (b, c)
>>>>>>> (c)
>>>>>>> 
>>>>>>> Please vote for syntax 1, 2 or 3 so that we can come to a consensus
>>>>>> before I start hacking the grammar file.
>>>>>>> 
>>>>>>> Thanks
>>>>>>> -- Prasanth
>>>>>>> 
>>>>>>> On May 29, 2012, at 4:05 PM, Jonathan Coveney wrote:
>>>>>>> 
>>>>>>>> Hey Prashanth, happy hacking.
>>>>>>>> 
>>>>>>>> My opinion:
>>>>>>>> 
>>>>>>>> CUBE:
>>>>>>>> 
>>>>>>>> alias = CUBE rel BY (a,b,c);
>>>>>>>> 
>>>>>>>> 
>>>>>>>> I like that syntax. It's unambiguous what is going on.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> ROLLUP:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> alias = CUBE rel BY ROLLUP(a,b,c);
>>>>>>>> 
>>>>>>>> 
>>>>>>>> I never liked that syntax in SQL. I suggest we just do what we did
>>> with
>>>>>> CUBE. IE
>>>>>>>> 
>>>>>>>> 
>>>>>>>> alias = ROLLUP rel BY (a,b,c);
>>>>>>>> 
>>>>>>>> 
>>>>>>>> GROUPING SETS:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> alias = CUBE rel BY GROUPING SETS((a,b),(b),());
>>>>>>>> 
>>>>>>>> 
>>>>>>>> I don't like this. The cube vs. grouping sets is confusing to me.
>>> maybe
>>>>>>>> following the
>>>>>>>> same pattern you could do something like:
>>>>>>>> 
>>>>>>>> alias = GROUPING_SET rel BY ((a,b),(b),());
>>>>>>>> 
>>>>>>>> As far as having, is there an optimization that can be done with a
>>>>>> HAVING
>>>>>>>> clause that can't be done based on the logical plan that comes
>>>>>> afterwards?
>>>>>>>> That seems odd to me. Since you have to materialize the result
>>> anyway,
>>>>>>>> can't the having clause just be a FILTER that comes after the cube? I
>>>>>> don't
>>>>>>>> know why we need a special syntax.
>>>>>>>> 
>>>>>>>> My opinion. Forgive janky formatting, gmail + paste = pain.
>>>>>>>> Jon
>>>>>>>> 
>>>>>>>> 2012/5/27 Prasanth J <[email protected]>
>>>>>>>> 
>>>>>>>>> Hello everyone
>>>>>>>>> 
>>>>>>>>> I am looking for feedback from the community about the syntax for
>>>>>>>>> CUBE/ROLLUP/GROUPING SETS operations in pig.
>>>>>>>>> I am moving the discussion from JIRA to dev-list so that everyone
>>> can
>>>>>>>>> share their opinion for operator syntax. Please have a look at the
>>>>>> syntax
>>>>>>>>> proposal at the link below and let me know your opinion
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>> 
>>> https://issues.apache.org/jira/browse/PIG-2167?focusedCommentId=13277644&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13277644
>>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> -- Prasanth
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>> 
>>> 
>

Re: CUBE/ROLLUP/GROUPING SETS syntax

Reply via email to