GROUPING SETS syntax

Dmitriy Ryaboy Fri, 22 Jun 2012 13:15:05 -0700

One happens on the mapper.


On Thu, Jun 21, 2012 at 2:52 PM, Prasanth J <[email protected]> wrote:
> Thanks Alan.
> Your suggestion looks correct.
>
> I think with this I can achieve what I wanted in the same syntax
> out = CUBE rel BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f);
>
> Just curious to know.
> How is this different from CROSS? and why is CROSS expensive when compared to 
> flatten?
>
> Thanks
> -- Prasanth
>
> On Jun 21, 2012, at 5:11 PM, Alan Gates wrote:
>
>> I think I'm missing something here.  The result of the "out =" line is three 
>> bags, correct?  If that's the case, the cross product you want is achieved 
>> by doing:
>>
>> result = foreach out generate flatten($0), flatten($1), flatten($2)
>>
>> This is not the same as CROSS, which would be expensive.
>>
>> Alan.
>>
>> On Jun 21, 2012, at 1:28 PM, Prasanth J wrote:
>>
>>> Hello all
>>>
>>> I initially implemented ROLLUP as a separate operation with the following 
>>> syntax
>>>
>>> a = ROLLUP inp BY (x,y);
>>>
>>> which does the same thing as CUBE (inserting foreach + group-by in logical 
>>> plan) except that it uses RollupDimensions UDF. But the issue with this 
>>> approach is that we cannot mix CUBE and ROLLUP operations together in the 
>>> same syntax which is a typical case. SQL/Oracle supports using CUBE and 
>>> ROLLUP together like
>>>
>>> GROUP BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f);
>>>
>>> so I modified the pig grammar to support the similar usage. So now we can 
>>> use a syntax similar to SQL
>>>
>>> out = CUBE rel BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f);
>>>
>>> In this approach, the logical plan should introduce cartesian product 
>>> between bags generated by CUBE(a,b,c), ROLLUP(c,d) and CUBE(e,f) for 
>>> generating the final output. But I read from the documentation 
>>> (http://pig.apache.org/docs/r0.10.0/basic.html#cross) that CROSS operator 
>>> is an expensive operator and advices to use it sparingly.
>>>
>>> Is there any other way to achieve the cartesian product in a less expensive 
>>> way? Also, does anyone have thoughts about this new syntax?
>>>
>>> Thanks
>>> -- Prasanth
>>>
>>> On May 30, 2012, at 8:10 PM, Jonathan Coveney wrote:
>>>
>>>> As far as the underlying implementation, if they all use the same
>>>> optimizations that you use in cube, then it can be LOCube. If they have
>>>> their own optimizations etc (or could), it may be worth them having their
>>>> own Logical operators (which might just be LOCube with flags for the time
>>>> being) that allows us more flexibilty. But I suppose that's between you,
>>>> eclipse, and your GSOC mentor.
>>>>
>>>> 2012/5/30 Prasanth J <[email protected]>
>>>>
>>>>> Thanks Alan and Jon for expressing your views.
>>>>>
>>>>> I agree with Jon's point, if the syntax contains CUBE then user expects it
>>>>> to perform CUBE operation. So Jon's syntax seems more meaningful and 
>>>>> concise
>>>>>
>>>>> rel = CUBE rel BY (dims);
>>>>> rel = ROLLUP rel BY (dims);
>>>>> rel = GROUPING_SET rel BY (dims);
>>>>>
>>>>> 2 reasons why I do not prefer using SQL syntax is
>>>>> 1) I do not want to break into existing Group operator implementation :)
>>>>> 2) The syntax gets longer in case of partial hierarchical cubing/rollups
>>>>> For ex:
>>>>>
>>>>> rel = GROUP rel BY dim0, ROLLUP(dim1, dim2, dim3), ROLLUP(dim4,dim5,dim6),
>>>>> ROLLUP(dim7,dim8,dim9);
>>>>>
>>>>> whereas same thing can be expressed like
>>>>>
>>>>> rel = ROLLUP rel BY dim0,
>>>>> (dim1,dim2,dim3),(dim4,dim5,dim6),(dim7,dim8,dim9);
>>>>>
>>>>> Thanks Alan for pointing out the way for independently managing the
>>>>> operators in parser and logical/physical plan. So for all these operators
>>>>> (CUBE, ROLLUP, GROUPING_SET) I can just generate LOCube and use flags to
>>>>> differentiate between these three operations.
>>>>>
>>>>> But, yes we are proliferating operators in this case.
>>>>>
>>>>> Thanks
>>>>> -- Prasanth
>>>>>
>>>>> On May 30, 2012, at 4:42 PM, Alan Gates wrote:
>>>>>
>>>>>>
>>>>>> On May 30, 2012, at 10:43 AM, Jonathan Coveney wrote:
>>>>>>
>>>>>>> I was going to say the same thing Alan said w.r.t. operators: operators
>>>>> in
>>>>>>> the grammar can correspond to whatever logical and physical operators
>>>>> you
>>>>>>> want.
>>>>>>>
>>>>>>> As far as the principle of least astonishment compared to SQL... Pig is
>>>>>>> already pretty astonishing. I don't know why we would bend over
>>>>> backwards
>>>>>>> to make the syntax so similar in this case when even getting to the
>>>>> point
>>>>>>> of doing a CUBE means understanding an object model that is pretty
>>>>>>> different from SQL.
>>>>>>>
>>>>>>> On that note,
>>>>>>>
>>>>>>> rel = CUBE rel BY GROUPING SETS(cols);
>>>>>>>
>>>>>>> seems really confusing. I'd much rather overload the group operating
>>>>> than
>>>>>>> the cube operator. If I see "cube," I expect a cube. If you start doing
>>>>>>> rollups etc, that's not a cube, it's a group. Or it's just a rollup. Pig
>>>>>>> latin is simple enough that I don't think having a rollup, group_set,
>>>>> etc
>>>>>>> operator will be so confusing, because they're already going to be
>>>>> typing
>>>>>>> that stuff in the conext of
>>>>>>>
>>>>>>> group rel by rollup(cols); and so on. I don't see how it's worth adding
>>>>>>> more, confusing syntax for the sake of creating parallels with a
>>>>> language
>>>>>>> we now share very little with.
>>>>>>
>>>>>> Fair points.
>>>>>>
>>>>>>>
>>>>>>> But I won't beat it any further... if people prefer a different syntax,
>>>>>>> that's fine. Just excited to have the features in Pig!
>>>>>> +1, I can live with any of the 3 syntax choices (near SQL, original, and
>>>>> Jon's).
>>>>>>
>>>>>> Alan.
>>>>>>
>>>>>>> Jon
>>>>>>>
>>>>>>> 2012/5/30 Alan Gates <[email protected]>
>>>>>>>
>>>>>>>> Some thoughts on this:
>>>>>>>>
>>>>>>>> 1) +1 to what Dmitriy said on HAVING
>>>>>>>>
>>>>>>>> 2) We need to be clear about separating operators in the grammar versus
>>>>>>>> logical plan versus physical plan.  The choices you make in the
>>>>> grammar are
>>>>>>>> totally independent of the other two.  That is, you could choose the
>>>>> syntax:
>>>>>>>>
>>>>>>>> rel = GROUP rel BY CUBE (a, b, c)
>>>>>>>>
>>>>>>>> and still have a separate POCube operator.  When the parser sees GROUP
>>>>> BY
>>>>>>>> CUBE it will generate an LOCube operator for the logical plan rather
>>>>> than
>>>>>>>> an LOGroup operator.  You can still have a separate POCube physical
>>>>>>>> operator.  Separate optimizations can be applied to LOGroup vs. LOCube
>>>>> and
>>>>>>>> POGroup vs. POCube.
>>>>>>>>
>>>>>>>> 3) On syntax I can see arguments for keeping as close to SQL as
>>>>> possible
>>>>>>>> and for the syntax proposed by Prasanth.  The argument for sticking
>>>>> close
>>>>>>>> to SQL is it conforms to the law of least astonishment.  It wouldn't be
>>>>>>>> exactly SQL, as it would end up looking like:
>>>>>>>>
>>>>>>>> rel = GROUP rel BY CUBE (cols)
>>>>>>>> rel = GROUP rel BY ROLLUP (cols)
>>>>>>>> rel = GROUP rel BY GROUPING SETS(cols);
>>>>>>>>
>>>>>>>> The argument I see for sticking with Prasanth's approach is that GROUP
>>>>> is
>>>>>>>> really short for COGROUP in Pig Latin, and I don't think we're
>>>>> proposing
>>>>>>>> doing COGROUP rel BY CUBE, nor can I see a case where you'd want to do
>>>>> such
>>>>>>>> a thing.  This makes CUBE really a separate operation.  But if we go
>>>>> this
>>>>>>>> route I agree with Prasanth we should do CUBE rel BY ROLLUP and CUBE
>>>>> rel BY
>>>>>>>> GROUPING SETS.  Let's not proliferate operators.
>>>>>>>>
>>>>>>>> Alan.
>>>>>>>>
>>>>>>>> On May 29, 2012, at 3:55 PM, Prasanth J wrote:
>>>>>>>>
>>>>>>>>> Thanks Jonathan for looking into it and for your suggestions.
>>>>>>>>>
>>>>>>>>> The reason why I came with a clause rather than a separate operator
>>>>> was
>>>>>>>> to avoid adding additional operators to the grammar.
>>>>>>>>> So adding ROLLUP, GROUPING_SET will need separate logical operators
>>>>>>>> adding to the complexity. I am planning to keep everything under cube
>>>>>>>> operator, so only LOCube and POCube operators will be added
>>>>> additionally.
>>>>>>>> And as you and Dmitriy have mentioned the purpose of HAVING clause is
>>>>> the
>>>>>>>> same as FILTER so we do not need a separate HAVING clause.
>>>>>>>>>
>>>>>>>>> I will give a quick recap of cube related operations and multiple
>>>>> syntax
>>>>>>>> options for achieving the same. I am also adding partial cubing and
>>>>> rollup
>>>>>>>> in this discussion.
>>>>>>>>>
>>>>>>>>> 1) CUBE
>>>>>>>>>
>>>>>>>>> Current syntax:
>>>>>>>>> alias = CUBE rel BY (a, b);
>>>>>>>>>
>>>>>>>>> Following group-by's will be computed:
>>>>>>>>> (a, b)
>>>>>>>>> (a)
>>>>>>>>> (b)
>>>>>>>>> ()
>>>>>>>>>
>>>>>>>>> 2) Partial CUBE
>>>>>>>>>
>>>>>>>>> Proposed syntax:
>>>>>>>>> alias = CUBE rel BY a, (b, c);
>>>>>>>>>
>>>>>>>>> Following group-by's will be computed:
>>>>>>>>> (a, b, c)
>>>>>>>>> (a, b)
>>>>>>>>> (a, c)
>>>>>>>>> (a)
>>>>>>>>>
>>>>>>>>> 3) ROLLUP
>>>>>>>>>
>>>>>>>>> Proposed syntax 1:
>>>>>>>>> alias = CUBE rel BY ROLLUP(a, b);
>>>>>>>>>
>>>>>>>>> Proposed syntax 2:
>>>>>>>>> alias = CUBE rel BY (a::b);
>>>>>>>>>
>>>>>>>>> Proposed syntax 3:
>>>>>>>>> alias = ROLLUP rel BY (a, b);
>>>>>>>>>
>>>>>>>>> Following group-by's will be computed:
>>>>>>>>> (a, b)
>>>>>>>>> (a)
>>>>>>>>> ()
>>>>>>>>>
>>>>>>>>> 4) Partial ROLLUP
>>>>>>>>>
>>>>>>>>> Proposed syntax 1:
>>>>>>>>> alias = CUBE rel BY a, ROLLUP(b, c);
>>>>>>>>>
>>>>>>>>> Proposed syntax 2:
>>>>>>>>> alias = CUBE rel BY (a, b::c);
>>>>>>>>>
>>>>>>>>> Proposed syntax 3:
>>>>>>>>> alias = ROLLUP rel BY a, (b, c);
>>>>>>>>>
>>>>>>>>> Following group-by's will be computed:
>>>>>>>>> (a, b, c)
>>>>>>>>> (a, b)
>>>>>>>>> (a)
>>>>>>>>>
>>>>>>>>> 5) GROUPING SETS
>>>>>>>>>
>>>>>>>>> Proposed syntax 1:
>>>>>>>>> alias = CUBE rel BY GROUPING SETS((a), (b, c), (c))
>>>>>>>>>
>>>>>>>>> Proposed syntax 2:
>>>>>>>>> alias = CUBE rel BY {(a), (b, c), (c)}
>>>>>>>>>
>>>>>>>>> Proposed syntax 3:
>>>>>>>>> alias = GROUPING_SET rel BY ((a), (b, c), (c))
>>>>>>>>>
>>>>>>>>> Following group-by's will be computed:
>>>>>>>>> (a)
>>>>>>>>> (b, c)
>>>>>>>>> (c)
>>>>>>>>>
>>>>>>>>> Please vote for syntax 1, 2 or 3 so that we can come to a consensus
>>>>>>>> before I start hacking the grammar file.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> -- Prasanth
>>>>>>>>>
>>>>>>>>> On May 29, 2012, at 4:05 PM, Jonathan Coveney wrote:
>>>>>>>>>
>>>>>>>>>> Hey Prashanth, happy hacking.
>>>>>>>>>>
>>>>>>>>>> My opinion:
>>>>>>>>>>
>>>>>>>>>> CUBE:
>>>>>>>>>>
>>>>>>>>>> alias = CUBE rel BY (a,b,c);
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I like that syntax. It's unambiguous what is going on.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ROLLUP:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> alias = CUBE rel BY ROLLUP(a,b,c);
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I never liked that syntax in SQL. I suggest we just do what we did
>>>>> with
>>>>>>>> CUBE. IE
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> alias = ROLLUP rel BY (a,b,c);
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> GROUPING SETS:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> alias = CUBE rel BY GROUPING SETS((a,b),(b),());
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I don't like this. The cube vs. grouping sets is confusing to me.
>>>>> maybe
>>>>>>>>>> following the
>>>>>>>>>> same pattern you could do something like:
>>>>>>>>>>
>>>>>>>>>> alias = GROUPING_SET rel BY ((a,b),(b),());
>>>>>>>>>>
>>>>>>>>>> As far as having, is there an optimization that can be done with a
>>>>>>>> HAVING
>>>>>>>>>> clause that can't be done based on the logical plan that comes
>>>>>>>> afterwards?
>>>>>>>>>> That seems odd to me. Since you have to materialize the result
>>>>> anyway,
>>>>>>>>>> can't the having clause just be a FILTER that comes after the cube? I
>>>>>>>> don't
>>>>>>>>>> know why we need a special syntax.
>>>>>>>>>>
>>>>>>>>>> My opinion. Forgive janky formatting, gmail + paste = pain.
>>>>>>>>>> Jon
>>>>>>>>>>
>>>>>>>>>> 2012/5/27 Prasanth J <[email protected]>
>>>>>>>>>>
>>>>>>>>>>> Hello everyone
>>>>>>>>>>>
>>>>>>>>>>> I am looking for feedback from the community about the syntax for
>>>>>>>>>>> CUBE/ROLLUP/GROUPING SETS operations in pig.
>>>>>>>>>>> I am moving the discussion from JIRA to dev-list so that everyone
>>>>> can
>>>>>>>>>>> share their opinion for operator syntax. Please have a look at the
>>>>>>>> syntax
>>>>>>>>>>> proposal at the link below and let me know your opinion
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>
>>>>> https://issues.apache.org/jira/browse/PIG-2167?focusedCommentId=13277644&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13277644
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>> -- Prasanth
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>
>>
>

Re: CUBE/ROLLUP/GROUPING SETS syntax

Reply via email to