GROUPING SETS syntax

Jonathan Coveney Thu, 21 Jun 2012 13:42:16 -0700

Just to make sure I understand this correctly, is

out = CUBE rel BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f);


equivalent to:

out1 = CUBE rel BY (a,b,c);
out2 = ROLLUP rel BY (c,d);
out3 = CUBY rel BY (e,f);

out = CROSS out1, out2, out3;

?

2012/6/21 Prasanth J <[email protected]>

> Hello all
>
> I initially implemented ROLLUP as a separate operation with the following
> syntax
>
> a = ROLLUP inp BY (x,y);
>
> which does the same thing as CUBE (inserting foreach + group-by in logical
> plan) except that it uses RollupDimensions UDF. But the issue with this
> approach is that we cannot mix CUBE and ROLLUP operations together in the
> same syntax which is a typical case. SQL/Oracle supports using CUBE and
> ROLLUP together like
>
> GROUP BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f);
>
> so I modified the pig grammar to support the similar usage. So now we can
> use a syntax similar to SQL
>
> out = CUBE rel BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f);
>
> In this approach, the logical plan should introduce cartesian product
> between bags generated by CUBE(a,b,c), ROLLUP(c,d) and CUBE(e,f) for
> generating the final output. But I read from the documentation (
> http://pig.apache.org/docs/r0.10.0/basic.html#cross) that CROSS operator
> is an expensive operator and advices to use it sparingly.
>
> Is there any other way to achieve the cartesian product in a less
> expensive way? Also, does anyone have thoughts about this new syntax?
>
> Thanks
> -- Prasanth
>
> On May 30, 2012, at 8:10 PM, Jonathan Coveney wrote:
>
> > As far as the underlying implementation, if they all use the same
> > optimizations that you use in cube, then it can be LOCube. If they have
> > their own optimizations etc (or could), it may be worth them having their
> > own Logical operators (which might just be LOCube with flags for the time
> > being) that allows us more flexibilty. But I suppose that's between you,
> > eclipse, and your GSOC mentor.
> >
> > 2012/5/30 Prasanth J <[email protected]>
> >
> >> Thanks Alan and Jon for expressing your views.
> >>
> >> I agree with Jon's point, if the syntax contains CUBE then user expects
> it
> >> to perform CUBE operation. So Jon's syntax seems more meaningful and
> concise
> >>
> >> rel = CUBE rel BY (dims);
> >> rel = ROLLUP rel BY (dims);
> >> rel = GROUPING_SET rel BY (dims);
> >>
> >> 2 reasons why I do not prefer using SQL syntax is
> >> 1) I do not want to break into existing Group operator implementation :)
> >> 2) The syntax gets longer in case of partial hierarchical cubing/rollups
> >> For ex:
> >>
> >> rel = GROUP rel BY dim0, ROLLUP(dim1, dim2, dim3),
> ROLLUP(dim4,dim5,dim6),
> >> ROLLUP(dim7,dim8,dim9);
> >>
> >> whereas same thing can be expressed like
> >>
> >> rel = ROLLUP rel BY dim0,
> >> (dim1,dim2,dim3),(dim4,dim5,dim6),(dim7,dim8,dim9);
> >>
> >> Thanks Alan for pointing out the way for independently managing the
> >> operators in parser and logical/physical plan. So for all these
> operators
> >> (CUBE, ROLLUP, GROUPING_SET) I can just generate LOCube and use flags to
> >> differentiate between these three operations.
> >>
> >> But, yes we are proliferating operators in this case.
> >>
> >> Thanks
> >> -- Prasanth
> >>
> >> On May 30, 2012, at 4:42 PM, Alan Gates wrote:
> >>
> >>>
> >>> On May 30, 2012, at 10:43 AM, Jonathan Coveney wrote:
> >>>
> >>>> I was going to say the same thing Alan said w.r.t. operators:
> operators
> >> in
> >>>> the grammar can correspond to whatever logical and physical operators
> >> you
> >>>> want.
> >>>>
> >>>> As far as the principle of least astonishment compared to SQL... Pig
> is
> >>>> already pretty astonishing. I don't know why we would bend over
> >> backwards
> >>>> to make the syntax so similar in this case when even getting to the
> >> point
> >>>> of doing a CUBE means understanding an object model that is pretty
> >>>> different from SQL.
> >>>>
> >>>> On that note,
> >>>>
> >>>> rel = CUBE rel BY GROUPING SETS(cols);
> >>>>
> >>>> seems really confusing. I'd much rather overload the group operating
> >> than
> >>>> the cube operator. If I see "cube," I expect a cube. If you start
> doing
> >>>> rollups etc, that's not a cube, it's a group. Or it's just a rollup.
> Pig
> >>>> latin is simple enough that I don't think having a rollup, group_set,
> >> etc
> >>>> operator will be so confusing, because they're already going to be
> >> typing
> >>>> that stuff in the conext of
> >>>>
> >>>> group rel by rollup(cols); and so on. I don't see how it's worth
> adding
> >>>> more, confusing syntax for the sake of creating parallels with a
> >> language
> >>>> we now share very little with.
> >>>
> >>> Fair points.
> >>>
> >>>>
> >>>> But I won't beat it any further... if people prefer a different
> syntax,
> >>>> that's fine. Just excited to have the features in Pig!
> >>> +1, I can live with any of the 3 syntax choices (near SQL, original,
> and
> >> Jon's).
> >>>
> >>> Alan.
> >>>
> >>>> Jon
> >>>>
> >>>> 2012/5/30 Alan Gates <[email protected]>
> >>>>
> >>>>> Some thoughts on this:
> >>>>>
> >>>>> 1) +1 to what Dmitriy said on HAVING
> >>>>>
> >>>>> 2) We need to be clear about separating operators in the grammar
> versus
> >>>>> logical plan versus physical plan.  The choices you make in the
> >> grammar are
> >>>>> totally independent of the other two.  That is, you could choose the
> >> syntax:
> >>>>>
> >>>>> rel = GROUP rel BY CUBE (a, b, c)
> >>>>>
> >>>>> and still have a separate POCube operator.  When the parser sees
> GROUP
> >> BY
> >>>>> CUBE it will generate an LOCube operator for the logical plan rather
> >> than
> >>>>> an LOGroup operator.  You can still have a separate POCube physical
> >>>>> operator.  Separate optimizations can be applied to LOGroup vs.
> LOCube
> >> and
> >>>>> POGroup vs. POCube.
> >>>>>
> >>>>> 3) On syntax I can see arguments for keeping as close to SQL as
> >> possible
> >>>>> and for the syntax proposed by Prasanth.  The argument for sticking
> >> close
> >>>>> to SQL is it conforms to the law of least astonishment.  It wouldn't
> be
> >>>>> exactly SQL, as it would end up looking like:
> >>>>>
> >>>>> rel = GROUP rel BY CUBE (cols)
> >>>>> rel = GROUP rel BY ROLLUP (cols)
> >>>>> rel = GROUP rel BY GROUPING SETS(cols);
> >>>>>
> >>>>> The argument I see for sticking with Prasanth's approach is that
> GROUP
> >> is
> >>>>> really short for COGROUP in Pig Latin, and I don't think we're
> >> proposing
> >>>>> doing COGROUP rel BY CUBE, nor can I see a case where you'd want to
> do
> >> such
> >>>>> a thing.  This makes CUBE really a separate operation.  But if we go
> >> this
> >>>>> route I agree with Prasanth we should do CUBE rel BY ROLLUP and CUBE
> >> rel BY
> >>>>> GROUPING SETS.  Let's not proliferate operators.
> >>>>>
> >>>>> Alan.
> >>>>>
> >>>>> On May 29, 2012, at 3:55 PM, Prasanth J wrote:
> >>>>>
> >>>>>> Thanks Jonathan for looking into it and for your suggestions.
> >>>>>>
> >>>>>> The reason why I came with a clause rather than a separate operator
> >> was
> >>>>> to avoid adding additional operators to the grammar.
> >>>>>> So adding ROLLUP, GROUPING_SET will need separate logical operators
> >>>>> adding to the complexity. I am planning to keep everything under cube
> >>>>> operator, so only LOCube and POCube operators will be added
> >> additionally.
> >>>>> And as you and Dmitriy have mentioned the purpose of HAVING clause is
> >> the
> >>>>> same as FILTER so we do not need a separate HAVING clause.
> >>>>>>
> >>>>>> I will give a quick recap of cube related operations and multiple
> >> syntax
> >>>>> options for achieving the same. I am also adding partial cubing and
> >> rollup
> >>>>> in this discussion.
> >>>>>>
> >>>>>> 1) CUBE
> >>>>>>
> >>>>>> Current syntax:
> >>>>>> alias = CUBE rel BY (a, b);
> >>>>>>
> >>>>>> Following group-by's will be computed:
> >>>>>> (a, b)
> >>>>>> (a)
> >>>>>> (b)
> >>>>>> ()
> >>>>>>
> >>>>>> 2) Partial CUBE
> >>>>>>
> >>>>>> Proposed syntax:
> >>>>>> alias = CUBE rel BY a, (b, c);
> >>>>>>
> >>>>>> Following group-by's will be computed:
> >>>>>> (a, b, c)
> >>>>>> (a, b)
> >>>>>> (a, c)
> >>>>>> (a)
> >>>>>>
> >>>>>> 3) ROLLUP
> >>>>>>
> >>>>>> Proposed syntax 1:
> >>>>>> alias = CUBE rel BY ROLLUP(a, b);
> >>>>>>
> >>>>>> Proposed syntax 2:
> >>>>>> alias = CUBE rel BY (a::b);
> >>>>>>
> >>>>>> Proposed syntax 3:
> >>>>>> alias = ROLLUP rel BY (a, b);
> >>>>>>
> >>>>>> Following group-by's will be computed:
> >>>>>> (a, b)
> >>>>>> (a)
> >>>>>> ()
> >>>>>>
> >>>>>> 4) Partial ROLLUP
> >>>>>>
> >>>>>> Proposed syntax 1:
> >>>>>> alias = CUBE rel BY a, ROLLUP(b, c);
> >>>>>>
> >>>>>> Proposed syntax 2:
> >>>>>> alias = CUBE rel BY (a, b::c);
> >>>>>>
> >>>>>> Proposed syntax 3:
> >>>>>> alias = ROLLUP rel BY a, (b, c);
> >>>>>>
> >>>>>> Following group-by's will be computed:
> >>>>>> (a, b, c)
> >>>>>> (a, b)
> >>>>>> (a)
> >>>>>>
> >>>>>> 5) GROUPING SETS
> >>>>>>
> >>>>>> Proposed syntax 1:
> >>>>>> alias = CUBE rel BY GROUPING SETS((a), (b, c), (c))
> >>>>>>
> >>>>>> Proposed syntax 2:
> >>>>>> alias = CUBE rel BY {(a), (b, c), (c)}
> >>>>>>
> >>>>>> Proposed syntax 3:
> >>>>>> alias = GROUPING_SET rel BY ((a), (b, c), (c))
> >>>>>>
> >>>>>> Following group-by's will be computed:
> >>>>>> (a)
> >>>>>> (b, c)
> >>>>>> (c)
> >>>>>>
> >>>>>> Please vote for syntax 1, 2 or 3 so that we can come to a consensus
> >>>>> before I start hacking the grammar file.
> >>>>>>
> >>>>>> Thanks
> >>>>>> -- Prasanth
> >>>>>>
> >>>>>> On May 29, 2012, at 4:05 PM, Jonathan Coveney wrote:
> >>>>>>
> >>>>>>> Hey Prashanth, happy hacking.
> >>>>>>>
> >>>>>>> My opinion:
> >>>>>>>
> >>>>>>> CUBE:
> >>>>>>>
> >>>>>>> alias = CUBE rel BY (a,b,c);
> >>>>>>>
> >>>>>>>
> >>>>>>> I like that syntax. It's unambiguous what is going on.
> >>>>>>>
> >>>>>>>
> >>>>>>> ROLLUP:
> >>>>>>>
> >>>>>>>
> >>>>>>> alias = CUBE rel BY ROLLUP(a,b,c);
> >>>>>>>
> >>>>>>>
> >>>>>>> I never liked that syntax in SQL. I suggest we just do what we did
> >> with
> >>>>> CUBE. IE
> >>>>>>>
> >>>>>>>
> >>>>>>> alias = ROLLUP rel BY (a,b,c);
> >>>>>>>
> >>>>>>>
> >>>>>>> GROUPING SETS:
> >>>>>>>
> >>>>>>>
> >>>>>>> alias = CUBE rel BY GROUPING SETS((a,b),(b),());
> >>>>>>>
> >>>>>>>
> >>>>>>> I don't like this. The cube vs. grouping sets is confusing to me.
> >> maybe
> >>>>>>> following the
> >>>>>>> same pattern you could do something like:
> >>>>>>>
> >>>>>>> alias = GROUPING_SET rel BY ((a,b),(b),());
> >>>>>>>
> >>>>>>> As far as having, is there an optimization that can be done with a
> >>>>> HAVING
> >>>>>>> clause that can't be done based on the logical plan that comes
> >>>>> afterwards?
> >>>>>>> That seems odd to me. Since you have to materialize the result
> >> anyway,
> >>>>>>> can't the having clause just be a FILTER that comes after the
> cube? I
> >>>>> don't
> >>>>>>> know why we need a special syntax.
> >>>>>>>
> >>>>>>> My opinion. Forgive janky formatting, gmail + paste = pain.
> >>>>>>> Jon
> >>>>>>>
> >>>>>>> 2012/5/27 Prasanth J <[email protected]>
> >>>>>>>
> >>>>>>>> Hello everyone
> >>>>>>>>
> >>>>>>>> I am looking for feedback from the community about the syntax for
> >>>>>>>> CUBE/ROLLUP/GROUPING SETS operations in pig.
> >>>>>>>> I am moving the discussion from JIRA to dev-list so that everyone
> >> can
> >>>>>>>> share their opinion for operator syntax. Please have a look at the
> >>>>> syntax
> >>>>>>>> proposal at the link below and let me know your opinion
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>
> >>
> https://issues.apache.org/jira/browse/PIG-2167?focusedCommentId=13277644&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13277644
> >>>>>>>>
> >>>>>>>> Thanks
> >>>>>>>> -- Prasanth
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>
> >>
> >>
>
>

Re: CUBE/ROLLUP/GROUPING SETS syntax

Reply via email to