GROUPING SETS syntax

Jonathan Coveney Thu, 21 Jun 2012 13:51:15 -0700

IMHO, I don't know that it's worth overloading the operator to support
that. If that's what it is, then they can just do what I said. I would
rather the focus be on getting ROLLUP, CUBE, etc implemented and efficient.


Curious what others think.

2012/6/21 Prasanth J <[email protected]>

> Yeah you are right.
>
> Thanks
> -- Prasanth
>
> On Jun 21, 2012, at 4:41 PM, Jonathan Coveney wrote:
>
> > Just to make sure I understand this correctly, is
> >
> > out = CUBE rel BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f);
> >
> > equivalent to:
> >
> > out1 = CUBE rel BY (a,b,c);
> > out2 = ROLLUP rel BY (c,d);
> > out3 = CUBY rel BY (e,f);
> >
> > out = CROSS out1, out2, out3;
> >
> > ?
> >
> > 2012/6/21 Prasanth J <[email protected]>
> >
> >> Hello all
> >>
> >> I initially implemented ROLLUP as a separate operation with the
> following
> >> syntax
> >>
> >> a = ROLLUP inp BY (x,y);
> >>
> >> which does the same thing as CUBE (inserting foreach + group-by in
> logical
> >> plan) except that it uses RollupDimensions UDF. But the issue with this
> >> approach is that we cannot mix CUBE and ROLLUP operations together in
> the
> >> same syntax which is a typical case. SQL/Oracle supports using CUBE and
> >> ROLLUP together like
> >>
> >> GROUP BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f);
> >>
> >> so I modified the pig grammar to support the similar usage. So now we
> can
> >> use a syntax similar to SQL
> >>
> >> out = CUBE rel BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f);
> >>
> >> In this approach, the logical plan should introduce cartesian product
> >> between bags generated by CUBE(a,b,c), ROLLUP(c,d) and CUBE(e,f) for
> >> generating the final output. But I read from the documentation (
> >> http://pig.apache.org/docs/r0.10.0/basic.html#cross) that CROSS
> operator
> >> is an expensive operator and advices to use it sparingly.
> >>
> >> Is there any other way to achieve the cartesian product in a less
> >> expensive way? Also, does anyone have thoughts about this new syntax?
> >>
> >> Thanks
> >> -- Prasanth
> >>
> >> On May 30, 2012, at 8:10 PM, Jonathan Coveney wrote:
> >>
> >>> As far as the underlying implementation, if they all use the same
> >>> optimizations that you use in cube, then it can be LOCube. If they have
> >>> their own optimizations etc (or could), it may be worth them having
> their
> >>> own Logical operators (which might just be LOCube with flags for the
> time
> >>> being) that allows us more flexibilty. But I suppose that's between
> you,
> >>> eclipse, and your GSOC mentor.
> >>>
> >>> 2012/5/30 Prasanth J <[email protected]>
> >>>
> >>>> Thanks Alan and Jon for expressing your views.
> >>>>
> >>>> I agree with Jon's point, if the syntax contains CUBE then user
> expects
> >> it
> >>>> to perform CUBE operation. So Jon's syntax seems more meaningful and
> >> concise
> >>>>
> >>>> rel = CUBE rel BY (dims);
> >>>> rel = ROLLUP rel BY (dims);
> >>>> rel = GROUPING_SET rel BY (dims);
> >>>>
> >>>> 2 reasons why I do not prefer using SQL syntax is
> >>>> 1) I do not want to break into existing Group operator implementation
> :)
> >>>> 2) The syntax gets longer in case of partial hierarchical
> cubing/rollups
> >>>> For ex:
> >>>>
> >>>> rel = GROUP rel BY dim0, ROLLUP(dim1, dim2, dim3),
> >> ROLLUP(dim4,dim5,dim6),
> >>>> ROLLUP(dim7,dim8,dim9);
> >>>>
> >>>> whereas same thing can be expressed like
> >>>>
> >>>> rel = ROLLUP rel BY dim0,
> >>>> (dim1,dim2,dim3),(dim4,dim5,dim6),(dim7,dim8,dim9);
> >>>>
> >>>> Thanks Alan for pointing out the way for independently managing the
> >>>> operators in parser and logical/physical plan. So for all these
> >> operators
> >>>> (CUBE, ROLLUP, GROUPING_SET) I can just generate LOCube and use flags
> to
> >>>> differentiate between these three operations.
> >>>>
> >>>> But, yes we are proliferating operators in this case.
> >>>>
> >>>> Thanks
> >>>> -- Prasanth
> >>>>
> >>>> On May 30, 2012, at 4:42 PM, Alan Gates wrote:
> >>>>
> >>>>>
> >>>>> On May 30, 2012, at 10:43 AM, Jonathan Coveney wrote:
> >>>>>
> >>>>>> I was going to say the same thing Alan said w.r.t. operators:
> >> operators
> >>>> in
> >>>>>> the grammar can correspond to whatever logical and physical
> operators
> >>>> you
> >>>>>> want.
> >>>>>>
> >>>>>> As far as the principle of least astonishment compared to SQL... Pig
> >> is
> >>>>>> already pretty astonishing. I don't know why we would bend over
> >>>> backwards
> >>>>>> to make the syntax so similar in this case when even getting to the
> >>>> point
> >>>>>> of doing a CUBE means understanding an object model that is pretty
> >>>>>> different from SQL.
> >>>>>>
> >>>>>> On that note,
> >>>>>>
> >>>>>> rel = CUBE rel BY GROUPING SETS(cols);
> >>>>>>
> >>>>>> seems really confusing. I'd much rather overload the group operating
> >>>> than
> >>>>>> the cube operator. If I see "cube," I expect a cube. If you start
> >> doing
> >>>>>> rollups etc, that's not a cube, it's a group. Or it's just a rollup.
> >> Pig
> >>>>>> latin is simple enough that I don't think having a rollup,
> group_set,
> >>>> etc
> >>>>>> operator will be so confusing, because they're already going to be
> >>>> typing
> >>>>>> that stuff in the conext of
> >>>>>>
> >>>>>> group rel by rollup(cols); and so on. I don't see how it's worth
> >> adding
> >>>>>> more, confusing syntax for the sake of creating parallels with a
> >>>> language
> >>>>>> we now share very little with.
> >>>>>
> >>>>> Fair points.
> >>>>>
> >>>>>>
> >>>>>> But I won't beat it any further... if people prefer a different
> >> syntax,
> >>>>>> that's fine. Just excited to have the features in Pig!
> >>>>> +1, I can live with any of the 3 syntax choices (near SQL, original,
> >> and
> >>>> Jon's).
> >>>>>
> >>>>> Alan.
> >>>>>
> >>>>>> Jon
> >>>>>>
> >>>>>> 2012/5/30 Alan Gates <[email protected]>
> >>>>>>
> >>>>>>> Some thoughts on this:
> >>>>>>>
> >>>>>>> 1) +1 to what Dmitriy said on HAVING
> >>>>>>>
> >>>>>>> 2) We need to be clear about separating operators in the grammar
> >> versus
> >>>>>>> logical plan versus physical plan.  The choices you make in the
> >>>> grammar are
> >>>>>>> totally independent of the other two.  That is, you could choose
> the
> >>>> syntax:
> >>>>>>>
> >>>>>>> rel = GROUP rel BY CUBE (a, b, c)
> >>>>>>>
> >>>>>>> and still have a separate POCube operator.  When the parser sees
> >> GROUP
> >>>> BY
> >>>>>>> CUBE it will generate an LOCube operator for the logical plan
> rather
> >>>> than
> >>>>>>> an LOGroup operator.  You can still have a separate POCube physical
> >>>>>>> operator.  Separate optimizations can be applied to LOGroup vs.
> >> LOCube
> >>>> and
> >>>>>>> POGroup vs. POCube.
> >>>>>>>
> >>>>>>> 3) On syntax I can see arguments for keeping as close to SQL as
> >>>> possible
> >>>>>>> and for the syntax proposed by Prasanth.  The argument for sticking
> >>>> close
> >>>>>>> to SQL is it conforms to the law of least astonishment.  It
> wouldn't
> >> be
> >>>>>>> exactly SQL, as it would end up looking like:
> >>>>>>>
> >>>>>>> rel = GROUP rel BY CUBE (cols)
> >>>>>>> rel = GROUP rel BY ROLLUP (cols)
> >>>>>>> rel = GROUP rel BY GROUPING SETS(cols);
> >>>>>>>
> >>>>>>> The argument I see for sticking with Prasanth's approach is that
> >> GROUP
> >>>> is
> >>>>>>> really short for COGROUP in Pig Latin, and I don't think we're
> >>>> proposing
> >>>>>>> doing COGROUP rel BY CUBE, nor can I see a case where you'd want to
> >> do
> >>>> such
> >>>>>>> a thing.  This makes CUBE really a separate operation.  But if we
> go
> >>>> this
> >>>>>>> route I agree with Prasanth we should do CUBE rel BY ROLLUP and
> CUBE
> >>>> rel BY
> >>>>>>> GROUPING SETS.  Let's not proliferate operators.
> >>>>>>>
> >>>>>>> Alan.
> >>>>>>>
> >>>>>>> On May 29, 2012, at 3:55 PM, Prasanth J wrote:
> >>>>>>>
> >>>>>>>> Thanks Jonathan for looking into it and for your suggestions.
> >>>>>>>>
> >>>>>>>> The reason why I came with a clause rather than a separate
> operator
> >>>> was
> >>>>>>> to avoid adding additional operators to the grammar.
> >>>>>>>> So adding ROLLUP, GROUPING_SET will need separate logical
> operators
> >>>>>>> adding to the complexity. I am planning to keep everything under
> cube
> >>>>>>> operator, so only LOCube and POCube operators will be added
> >>>> additionally.
> >>>>>>> And as you and Dmitriy have mentioned the purpose of HAVING clause
> is
> >>>> the
> >>>>>>> same as FILTER so we do not need a separate HAVING clause.
> >>>>>>>>
> >>>>>>>> I will give a quick recap of cube related operations and multiple
> >>>> syntax
> >>>>>>> options for achieving the same. I am also adding partial cubing and
> >>>> rollup
> >>>>>>> in this discussion.
> >>>>>>>>
> >>>>>>>> 1) CUBE
> >>>>>>>>
> >>>>>>>> Current syntax:
> >>>>>>>> alias = CUBE rel BY (a, b);
> >>>>>>>>
> >>>>>>>> Following group-by's will be computed:
> >>>>>>>> (a, b)
> >>>>>>>> (a)
> >>>>>>>> (b)
> >>>>>>>> ()
> >>>>>>>>
> >>>>>>>> 2) Partial CUBE
> >>>>>>>>
> >>>>>>>> Proposed syntax:
> >>>>>>>> alias = CUBE rel BY a, (b, c);
> >>>>>>>>
> >>>>>>>> Following group-by's will be computed:
> >>>>>>>> (a, b, c)
> >>>>>>>> (a, b)
> >>>>>>>> (a, c)
> >>>>>>>> (a)
> >>>>>>>>
> >>>>>>>> 3) ROLLUP
> >>>>>>>>
> >>>>>>>> Proposed syntax 1:
> >>>>>>>> alias = CUBE rel BY ROLLUP(a, b);
> >>>>>>>>
> >>>>>>>> Proposed syntax 2:
> >>>>>>>> alias = CUBE rel BY (a::b);
> >>>>>>>>
> >>>>>>>> Proposed syntax 3:
> >>>>>>>> alias = ROLLUP rel BY (a, b);
> >>>>>>>>
> >>>>>>>> Following group-by's will be computed:
> >>>>>>>> (a, b)
> >>>>>>>> (a)
> >>>>>>>> ()
> >>>>>>>>
> >>>>>>>> 4) Partial ROLLUP
> >>>>>>>>
> >>>>>>>> Proposed syntax 1:
> >>>>>>>> alias = CUBE rel BY a, ROLLUP(b, c);
> >>>>>>>>
> >>>>>>>> Proposed syntax 2:
> >>>>>>>> alias = CUBE rel BY (a, b::c);
> >>>>>>>>
> >>>>>>>> Proposed syntax 3:
> >>>>>>>> alias = ROLLUP rel BY a, (b, c);
> >>>>>>>>
> >>>>>>>> Following group-by's will be computed:
> >>>>>>>> (a, b, c)
> >>>>>>>> (a, b)
> >>>>>>>> (a)
> >>>>>>>>
> >>>>>>>> 5) GROUPING SETS
> >>>>>>>>
> >>>>>>>> Proposed syntax 1:
> >>>>>>>> alias = CUBE rel BY GROUPING SETS((a), (b, c), (c))
> >>>>>>>>
> >>>>>>>> Proposed syntax 2:
> >>>>>>>> alias = CUBE rel BY {(a), (b, c), (c)}
> >>>>>>>>
> >>>>>>>> Proposed syntax 3:
> >>>>>>>> alias = GROUPING_SET rel BY ((a), (b, c), (c))
> >>>>>>>>
> >>>>>>>> Following group-by's will be computed:
> >>>>>>>> (a)
> >>>>>>>> (b, c)
> >>>>>>>> (c)
> >>>>>>>>
> >>>>>>>> Please vote for syntax 1, 2 or 3 so that we can come to a
> consensus
> >>>>>>> before I start hacking the grammar file.
> >>>>>>>>
> >>>>>>>> Thanks
> >>>>>>>> -- Prasanth
> >>>>>>>>
> >>>>>>>> On May 29, 2012, at 4:05 PM, Jonathan Coveney wrote:
> >>>>>>>>
> >>>>>>>>> Hey Prashanth, happy hacking.
> >>>>>>>>>
> >>>>>>>>> My opinion:
> >>>>>>>>>
> >>>>>>>>> CUBE:
> >>>>>>>>>
> >>>>>>>>> alias = CUBE rel BY (a,b,c);
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> I like that syntax. It's unambiguous what is going on.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> ROLLUP:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> alias = CUBE rel BY ROLLUP(a,b,c);
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> I never liked that syntax in SQL. I suggest we just do what we
> did
> >>>> with
> >>>>>>> CUBE. IE
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> alias = ROLLUP rel BY (a,b,c);
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> GROUPING SETS:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> alias = CUBE rel BY GROUPING SETS((a,b),(b),());
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> I don't like this. The cube vs. grouping sets is confusing to me.
> >>>> maybe
> >>>>>>>>> following the
> >>>>>>>>> same pattern you could do something like:
> >>>>>>>>>
> >>>>>>>>> alias = GROUPING_SET rel BY ((a,b),(b),());
> >>>>>>>>>
> >>>>>>>>> As far as having, is there an optimization that can be done with
> a
> >>>>>>> HAVING
> >>>>>>>>> clause that can't be done based on the logical plan that comes
> >>>>>>> afterwards?
> >>>>>>>>> That seems odd to me. Since you have to materialize the result
> >>>> anyway,
> >>>>>>>>> can't the having clause just be a FILTER that comes after the
> >> cube? I
> >>>>>>> don't
> >>>>>>>>> know why we need a special syntax.
> >>>>>>>>>
> >>>>>>>>> My opinion. Forgive janky formatting, gmail + paste = pain.
> >>>>>>>>> Jon
> >>>>>>>>>
> >>>>>>>>> 2012/5/27 Prasanth J <[email protected]>
> >>>>>>>>>
> >>>>>>>>>> Hello everyone
> >>>>>>>>>>
> >>>>>>>>>> I am looking for feedback from the community about the syntax
> for
> >>>>>>>>>> CUBE/ROLLUP/GROUPING SETS operations in pig.
> >>>>>>>>>> I am moving the discussion from JIRA to dev-list so that
> everyone
> >>>> can
> >>>>>>>>>> share their opinion for operator syntax. Please have a look at
> the
> >>>>>>> syntax
> >>>>>>>>>> proposal at the link below and let me know your opinion
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>
> >>>>
> >>
> https://issues.apache.org/jira/browse/PIG-2167?focusedCommentId=13277644&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13277644
> >>>>>>>>>>
> >>>>>>>>>> Thanks
> >>>>>>>>>> -- Prasanth
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>>
> >>>>
> >>
> >>
>
>

Re: CUBE/ROLLUP/GROUPING SETS syntax

Reply via email to