GROUPING SETS syntax

Jonathan Coveney Wed, 30 May 2012 10:43:28 -0700

I was going to say the same thing Alan said w.r.t. operators: operators in
the grammar can correspond to whatever logical and physical operators you
want.


As far as the principle of least astonishment compared to SQL... Pig is
already pretty astonishing. I don't know why we would bend over backwards
to make the syntax so similar in this case when even getting to the point
of doing a CUBE means understanding an object model that is pretty
different from SQL.

On that note,

rel = CUBE rel BY GROUPING SETS(cols);

seems really confusing. I'd much rather overload the group operating than
the cube operator. If I see "cube," I expect a cube. If you start doing
rollups etc, that's not a cube, it's a group. Or it's just a rollup. Pig
latin is simple enough that I don't think having a rollup, group_set, etc
operator will be so confusing, because they're already going to be typing
that stuff in the conext of

group rel by rollup(cols); and so on. I don't see how it's worth adding
more, confusing syntax for the sake of creating parallels with a language
we now share very little with.

But I won't beat it any further... if people prefer a different syntax,
that's fine. Just excited to have the features in Pig!
Jon

2012/5/30 Alan Gates <[email protected]>

> Some thoughts on this:
>
> 1) +1 to what Dmitriy said on HAVING
>
> 2) We need to be clear about separating operators in the grammar versus
> logical plan versus physical plan.  The choices you make in the grammar are
> totally independent of the other two.  That is, you could choose the syntax:
>
> rel = GROUP rel BY CUBE (a, b, c)
>
> and still have a separate POCube operator.  When the parser sees GROUP BY
> CUBE it will generate an LOCube operator for the logical plan rather than
> an LOGroup operator.  You can still have a separate POCube physical
> operator.  Separate optimizations can be applied to LOGroup vs. LOCube and
> POGroup vs. POCube.
>
> 3) On syntax I can see arguments for keeping as close to SQL as possible
> and for the syntax proposed by Prasanth.  The argument for sticking close
> to SQL is it conforms to the law of least astonishment.  It wouldn't be
> exactly SQL, as it would end up looking like:
>
> rel = GROUP rel BY CUBE (cols)
> rel = GROUP rel BY ROLLUP (cols)
> rel = GROUP rel BY GROUPING SETS(cols);
>
> The argument I see for sticking with Prasanth's approach is that GROUP is
> really short for COGROUP in Pig Latin, and I don't think we're proposing
> doing COGROUP rel BY CUBE, nor can I see a case where you'd want to do such
> a thing.  This makes CUBE really a separate operation.  But if we go this
> route I agree with Prasanth we should do CUBE rel BY ROLLUP and CUBE rel BY
> GROUPING SETS.  Let's not proliferate operators.
>
> Alan.
>
> On May 29, 2012, at 3:55 PM, Prasanth J wrote:
>
> > Thanks Jonathan for looking into it and for your suggestions.
> >
> > The reason why I came with a clause rather than a separate operator was
> to avoid adding additional operators to the grammar.
> > So adding ROLLUP, GROUPING_SET will need separate logical operators
> adding to the complexity. I am planning to keep everything under cube
> operator, so only LOCube and POCube operators will be added additionally.
> And as you and Dmitriy have mentioned the purpose of HAVING clause is the
> same as FILTER so we do not need a separate HAVING clause.
> >
> > I will give a quick recap of cube related operations and multiple syntax
> options for achieving the same. I am also adding partial cubing and rollup
> in this discussion.
> >
> > 1) CUBE
> >
> > Current syntax:
> > alias = CUBE rel BY (a, b);
> >
> > Following group-by's will be computed:
> > (a, b)
> > (a)
> > (b)
> > ()
> >
> > 2) Partial CUBE
> >
> > Proposed syntax:
> > alias = CUBE rel BY a, (b, c);
> >
> > Following group-by's will be computed:
> > (a, b, c)
> > (a, b)
> > (a, c)
> > (a)
> >
> > 3) ROLLUP
> >
> > Proposed syntax 1:
> > alias = CUBE rel BY ROLLUP(a, b);
> >
> > Proposed syntax 2:
> > alias = CUBE rel BY (a::b);
> >
> > Proposed syntax 3:
> > alias = ROLLUP rel BY (a, b);
> >
> > Following group-by's will be computed:
> > (a, b)
> > (a)
> > ()
> >
> > 4) Partial ROLLUP
> >
> > Proposed syntax 1:
> > alias = CUBE rel BY a, ROLLUP(b, c);
> >
> > Proposed syntax 2:
> > alias = CUBE rel BY (a, b::c);
> >
> > Proposed syntax 3:
> > alias = ROLLUP rel BY a, (b, c);
> >
> > Following group-by's will be computed:
> > (a, b, c)
> > (a, b)
> > (a)
> >
> > 5) GROUPING SETS
> >
> > Proposed syntax 1:
> > alias = CUBE rel BY GROUPING SETS((a), (b, c), (c))
> >
> > Proposed syntax 2:
> > alias = CUBE rel BY {(a), (b, c), (c)}
> >
> > Proposed syntax 3:
> > alias = GROUPING_SET rel BY ((a), (b, c), (c))
> >
> > Following group-by's will be computed:
> > (a)
> > (b, c)
> > (c)
> >
> > Please vote for syntax 1, 2 or 3 so that we can come to a consensus
> before I start hacking the grammar file.
> >
> > Thanks
> > -- Prasanth
> >
> > On May 29, 2012, at 4:05 PM, Jonathan Coveney wrote:
> >
> >> Hey Prashanth, happy hacking.
> >>
> >> My opinion:
> >>
> >> CUBE:
> >>
> >> alias = CUBE rel BY (a,b,c);
> >>
> >>
> >> I like that syntax. It's unambiguous what is going on.
> >>
> >>
> >> ROLLUP:
> >>
> >>
> >> alias = CUBE rel BY ROLLUP(a,b,c);
> >>
> >>
> >> I never liked that syntax in SQL. I suggest we just do what we did with
> CUBE. IE
> >>
> >>
> >> alias = ROLLUP rel BY (a,b,c);
> >>
> >>
> >> GROUPING SETS:
> >>
> >>
> >> alias = CUBE rel BY GROUPING SETS((a,b),(b),());
> >>
> >>
> >> I don't like this. The cube vs. grouping sets is confusing to me. maybe
> >> following the
> >> same pattern you could do something like:
> >>
> >> alias = GROUPING_SET rel BY ((a,b),(b),());
> >>
> >> As far as having, is there an optimization that can be done with a
> HAVING
> >> clause that can't be done based on the logical plan that comes
> afterwards?
> >> That seems odd to me. Since you have to materialize the result anyway,
> >> can't the having clause just be a FILTER that comes after the cube? I
> don't
> >> know why we need a special syntax.
> >>
> >> My opinion. Forgive janky formatting, gmail + paste = pain.
> >> Jon
> >>
> >> 2012/5/27 Prasanth J <[email protected]>
> >>
> >>> Hello everyone
> >>>
> >>> I am looking for feedback from the community about the syntax for
> >>> CUBE/ROLLUP/GROUPING SETS operations in pig.
> >>> I am moving the discussion from JIRA to dev-list so that everyone can
> >>> share their opinion for operator syntax. Please have a look at the
> syntax
> >>> proposal at the link below and let me know your opinion
> >>>
> >>>
> >>>
> https://issues.apache.org/jira/browse/PIG-2167?focusedCommentId=13277644&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13277644
> >>>
> >>> Thanks
> >>> -- Prasanth
> >>>
> >>>
> >
>
>

Re: CUBE/ROLLUP/GROUPING SETS syntax

Reply via email to