I was going to say the same thing Alan said w.r.t. operators: operators in the grammar can correspond to whatever logical and physical operators you want.
As far as the principle of least astonishment compared to SQL... Pig is already pretty astonishing. I don't know why we would bend over backwards to make the syntax so similar in this case when even getting to the point of doing a CUBE means understanding an object model that is pretty different from SQL. On that note, rel = CUBE rel BY GROUPING SETS(cols); seems really confusing. I'd much rather overload the group operating than the cube operator. If I see "cube," I expect a cube. If you start doing rollups etc, that's not a cube, it's a group. Or it's just a rollup. Pig latin is simple enough that I don't think having a rollup, group_set, etc operator will be so confusing, because they're already going to be typing that stuff in the conext of group rel by rollup(cols); and so on. I don't see how it's worth adding more, confusing syntax for the sake of creating parallels with a language we now share very little with. But I won't beat it any further... if people prefer a different syntax, that's fine. Just excited to have the features in Pig! Jon 2012/5/30 Alan Gates <ga...@hortonworks.com> > Some thoughts on this: > > 1) +1 to what Dmitriy said on HAVING > > 2) We need to be clear about separating operators in the grammar versus > logical plan versus physical plan. The choices you make in the grammar are > totally independent of the other two. That is, you could choose the syntax: > > rel = GROUP rel BY CUBE (a, b, c) > > and still have a separate POCube operator. When the parser sees GROUP BY > CUBE it will generate an LOCube operator for the logical plan rather than > an LOGroup operator. You can still have a separate POCube physical > operator. Separate optimizations can be applied to LOGroup vs. LOCube and > POGroup vs. POCube. > > 3) On syntax I can see arguments for keeping as close to SQL as possible > and for the syntax proposed by Prasanth. The argument for sticking close > to SQL is it conforms to the law of least astonishment. It wouldn't be > exactly SQL, as it would end up looking like: > > rel = GROUP rel BY CUBE (cols) > rel = GROUP rel BY ROLLUP (cols) > rel = GROUP rel BY GROUPING SETS(cols); > > The argument I see for sticking with Prasanth's approach is that GROUP is > really short for COGROUP in Pig Latin, and I don't think we're proposing > doing COGROUP rel BY CUBE, nor can I see a case where you'd want to do such > a thing. This makes CUBE really a separate operation. But if we go this > route I agree with Prasanth we should do CUBE rel BY ROLLUP and CUBE rel BY > GROUPING SETS. Let's not proliferate operators. > > Alan. > > On May 29, 2012, at 3:55 PM, Prasanth J wrote: > > > Thanks Jonathan for looking into it and for your suggestions. > > > > The reason why I came with a clause rather than a separate operator was > to avoid adding additional operators to the grammar. > > So adding ROLLUP, GROUPING_SET will need separate logical operators > adding to the complexity. I am planning to keep everything under cube > operator, so only LOCube and POCube operators will be added additionally. > And as you and Dmitriy have mentioned the purpose of HAVING clause is the > same as FILTER so we do not need a separate HAVING clause. > > > > I will give a quick recap of cube related operations and multiple syntax > options for achieving the same. I am also adding partial cubing and rollup > in this discussion. > > > > 1) CUBE > > > > Current syntax: > > alias = CUBE rel BY (a, b); > > > > Following group-by's will be computed: > > (a, b) > > (a) > > (b) > > () > > > > 2) Partial CUBE > > > > Proposed syntax: > > alias = CUBE rel BY a, (b, c); > > > > Following group-by's will be computed: > > (a, b, c) > > (a, b) > > (a, c) > > (a) > > > > 3) ROLLUP > > > > Proposed syntax 1: > > alias = CUBE rel BY ROLLUP(a, b); > > > > Proposed syntax 2: > > alias = CUBE rel BY (a::b); > > > > Proposed syntax 3: > > alias = ROLLUP rel BY (a, b); > > > > Following group-by's will be computed: > > (a, b) > > (a) > > () > > > > 4) Partial ROLLUP > > > > Proposed syntax 1: > > alias = CUBE rel BY a, ROLLUP(b, c); > > > > Proposed syntax 2: > > alias = CUBE rel BY (a, b::c); > > > > Proposed syntax 3: > > alias = ROLLUP rel BY a, (b, c); > > > > Following group-by's will be computed: > > (a, b, c) > > (a, b) > > (a) > > > > 5) GROUPING SETS > > > > Proposed syntax 1: > > alias = CUBE rel BY GROUPING SETS((a), (b, c), (c)) > > > > Proposed syntax 2: > > alias = CUBE rel BY {(a), (b, c), (c)} > > > > Proposed syntax 3: > > alias = GROUPING_SET rel BY ((a), (b, c), (c)) > > > > Following group-by's will be computed: > > (a) > > (b, c) > > (c) > > > > Please vote for syntax 1, 2 or 3 so that we can come to a consensus > before I start hacking the grammar file. > > > > Thanks > > -- Prasanth > > > > On May 29, 2012, at 4:05 PM, Jonathan Coveney wrote: > > > >> Hey Prashanth, happy hacking. > >> > >> My opinion: > >> > >> CUBE: > >> > >> alias = CUBE rel BY (a,b,c); > >> > >> > >> I like that syntax. It's unambiguous what is going on. > >> > >> > >> ROLLUP: > >> > >> > >> alias = CUBE rel BY ROLLUP(a,b,c); > >> > >> > >> I never liked that syntax in SQL. I suggest we just do what we did with > CUBE. IE > >> > >> > >> alias = ROLLUP rel BY (a,b,c); > >> > >> > >> GROUPING SETS: > >> > >> > >> alias = CUBE rel BY GROUPING SETS((a,b),(b),()); > >> > >> > >> I don't like this. The cube vs. grouping sets is confusing to me. maybe > >> following the > >> same pattern you could do something like: > >> > >> alias = GROUPING_SET rel BY ((a,b),(b),()); > >> > >> As far as having, is there an optimization that can be done with a > HAVING > >> clause that can't be done based on the logical plan that comes > afterwards? > >> That seems odd to me. Since you have to materialize the result anyway, > >> can't the having clause just be a FILTER that comes after the cube? I > don't > >> know why we need a special syntax. > >> > >> My opinion. Forgive janky formatting, gmail + paste = pain. > >> Jon > >> > >> 2012/5/27 Prasanth J <buckeye.prasa...@gmail.com> > >> > >>> Hello everyone > >>> > >>> I am looking for feedback from the community about the syntax for > >>> CUBE/ROLLUP/GROUPING SETS operations in pig. > >>> I am moving the discussion from JIRA to dev-list so that everyone can > >>> share their opinion for operator syntax. Please have a look at the > syntax > >>> proposal at the link below and let me know your opinion > >>> > >>> > >>> > https://issues.apache.org/jira/browse/PIG-2167?focusedCommentId=13277644&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13277644 > >>> > >>> Thanks > >>> -- Prasanth > >>> > >>> > > > >