On May 30, 2012, at 10:43 AM, Jonathan Coveney wrote: > I was going to say the same thing Alan said w.r.t. operators: operators in > the grammar can correspond to whatever logical and physical operators you > want. > > As far as the principle of least astonishment compared to SQL... Pig is > already pretty astonishing. I don't know why we would bend over backwards > to make the syntax so similar in this case when even getting to the point > of doing a CUBE means understanding an object model that is pretty > different from SQL. > > On that note, > > rel = CUBE rel BY GROUPING SETS(cols); > > seems really confusing. I'd much rather overload the group operating than > the cube operator. If I see "cube," I expect a cube. If you start doing > rollups etc, that's not a cube, it's a group. Or it's just a rollup. Pig > latin is simple enough that I don't think having a rollup, group_set, etc > operator will be so confusing, because they're already going to be typing > that stuff in the conext of > > group rel by rollup(cols); and so on. I don't see how it's worth adding > more, confusing syntax for the sake of creating parallels with a language > we now share very little with.
Fair points. > > But I won't beat it any further... if people prefer a different syntax, > that's fine. Just excited to have the features in Pig! +1, I can live with any of the 3 syntax choices (near SQL, original, and Jon's). Alan. > Jon > > 2012/5/30 Alan Gates <[email protected]> > >> Some thoughts on this: >> >> 1) +1 to what Dmitriy said on HAVING >> >> 2) We need to be clear about separating operators in the grammar versus >> logical plan versus physical plan. The choices you make in the grammar are >> totally independent of the other two. That is, you could choose the syntax: >> >> rel = GROUP rel BY CUBE (a, b, c) >> >> and still have a separate POCube operator. When the parser sees GROUP BY >> CUBE it will generate an LOCube operator for the logical plan rather than >> an LOGroup operator. You can still have a separate POCube physical >> operator. Separate optimizations can be applied to LOGroup vs. LOCube and >> POGroup vs. POCube. >> >> 3) On syntax I can see arguments for keeping as close to SQL as possible >> and for the syntax proposed by Prasanth. The argument for sticking close >> to SQL is it conforms to the law of least astonishment. It wouldn't be >> exactly SQL, as it would end up looking like: >> >> rel = GROUP rel BY CUBE (cols) >> rel = GROUP rel BY ROLLUP (cols) >> rel = GROUP rel BY GROUPING SETS(cols); >> >> The argument I see for sticking with Prasanth's approach is that GROUP is >> really short for COGROUP in Pig Latin, and I don't think we're proposing >> doing COGROUP rel BY CUBE, nor can I see a case where you'd want to do such >> a thing. This makes CUBE really a separate operation. But if we go this >> route I agree with Prasanth we should do CUBE rel BY ROLLUP and CUBE rel BY >> GROUPING SETS. Let's not proliferate operators. >> >> Alan. >> >> On May 29, 2012, at 3:55 PM, Prasanth J wrote: >> >>> Thanks Jonathan for looking into it and for your suggestions. >>> >>> The reason why I came with a clause rather than a separate operator was >> to avoid adding additional operators to the grammar. >>> So adding ROLLUP, GROUPING_SET will need separate logical operators >> adding to the complexity. I am planning to keep everything under cube >> operator, so only LOCube and POCube operators will be added additionally. >> And as you and Dmitriy have mentioned the purpose of HAVING clause is the >> same as FILTER so we do not need a separate HAVING clause. >>> >>> I will give a quick recap of cube related operations and multiple syntax >> options for achieving the same. I am also adding partial cubing and rollup >> in this discussion. >>> >>> 1) CUBE >>> >>> Current syntax: >>> alias = CUBE rel BY (a, b); >>> >>> Following group-by's will be computed: >>> (a, b) >>> (a) >>> (b) >>> () >>> >>> 2) Partial CUBE >>> >>> Proposed syntax: >>> alias = CUBE rel BY a, (b, c); >>> >>> Following group-by's will be computed: >>> (a, b, c) >>> (a, b) >>> (a, c) >>> (a) >>> >>> 3) ROLLUP >>> >>> Proposed syntax 1: >>> alias = CUBE rel BY ROLLUP(a, b); >>> >>> Proposed syntax 2: >>> alias = CUBE rel BY (a::b); >>> >>> Proposed syntax 3: >>> alias = ROLLUP rel BY (a, b); >>> >>> Following group-by's will be computed: >>> (a, b) >>> (a) >>> () >>> >>> 4) Partial ROLLUP >>> >>> Proposed syntax 1: >>> alias = CUBE rel BY a, ROLLUP(b, c); >>> >>> Proposed syntax 2: >>> alias = CUBE rel BY (a, b::c); >>> >>> Proposed syntax 3: >>> alias = ROLLUP rel BY a, (b, c); >>> >>> Following group-by's will be computed: >>> (a, b, c) >>> (a, b) >>> (a) >>> >>> 5) GROUPING SETS >>> >>> Proposed syntax 1: >>> alias = CUBE rel BY GROUPING SETS((a), (b, c), (c)) >>> >>> Proposed syntax 2: >>> alias = CUBE rel BY {(a), (b, c), (c)} >>> >>> Proposed syntax 3: >>> alias = GROUPING_SET rel BY ((a), (b, c), (c)) >>> >>> Following group-by's will be computed: >>> (a) >>> (b, c) >>> (c) >>> >>> Please vote for syntax 1, 2 or 3 so that we can come to a consensus >> before I start hacking the grammar file. >>> >>> Thanks >>> -- Prasanth >>> >>> On May 29, 2012, at 4:05 PM, Jonathan Coveney wrote: >>> >>>> Hey Prashanth, happy hacking. >>>> >>>> My opinion: >>>> >>>> CUBE: >>>> >>>> alias = CUBE rel BY (a,b,c); >>>> >>>> >>>> I like that syntax. It's unambiguous what is going on. >>>> >>>> >>>> ROLLUP: >>>> >>>> >>>> alias = CUBE rel BY ROLLUP(a,b,c); >>>> >>>> >>>> I never liked that syntax in SQL. I suggest we just do what we did with >> CUBE. IE >>>> >>>> >>>> alias = ROLLUP rel BY (a,b,c); >>>> >>>> >>>> GROUPING SETS: >>>> >>>> >>>> alias = CUBE rel BY GROUPING SETS((a,b),(b),()); >>>> >>>> >>>> I don't like this. The cube vs. grouping sets is confusing to me. maybe >>>> following the >>>> same pattern you could do something like: >>>> >>>> alias = GROUPING_SET rel BY ((a,b),(b),()); >>>> >>>> As far as having, is there an optimization that can be done with a >> HAVING >>>> clause that can't be done based on the logical plan that comes >> afterwards? >>>> That seems odd to me. Since you have to materialize the result anyway, >>>> can't the having clause just be a FILTER that comes after the cube? I >> don't >>>> know why we need a special syntax. >>>> >>>> My opinion. Forgive janky formatting, gmail + paste = pain. >>>> Jon >>>> >>>> 2012/5/27 Prasanth J <[email protected]> >>>> >>>>> Hello everyone >>>>> >>>>> I am looking for feedback from the community about the syntax for >>>>> CUBE/ROLLUP/GROUPING SETS operations in pig. >>>>> I am moving the discussion from JIRA to dev-list so that everyone can >>>>> share their opinion for operator syntax. Please have a look at the >> syntax >>>>> proposal at the link below and let me know your opinion >>>>> >>>>> >>>>> >> https://issues.apache.org/jira/browse/PIG-2167?focusedCommentId=13277644&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13277644 >>>>> >>>>> Thanks >>>>> -- Prasanth >>>>> >>>>> >>> >> >>
