Thanks Alan and Jon for expressing your views. I agree with Jon's point, if the syntax contains CUBE then user expects it to perform CUBE operation. So Jon's syntax seems more meaningful and concise
rel = CUBE rel BY (dims); rel = ROLLUP rel BY (dims); rel = GROUPING_SET rel BY (dims); 2 reasons why I do not prefer using SQL syntax is 1) I do not want to break into existing Group operator implementation :) 2) The syntax gets longer in case of partial hierarchical cubing/rollups For ex: rel = GROUP rel BY dim0, ROLLUP(dim1, dim2, dim3), ROLLUP(dim4,dim5,dim6), ROLLUP(dim7,dim8,dim9); whereas same thing can be expressed like rel = ROLLUP rel BY dim0, (dim1,dim2,dim3),(dim4,dim5,dim6),(dim7,dim8,dim9); Thanks Alan for pointing out the way for independently managing the operators in parser and logical/physical plan. So for all these operators (CUBE, ROLLUP, GROUPING_SET) I can just generate LOCube and use flags to differentiate between these three operations. But, yes we are proliferating operators in this case. Thanks -- Prasanth On May 30, 2012, at 4:42 PM, Alan Gates wrote: > > On May 30, 2012, at 10:43 AM, Jonathan Coveney wrote: > >> I was going to say the same thing Alan said w.r.t. operators: operators in >> the grammar can correspond to whatever logical and physical operators you >> want. >> >> As far as the principle of least astonishment compared to SQL... Pig is >> already pretty astonishing. I don't know why we would bend over backwards >> to make the syntax so similar in this case when even getting to the point >> of doing a CUBE means understanding an object model that is pretty >> different from SQL. >> >> On that note, >> >> rel = CUBE rel BY GROUPING SETS(cols); >> >> seems really confusing. I'd much rather overload the group operating than >> the cube operator. If I see "cube," I expect a cube. If you start doing >> rollups etc, that's not a cube, it's a group. Or it's just a rollup. Pig >> latin is simple enough that I don't think having a rollup, group_set, etc >> operator will be so confusing, because they're already going to be typing >> that stuff in the conext of >> >> group rel by rollup(cols); and so on. I don't see how it's worth adding >> more, confusing syntax for the sake of creating parallels with a language >> we now share very little with. > > Fair points. > >> >> But I won't beat it any further... if people prefer a different syntax, >> that's fine. Just excited to have the features in Pig! > +1, I can live with any of the 3 syntax choices (near SQL, original, and > Jon's). > > Alan. > >> Jon >> >> 2012/5/30 Alan Gates <[email protected]> >> >>> Some thoughts on this: >>> >>> 1) +1 to what Dmitriy said on HAVING >>> >>> 2) We need to be clear about separating operators in the grammar versus >>> logical plan versus physical plan. The choices you make in the grammar are >>> totally independent of the other two. That is, you could choose the syntax: >>> >>> rel = GROUP rel BY CUBE (a, b, c) >>> >>> and still have a separate POCube operator. When the parser sees GROUP BY >>> CUBE it will generate an LOCube operator for the logical plan rather than >>> an LOGroup operator. You can still have a separate POCube physical >>> operator. Separate optimizations can be applied to LOGroup vs. LOCube and >>> POGroup vs. POCube. >>> >>> 3) On syntax I can see arguments for keeping as close to SQL as possible >>> and for the syntax proposed by Prasanth. The argument for sticking close >>> to SQL is it conforms to the law of least astonishment. It wouldn't be >>> exactly SQL, as it would end up looking like: >>> >>> rel = GROUP rel BY CUBE (cols) >>> rel = GROUP rel BY ROLLUP (cols) >>> rel = GROUP rel BY GROUPING SETS(cols); >>> >>> The argument I see for sticking with Prasanth's approach is that GROUP is >>> really short for COGROUP in Pig Latin, and I don't think we're proposing >>> doing COGROUP rel BY CUBE, nor can I see a case where you'd want to do such >>> a thing. This makes CUBE really a separate operation. But if we go this >>> route I agree with Prasanth we should do CUBE rel BY ROLLUP and CUBE rel BY >>> GROUPING SETS. Let's not proliferate operators. >>> >>> Alan. >>> >>> On May 29, 2012, at 3:55 PM, Prasanth J wrote: >>> >>>> Thanks Jonathan for looking into it and for your suggestions. >>>> >>>> The reason why I came with a clause rather than a separate operator was >>> to avoid adding additional operators to the grammar. >>>> So adding ROLLUP, GROUPING_SET will need separate logical operators >>> adding to the complexity. I am planning to keep everything under cube >>> operator, so only LOCube and POCube operators will be added additionally. >>> And as you and Dmitriy have mentioned the purpose of HAVING clause is the >>> same as FILTER so we do not need a separate HAVING clause. >>>> >>>> I will give a quick recap of cube related operations and multiple syntax >>> options for achieving the same. I am also adding partial cubing and rollup >>> in this discussion. >>>> >>>> 1) CUBE >>>> >>>> Current syntax: >>>> alias = CUBE rel BY (a, b); >>>> >>>> Following group-by's will be computed: >>>> (a, b) >>>> (a) >>>> (b) >>>> () >>>> >>>> 2) Partial CUBE >>>> >>>> Proposed syntax: >>>> alias = CUBE rel BY a, (b, c); >>>> >>>> Following group-by's will be computed: >>>> (a, b, c) >>>> (a, b) >>>> (a, c) >>>> (a) >>>> >>>> 3) ROLLUP >>>> >>>> Proposed syntax 1: >>>> alias = CUBE rel BY ROLLUP(a, b); >>>> >>>> Proposed syntax 2: >>>> alias = CUBE rel BY (a::b); >>>> >>>> Proposed syntax 3: >>>> alias = ROLLUP rel BY (a, b); >>>> >>>> Following group-by's will be computed: >>>> (a, b) >>>> (a) >>>> () >>>> >>>> 4) Partial ROLLUP >>>> >>>> Proposed syntax 1: >>>> alias = CUBE rel BY a, ROLLUP(b, c); >>>> >>>> Proposed syntax 2: >>>> alias = CUBE rel BY (a, b::c); >>>> >>>> Proposed syntax 3: >>>> alias = ROLLUP rel BY a, (b, c); >>>> >>>> Following group-by's will be computed: >>>> (a, b, c) >>>> (a, b) >>>> (a) >>>> >>>> 5) GROUPING SETS >>>> >>>> Proposed syntax 1: >>>> alias = CUBE rel BY GROUPING SETS((a), (b, c), (c)) >>>> >>>> Proposed syntax 2: >>>> alias = CUBE rel BY {(a), (b, c), (c)} >>>> >>>> Proposed syntax 3: >>>> alias = GROUPING_SET rel BY ((a), (b, c), (c)) >>>> >>>> Following group-by's will be computed: >>>> (a) >>>> (b, c) >>>> (c) >>>> >>>> Please vote for syntax 1, 2 or 3 so that we can come to a consensus >>> before I start hacking the grammar file. >>>> >>>> Thanks >>>> -- Prasanth >>>> >>>> On May 29, 2012, at 4:05 PM, Jonathan Coveney wrote: >>>> >>>>> Hey Prashanth, happy hacking. >>>>> >>>>> My opinion: >>>>> >>>>> CUBE: >>>>> >>>>> alias = CUBE rel BY (a,b,c); >>>>> >>>>> >>>>> I like that syntax. It's unambiguous what is going on. >>>>> >>>>> >>>>> ROLLUP: >>>>> >>>>> >>>>> alias = CUBE rel BY ROLLUP(a,b,c); >>>>> >>>>> >>>>> I never liked that syntax in SQL. I suggest we just do what we did with >>> CUBE. IE >>>>> >>>>> >>>>> alias = ROLLUP rel BY (a,b,c); >>>>> >>>>> >>>>> GROUPING SETS: >>>>> >>>>> >>>>> alias = CUBE rel BY GROUPING SETS((a,b),(b),()); >>>>> >>>>> >>>>> I don't like this. The cube vs. grouping sets is confusing to me. maybe >>>>> following the >>>>> same pattern you could do something like: >>>>> >>>>> alias = GROUPING_SET rel BY ((a,b),(b),()); >>>>> >>>>> As far as having, is there an optimization that can be done with a >>> HAVING >>>>> clause that can't be done based on the logical plan that comes >>> afterwards? >>>>> That seems odd to me. Since you have to materialize the result anyway, >>>>> can't the having clause just be a FILTER that comes after the cube? I >>> don't >>>>> know why we need a special syntax. >>>>> >>>>> My opinion. Forgive janky formatting, gmail + paste = pain. >>>>> Jon >>>>> >>>>> 2012/5/27 Prasanth J <[email protected]> >>>>> >>>>>> Hello everyone >>>>>> >>>>>> I am looking for feedback from the community about the syntax for >>>>>> CUBE/ROLLUP/GROUPING SETS operations in pig. >>>>>> I am moving the discussion from JIRA to dev-list so that everyone can >>>>>> share their opinion for operator syntax. Please have a look at the >>> syntax >>>>>> proposal at the link below and let me know your opinion >>>>>> >>>>>> >>>>>> >>> https://issues.apache.org/jira/browse/PIG-2167?focusedCommentId=13277644&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13277644 >>>>>> >>>>>> Thanks >>>>>> -- Prasanth >>>>>> >>>>>> >>>> >>> >>> >
