Hello all I initially implemented ROLLUP as a separate operation with the following syntax
a = ROLLUP inp BY (x,y); which does the same thing as CUBE (inserting foreach + group-by in logical plan) except that it uses RollupDimensions UDF. But the issue with this approach is that we cannot mix CUBE and ROLLUP operations together in the same syntax which is a typical case. SQL/Oracle supports using CUBE and ROLLUP together like GROUP BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f); so I modified the pig grammar to support the similar usage. So now we can use a syntax similar to SQL out = CUBE rel BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f); In this approach, the logical plan should introduce cartesian product between bags generated by CUBE(a,b,c), ROLLUP(c,d) and CUBE(e,f) for generating the final output. But I read from the documentation (http://pig.apache.org/docs/r0.10.0/basic.html#cross) that CROSS operator is an expensive operator and advices to use it sparingly. Is there any other way to achieve the cartesian product in a less expensive way? Also, does anyone have thoughts about this new syntax? Thanks -- Prasanth On May 30, 2012, at 8:10 PM, Jonathan Coveney wrote: > As far as the underlying implementation, if they all use the same > optimizations that you use in cube, then it can be LOCube. If they have > their own optimizations etc (or could), it may be worth them having their > own Logical operators (which might just be LOCube with flags for the time > being) that allows us more flexibilty. But I suppose that's between you, > eclipse, and your GSOC mentor. > > 2012/5/30 Prasanth J <[email protected]> > >> Thanks Alan and Jon for expressing your views. >> >> I agree with Jon's point, if the syntax contains CUBE then user expects it >> to perform CUBE operation. So Jon's syntax seems more meaningful and concise >> >> rel = CUBE rel BY (dims); >> rel = ROLLUP rel BY (dims); >> rel = GROUPING_SET rel BY (dims); >> >> 2 reasons why I do not prefer using SQL syntax is >> 1) I do not want to break into existing Group operator implementation :) >> 2) The syntax gets longer in case of partial hierarchical cubing/rollups >> For ex: >> >> rel = GROUP rel BY dim0, ROLLUP(dim1, dim2, dim3), ROLLUP(dim4,dim5,dim6), >> ROLLUP(dim7,dim8,dim9); >> >> whereas same thing can be expressed like >> >> rel = ROLLUP rel BY dim0, >> (dim1,dim2,dim3),(dim4,dim5,dim6),(dim7,dim8,dim9); >> >> Thanks Alan for pointing out the way for independently managing the >> operators in parser and logical/physical plan. So for all these operators >> (CUBE, ROLLUP, GROUPING_SET) I can just generate LOCube and use flags to >> differentiate between these three operations. >> >> But, yes we are proliferating operators in this case. >> >> Thanks >> -- Prasanth >> >> On May 30, 2012, at 4:42 PM, Alan Gates wrote: >> >>> >>> On May 30, 2012, at 10:43 AM, Jonathan Coveney wrote: >>> >>>> I was going to say the same thing Alan said w.r.t. operators: operators >> in >>>> the grammar can correspond to whatever logical and physical operators >> you >>>> want. >>>> >>>> As far as the principle of least astonishment compared to SQL... Pig is >>>> already pretty astonishing. I don't know why we would bend over >> backwards >>>> to make the syntax so similar in this case when even getting to the >> point >>>> of doing a CUBE means understanding an object model that is pretty >>>> different from SQL. >>>> >>>> On that note, >>>> >>>> rel = CUBE rel BY GROUPING SETS(cols); >>>> >>>> seems really confusing. I'd much rather overload the group operating >> than >>>> the cube operator. If I see "cube," I expect a cube. If you start doing >>>> rollups etc, that's not a cube, it's a group. Or it's just a rollup. Pig >>>> latin is simple enough that I don't think having a rollup, group_set, >> etc >>>> operator will be so confusing, because they're already going to be >> typing >>>> that stuff in the conext of >>>> >>>> group rel by rollup(cols); and so on. I don't see how it's worth adding >>>> more, confusing syntax for the sake of creating parallels with a >> language >>>> we now share very little with. >>> >>> Fair points. >>> >>>> >>>> But I won't beat it any further... if people prefer a different syntax, >>>> that's fine. Just excited to have the features in Pig! >>> +1, I can live with any of the 3 syntax choices (near SQL, original, and >> Jon's). >>> >>> Alan. >>> >>>> Jon >>>> >>>> 2012/5/30 Alan Gates <[email protected]> >>>> >>>>> Some thoughts on this: >>>>> >>>>> 1) +1 to what Dmitriy said on HAVING >>>>> >>>>> 2) We need to be clear about separating operators in the grammar versus >>>>> logical plan versus physical plan. The choices you make in the >> grammar are >>>>> totally independent of the other two. That is, you could choose the >> syntax: >>>>> >>>>> rel = GROUP rel BY CUBE (a, b, c) >>>>> >>>>> and still have a separate POCube operator. When the parser sees GROUP >> BY >>>>> CUBE it will generate an LOCube operator for the logical plan rather >> than >>>>> an LOGroup operator. You can still have a separate POCube physical >>>>> operator. Separate optimizations can be applied to LOGroup vs. LOCube >> and >>>>> POGroup vs. POCube. >>>>> >>>>> 3) On syntax I can see arguments for keeping as close to SQL as >> possible >>>>> and for the syntax proposed by Prasanth. The argument for sticking >> close >>>>> to SQL is it conforms to the law of least astonishment. It wouldn't be >>>>> exactly SQL, as it would end up looking like: >>>>> >>>>> rel = GROUP rel BY CUBE (cols) >>>>> rel = GROUP rel BY ROLLUP (cols) >>>>> rel = GROUP rel BY GROUPING SETS(cols); >>>>> >>>>> The argument I see for sticking with Prasanth's approach is that GROUP >> is >>>>> really short for COGROUP in Pig Latin, and I don't think we're >> proposing >>>>> doing COGROUP rel BY CUBE, nor can I see a case where you'd want to do >> such >>>>> a thing. This makes CUBE really a separate operation. But if we go >> this >>>>> route I agree with Prasanth we should do CUBE rel BY ROLLUP and CUBE >> rel BY >>>>> GROUPING SETS. Let's not proliferate operators. >>>>> >>>>> Alan. >>>>> >>>>> On May 29, 2012, at 3:55 PM, Prasanth J wrote: >>>>> >>>>>> Thanks Jonathan for looking into it and for your suggestions. >>>>>> >>>>>> The reason why I came with a clause rather than a separate operator >> was >>>>> to avoid adding additional operators to the grammar. >>>>>> So adding ROLLUP, GROUPING_SET will need separate logical operators >>>>> adding to the complexity. I am planning to keep everything under cube >>>>> operator, so only LOCube and POCube operators will be added >> additionally. >>>>> And as you and Dmitriy have mentioned the purpose of HAVING clause is >> the >>>>> same as FILTER so we do not need a separate HAVING clause. >>>>>> >>>>>> I will give a quick recap of cube related operations and multiple >> syntax >>>>> options for achieving the same. I am also adding partial cubing and >> rollup >>>>> in this discussion. >>>>>> >>>>>> 1) CUBE >>>>>> >>>>>> Current syntax: >>>>>> alias = CUBE rel BY (a, b); >>>>>> >>>>>> Following group-by's will be computed: >>>>>> (a, b) >>>>>> (a) >>>>>> (b) >>>>>> () >>>>>> >>>>>> 2) Partial CUBE >>>>>> >>>>>> Proposed syntax: >>>>>> alias = CUBE rel BY a, (b, c); >>>>>> >>>>>> Following group-by's will be computed: >>>>>> (a, b, c) >>>>>> (a, b) >>>>>> (a, c) >>>>>> (a) >>>>>> >>>>>> 3) ROLLUP >>>>>> >>>>>> Proposed syntax 1: >>>>>> alias = CUBE rel BY ROLLUP(a, b); >>>>>> >>>>>> Proposed syntax 2: >>>>>> alias = CUBE rel BY (a::b); >>>>>> >>>>>> Proposed syntax 3: >>>>>> alias = ROLLUP rel BY (a, b); >>>>>> >>>>>> Following group-by's will be computed: >>>>>> (a, b) >>>>>> (a) >>>>>> () >>>>>> >>>>>> 4) Partial ROLLUP >>>>>> >>>>>> Proposed syntax 1: >>>>>> alias = CUBE rel BY a, ROLLUP(b, c); >>>>>> >>>>>> Proposed syntax 2: >>>>>> alias = CUBE rel BY (a, b::c); >>>>>> >>>>>> Proposed syntax 3: >>>>>> alias = ROLLUP rel BY a, (b, c); >>>>>> >>>>>> Following group-by's will be computed: >>>>>> (a, b, c) >>>>>> (a, b) >>>>>> (a) >>>>>> >>>>>> 5) GROUPING SETS >>>>>> >>>>>> Proposed syntax 1: >>>>>> alias = CUBE rel BY GROUPING SETS((a), (b, c), (c)) >>>>>> >>>>>> Proposed syntax 2: >>>>>> alias = CUBE rel BY {(a), (b, c), (c)} >>>>>> >>>>>> Proposed syntax 3: >>>>>> alias = GROUPING_SET rel BY ((a), (b, c), (c)) >>>>>> >>>>>> Following group-by's will be computed: >>>>>> (a) >>>>>> (b, c) >>>>>> (c) >>>>>> >>>>>> Please vote for syntax 1, 2 or 3 so that we can come to a consensus >>>>> before I start hacking the grammar file. >>>>>> >>>>>> Thanks >>>>>> -- Prasanth >>>>>> >>>>>> On May 29, 2012, at 4:05 PM, Jonathan Coveney wrote: >>>>>> >>>>>>> Hey Prashanth, happy hacking. >>>>>>> >>>>>>> My opinion: >>>>>>> >>>>>>> CUBE: >>>>>>> >>>>>>> alias = CUBE rel BY (a,b,c); >>>>>>> >>>>>>> >>>>>>> I like that syntax. It's unambiguous what is going on. >>>>>>> >>>>>>> >>>>>>> ROLLUP: >>>>>>> >>>>>>> >>>>>>> alias = CUBE rel BY ROLLUP(a,b,c); >>>>>>> >>>>>>> >>>>>>> I never liked that syntax in SQL. I suggest we just do what we did >> with >>>>> CUBE. IE >>>>>>> >>>>>>> >>>>>>> alias = ROLLUP rel BY (a,b,c); >>>>>>> >>>>>>> >>>>>>> GROUPING SETS: >>>>>>> >>>>>>> >>>>>>> alias = CUBE rel BY GROUPING SETS((a,b),(b),()); >>>>>>> >>>>>>> >>>>>>> I don't like this. The cube vs. grouping sets is confusing to me. >> maybe >>>>>>> following the >>>>>>> same pattern you could do something like: >>>>>>> >>>>>>> alias = GROUPING_SET rel BY ((a,b),(b),()); >>>>>>> >>>>>>> As far as having, is there an optimization that can be done with a >>>>> HAVING >>>>>>> clause that can't be done based on the logical plan that comes >>>>> afterwards? >>>>>>> That seems odd to me. Since you have to materialize the result >> anyway, >>>>>>> can't the having clause just be a FILTER that comes after the cube? I >>>>> don't >>>>>>> know why we need a special syntax. >>>>>>> >>>>>>> My opinion. Forgive janky formatting, gmail + paste = pain. >>>>>>> Jon >>>>>>> >>>>>>> 2012/5/27 Prasanth J <[email protected]> >>>>>>> >>>>>>>> Hello everyone >>>>>>>> >>>>>>>> I am looking for feedback from the community about the syntax for >>>>>>>> CUBE/ROLLUP/GROUPING SETS operations in pig. >>>>>>>> I am moving the discussion from JIRA to dev-list so that everyone >> can >>>>>>>> share their opinion for operator syntax. Please have a look at the >>>>> syntax >>>>>>>> proposal at the link below and let me know your opinion >>>>>>>> >>>>>>>> >>>>>>>> >>>>> >> https://issues.apache.org/jira/browse/PIG-2167?focusedCommentId=13277644&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13277644 >>>>>>>> >>>>>>>> Thanks >>>>>>>> -- Prasanth >>>>>>>> >>>>>>>> >>>>>> >>>>> >>>>> >>> >> >>
