Just to make sure I understand this correctly, is out = CUBE rel BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f);
equivalent to: out1 = CUBE rel BY (a,b,c); out2 = ROLLUP rel BY (c,d); out3 = CUBY rel BY (e,f); out = CROSS out1, out2, out3; ? 2012/6/21 Prasanth J <[email protected]> > Hello all > > I initially implemented ROLLUP as a separate operation with the following > syntax > > a = ROLLUP inp BY (x,y); > > which does the same thing as CUBE (inserting foreach + group-by in logical > plan) except that it uses RollupDimensions UDF. But the issue with this > approach is that we cannot mix CUBE and ROLLUP operations together in the > same syntax which is a typical case. SQL/Oracle supports using CUBE and > ROLLUP together like > > GROUP BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f); > > so I modified the pig grammar to support the similar usage. So now we can > use a syntax similar to SQL > > out = CUBE rel BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f); > > In this approach, the logical plan should introduce cartesian product > between bags generated by CUBE(a,b,c), ROLLUP(c,d) and CUBE(e,f) for > generating the final output. But I read from the documentation ( > http://pig.apache.org/docs/r0.10.0/basic.html#cross) that CROSS operator > is an expensive operator and advices to use it sparingly. > > Is there any other way to achieve the cartesian product in a less > expensive way? Also, does anyone have thoughts about this new syntax? > > Thanks > -- Prasanth > > On May 30, 2012, at 8:10 PM, Jonathan Coveney wrote: > > > As far as the underlying implementation, if they all use the same > > optimizations that you use in cube, then it can be LOCube. If they have > > their own optimizations etc (or could), it may be worth them having their > > own Logical operators (which might just be LOCube with flags for the time > > being) that allows us more flexibilty. But I suppose that's between you, > > eclipse, and your GSOC mentor. > > > > 2012/5/30 Prasanth J <[email protected]> > > > >> Thanks Alan and Jon for expressing your views. > >> > >> I agree with Jon's point, if the syntax contains CUBE then user expects > it > >> to perform CUBE operation. So Jon's syntax seems more meaningful and > concise > >> > >> rel = CUBE rel BY (dims); > >> rel = ROLLUP rel BY (dims); > >> rel = GROUPING_SET rel BY (dims); > >> > >> 2 reasons why I do not prefer using SQL syntax is > >> 1) I do not want to break into existing Group operator implementation :) > >> 2) The syntax gets longer in case of partial hierarchical cubing/rollups > >> For ex: > >> > >> rel = GROUP rel BY dim0, ROLLUP(dim1, dim2, dim3), > ROLLUP(dim4,dim5,dim6), > >> ROLLUP(dim7,dim8,dim9); > >> > >> whereas same thing can be expressed like > >> > >> rel = ROLLUP rel BY dim0, > >> (dim1,dim2,dim3),(dim4,dim5,dim6),(dim7,dim8,dim9); > >> > >> Thanks Alan for pointing out the way for independently managing the > >> operators in parser and logical/physical plan. So for all these > operators > >> (CUBE, ROLLUP, GROUPING_SET) I can just generate LOCube and use flags to > >> differentiate between these three operations. > >> > >> But, yes we are proliferating operators in this case. > >> > >> Thanks > >> -- Prasanth > >> > >> On May 30, 2012, at 4:42 PM, Alan Gates wrote: > >> > >>> > >>> On May 30, 2012, at 10:43 AM, Jonathan Coveney wrote: > >>> > >>>> I was going to say the same thing Alan said w.r.t. operators: > operators > >> in > >>>> the grammar can correspond to whatever logical and physical operators > >> you > >>>> want. > >>>> > >>>> As far as the principle of least astonishment compared to SQL... Pig > is > >>>> already pretty astonishing. I don't know why we would bend over > >> backwards > >>>> to make the syntax so similar in this case when even getting to the > >> point > >>>> of doing a CUBE means understanding an object model that is pretty > >>>> different from SQL. > >>>> > >>>> On that note, > >>>> > >>>> rel = CUBE rel BY GROUPING SETS(cols); > >>>> > >>>> seems really confusing. I'd much rather overload the group operating > >> than > >>>> the cube operator. If I see "cube," I expect a cube. If you start > doing > >>>> rollups etc, that's not a cube, it's a group. Or it's just a rollup. > Pig > >>>> latin is simple enough that I don't think having a rollup, group_set, > >> etc > >>>> operator will be so confusing, because they're already going to be > >> typing > >>>> that stuff in the conext of > >>>> > >>>> group rel by rollup(cols); and so on. I don't see how it's worth > adding > >>>> more, confusing syntax for the sake of creating parallels with a > >> language > >>>> we now share very little with. > >>> > >>> Fair points. > >>> > >>>> > >>>> But I won't beat it any further... if people prefer a different > syntax, > >>>> that's fine. Just excited to have the features in Pig! > >>> +1, I can live with any of the 3 syntax choices (near SQL, original, > and > >> Jon's). > >>> > >>> Alan. > >>> > >>>> Jon > >>>> > >>>> 2012/5/30 Alan Gates <[email protected]> > >>>> > >>>>> Some thoughts on this: > >>>>> > >>>>> 1) +1 to what Dmitriy said on HAVING > >>>>> > >>>>> 2) We need to be clear about separating operators in the grammar > versus > >>>>> logical plan versus physical plan. The choices you make in the > >> grammar are > >>>>> totally independent of the other two. That is, you could choose the > >> syntax: > >>>>> > >>>>> rel = GROUP rel BY CUBE (a, b, c) > >>>>> > >>>>> and still have a separate POCube operator. When the parser sees > GROUP > >> BY > >>>>> CUBE it will generate an LOCube operator for the logical plan rather > >> than > >>>>> an LOGroup operator. You can still have a separate POCube physical > >>>>> operator. Separate optimizations can be applied to LOGroup vs. > LOCube > >> and > >>>>> POGroup vs. POCube. > >>>>> > >>>>> 3) On syntax I can see arguments for keeping as close to SQL as > >> possible > >>>>> and for the syntax proposed by Prasanth. The argument for sticking > >> close > >>>>> to SQL is it conforms to the law of least astonishment. It wouldn't > be > >>>>> exactly SQL, as it would end up looking like: > >>>>> > >>>>> rel = GROUP rel BY CUBE (cols) > >>>>> rel = GROUP rel BY ROLLUP (cols) > >>>>> rel = GROUP rel BY GROUPING SETS(cols); > >>>>> > >>>>> The argument I see for sticking with Prasanth's approach is that > GROUP > >> is > >>>>> really short for COGROUP in Pig Latin, and I don't think we're > >> proposing > >>>>> doing COGROUP rel BY CUBE, nor can I see a case where you'd want to > do > >> such > >>>>> a thing. This makes CUBE really a separate operation. But if we go > >> this > >>>>> route I agree with Prasanth we should do CUBE rel BY ROLLUP and CUBE > >> rel BY > >>>>> GROUPING SETS. Let's not proliferate operators. > >>>>> > >>>>> Alan. > >>>>> > >>>>> On May 29, 2012, at 3:55 PM, Prasanth J wrote: > >>>>> > >>>>>> Thanks Jonathan for looking into it and for your suggestions. > >>>>>> > >>>>>> The reason why I came with a clause rather than a separate operator > >> was > >>>>> to avoid adding additional operators to the grammar. > >>>>>> So adding ROLLUP, GROUPING_SET will need separate logical operators > >>>>> adding to the complexity. I am planning to keep everything under cube > >>>>> operator, so only LOCube and POCube operators will be added > >> additionally. > >>>>> And as you and Dmitriy have mentioned the purpose of HAVING clause is > >> the > >>>>> same as FILTER so we do not need a separate HAVING clause. > >>>>>> > >>>>>> I will give a quick recap of cube related operations and multiple > >> syntax > >>>>> options for achieving the same. I am also adding partial cubing and > >> rollup > >>>>> in this discussion. > >>>>>> > >>>>>> 1) CUBE > >>>>>> > >>>>>> Current syntax: > >>>>>> alias = CUBE rel BY (a, b); > >>>>>> > >>>>>> Following group-by's will be computed: > >>>>>> (a, b) > >>>>>> (a) > >>>>>> (b) > >>>>>> () > >>>>>> > >>>>>> 2) Partial CUBE > >>>>>> > >>>>>> Proposed syntax: > >>>>>> alias = CUBE rel BY a, (b, c); > >>>>>> > >>>>>> Following group-by's will be computed: > >>>>>> (a, b, c) > >>>>>> (a, b) > >>>>>> (a, c) > >>>>>> (a) > >>>>>> > >>>>>> 3) ROLLUP > >>>>>> > >>>>>> Proposed syntax 1: > >>>>>> alias = CUBE rel BY ROLLUP(a, b); > >>>>>> > >>>>>> Proposed syntax 2: > >>>>>> alias = CUBE rel BY (a::b); > >>>>>> > >>>>>> Proposed syntax 3: > >>>>>> alias = ROLLUP rel BY (a, b); > >>>>>> > >>>>>> Following group-by's will be computed: > >>>>>> (a, b) > >>>>>> (a) > >>>>>> () > >>>>>> > >>>>>> 4) Partial ROLLUP > >>>>>> > >>>>>> Proposed syntax 1: > >>>>>> alias = CUBE rel BY a, ROLLUP(b, c); > >>>>>> > >>>>>> Proposed syntax 2: > >>>>>> alias = CUBE rel BY (a, b::c); > >>>>>> > >>>>>> Proposed syntax 3: > >>>>>> alias = ROLLUP rel BY a, (b, c); > >>>>>> > >>>>>> Following group-by's will be computed: > >>>>>> (a, b, c) > >>>>>> (a, b) > >>>>>> (a) > >>>>>> > >>>>>> 5) GROUPING SETS > >>>>>> > >>>>>> Proposed syntax 1: > >>>>>> alias = CUBE rel BY GROUPING SETS((a), (b, c), (c)) > >>>>>> > >>>>>> Proposed syntax 2: > >>>>>> alias = CUBE rel BY {(a), (b, c), (c)} > >>>>>> > >>>>>> Proposed syntax 3: > >>>>>> alias = GROUPING_SET rel BY ((a), (b, c), (c)) > >>>>>> > >>>>>> Following group-by's will be computed: > >>>>>> (a) > >>>>>> (b, c) > >>>>>> (c) > >>>>>> > >>>>>> Please vote for syntax 1, 2 or 3 so that we can come to a consensus > >>>>> before I start hacking the grammar file. > >>>>>> > >>>>>> Thanks > >>>>>> -- Prasanth > >>>>>> > >>>>>> On May 29, 2012, at 4:05 PM, Jonathan Coveney wrote: > >>>>>> > >>>>>>> Hey Prashanth, happy hacking. > >>>>>>> > >>>>>>> My opinion: > >>>>>>> > >>>>>>> CUBE: > >>>>>>> > >>>>>>> alias = CUBE rel BY (a,b,c); > >>>>>>> > >>>>>>> > >>>>>>> I like that syntax. It's unambiguous what is going on. > >>>>>>> > >>>>>>> > >>>>>>> ROLLUP: > >>>>>>> > >>>>>>> > >>>>>>> alias = CUBE rel BY ROLLUP(a,b,c); > >>>>>>> > >>>>>>> > >>>>>>> I never liked that syntax in SQL. I suggest we just do what we did > >> with > >>>>> CUBE. IE > >>>>>>> > >>>>>>> > >>>>>>> alias = ROLLUP rel BY (a,b,c); > >>>>>>> > >>>>>>> > >>>>>>> GROUPING SETS: > >>>>>>> > >>>>>>> > >>>>>>> alias = CUBE rel BY GROUPING SETS((a,b),(b),()); > >>>>>>> > >>>>>>> > >>>>>>> I don't like this. The cube vs. grouping sets is confusing to me. > >> maybe > >>>>>>> following the > >>>>>>> same pattern you could do something like: > >>>>>>> > >>>>>>> alias = GROUPING_SET rel BY ((a,b),(b),()); > >>>>>>> > >>>>>>> As far as having, is there an optimization that can be done with a > >>>>> HAVING > >>>>>>> clause that can't be done based on the logical plan that comes > >>>>> afterwards? > >>>>>>> That seems odd to me. Since you have to materialize the result > >> anyway, > >>>>>>> can't the having clause just be a FILTER that comes after the > cube? I > >>>>> don't > >>>>>>> know why we need a special syntax. > >>>>>>> > >>>>>>> My opinion. Forgive janky formatting, gmail + paste = pain. > >>>>>>> Jon > >>>>>>> > >>>>>>> 2012/5/27 Prasanth J <[email protected]> > >>>>>>> > >>>>>>>> Hello everyone > >>>>>>>> > >>>>>>>> I am looking for feedback from the community about the syntax for > >>>>>>>> CUBE/ROLLUP/GROUPING SETS operations in pig. > >>>>>>>> I am moving the discussion from JIRA to dev-list so that everyone > >> can > >>>>>>>> share their opinion for operator syntax. Please have a look at the > >>>>> syntax > >>>>>>>> proposal at the link below and let me know your opinion > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>> > >> > https://issues.apache.org/jira/browse/PIG-2167?focusedCommentId=13277644&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13277644 > >>>>>>>> > >>>>>>>> Thanks > >>>>>>>> -- Prasanth > >>>>>>>> > >>>>>>>> > >>>>>> > >>>>> > >>>>> > >>> > >> > >> > >
