IMHO, I don't know that it's worth overloading the operator to support that. If that's what it is, then they can just do what I said. I would rather the focus be on getting ROLLUP, CUBE, etc implemented and efficient.
Curious what others think. 2012/6/21 Prasanth J <[email protected]> > Yeah you are right. > > Thanks > -- Prasanth > > On Jun 21, 2012, at 4:41 PM, Jonathan Coveney wrote: > > > Just to make sure I understand this correctly, is > > > > out = CUBE rel BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f); > > > > equivalent to: > > > > out1 = CUBE rel BY (a,b,c); > > out2 = ROLLUP rel BY (c,d); > > out3 = CUBY rel BY (e,f); > > > > out = CROSS out1, out2, out3; > > > > ? > > > > 2012/6/21 Prasanth J <[email protected]> > > > >> Hello all > >> > >> I initially implemented ROLLUP as a separate operation with the > following > >> syntax > >> > >> a = ROLLUP inp BY (x,y); > >> > >> which does the same thing as CUBE (inserting foreach + group-by in > logical > >> plan) except that it uses RollupDimensions UDF. But the issue with this > >> approach is that we cannot mix CUBE and ROLLUP operations together in > the > >> same syntax which is a typical case. SQL/Oracle supports using CUBE and > >> ROLLUP together like > >> > >> GROUP BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f); > >> > >> so I modified the pig grammar to support the similar usage. So now we > can > >> use a syntax similar to SQL > >> > >> out = CUBE rel BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f); > >> > >> In this approach, the logical plan should introduce cartesian product > >> between bags generated by CUBE(a,b,c), ROLLUP(c,d) and CUBE(e,f) for > >> generating the final output. But I read from the documentation ( > >> http://pig.apache.org/docs/r0.10.0/basic.html#cross) that CROSS > operator > >> is an expensive operator and advices to use it sparingly. > >> > >> Is there any other way to achieve the cartesian product in a less > >> expensive way? Also, does anyone have thoughts about this new syntax? > >> > >> Thanks > >> -- Prasanth > >> > >> On May 30, 2012, at 8:10 PM, Jonathan Coveney wrote: > >> > >>> As far as the underlying implementation, if they all use the same > >>> optimizations that you use in cube, then it can be LOCube. If they have > >>> their own optimizations etc (or could), it may be worth them having > their > >>> own Logical operators (which might just be LOCube with flags for the > time > >>> being) that allows us more flexibilty. But I suppose that's between > you, > >>> eclipse, and your GSOC mentor. > >>> > >>> 2012/5/30 Prasanth J <[email protected]> > >>> > >>>> Thanks Alan and Jon for expressing your views. > >>>> > >>>> I agree with Jon's point, if the syntax contains CUBE then user > expects > >> it > >>>> to perform CUBE operation. So Jon's syntax seems more meaningful and > >> concise > >>>> > >>>> rel = CUBE rel BY (dims); > >>>> rel = ROLLUP rel BY (dims); > >>>> rel = GROUPING_SET rel BY (dims); > >>>> > >>>> 2 reasons why I do not prefer using SQL syntax is > >>>> 1) I do not want to break into existing Group operator implementation > :) > >>>> 2) The syntax gets longer in case of partial hierarchical > cubing/rollups > >>>> For ex: > >>>> > >>>> rel = GROUP rel BY dim0, ROLLUP(dim1, dim2, dim3), > >> ROLLUP(dim4,dim5,dim6), > >>>> ROLLUP(dim7,dim8,dim9); > >>>> > >>>> whereas same thing can be expressed like > >>>> > >>>> rel = ROLLUP rel BY dim0, > >>>> (dim1,dim2,dim3),(dim4,dim5,dim6),(dim7,dim8,dim9); > >>>> > >>>> Thanks Alan for pointing out the way for independently managing the > >>>> operators in parser and logical/physical plan. So for all these > >> operators > >>>> (CUBE, ROLLUP, GROUPING_SET) I can just generate LOCube and use flags > to > >>>> differentiate between these three operations. > >>>> > >>>> But, yes we are proliferating operators in this case. > >>>> > >>>> Thanks > >>>> -- Prasanth > >>>> > >>>> On May 30, 2012, at 4:42 PM, Alan Gates wrote: > >>>> > >>>>> > >>>>> On May 30, 2012, at 10:43 AM, Jonathan Coveney wrote: > >>>>> > >>>>>> I was going to say the same thing Alan said w.r.t. operators: > >> operators > >>>> in > >>>>>> the grammar can correspond to whatever logical and physical > operators > >>>> you > >>>>>> want. > >>>>>> > >>>>>> As far as the principle of least astonishment compared to SQL... Pig > >> is > >>>>>> already pretty astonishing. I don't know why we would bend over > >>>> backwards > >>>>>> to make the syntax so similar in this case when even getting to the > >>>> point > >>>>>> of doing a CUBE means understanding an object model that is pretty > >>>>>> different from SQL. > >>>>>> > >>>>>> On that note, > >>>>>> > >>>>>> rel = CUBE rel BY GROUPING SETS(cols); > >>>>>> > >>>>>> seems really confusing. I'd much rather overload the group operating > >>>> than > >>>>>> the cube operator. If I see "cube," I expect a cube. If you start > >> doing > >>>>>> rollups etc, that's not a cube, it's a group. Or it's just a rollup. > >> Pig > >>>>>> latin is simple enough that I don't think having a rollup, > group_set, > >>>> etc > >>>>>> operator will be so confusing, because they're already going to be > >>>> typing > >>>>>> that stuff in the conext of > >>>>>> > >>>>>> group rel by rollup(cols); and so on. I don't see how it's worth > >> adding > >>>>>> more, confusing syntax for the sake of creating parallels with a > >>>> language > >>>>>> we now share very little with. > >>>>> > >>>>> Fair points. > >>>>> > >>>>>> > >>>>>> But I won't beat it any further... if people prefer a different > >> syntax, > >>>>>> that's fine. Just excited to have the features in Pig! > >>>>> +1, I can live with any of the 3 syntax choices (near SQL, original, > >> and > >>>> Jon's). > >>>>> > >>>>> Alan. > >>>>> > >>>>>> Jon > >>>>>> > >>>>>> 2012/5/30 Alan Gates <[email protected]> > >>>>>> > >>>>>>> Some thoughts on this: > >>>>>>> > >>>>>>> 1) +1 to what Dmitriy said on HAVING > >>>>>>> > >>>>>>> 2) We need to be clear about separating operators in the grammar > >> versus > >>>>>>> logical plan versus physical plan. The choices you make in the > >>>> grammar are > >>>>>>> totally independent of the other two. That is, you could choose > the > >>>> syntax: > >>>>>>> > >>>>>>> rel = GROUP rel BY CUBE (a, b, c) > >>>>>>> > >>>>>>> and still have a separate POCube operator. When the parser sees > >> GROUP > >>>> BY > >>>>>>> CUBE it will generate an LOCube operator for the logical plan > rather > >>>> than > >>>>>>> an LOGroup operator. You can still have a separate POCube physical > >>>>>>> operator. Separate optimizations can be applied to LOGroup vs. > >> LOCube > >>>> and > >>>>>>> POGroup vs. POCube. > >>>>>>> > >>>>>>> 3) On syntax I can see arguments for keeping as close to SQL as > >>>> possible > >>>>>>> and for the syntax proposed by Prasanth. The argument for sticking > >>>> close > >>>>>>> to SQL is it conforms to the law of least astonishment. It > wouldn't > >> be > >>>>>>> exactly SQL, as it would end up looking like: > >>>>>>> > >>>>>>> rel = GROUP rel BY CUBE (cols) > >>>>>>> rel = GROUP rel BY ROLLUP (cols) > >>>>>>> rel = GROUP rel BY GROUPING SETS(cols); > >>>>>>> > >>>>>>> The argument I see for sticking with Prasanth's approach is that > >> GROUP > >>>> is > >>>>>>> really short for COGROUP in Pig Latin, and I don't think we're > >>>> proposing > >>>>>>> doing COGROUP rel BY CUBE, nor can I see a case where you'd want to > >> do > >>>> such > >>>>>>> a thing. This makes CUBE really a separate operation. But if we > go > >>>> this > >>>>>>> route I agree with Prasanth we should do CUBE rel BY ROLLUP and > CUBE > >>>> rel BY > >>>>>>> GROUPING SETS. Let's not proliferate operators. > >>>>>>> > >>>>>>> Alan. > >>>>>>> > >>>>>>> On May 29, 2012, at 3:55 PM, Prasanth J wrote: > >>>>>>> > >>>>>>>> Thanks Jonathan for looking into it and for your suggestions. > >>>>>>>> > >>>>>>>> The reason why I came with a clause rather than a separate > operator > >>>> was > >>>>>>> to avoid adding additional operators to the grammar. > >>>>>>>> So adding ROLLUP, GROUPING_SET will need separate logical > operators > >>>>>>> adding to the complexity. I am planning to keep everything under > cube > >>>>>>> operator, so only LOCube and POCube operators will be added > >>>> additionally. > >>>>>>> And as you and Dmitriy have mentioned the purpose of HAVING clause > is > >>>> the > >>>>>>> same as FILTER so we do not need a separate HAVING clause. > >>>>>>>> > >>>>>>>> I will give a quick recap of cube related operations and multiple > >>>> syntax > >>>>>>> options for achieving the same. I am also adding partial cubing and > >>>> rollup > >>>>>>> in this discussion. > >>>>>>>> > >>>>>>>> 1) CUBE > >>>>>>>> > >>>>>>>> Current syntax: > >>>>>>>> alias = CUBE rel BY (a, b); > >>>>>>>> > >>>>>>>> Following group-by's will be computed: > >>>>>>>> (a, b) > >>>>>>>> (a) > >>>>>>>> (b) > >>>>>>>> () > >>>>>>>> > >>>>>>>> 2) Partial CUBE > >>>>>>>> > >>>>>>>> Proposed syntax: > >>>>>>>> alias = CUBE rel BY a, (b, c); > >>>>>>>> > >>>>>>>> Following group-by's will be computed: > >>>>>>>> (a, b, c) > >>>>>>>> (a, b) > >>>>>>>> (a, c) > >>>>>>>> (a) > >>>>>>>> > >>>>>>>> 3) ROLLUP > >>>>>>>> > >>>>>>>> Proposed syntax 1: > >>>>>>>> alias = CUBE rel BY ROLLUP(a, b); > >>>>>>>> > >>>>>>>> Proposed syntax 2: > >>>>>>>> alias = CUBE rel BY (a::b); > >>>>>>>> > >>>>>>>> Proposed syntax 3: > >>>>>>>> alias = ROLLUP rel BY (a, b); > >>>>>>>> > >>>>>>>> Following group-by's will be computed: > >>>>>>>> (a, b) > >>>>>>>> (a) > >>>>>>>> () > >>>>>>>> > >>>>>>>> 4) Partial ROLLUP > >>>>>>>> > >>>>>>>> Proposed syntax 1: > >>>>>>>> alias = CUBE rel BY a, ROLLUP(b, c); > >>>>>>>> > >>>>>>>> Proposed syntax 2: > >>>>>>>> alias = CUBE rel BY (a, b::c); > >>>>>>>> > >>>>>>>> Proposed syntax 3: > >>>>>>>> alias = ROLLUP rel BY a, (b, c); > >>>>>>>> > >>>>>>>> Following group-by's will be computed: > >>>>>>>> (a, b, c) > >>>>>>>> (a, b) > >>>>>>>> (a) > >>>>>>>> > >>>>>>>> 5) GROUPING SETS > >>>>>>>> > >>>>>>>> Proposed syntax 1: > >>>>>>>> alias = CUBE rel BY GROUPING SETS((a), (b, c), (c)) > >>>>>>>> > >>>>>>>> Proposed syntax 2: > >>>>>>>> alias = CUBE rel BY {(a), (b, c), (c)} > >>>>>>>> > >>>>>>>> Proposed syntax 3: > >>>>>>>> alias = GROUPING_SET rel BY ((a), (b, c), (c)) > >>>>>>>> > >>>>>>>> Following group-by's will be computed: > >>>>>>>> (a) > >>>>>>>> (b, c) > >>>>>>>> (c) > >>>>>>>> > >>>>>>>> Please vote for syntax 1, 2 or 3 so that we can come to a > consensus > >>>>>>> before I start hacking the grammar file. > >>>>>>>> > >>>>>>>> Thanks > >>>>>>>> -- Prasanth > >>>>>>>> > >>>>>>>> On May 29, 2012, at 4:05 PM, Jonathan Coveney wrote: > >>>>>>>> > >>>>>>>>> Hey Prashanth, happy hacking. > >>>>>>>>> > >>>>>>>>> My opinion: > >>>>>>>>> > >>>>>>>>> CUBE: > >>>>>>>>> > >>>>>>>>> alias = CUBE rel BY (a,b,c); > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> I like that syntax. It's unambiguous what is going on. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> ROLLUP: > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> alias = CUBE rel BY ROLLUP(a,b,c); > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> I never liked that syntax in SQL. I suggest we just do what we > did > >>>> with > >>>>>>> CUBE. IE > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> alias = ROLLUP rel BY (a,b,c); > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> GROUPING SETS: > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> alias = CUBE rel BY GROUPING SETS((a,b),(b),()); > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> I don't like this. The cube vs. grouping sets is confusing to me. > >>>> maybe > >>>>>>>>> following the > >>>>>>>>> same pattern you could do something like: > >>>>>>>>> > >>>>>>>>> alias = GROUPING_SET rel BY ((a,b),(b),()); > >>>>>>>>> > >>>>>>>>> As far as having, is there an optimization that can be done with > a > >>>>>>> HAVING > >>>>>>>>> clause that can't be done based on the logical plan that comes > >>>>>>> afterwards? > >>>>>>>>> That seems odd to me. Since you have to materialize the result > >>>> anyway, > >>>>>>>>> can't the having clause just be a FILTER that comes after the > >> cube? I > >>>>>>> don't > >>>>>>>>> know why we need a special syntax. > >>>>>>>>> > >>>>>>>>> My opinion. Forgive janky formatting, gmail + paste = pain. > >>>>>>>>> Jon > >>>>>>>>> > >>>>>>>>> 2012/5/27 Prasanth J <[email protected]> > >>>>>>>>> > >>>>>>>>>> Hello everyone > >>>>>>>>>> > >>>>>>>>>> I am looking for feedback from the community about the syntax > for > >>>>>>>>>> CUBE/ROLLUP/GROUPING SETS operations in pig. > >>>>>>>>>> I am moving the discussion from JIRA to dev-list so that > everyone > >>>> can > >>>>>>>>>> share their opinion for operator syntax. Please have a look at > the > >>>>>>> syntax > >>>>>>>>>> proposal at the link below and let me know your opinion > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>> > >>>> > >> > https://issues.apache.org/jira/browse/PIG-2167?focusedCommentId=13277644&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13277644 > >>>>>>>>>> > >>>>>>>>>> Thanks > >>>>>>>>>> -- Prasanth > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>>> > >>>>> > >>>> > >>>> > >> > >> > >
