I think I'm missing something here. The result of the "out =" line is three bags, correct? If that's the case, the cross product you want is achieved by doing:
result = foreach out generate flatten($0), flatten($1), flatten($2) This is not the same as CROSS, which would be expensive. Alan. On Jun 21, 2012, at 1:28 PM, Prasanth J wrote: > Hello all > > I initially implemented ROLLUP as a separate operation with the following > syntax > > a = ROLLUP inp BY (x,y); > > which does the same thing as CUBE (inserting foreach + group-by in logical > plan) except that it uses RollupDimensions UDF. But the issue with this > approach is that we cannot mix CUBE and ROLLUP operations together in the > same syntax which is a typical case. SQL/Oracle supports using CUBE and > ROLLUP together like > > GROUP BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f); > > so I modified the pig grammar to support the similar usage. So now we can use > a syntax similar to SQL > > out = CUBE rel BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f); > > In this approach, the logical plan should introduce cartesian product between > bags generated by CUBE(a,b,c), ROLLUP(c,d) and CUBE(e,f) for generating the > final output. But I read from the documentation > (http://pig.apache.org/docs/r0.10.0/basic.html#cross) that CROSS operator is > an expensive operator and advices to use it sparingly. > > Is there any other way to achieve the cartesian product in a less expensive > way? Also, does anyone have thoughts about this new syntax? > > Thanks > -- Prasanth > > On May 30, 2012, at 8:10 PM, Jonathan Coveney wrote: > >> As far as the underlying implementation, if they all use the same >> optimizations that you use in cube, then it can be LOCube. If they have >> their own optimizations etc (or could), it may be worth them having their >> own Logical operators (which might just be LOCube with flags for the time >> being) that allows us more flexibilty. But I suppose that's between you, >> eclipse, and your GSOC mentor. >> >> 2012/5/30 Prasanth J <[email protected]> >> >>> Thanks Alan and Jon for expressing your views. >>> >>> I agree with Jon's point, if the syntax contains CUBE then user expects it >>> to perform CUBE operation. So Jon's syntax seems more meaningful and concise >>> >>> rel = CUBE rel BY (dims); >>> rel = ROLLUP rel BY (dims); >>> rel = GROUPING_SET rel BY (dims); >>> >>> 2 reasons why I do not prefer using SQL syntax is >>> 1) I do not want to break into existing Group operator implementation :) >>> 2) The syntax gets longer in case of partial hierarchical cubing/rollups >>> For ex: >>> >>> rel = GROUP rel BY dim0, ROLLUP(dim1, dim2, dim3), ROLLUP(dim4,dim5,dim6), >>> ROLLUP(dim7,dim8,dim9); >>> >>> whereas same thing can be expressed like >>> >>> rel = ROLLUP rel BY dim0, >>> (dim1,dim2,dim3),(dim4,dim5,dim6),(dim7,dim8,dim9); >>> >>> Thanks Alan for pointing out the way for independently managing the >>> operators in parser and logical/physical plan. So for all these operators >>> (CUBE, ROLLUP, GROUPING_SET) I can just generate LOCube and use flags to >>> differentiate between these three operations. >>> >>> But, yes we are proliferating operators in this case. >>> >>> Thanks >>> -- Prasanth >>> >>> On May 30, 2012, at 4:42 PM, Alan Gates wrote: >>> >>>> >>>> On May 30, 2012, at 10:43 AM, Jonathan Coveney wrote: >>>> >>>>> I was going to say the same thing Alan said w.r.t. operators: operators >>> in >>>>> the grammar can correspond to whatever logical and physical operators >>> you >>>>> want. >>>>> >>>>> As far as the principle of least astonishment compared to SQL... Pig is >>>>> already pretty astonishing. I don't know why we would bend over >>> backwards >>>>> to make the syntax so similar in this case when even getting to the >>> point >>>>> of doing a CUBE means understanding an object model that is pretty >>>>> different from SQL. >>>>> >>>>> On that note, >>>>> >>>>> rel = CUBE rel BY GROUPING SETS(cols); >>>>> >>>>> seems really confusing. I'd much rather overload the group operating >>> than >>>>> the cube operator. If I see "cube," I expect a cube. If you start doing >>>>> rollups etc, that's not a cube, it's a group. Or it's just a rollup. Pig >>>>> latin is simple enough that I don't think having a rollup, group_set, >>> etc >>>>> operator will be so confusing, because they're already going to be >>> typing >>>>> that stuff in the conext of >>>>> >>>>> group rel by rollup(cols); and so on. I don't see how it's worth adding >>>>> more, confusing syntax for the sake of creating parallels with a >>> language >>>>> we now share very little with. >>>> >>>> Fair points. >>>> >>>>> >>>>> But I won't beat it any further... if people prefer a different syntax, >>>>> that's fine. Just excited to have the features in Pig! >>>> +1, I can live with any of the 3 syntax choices (near SQL, original, and >>> Jon's). >>>> >>>> Alan. >>>> >>>>> Jon >>>>> >>>>> 2012/5/30 Alan Gates <[email protected]> >>>>> >>>>>> Some thoughts on this: >>>>>> >>>>>> 1) +1 to what Dmitriy said on HAVING >>>>>> >>>>>> 2) We need to be clear about separating operators in the grammar versus >>>>>> logical plan versus physical plan. The choices you make in the >>> grammar are >>>>>> totally independent of the other two. That is, you could choose the >>> syntax: >>>>>> >>>>>> rel = GROUP rel BY CUBE (a, b, c) >>>>>> >>>>>> and still have a separate POCube operator. When the parser sees GROUP >>> BY >>>>>> CUBE it will generate an LOCube operator for the logical plan rather >>> than >>>>>> an LOGroup operator. You can still have a separate POCube physical >>>>>> operator. Separate optimizations can be applied to LOGroup vs. LOCube >>> and >>>>>> POGroup vs. POCube. >>>>>> >>>>>> 3) On syntax I can see arguments for keeping as close to SQL as >>> possible >>>>>> and for the syntax proposed by Prasanth. The argument for sticking >>> close >>>>>> to SQL is it conforms to the law of least astonishment. It wouldn't be >>>>>> exactly SQL, as it would end up looking like: >>>>>> >>>>>> rel = GROUP rel BY CUBE (cols) >>>>>> rel = GROUP rel BY ROLLUP (cols) >>>>>> rel = GROUP rel BY GROUPING SETS(cols); >>>>>> >>>>>> The argument I see for sticking with Prasanth's approach is that GROUP >>> is >>>>>> really short for COGROUP in Pig Latin, and I don't think we're >>> proposing >>>>>> doing COGROUP rel BY CUBE, nor can I see a case where you'd want to do >>> such >>>>>> a thing. This makes CUBE really a separate operation. But if we go >>> this >>>>>> route I agree with Prasanth we should do CUBE rel BY ROLLUP and CUBE >>> rel BY >>>>>> GROUPING SETS. Let's not proliferate operators. >>>>>> >>>>>> Alan. >>>>>> >>>>>> On May 29, 2012, at 3:55 PM, Prasanth J wrote: >>>>>> >>>>>>> Thanks Jonathan for looking into it and for your suggestions. >>>>>>> >>>>>>> The reason why I came with a clause rather than a separate operator >>> was >>>>>> to avoid adding additional operators to the grammar. >>>>>>> So adding ROLLUP, GROUPING_SET will need separate logical operators >>>>>> adding to the complexity. I am planning to keep everything under cube >>>>>> operator, so only LOCube and POCube operators will be added >>> additionally. >>>>>> And as you and Dmitriy have mentioned the purpose of HAVING clause is >>> the >>>>>> same as FILTER so we do not need a separate HAVING clause. >>>>>>> >>>>>>> I will give a quick recap of cube related operations and multiple >>> syntax >>>>>> options for achieving the same. I am also adding partial cubing and >>> rollup >>>>>> in this discussion. >>>>>>> >>>>>>> 1) CUBE >>>>>>> >>>>>>> Current syntax: >>>>>>> alias = CUBE rel BY (a, b); >>>>>>> >>>>>>> Following group-by's will be computed: >>>>>>> (a, b) >>>>>>> (a) >>>>>>> (b) >>>>>>> () >>>>>>> >>>>>>> 2) Partial CUBE >>>>>>> >>>>>>> Proposed syntax: >>>>>>> alias = CUBE rel BY a, (b, c); >>>>>>> >>>>>>> Following group-by's will be computed: >>>>>>> (a, b, c) >>>>>>> (a, b) >>>>>>> (a, c) >>>>>>> (a) >>>>>>> >>>>>>> 3) ROLLUP >>>>>>> >>>>>>> Proposed syntax 1: >>>>>>> alias = CUBE rel BY ROLLUP(a, b); >>>>>>> >>>>>>> Proposed syntax 2: >>>>>>> alias = CUBE rel BY (a::b); >>>>>>> >>>>>>> Proposed syntax 3: >>>>>>> alias = ROLLUP rel BY (a, b); >>>>>>> >>>>>>> Following group-by's will be computed: >>>>>>> (a, b) >>>>>>> (a) >>>>>>> () >>>>>>> >>>>>>> 4) Partial ROLLUP >>>>>>> >>>>>>> Proposed syntax 1: >>>>>>> alias = CUBE rel BY a, ROLLUP(b, c); >>>>>>> >>>>>>> Proposed syntax 2: >>>>>>> alias = CUBE rel BY (a, b::c); >>>>>>> >>>>>>> Proposed syntax 3: >>>>>>> alias = ROLLUP rel BY a, (b, c); >>>>>>> >>>>>>> Following group-by's will be computed: >>>>>>> (a, b, c) >>>>>>> (a, b) >>>>>>> (a) >>>>>>> >>>>>>> 5) GROUPING SETS >>>>>>> >>>>>>> Proposed syntax 1: >>>>>>> alias = CUBE rel BY GROUPING SETS((a), (b, c), (c)) >>>>>>> >>>>>>> Proposed syntax 2: >>>>>>> alias = CUBE rel BY {(a), (b, c), (c)} >>>>>>> >>>>>>> Proposed syntax 3: >>>>>>> alias = GROUPING_SET rel BY ((a), (b, c), (c)) >>>>>>> >>>>>>> Following group-by's will be computed: >>>>>>> (a) >>>>>>> (b, c) >>>>>>> (c) >>>>>>> >>>>>>> Please vote for syntax 1, 2 or 3 so that we can come to a consensus >>>>>> before I start hacking the grammar file. >>>>>>> >>>>>>> Thanks >>>>>>> -- Prasanth >>>>>>> >>>>>>> On May 29, 2012, at 4:05 PM, Jonathan Coveney wrote: >>>>>>> >>>>>>>> Hey Prashanth, happy hacking. >>>>>>>> >>>>>>>> My opinion: >>>>>>>> >>>>>>>> CUBE: >>>>>>>> >>>>>>>> alias = CUBE rel BY (a,b,c); >>>>>>>> >>>>>>>> >>>>>>>> I like that syntax. It's unambiguous what is going on. >>>>>>>> >>>>>>>> >>>>>>>> ROLLUP: >>>>>>>> >>>>>>>> >>>>>>>> alias = CUBE rel BY ROLLUP(a,b,c); >>>>>>>> >>>>>>>> >>>>>>>> I never liked that syntax in SQL. I suggest we just do what we did >>> with >>>>>> CUBE. IE >>>>>>>> >>>>>>>> >>>>>>>> alias = ROLLUP rel BY (a,b,c); >>>>>>>> >>>>>>>> >>>>>>>> GROUPING SETS: >>>>>>>> >>>>>>>> >>>>>>>> alias = CUBE rel BY GROUPING SETS((a,b),(b),()); >>>>>>>> >>>>>>>> >>>>>>>> I don't like this. The cube vs. grouping sets is confusing to me. >>> maybe >>>>>>>> following the >>>>>>>> same pattern you could do something like: >>>>>>>> >>>>>>>> alias = GROUPING_SET rel BY ((a,b),(b),()); >>>>>>>> >>>>>>>> As far as having, is there an optimization that can be done with a >>>>>> HAVING >>>>>>>> clause that can't be done based on the logical plan that comes >>>>>> afterwards? >>>>>>>> That seems odd to me. Since you have to materialize the result >>> anyway, >>>>>>>> can't the having clause just be a FILTER that comes after the cube? I >>>>>> don't >>>>>>>> know why we need a special syntax. >>>>>>>> >>>>>>>> My opinion. Forgive janky formatting, gmail + paste = pain. >>>>>>>> Jon >>>>>>>> >>>>>>>> 2012/5/27 Prasanth J <[email protected]> >>>>>>>> >>>>>>>>> Hello everyone >>>>>>>>> >>>>>>>>> I am looking for feedback from the community about the syntax for >>>>>>>>> CUBE/ROLLUP/GROUPING SETS operations in pig. >>>>>>>>> I am moving the discussion from JIRA to dev-list so that everyone >>> can >>>>>>>>> share their opinion for operator syntax. Please have a look at the >>>>>> syntax >>>>>>>>> proposal at the link below and let me know your opinion >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>> >>> https://issues.apache.org/jira/browse/PIG-2167?focusedCommentId=13277644&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13277644 >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> -- Prasanth >>>>>>>>> >>>>>>>>> >>>>>>> >>>>>> >>>>>> >>>> >>> >>> >
