One happens on the mapper.
On Thu, Jun 21, 2012 at 2:52 PM, Prasanth J <[email protected]> wrote: > Thanks Alan. > Your suggestion looks correct. > > I think with this I can achieve what I wanted in the same syntax > out = CUBE rel BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f); > > Just curious to know. > How is this different from CROSS? and why is CROSS expensive when compared to > flatten? > > Thanks > -- Prasanth > > On Jun 21, 2012, at 5:11 PM, Alan Gates wrote: > >> I think I'm missing something here. The result of the "out =" line is three >> bags, correct? If that's the case, the cross product you want is achieved >> by doing: >> >> result = foreach out generate flatten($0), flatten($1), flatten($2) >> >> This is not the same as CROSS, which would be expensive. >> >> Alan. >> >> On Jun 21, 2012, at 1:28 PM, Prasanth J wrote: >> >>> Hello all >>> >>> I initially implemented ROLLUP as a separate operation with the following >>> syntax >>> >>> a = ROLLUP inp BY (x,y); >>> >>> which does the same thing as CUBE (inserting foreach + group-by in logical >>> plan) except that it uses RollupDimensions UDF. But the issue with this >>> approach is that we cannot mix CUBE and ROLLUP operations together in the >>> same syntax which is a typical case. SQL/Oracle supports using CUBE and >>> ROLLUP together like >>> >>> GROUP BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f); >>> >>> so I modified the pig grammar to support the similar usage. So now we can >>> use a syntax similar to SQL >>> >>> out = CUBE rel BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f); >>> >>> In this approach, the logical plan should introduce cartesian product >>> between bags generated by CUBE(a,b,c), ROLLUP(c,d) and CUBE(e,f) for >>> generating the final output. But I read from the documentation >>> (http://pig.apache.org/docs/r0.10.0/basic.html#cross) that CROSS operator >>> is an expensive operator and advices to use it sparingly. >>> >>> Is there any other way to achieve the cartesian product in a less expensive >>> way? Also, does anyone have thoughts about this new syntax? >>> >>> Thanks >>> -- Prasanth >>> >>> On May 30, 2012, at 8:10 PM, Jonathan Coveney wrote: >>> >>>> As far as the underlying implementation, if they all use the same >>>> optimizations that you use in cube, then it can be LOCube. If they have >>>> their own optimizations etc (or could), it may be worth them having their >>>> own Logical operators (which might just be LOCube with flags for the time >>>> being) that allows us more flexibilty. But I suppose that's between you, >>>> eclipse, and your GSOC mentor. >>>> >>>> 2012/5/30 Prasanth J <[email protected]> >>>> >>>>> Thanks Alan and Jon for expressing your views. >>>>> >>>>> I agree with Jon's point, if the syntax contains CUBE then user expects it >>>>> to perform CUBE operation. So Jon's syntax seems more meaningful and >>>>> concise >>>>> >>>>> rel = CUBE rel BY (dims); >>>>> rel = ROLLUP rel BY (dims); >>>>> rel = GROUPING_SET rel BY (dims); >>>>> >>>>> 2 reasons why I do not prefer using SQL syntax is >>>>> 1) I do not want to break into existing Group operator implementation :) >>>>> 2) The syntax gets longer in case of partial hierarchical cubing/rollups >>>>> For ex: >>>>> >>>>> rel = GROUP rel BY dim0, ROLLUP(dim1, dim2, dim3), ROLLUP(dim4,dim5,dim6), >>>>> ROLLUP(dim7,dim8,dim9); >>>>> >>>>> whereas same thing can be expressed like >>>>> >>>>> rel = ROLLUP rel BY dim0, >>>>> (dim1,dim2,dim3),(dim4,dim5,dim6),(dim7,dim8,dim9); >>>>> >>>>> Thanks Alan for pointing out the way for independently managing the >>>>> operators in parser and logical/physical plan. So for all these operators >>>>> (CUBE, ROLLUP, GROUPING_SET) I can just generate LOCube and use flags to >>>>> differentiate between these three operations. >>>>> >>>>> But, yes we are proliferating operators in this case. >>>>> >>>>> Thanks >>>>> -- Prasanth >>>>> >>>>> On May 30, 2012, at 4:42 PM, Alan Gates wrote: >>>>> >>>>>> >>>>>> On May 30, 2012, at 10:43 AM, Jonathan Coveney wrote: >>>>>> >>>>>>> I was going to say the same thing Alan said w.r.t. operators: operators >>>>> in >>>>>>> the grammar can correspond to whatever logical and physical operators >>>>> you >>>>>>> want. >>>>>>> >>>>>>> As far as the principle of least astonishment compared to SQL... Pig is >>>>>>> already pretty astonishing. I don't know why we would bend over >>>>> backwards >>>>>>> to make the syntax so similar in this case when even getting to the >>>>> point >>>>>>> of doing a CUBE means understanding an object model that is pretty >>>>>>> different from SQL. >>>>>>> >>>>>>> On that note, >>>>>>> >>>>>>> rel = CUBE rel BY GROUPING SETS(cols); >>>>>>> >>>>>>> seems really confusing. I'd much rather overload the group operating >>>>> than >>>>>>> the cube operator. If I see "cube," I expect a cube. If you start doing >>>>>>> rollups etc, that's not a cube, it's a group. Or it's just a rollup. Pig >>>>>>> latin is simple enough that I don't think having a rollup, group_set, >>>>> etc >>>>>>> operator will be so confusing, because they're already going to be >>>>> typing >>>>>>> that stuff in the conext of >>>>>>> >>>>>>> group rel by rollup(cols); and so on. I don't see how it's worth adding >>>>>>> more, confusing syntax for the sake of creating parallels with a >>>>> language >>>>>>> we now share very little with. >>>>>> >>>>>> Fair points. >>>>>> >>>>>>> >>>>>>> But I won't beat it any further... if people prefer a different syntax, >>>>>>> that's fine. Just excited to have the features in Pig! >>>>>> +1, I can live with any of the 3 syntax choices (near SQL, original, and >>>>> Jon's). >>>>>> >>>>>> Alan. >>>>>> >>>>>>> Jon >>>>>>> >>>>>>> 2012/5/30 Alan Gates <[email protected]> >>>>>>> >>>>>>>> Some thoughts on this: >>>>>>>> >>>>>>>> 1) +1 to what Dmitriy said on HAVING >>>>>>>> >>>>>>>> 2) We need to be clear about separating operators in the grammar versus >>>>>>>> logical plan versus physical plan. The choices you make in the >>>>> grammar are >>>>>>>> totally independent of the other two. That is, you could choose the >>>>> syntax: >>>>>>>> >>>>>>>> rel = GROUP rel BY CUBE (a, b, c) >>>>>>>> >>>>>>>> and still have a separate POCube operator. When the parser sees GROUP >>>>> BY >>>>>>>> CUBE it will generate an LOCube operator for the logical plan rather >>>>> than >>>>>>>> an LOGroup operator. You can still have a separate POCube physical >>>>>>>> operator. Separate optimizations can be applied to LOGroup vs. LOCube >>>>> and >>>>>>>> POGroup vs. POCube. >>>>>>>> >>>>>>>> 3) On syntax I can see arguments for keeping as close to SQL as >>>>> possible >>>>>>>> and for the syntax proposed by Prasanth. The argument for sticking >>>>> close >>>>>>>> to SQL is it conforms to the law of least astonishment. It wouldn't be >>>>>>>> exactly SQL, as it would end up looking like: >>>>>>>> >>>>>>>> rel = GROUP rel BY CUBE (cols) >>>>>>>> rel = GROUP rel BY ROLLUP (cols) >>>>>>>> rel = GROUP rel BY GROUPING SETS(cols); >>>>>>>> >>>>>>>> The argument I see for sticking with Prasanth's approach is that GROUP >>>>> is >>>>>>>> really short for COGROUP in Pig Latin, and I don't think we're >>>>> proposing >>>>>>>> doing COGROUP rel BY CUBE, nor can I see a case where you'd want to do >>>>> such >>>>>>>> a thing. This makes CUBE really a separate operation. But if we go >>>>> this >>>>>>>> route I agree with Prasanth we should do CUBE rel BY ROLLUP and CUBE >>>>> rel BY >>>>>>>> GROUPING SETS. Let's not proliferate operators. >>>>>>>> >>>>>>>> Alan. >>>>>>>> >>>>>>>> On May 29, 2012, at 3:55 PM, Prasanth J wrote: >>>>>>>> >>>>>>>>> Thanks Jonathan for looking into it and for your suggestions. >>>>>>>>> >>>>>>>>> The reason why I came with a clause rather than a separate operator >>>>> was >>>>>>>> to avoid adding additional operators to the grammar. >>>>>>>>> So adding ROLLUP, GROUPING_SET will need separate logical operators >>>>>>>> adding to the complexity. I am planning to keep everything under cube >>>>>>>> operator, so only LOCube and POCube operators will be added >>>>> additionally. >>>>>>>> And as you and Dmitriy have mentioned the purpose of HAVING clause is >>>>> the >>>>>>>> same as FILTER so we do not need a separate HAVING clause. >>>>>>>>> >>>>>>>>> I will give a quick recap of cube related operations and multiple >>>>> syntax >>>>>>>> options for achieving the same. I am also adding partial cubing and >>>>> rollup >>>>>>>> in this discussion. >>>>>>>>> >>>>>>>>> 1) CUBE >>>>>>>>> >>>>>>>>> Current syntax: >>>>>>>>> alias = CUBE rel BY (a, b); >>>>>>>>> >>>>>>>>> Following group-by's will be computed: >>>>>>>>> (a, b) >>>>>>>>> (a) >>>>>>>>> (b) >>>>>>>>> () >>>>>>>>> >>>>>>>>> 2) Partial CUBE >>>>>>>>> >>>>>>>>> Proposed syntax: >>>>>>>>> alias = CUBE rel BY a, (b, c); >>>>>>>>> >>>>>>>>> Following group-by's will be computed: >>>>>>>>> (a, b, c) >>>>>>>>> (a, b) >>>>>>>>> (a, c) >>>>>>>>> (a) >>>>>>>>> >>>>>>>>> 3) ROLLUP >>>>>>>>> >>>>>>>>> Proposed syntax 1: >>>>>>>>> alias = CUBE rel BY ROLLUP(a, b); >>>>>>>>> >>>>>>>>> Proposed syntax 2: >>>>>>>>> alias = CUBE rel BY (a::b); >>>>>>>>> >>>>>>>>> Proposed syntax 3: >>>>>>>>> alias = ROLLUP rel BY (a, b); >>>>>>>>> >>>>>>>>> Following group-by's will be computed: >>>>>>>>> (a, b) >>>>>>>>> (a) >>>>>>>>> () >>>>>>>>> >>>>>>>>> 4) Partial ROLLUP >>>>>>>>> >>>>>>>>> Proposed syntax 1: >>>>>>>>> alias = CUBE rel BY a, ROLLUP(b, c); >>>>>>>>> >>>>>>>>> Proposed syntax 2: >>>>>>>>> alias = CUBE rel BY (a, b::c); >>>>>>>>> >>>>>>>>> Proposed syntax 3: >>>>>>>>> alias = ROLLUP rel BY a, (b, c); >>>>>>>>> >>>>>>>>> Following group-by's will be computed: >>>>>>>>> (a, b, c) >>>>>>>>> (a, b) >>>>>>>>> (a) >>>>>>>>> >>>>>>>>> 5) GROUPING SETS >>>>>>>>> >>>>>>>>> Proposed syntax 1: >>>>>>>>> alias = CUBE rel BY GROUPING SETS((a), (b, c), (c)) >>>>>>>>> >>>>>>>>> Proposed syntax 2: >>>>>>>>> alias = CUBE rel BY {(a), (b, c), (c)} >>>>>>>>> >>>>>>>>> Proposed syntax 3: >>>>>>>>> alias = GROUPING_SET rel BY ((a), (b, c), (c)) >>>>>>>>> >>>>>>>>> Following group-by's will be computed: >>>>>>>>> (a) >>>>>>>>> (b, c) >>>>>>>>> (c) >>>>>>>>> >>>>>>>>> Please vote for syntax 1, 2 or 3 so that we can come to a consensus >>>>>>>> before I start hacking the grammar file. >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> -- Prasanth >>>>>>>>> >>>>>>>>> On May 29, 2012, at 4:05 PM, Jonathan Coveney wrote: >>>>>>>>> >>>>>>>>>> Hey Prashanth, happy hacking. >>>>>>>>>> >>>>>>>>>> My opinion: >>>>>>>>>> >>>>>>>>>> CUBE: >>>>>>>>>> >>>>>>>>>> alias = CUBE rel BY (a,b,c); >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I like that syntax. It's unambiguous what is going on. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> ROLLUP: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> alias = CUBE rel BY ROLLUP(a,b,c); >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I never liked that syntax in SQL. I suggest we just do what we did >>>>> with >>>>>>>> CUBE. IE >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> alias = ROLLUP rel BY (a,b,c); >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> GROUPING SETS: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> alias = CUBE rel BY GROUPING SETS((a,b),(b),()); >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I don't like this. The cube vs. grouping sets is confusing to me. >>>>> maybe >>>>>>>>>> following the >>>>>>>>>> same pattern you could do something like: >>>>>>>>>> >>>>>>>>>> alias = GROUPING_SET rel BY ((a,b),(b),()); >>>>>>>>>> >>>>>>>>>> As far as having, is there an optimization that can be done with a >>>>>>>> HAVING >>>>>>>>>> clause that can't be done based on the logical plan that comes >>>>>>>> afterwards? >>>>>>>>>> That seems odd to me. Since you have to materialize the result >>>>> anyway, >>>>>>>>>> can't the having clause just be a FILTER that comes after the cube? I >>>>>>>> don't >>>>>>>>>> know why we need a special syntax. >>>>>>>>>> >>>>>>>>>> My opinion. Forgive janky formatting, gmail + paste = pain. >>>>>>>>>> Jon >>>>>>>>>> >>>>>>>>>> 2012/5/27 Prasanth J <[email protected]> >>>>>>>>>> >>>>>>>>>>> Hello everyone >>>>>>>>>>> >>>>>>>>>>> I am looking for feedback from the community about the syntax for >>>>>>>>>>> CUBE/ROLLUP/GROUPING SETS operations in pig. >>>>>>>>>>> I am moving the discussion from JIRA to dev-list so that everyone >>>>> can >>>>>>>>>>> share their opinion for operator syntax. Please have a look at the >>>>>>>> syntax >>>>>>>>>>> proposal at the link below and let me know your opinion >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>> >>>>> https://issues.apache.org/jira/browse/PIG-2167?focusedCommentId=13277644&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13277644 >>>>>>>>>>> >>>>>>>>>>> Thanks >>>>>>>>>>> -- Prasanth >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>> >>>>> >>>>> >>> >> >
