[ 
https://issues.apache.org/jira/browse/PIG-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277644#comment-13277644
 ] 

Prasanth J commented on PIG-2167:
---------------------------------

Hello everyone

I will be working with Dmitriy on this issue for GSoC 2012 which is about to 
start next week (May 21). Before getting started with the project I would like 
to hear from the community about the syntax for CUBE (and related) operators. 
As a part of this project I also have plans for implementing hierarchy based 
grouping, partial grouping and monotonicity based pruning using HAVING clause. 

Following are the list of tasks that require modifications to the grammar file 
and for each of them I have included SQL/Oracle syntax, the proposed syntax for 
pig and the corresponding output groups. I would like to know feedback for each 
of the suggested syntax and finalize the syntax for all of them.

1) *CUBE* operation (all combinations of grouping)
_SQL/Oracle syntax_
{noformat}GROUP BY CUBE (a,b,c);{noformat}

_Current syntax in Pig_
{noformat}alias = CUBE rel BY (a,b,c);{noformat}

_Output groups:_
{(a,b,c), (a,b), (b,c), (a,c), (a), (b), (c), ()}

2) *ROLLUP* operation (hierarchy based grouping)
_SQL/Oracle syntax_
{noformat}GROUP BY ROLLUP (a,b,c);{noformat}

_Proposed syntax for Pig:_
{noformat}alias = CUBE rel BY ROLLUP(a,b,c);{noformat}

_Output groups:_
{(a,b,c), (a,b), (a), ()}

3) *GROUPING SETS* operation (partial grouping)
_SQL/Oracle syntax_
{noformat}GROUP BY GROUPING SETS ((a,b),(b),());{noformat}

_Proposed syntax for Pig:_
{noformat}alias = CUBE rel BY GROUPING SETS((a,b),(b),());{noformat}

_Output groups:_
{(a,b), (b), ()}

4) *HAVING* clause (for pruning the groups that does not satisfy the specified 
condition)
_SQL/Oracle syntax:_
{noformat}GROUP BY CUBE (x,y)  
HAVING SUM(z)>100{noformat}

_Proposed syntax for Pig:_
{noformat}alias = CUBE rel BY (a,b,c) HAVING SUM(d)>100;{noformat}

_Output groups:_
Only the groups that satisfy the specified condition appears in the output

The primary reasons for having a separate CUBE operator are
* Ease of implementation without modification to existing GROUP BY operator 
implementation
* Ability to have a separate Physical Operator (POCube)
* Cube specific optimizations can be applied to the physical operator
* Some operations applicable to GROUP operator are not applicable for CUBE
** Constant expression evaluation
** Duplicate column projection

Please let me know your feedback/suggestions for the operator syntax. 

*References:*
* Oracle documentation for CUBE 
http://docs.oracle.com/cd/B19306_01/server.102/b14223/aggreg.htm#i1012749
* SQL server documentation for CUBE 
http://msdn.microsoft.com/en-us/library/ms177673.aspx
* Simple CUBE/ROLLUP operations in Oracle and SQL 
http://sqlfiddle.com/#!3/3bba2/11
http://sqlfiddle.com/#!4/32c81/26

                
> CUBE operation in Pig
> ---------------------
>
>                 Key: PIG-2167
>                 URL: https://issues.apache.org/jira/browse/PIG-2167
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>              Labels: gsoc2012, mentor
>         Attachments: PIG-2167.1.patch, PIG-2167.2.patch, PIG-2167.3.patch, 
> PIG-2167.4.patch, Pig-Cubing-Performance.png
>
>
> Computing aggregates over a cube of several dimensions is a common operation 
> in data warehousing.
> The standard SQL syntax is "GROUP relation BY dim1, dim2, dim3 WITH CUBE" -- 
> which in addition to all dim1-2-3, produces aggregations for just dim1, just 
> dim1 and dim2, etc. NULL is generally used to represent "all".
> A presentation by Arnab Nandi describes how one might implement efficient 
> cubing in Map-Reduce here: http://pdf.cx/44wrk
> We can start with the naive solution which only works for algebraic measures, 
> and work up from there.
> This is a candidate project for Google summer of code 2012. More information 
> about the program can be found at 
> https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to