GitHub user gatorsmile opened a pull request:

    https://github.com/apache/spark/pull/9419

    [SPARK-11275][SQL][WIP] Rollup and Cube Generates the Incorrect Results 
when Aggregation Functions Use Group By Columns

    In the current implementation, Rollup and Cube are unable to generate the 
correct results for the following cases:
    
    When the aggregation functions use the group by key columns:
        sql("select b, a, sum(a), min(a), min(b+b) from mytable group by a, b 
with rollup").collect()
        sql("select a, b, sum(a), min(a), min(b+b) from mytable group by b, a 
with cube").collect()
    
    The problem becomes more complex if the group by clauses have the functions 
whose inputs are also appear in the group by. 
        sql("select a + b, b, sum(a - b) from mytable group by a + b, b with 
rollup").collect()
        sql("select a + b, b, sum(a - b) from mytable group by a + b, b with 
cube").collect()
    
    The basic solutions are adding extra Projection when the query are part of 
the above situations. The projection will add duplicate values for these 
affected columns with alias names so that the column values will not be removed 
when expand is evaluated during the runtime. 
    
    Working on the test cases. Will add more cases into Hive golden buckets. 
Welcome any comment and suggestion! Thank you!   
     

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/gatorsmile/spark rollupCube

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/9419.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #9419
    
----
commit b10418e161d5809f3b1de92cf4a33b2f362cd2b4
Author: Xiao Li <xiaoli@xiaos-imac.local>
Date:   2015-11-02T09:05:31Z

    Spark-11275

commit 7721442cdf65924af204d39e3b3b7bda6c41dfc6
Author: Xiao Li <xiaoli@xiaos-imac.local>
Date:   2015-11-02T09:51:16Z

    syntax cleaning

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to