[ 
https://issues.apache.org/jira/browse/IMPALA-7204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16530549#comment-16530549
 ] 

Ruslan Dautkhanov commented on IMPALA-7204:
-------------------------------------------

Thanks [~tarmstrong] !

Would be great if it also adds infrastructure to runs these most heavy 
operations with intra-node parallelism.

Most of our query execution times come from count(distinct) as we do mostly 
count(distinct) on highly-cardinal values.. 

cc [~twmarshall]

> Add support for GROUP BY ROLLUP
> -------------------------------
>
>                 Key: IMPALA-7204
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7204
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>    Affects Versions: Impala 3.0, Impala 2.12.0
>            Reporter: Ruslan Dautkhanov
>            Priority: Critical
>              Labels: GROUP_BY, sql
>
> Now suppose that we'd like to analyze our sales data, to study the amount of 
> sales that is occurring for different products, in different states and 
> regions. Using the ROLLUP feature of SQL 2003, we could issue the query:
> {code:sql}
> select region, state, product, sum(sales) total_sales
> from sales_history 
> group by rollup (region, state, product)
> {code}
> Semantically, the above query is equivalent to
>  
> {code:sql}
> select region, state, product, sum(sales) total_sales
> from sales_history 
> group by region, state, product
> union
> select region, state, null, sum(sales) total_sales
> from sales_history 
> group by region, state
> union
> select region, null, null, sum(sales) total_sales
> from sales_history 
> group by region
> union
> select null, null, null, sum(sales) total_sales
> from sales_history
>  
> {code}
> The query might produce results that looked something like:
> {noformat}
> REGION STATE PRODUCT TOTAL_SALES
> ------ ----- ------- -----------
> null null null 6200
> EAST MA BOATS 100
> EAST MA CARS 1500
> EAST MA null 1600
> EAST NY BOATS 150
> EAST NY CARS 1000
> EAST NY null 1150
> EAST null null 2750
> WEST CA BOATS 750
> WEST CA CARS 500
> WEST CA null 1250
> WEST AZ BOATS 2000
> WEST AZ CARS 200
> WEST AZ null 2200
> WEST null null 3450
> {noformat}
> We have a lot of production queries that work around this missing Impala 
> functionality by having three UNION ALLs. Physical execution plan shows 
> Impala actually reads full fact table three times. So it could be a three 
> times improvement (or more, depending on number of columns that are being 
> rolled up).
> I can't find another SQL on Hadoop engine that doesn't support this feature. 
>  *Checked Spark, Hive, PIG, Flink and some other engines - they all do 
> support this basic SQL feature*.
> Would be great to have a matching feature in Impala too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to