[ https://issues.apache.org/jira/browse/IMPALA-7204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16530549#comment-16530549 ]
Ruslan Dautkhanov commented on IMPALA-7204: ------------------------------------------- Thanks [~tarmstrong] ! Would be great if it also adds infrastructure to runs these most heavy operations with intra-node parallelism. Most of our query execution times come from count(distinct) as we do mostly count(distinct) on highly-cardinal values.. cc [~twmarshall] > Add support for GROUP BY ROLLUP > ------------------------------- > > Key: IMPALA-7204 > URL: https://issues.apache.org/jira/browse/IMPALA-7204 > Project: IMPALA > Issue Type: Improvement > Components: Backend > Affects Versions: Impala 3.0, Impala 2.12.0 > Reporter: Ruslan Dautkhanov > Priority: Critical > Labels: GROUP_BY, sql > > Now suppose that we'd like to analyze our sales data, to study the amount of > sales that is occurring for different products, in different states and > regions. Using the ROLLUP feature of SQL 2003, we could issue the query: > {code:sql} > select region, state, product, sum(sales) total_sales > from sales_history > group by rollup (region, state, product) > {code} > Semantically, the above query is equivalent to > > {code:sql} > select region, state, product, sum(sales) total_sales > from sales_history > group by region, state, product > union > select region, state, null, sum(sales) total_sales > from sales_history > group by region, state > union > select region, null, null, sum(sales) total_sales > from sales_history > group by region > union > select null, null, null, sum(sales) total_sales > from sales_history > > {code} > The query might produce results that looked something like: > {noformat} > REGION STATE PRODUCT TOTAL_SALES > ------ ----- ------- ----------- > null null null 6200 > EAST MA BOATS 100 > EAST MA CARS 1500 > EAST MA null 1600 > EAST NY BOATS 150 > EAST NY CARS 1000 > EAST NY null 1150 > EAST null null 2750 > WEST CA BOATS 750 > WEST CA CARS 500 > WEST CA null 1250 > WEST AZ BOATS 2000 > WEST AZ CARS 200 > WEST AZ null 2200 > WEST null null 3450 > {noformat} > We have a lot of production queries that work around this missing Impala > functionality by having three UNION ALLs. Physical execution plan shows > Impala actually reads full fact table three times. So it could be a three > times improvement (or more, depending on number of columns that are being > rolled up). > I can't find another SQL on Hadoop engine that doesn't support this feature. > *Checked Spark, Hive, PIG, Flink and some other engines - they all do > support this basic SQL feature*. > Would be great to have a matching feature in Impala too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org