[ 
https://issues.apache.org/jira/browse/IMPALA-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tamas Mate updated IMPALA-10789:
--------------------------------
    Target Version: Impala 4.2.0  (was: Impala 4.1.0)

> Early materialize expressions in ScanNode
> -----------------------------------------
>
>                 Key: IMPALA-10789
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10789
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend, Frontend
>    Affects Versions: Impala 4.1.0
>            Reporter: Xianqing He
>            Assignee: Xianqing He
>            Priority: Major
>
> Impala uses the Late Materialize, to calculate expressions, for example,
> {code:java}
> SELECT SUM(col), COUNT(col), MIN(col), MAX(col)
> FROM (
>       SELECT CAST(regexp_extract(string_col, '(\\d+)', 0) AS bigint) col
>       FROM functional_parquet.alltypesagg
> ) t{code}
> The Plan likes
> {code:java}
> PLAN-ROOT SINK
> |
> 03:AGGREGATE [FINALIZE]
> |  output: sum:merge(col), count:merge(col), min:merge(col), max:merge(col)
> |  row-size=32B cardinality=1
> |
> 02:EXCHANGE [UNPARTITIONED]
> |
> 01:AGGREGATE
> |  output: sum(CAST(regexp_extract(string_col, '(\\d+)', 0) AS BIGINT)), 
> count(CAST(regexp_extract(string_col, '(\\d+)', 0) AS BIGINT)), 
> min(CAST(regexp_extract(string_col, '(\\d+)', 0) AS BIGINT)), 
> max(CAST(regexp_extract(string_col, '(\\d+)', 0) AS BIGINT))
> |  row-size=32B cardinality=1
> |
> 00:SCAN HDFS [functional_parquet.alltypesagg]
>    partitions=11/11 files=11 size=464.70KB
>    row-size=15B cardinality=11.00K
> {code}
> In the aggregation phase, the expression in the parameters of the aggregation 
> function is evaluated. In this way, the same expression that appears in 
> multiple aggregation functions needs to be evaluated multiple times, which 
> leads to long time consuming, especially for complex expressions, such as 
> regular expressions.
> For analytic functions and contains union all, 
>  
> {code:java}
> SELECT SUM(int_col) OVER (PARTITION BY id )
> FROM (
>     SELECT id
>         , CASE
>             WHEN id = 10 THEN tinyint_col
>             WHEN string_col LIKE '%6%' THEN smallint_col
>         END AS int_col
>     FROM functional_parquet.alltypesagg
>     UNION ALL
>     SELECT id
>         , CASE
>             WHEN id = 10 THEN tinyint_col
>             WHEN string_col LIKE '%6%' THEN smallint_col
>         END AS int_col
>     FROM functional_parquet.alltypes
> ) t
> {code}
> The plan likes
> {code:java}
> PLAN-ROOT SINK
> |
> 06:EXCHANGE [UNPARTITIONED]
> |
> 04:ANALYTIC
> |  functions: sum(int_col)
> |  partition by: id
> |  row-size=14B cardinality=18.30K
> |
> 03:SORT
> |  order by: id ASC NULLS FIRST
> |  row-size=6B cardinality=18.30K
> |
> 05:EXCHANGE [HASH(id)]
> |
> 00:UNION
> |  row-size=6B cardinality=18.30K
> |
> |--02:SCAN HDFS [functional_parquet.alltypes]
> |     partitions=24/24 files=24 size=189.91KB
> |     row-size=22B cardinality=7.30K
> |
> 01:SCAN HDFS [functional_parquet.alltypesagg]
>    partitions=11/11 files=11 size=464.70KB
>    row-size=24B cardinality=11.00K{code}
> In UnionNode, it will materialize expressions and prune columns.
> Currently UnionNode is single-threaded and ScanNode supports multi-threading, 
> it will  improve query performance if materialize expressions in ScanNode.
> We can specify which expressions require early materialize by hints, and 
> Impala internally determines if the expression can be evaluated in ScanNode.
> {code:java}
> SELECT SUM(col), COUNT(col), MIN(col), MAX(col)
> FROM (
>       SELECT CAST(regexp_extract(string_col, '(\\d+)', 0) AS bigint) 
> col/*+materialize_expr*/
>       FROM functional_parquet.alltypesagg
> ) t
> {code}
> This can materialize in ScanNode, but like the follow can't use early 
> matrialize
> {code:java}
> SELECT SUM(col)
> FROM (
>       SELECT CASE 
>                       WHEN t1.id = 10 THEN t2.tinyint_col
>                       ELSE t2.smallint_col
>               END AS col/*+materialize_expr*/
>       FROM functional_parquet.alltypesagg t1
>               JOIN functional_parquet.alltypes t2 ON t1.id = t2.id
> ) t
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to