[ https://issues.apache.org/jira/browse/IMPALA-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tamas Mate updated IMPALA-10789: -------------------------------- Target Version: Impala 4.2.0 (was: Impala 4.1.0) > Early materialize expressions in ScanNode > ----------------------------------------- > > Key: IMPALA-10789 > URL: https://issues.apache.org/jira/browse/IMPALA-10789 > Project: IMPALA > Issue Type: Improvement > Components: Backend, Frontend > Affects Versions: Impala 4.1.0 > Reporter: Xianqing He > Assignee: Xianqing He > Priority: Major > > Impala uses the Late Materialize, to calculate expressions, for example, > {code:java} > SELECT SUM(col), COUNT(col), MIN(col), MAX(col) > FROM ( > SELECT CAST(regexp_extract(string_col, '(\\d+)', 0) AS bigint) col > FROM functional_parquet.alltypesagg > ) t{code} > The Plan likes > {code:java} > PLAN-ROOT SINK > | > 03:AGGREGATE [FINALIZE] > | output: sum:merge(col), count:merge(col), min:merge(col), max:merge(col) > | row-size=32B cardinality=1 > | > 02:EXCHANGE [UNPARTITIONED] > | > 01:AGGREGATE > | output: sum(CAST(regexp_extract(string_col, '(\\d+)', 0) AS BIGINT)), > count(CAST(regexp_extract(string_col, '(\\d+)', 0) AS BIGINT)), > min(CAST(regexp_extract(string_col, '(\\d+)', 0) AS BIGINT)), > max(CAST(regexp_extract(string_col, '(\\d+)', 0) AS BIGINT)) > | row-size=32B cardinality=1 > | > 00:SCAN HDFS [functional_parquet.alltypesagg] > partitions=11/11 files=11 size=464.70KB > row-size=15B cardinality=11.00K > {code} > In the aggregation phase, the expression in the parameters of the aggregation > function is evaluated. In this way, the same expression that appears in > multiple aggregation functions needs to be evaluated multiple times, which > leads to long time consuming, especially for complex expressions, such as > regular expressions. > For analytic functions and contains union all, > > {code:java} > SELECT SUM(int_col) OVER (PARTITION BY id ) > FROM ( > SELECT id > , CASE > WHEN id = 10 THEN tinyint_col > WHEN string_col LIKE '%6%' THEN smallint_col > END AS int_col > FROM functional_parquet.alltypesagg > UNION ALL > SELECT id > , CASE > WHEN id = 10 THEN tinyint_col > WHEN string_col LIKE '%6%' THEN smallint_col > END AS int_col > FROM functional_parquet.alltypes > ) t > {code} > The plan likes > {code:java} > PLAN-ROOT SINK > | > 06:EXCHANGE [UNPARTITIONED] > | > 04:ANALYTIC > | functions: sum(int_col) > | partition by: id > | row-size=14B cardinality=18.30K > | > 03:SORT > | order by: id ASC NULLS FIRST > | row-size=6B cardinality=18.30K > | > 05:EXCHANGE [HASH(id)] > | > 00:UNION > | row-size=6B cardinality=18.30K > | > |--02:SCAN HDFS [functional_parquet.alltypes] > | partitions=24/24 files=24 size=189.91KB > | row-size=22B cardinality=7.30K > | > 01:SCAN HDFS [functional_parquet.alltypesagg] > partitions=11/11 files=11 size=464.70KB > row-size=24B cardinality=11.00K{code} > In UnionNode, it will materialize expressions and prune columns. > Currently UnionNode is single-threaded and ScanNode supports multi-threading, > it will improve query performance if materialize expressions in ScanNode. > We can specify which expressions require early materialize by hints, and > Impala internally determines if the expression can be evaluated in ScanNode. > {code:java} > SELECT SUM(col), COUNT(col), MIN(col), MAX(col) > FROM ( > SELECT CAST(regexp_extract(string_col, '(\\d+)', 0) AS bigint) > col/*+materialize_expr*/ > FROM functional_parquet.alltypesagg > ) t > {code} > This can materialize in ScanNode, but like the follow can't use early > matrialize > {code:java} > SELECT SUM(col) > FROM ( > SELECT CASE > WHEN t1.id = 10 THEN t2.tinyint_col > ELSE t2.smallint_col > END AS col/*+materialize_expr*/ > FROM functional_parquet.alltypesagg t1 > JOIN functional_parquet.alltypes t2 ON t1.id = t2.id > ) t > {code} > -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org