Yeah, I agree costing expression evaluation can be hard and not necessary in general. For the query shown below that is referenced in https://issues.apache.org/jira/browse/IMPALA-10789 <https://issues.apache.org/jira/browse/IMPALA-10789>, part of the cost saving could be due to reducing the strings to BIGINT which can lead to row size reduction. Multiplying it by the # of rows from the scan, the total saving in data size probably can help reduce the total cost of the plan.
SELECT SUM(col), COUNT(col), MIN(col), MAX(col) FROM ( SELECT CAST(regexp_extract(string_col, '(\\d+)', 0) AS bigint) col FROM functional_parquet.alltypesagg ) t On when to early materialization of expressions (aka expression pushdown). Normally, an expression E can be pushed down to a scan node if all inputs to E are available at S and can be evaluated at S properly. Thus we need to compute the inputs available at the scan node and test if a push-down expression can be satisfied by the available inputs. For the above query, the inputs to the expression "CAST(regexp_extract(string_col, '(\\d+)', 0) AS bigint)” are string_col and two constants. Since all are available, the expression can be evaluated at the scan. Thanks —Qifan > On Jul 21, 2021, at 1:02 AM, Tim Armstrong <tim.g.armstr...@gmail.com> wrote: > > I agree in that scenario where predicates are pushed down to the scan. It > can be more of a trade-off if there are selective predicates later in the > plan, e.g. after a join or aggregation node - in that case it depends on > how expensive the expression is and how selective the predicates are. > > I definitely think this is worth exploring, I think we probably just need > to think about the rules for when to do it. > > On Tue, 20 Jul 2021 at 01:01, Xianqing <hexianqing...@126.com> wrote: > >> "It is difficult to cost - late materialisation can be a lot cheaper if >> there are predicates"<br/>We can materialize the expression after the >> predicates is executed. For example,<br/>select count(expr), sum(expr) from >> (select a+b+c expr from T where a>10) T where T.expr>20<br/>1. Scan >> column a and execute a>10, then scan b, c<br/>2. Then Calculate a + b + >> c and execute expr>20 >> At 2021-07-20 12:14:18, "Tim Armstrong" <tim.g.armstr...@gmail.com> wrote: >>> I believe that this optimisation could be very beneficial. It is difficult >>> to cost - late materialisation can be a lot cheaper if there are >> predicates. >>> >>> Pruning columns would be very useful for some queries just on its own. We >>> tried to implement a related optimisation - >>> https://issues.apache.org/jira/browse/IMPALA-2138 - and it showed some >> good >>> performance benefits, but the implementation in the planner was quite >>> complex because it tried to insert UNION operators at many points in the >>> plan. >>> >>> We were hoping that https://issues.apache.org/jira/browse/IMPALA-3902 >> would >>> be the long-term solution to getting more parallelism so that it doesn't >>> matter whether expression evaluation is in the scan or not, but probably >> it >>> makes sense to get the community's opinion on that. >>> >>> On Mon, 19 Jul 2021 at 20:35, hexianqing <hexianqing...@126.com> wrote: >>> >>>> I have submitted the following issue, >>>> https://issues.apache.org/jira/browse/IMPALA-10789 < >>>> https://issues.apache.org/jira/browse/IMPALA-10789> >>>> >>>> In some cases, the performance is better that early materialize >>>> expressions in ScanNode. >>>> For example, >>>> SELECT SUM(col), COUNT(col), MIN(col), MAX(col) FROM ( SELECT >>>> CAST(regexp_extract(string_col, '(\\d+)', 0) AS bigint) col FROM >>>> functional_parquet.alltypesagg ) t >>>> The expression only needs to be evaluated once if materialize >> expressions >>>> in ScanNode. >>>> I have roughly implemented this feature and the performance has improved >>>> significantly. >>>> I would like to discuss whether this feature can be contributed. >>>> >>>> Thanks! >>>> Xianqing >>