Yeah, I agree it is diffcult to provide a cost model to trade off the early materialization of expressions in ScanNode and late materialization after join. Now we specify the early materialization by hint. If the following two issues are resolved, the early materialization improvement will be less obvious. https://issues.apache.org/jira/browse/IMPALA-2138 https://issues.apache.org/jira/browse/IMPALA-3902
Thanks! -Xianqing | | Xianqing | | hexianqing...@126.com | 签名由网易邮箱大师定制 On 07/21/2021 22:06,Qifan Chen<qc...@cloudera.com> wrote: Yeah, I agree costing expression evaluation can be hard and not necessary in general. For the query shown below that is referenced in https://issues.apache.org/jira/browse/IMPALA-10789 <https://issues.apache.org/jira/browse/IMPALA-10789>, part of the cost saving could be due to reducing the strings to BIGINT which can lead to row size reduction. Multiplying it by the # of rows from the scan, the total saving in data size probably can help reduce the total cost of the plan. SELECT SUM(col), COUNT(col), MIN(col), MAX(col) FROM ( SELECT CAST(regexp_extract(string_col, '(\\d+)', 0) AS bigint) col FROM functional_parquet.alltypesagg ) t On when to early materialization of expressions (aka expression pushdown). Normally, an expression E can be pushed down to a scan node if all inputs to E are available at S and can be evaluated at S properly. Thus we need to compute the inputs available at the scan node and test if a push-down expression can be satisfied by the available inputs. For the above query, the inputs to the expression "CAST(regexp_extract(string_col, '(\\d+)', 0) AS bigint)” are string_col and two constants. Since all are available, the expression can be evaluated at the scan. Thanks —Qifan On Jul 21, 2021, at 1:02 AM, Tim Armstrong <tim.g.armstr...@gmail.com> wrote: I agree in that scenario where predicates are pushed down to the scan. It can be more of a trade-off if there are selective predicates later in the plan, e.g. after a join or aggregation node - in that case it depends on how expensive the expression is and how selective the predicates are. I definitely think this is worth exploring, I think we probably just need to think about the rules for when to do it. On Tue, 20 Jul 2021 at 01:01, Xianqing <hexianqing...@126.com> wrote: "It is difficult to cost - late materialisation can be a lot cheaper if there are predicates"<br/>We can materialize the expression after the predicates is executed. For example,<br/>select count(expr), sum(expr) from (select a+b+c expr from T where a>10) T where T.expr>20<br/>1. Scan column a and execute a>10, then scan b, c<br/>2. Then Calculate a + b + c and execute expr>20 At 2021-07-20 12:14:18, "Tim Armstrong" <tim.g.armstr...@gmail.com> wrote: I believe that this optimisation could be very beneficial. It is difficult to cost - late materialisation can be a lot cheaper if there are predicates. Pruning columns would be very useful for some queries just on its own. We tried to implement a related optimisation - https://issues.apache.org/jira/browse/IMPALA-2138 - and it showed some good performance benefits, but the implementation in the planner was quite complex because it tried to insert UNION operators at many points in the plan. We were hoping that https://issues.apache.org/jira/browse/IMPALA-3902 would be the long-term solution to getting more parallelism so that it doesn't matter whether expression evaluation is in the scan or not, but probably it makes sense to get the community's opinion on that. On Mon, 19 Jul 2021 at 20:35, hexianqing <hexianqing...@126.com> wrote: I have submitted the following issue, https://issues.apache.org/jira/browse/IMPALA-10789 < https://issues.apache.org/jira/browse/IMPALA-10789> In some cases, the performance is better that early materialize expressions in ScanNode. For example, SELECT SUM(col), COUNT(col), MIN(col), MAX(col) FROM ( SELECT CAST(regexp_extract(string_col, '(\\d+)', 0) AS bigint) col FROM functional_parquet.alltypesagg ) t The expression only needs to be evaluated once if materialize expressions in ScanNode. I have roughly implemented this feature and the performance has improved significantly. I would like to discuss whether this feature can be contributed. Thanks! Xianqing