I agree in that scenario where predicates are pushed down to the scan. It can be more of a trade-off if there are selective predicates later in the plan, e.g. after a join or aggregation node - in that case it depends on how expensive the expression is and how selective the predicates are.
I definitely think this is worth exploring, I think we probably just need to think about the rules for when to do it. On Tue, 20 Jul 2021 at 01:01, Xianqing <hexianqing...@126.com> wrote: > "It is difficult to cost - late materialisation can be a lot cheaper if > there are predicates"<br/>We can materialize the expression after the > predicates is executed. For example,<br/>select count(expr), sum(expr) from > (select a+b+c expr from T where a>10) T where T.expr>20<br/>1. Scan > column a and execute a>10, then scan b, c<br/>2. Then Calculate a + b + > c and execute expr>20 > At 2021-07-20 12:14:18, "Tim Armstrong" <tim.g.armstr...@gmail.com> wrote: > >I believe that this optimisation could be very beneficial. It is difficult > >to cost - late materialisation can be a lot cheaper if there are > predicates. > > > >Pruning columns would be very useful for some queries just on its own. We > >tried to implement a related optimisation - > >https://issues.apache.org/jira/browse/IMPALA-2138 - and it showed some > good > >performance benefits, but the implementation in the planner was quite > >complex because it tried to insert UNION operators at many points in the > >plan. > > > >We were hoping that https://issues.apache.org/jira/browse/IMPALA-3902 > would > >be the long-term solution to getting more parallelism so that it doesn't > >matter whether expression evaluation is in the scan or not, but probably > it > >makes sense to get the community's opinion on that. > > > >On Mon, 19 Jul 2021 at 20:35, hexianqing <hexianqing...@126.com> wrote: > > > >> I have submitted the following issue, > >> https://issues.apache.org/jira/browse/IMPALA-10789 < > >> https://issues.apache.org/jira/browse/IMPALA-10789> > >> > >> In some cases, the performance is better that early materialize > >> expressions in ScanNode. > >> For example, > >> SELECT SUM(col), COUNT(col), MIN(col), MAX(col) FROM ( SELECT > >> CAST(regexp_extract(string_col, '(\\d+)', 0) AS bigint) col FROM > >> functional_parquet.alltypesagg ) t > >> The expression only needs to be evaluated once if materialize > expressions > >> in ScanNode. > >> I have roughly implemented this feature and the performance has improved > >> significantly. > >> I would like to discuss whether this feature can be contributed. > >> > >> Thanks! > >> Xianqing >