I agree in that scenario where predicates are pushed down to the scan. It
can be more of a trade-off if there are selective predicates later in the
plan, e.g. after a join or aggregation node - in that case it depends on
how expensive the expression is and how selective the predicates are.

I definitely think this is worth exploring, I think we probably just need
to think about the rules for when to do it.

On Tue, 20 Jul 2021 at 01:01, Xianqing <hexianqing...@126.com> wrote:

> "It is difficult to cost - late materialisation can be a lot cheaper if
> there are predicates"<br/>We can materialize the expression after the
> predicates is executed. For example,<br/>select count(expr), sum(expr) from
> (select a+b+c expr from T where a&gt;10) T where T.expr&gt;20<br/>1. Scan
> column a and execute a&gt;10, then scan b, c<br/>2. Then Calculate a + b +
> c and execute expr&gt;20
> At 2021-07-20 12:14:18, "Tim Armstrong" <tim.g.armstr...@gmail.com> wrote:
> >I believe that this optimisation could be very beneficial. It is difficult
> >to cost - late materialisation can be a lot cheaper if there are
> predicates.
> >
> >Pruning columns would be very useful for some queries just on its own. We
> >tried to implement a related optimisation -
> >https://issues.apache.org/jira/browse/IMPALA-2138 - and it showed some
> good
> >performance benefits, but the implementation in the planner was quite
> >complex because it tried to insert UNION operators at many points in the
> >plan.
> >
> >We were hoping that https://issues.apache.org/jira/browse/IMPALA-3902
> would
> >be the long-term solution to getting more parallelism so that it doesn't
> >matter whether expression evaluation is in the scan or not, but probably
> it
> >makes sense to get the community's opinion on that.
> >
> >On Mon, 19 Jul 2021 at 20:35, hexianqing <hexianqing...@126.com> wrote:
> >
> >> I have submitted the following issue,
> >> https://issues.apache.org/jira/browse/IMPALA-10789 <
> >> https://issues.apache.org/jira/browse/IMPALA-10789>
> >>
> >> In some cases, the performance is better that early materialize
> >> expressions in ScanNode.
> >> For example,
> >> SELECT SUM(col), COUNT(col), MIN(col), MAX(col) FROM ( SELECT
> >> CAST(regexp_extract(string_col, '(\\d+)', 0) AS bigint) col FROM
> >> functional_parquet.alltypesagg ) t
> >> The expression only needs to be evaluated once if materialize
> expressions
> >> in ScanNode.
> >> I have roughly implemented this feature and the performance has improved
> >> significantly.
> >> I would like to discuss whether this feature can be contributed.
> >>
> >> Thanks!
> >> Xianqing
>

Reply via email to