Yeah, I agree costing expression evaluation can be hard and not necessary in 
general. For the query shown below that is referenced in 
https://issues.apache.org/jira/browse/IMPALA-10789 
<https://issues.apache.org/jira/browse/IMPALA-10789>, part of the cost saving 
could be due to reducing the strings to BIGINT which can lead to row size 
reduction. Multiplying it by the # of rows from the scan,  the total saving in 
data size probably can help reduce the total cost of the plan. 

SELECT SUM(col), COUNT(col), MIN(col), MAX(col)
FROM (
        SELECT CAST(regexp_extract(string_col, '(\\d+)', 0) AS bigint) col
        FROM functional_parquet.alltypesagg
) t

On when to early materialization of expressions (aka expression pushdown). 
Normally, an expression E can be pushed down to a scan node if all inputs to E 
are available at S and can be evaluated at S properly. Thus we need to compute 
the inputs available at the scan node and test if a push-down expression can be 
satisfied by the available inputs. 

For the above query, the inputs to the expression 
"CAST(regexp_extract(string_col, '(\\d+)', 0) AS bigint)” are string_col and 
two constants. Since all are available, the expression can be evaluated at the 
scan. 

Thanks —Qifan



> On Jul 21, 2021, at 1:02 AM, Tim Armstrong <tim.g.armstr...@gmail.com> wrote:
> 
> I agree in that scenario where predicates are pushed down to the scan. It
> can be more of a trade-off if there are selective predicates later in the
> plan, e.g. after a join or aggregation node - in that case it depends on
> how expensive the expression is and how selective the predicates are.
> 
> I definitely think this is worth exploring, I think we probably just need
> to think about the rules for when to do it.
> 
> On Tue, 20 Jul 2021 at 01:01, Xianqing <hexianqing...@126.com> wrote:
> 
>> "It is difficult to cost - late materialisation can be a lot cheaper if
>> there are predicates"<br/>We can materialize the expression after the
>> predicates is executed. For example,<br/>select count(expr), sum(expr) from
>> (select a+b+c expr from T where a&gt;10) T where T.expr&gt;20<br/>1. Scan
>> column a and execute a&gt;10, then scan b, c<br/>2. Then Calculate a + b +
>> c and execute expr&gt;20
>> At 2021-07-20 12:14:18, "Tim Armstrong" <tim.g.armstr...@gmail.com> wrote:
>>> I believe that this optimisation could be very beneficial. It is difficult
>>> to cost - late materialisation can be a lot cheaper if there are
>> predicates.
>>> 
>>> Pruning columns would be very useful for some queries just on its own. We
>>> tried to implement a related optimisation -
>>> https://issues.apache.org/jira/browse/IMPALA-2138 - and it showed some
>> good
>>> performance benefits, but the implementation in the planner was quite
>>> complex because it tried to insert UNION operators at many points in the
>>> plan.
>>> 
>>> We were hoping that https://issues.apache.org/jira/browse/IMPALA-3902
>> would
>>> be the long-term solution to getting more parallelism so that it doesn't
>>> matter whether expression evaluation is in the scan or not, but probably
>> it
>>> makes sense to get the community's opinion on that.
>>> 
>>> On Mon, 19 Jul 2021 at 20:35, hexianqing <hexianqing...@126.com> wrote:
>>> 
>>>> I have submitted the following issue,
>>>> https://issues.apache.org/jira/browse/IMPALA-10789 <
>>>> https://issues.apache.org/jira/browse/IMPALA-10789>
>>>> 
>>>> In some cases, the performance is better that early materialize
>>>> expressions in ScanNode.
>>>> For example,
>>>> SELECT SUM(col), COUNT(col), MIN(col), MAX(col) FROM ( SELECT
>>>> CAST(regexp_extract(string_col, '(\\d+)', 0) AS bigint) col FROM
>>>> functional_parquet.alltypesagg ) t
>>>> The expression only needs to be evaluated once if materialize
>> expressions
>>>> in ScanNode.
>>>> I have roughly implemented this feature and the performance has improved
>>>> significantly.
>>>> I would like to discuss whether this feature can be contributed.
>>>> 
>>>> Thanks!
>>>> Xianqing
>> 

Reply via email to