Re: Discussion about IMPALA-10789: Early materialize expressions in ScanNode

Xianqing Tue, 27 Jul 2021 00:14:29 -0700

Yeah, I agree it is diffcult to provide a cost model to trade off the early 
materialization of expressions in ScanNode and late materialization after join.
Now we specify the early materialization by hint.
If the following two issues are resolved, the early materialization improvement 
will be less obvious.
https://issues.apache.org/jira/browse/IMPALA-2138
https://issues.apache.org/jira/browse/IMPALA-3902



Thanks!  -Xianqing


| |
Xianqing
|
|
hexianqing...@126.com
|
签名由网易邮箱大师定制


On 07/21/2021 22:06，Qifan Chen<qc...@cloudera.com> wrote：
Yeah, I agree costing expression evaluation can be hard and not necessary in 
general. For the query shown below that is referenced in 
https://issues.apache.org/jira/browse/IMPALA-10789 
<https://issues.apache.org/jira/browse/IMPALA-10789>, part of the cost saving 
could be due to reducing the strings to BIGINT which can lead to row size 
reduction. Multiplying it by the # of rows from the scan,  the total saving in 
data size probably can help reduce the total cost of the plan.

SELECT SUM(col), COUNT(col), MIN(col), MAX(col)
FROM (
SELECT CAST(regexp_extract(string_col, '(\\d+)', 0) AS bigint) col
FROM functional_parquet.alltypesagg
) t

On when to early materialization of expressions (aka expression pushdown). 
Normally, an expression E can be pushed down to a scan node if all inputs to E 
are available at S and can be evaluated at S properly. Thus we need to compute 
the inputs available at the scan node and test if a push-down expression can be 
satisfied by the available inputs.

For the above query, the inputs to the expression 
"CAST(regexp_extract(string_col, '(\\d+)', 0) AS bigint)” are string_col and 
two constants. Since all are available, the expression can be evaluated at the 
scan.

Thanks —Qifan



On Jul 21, 2021, at 1:02 AM, Tim Armstrong <tim.g.armstr...@gmail.com> wrote:

I agree in that scenario where predicates are pushed down to the scan. It
can be more of a trade-off if there are selective predicates later in the
plan, e.g. after a join or aggregation node - in that case it depends on
how expensive the expression is and how selective the predicates are.

I definitely think this is worth exploring, I think we probably just need
to think about the rules for when to do it.

On Tue, 20 Jul 2021 at 01:01, Xianqing <hexianqing...@126.com> wrote:

"It is difficult to cost - late materialisation can be a lot cheaper if
there are predicates"<br/>We can materialize the expression after the
predicates is executed. For example,<br/>select count(expr), sum(expr) from
(select a+b+c expr from T where a&gt;10) T where T.expr&gt;20<br/>1. Scan
column a and execute a&gt;10, then scan b, c<br/>2. Then Calculate a + b +
c and execute expr&gt;20
At 2021-07-20 12:14:18, "Tim Armstrong" <tim.g.armstr...@gmail.com> wrote:
I believe that this optimisation could be very beneficial. It is difficult
to cost - late materialisation can be a lot cheaper if there are
predicates.

Pruning columns would be very useful for some queries just on its own. We
tried to implement a related optimisation -
https://issues.apache.org/jira/browse/IMPALA-2138 - and it showed some
good
performance benefits, but the implementation in the planner was quite
complex because it tried to insert UNION operators at many points in the
plan.

We were hoping that https://issues.apache.org/jira/browse/IMPALA-3902
would
be the long-term solution to getting more parallelism so that it doesn't
matter whether expression evaluation is in the scan or not, but probably
it
makes sense to get the community's opinion on that.

On Mon, 19 Jul 2021 at 20:35, hexianqing <hexianqing...@126.com> wrote:

I have submitted the following issue,
https://issues.apache.org/jira/browse/IMPALA-10789 <
https://issues.apache.org/jira/browse/IMPALA-10789>

In some cases, the performance is better that early materialize
expressions in ScanNode.
For example,
SELECT SUM(col), COUNT(col), MIN(col), MAX(col) FROM ( SELECT
CAST(regexp_extract(string_col, '(\\d+)', 0) AS bigint) col FROM
functional_parquet.alltypesagg ) t
The expression only needs to be evaluated once if materialize
expressions
in ScanNode.
I have roughly implemented this feature and the performance has improved
significantly.
I would like to discuss whether this feature can be contributed.

Thanks!
Xianqing

Re: Discussion about IMPALA-10789: Early materialize expressions in ScanNode

Reply via email to