[ 
https://issues.apache.org/jira/browse/IMPALA-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16662835#comment-16662835
 ] 

Paul Rogers edited comment on IMPALA-7655 at 10/25/18 3:28 AM:
---------------------------------------------------------------

The devil, on this kind of change, is always in the details. Here are some that 
must be resolved

h4. Rule Ordering Dependencies

Prior to this change, several functions were rewritten to use {{if}}, and 
{{if}} was simplified to handle degenerate cases. Now, those functions are 
rewritten to use {{CASE}}. This then leads to a rule-ordering issue. If we 
simplify {{ifNull(null, y)}} to {{CASE WHEN null is not null THEN null ELSE 
y}}, we'd expect this to further simplify to {{y}}.

In unit test path, each simplification rule is applied once. The 
constant-folding rule is applied before the if-rewrite. Thus, if we generate 
{{null is null}}, the constant folding rule won't be applied a second time. 
Instead, the {{ifnull()}} rewrite code must handle these optimizations.

Oddly, outside of the rule rewrite unit tests, rules are applied repeatedly, 
and so the upstream rewrites might be done. This seems a limitation in the 
testing framework.

In general, the conditional function rewrites must handle their own null-based 
simplifications; they can't assume {{CASE}} rewrites will handle them.

h4. Special Aggregate Handling

Another issue is how IMPALA-5125 was implemented: if rewriting an expression 
drops the last aggregate, the fix simply throws away the rewrite. This is fine 
for simplifications, but not for the conditional rewrites. Since there is no BE 
implementation for the conditional functions, we can't revert to the original 
form. This means the the conditional rewrites must be aware of aggregates.

h4. Conditional Rewrites Must Always Be Enabled

Next, {{BETWEEN}} is rewritten as {{IF()}}, but appears to be done *after* the 
conditional function rewrite rules, leaving an {{if()}} call in the final plan.

Turns out that this issue can occur because, in the {{Analyzer}}, the user can 
disable optimization. At present, the conditional rewrite is part of the 
{{SimplifyConditionalsRule}} which is optional. But, the conditional rewrite is 
not optional, it must always be done whether optimizations are on or not.

Further, if optimizations are on, we want to apply constant folding and other 
rules before the conditional rewrite.

As a result, the conditional rewrites need to be pulled out into a new rule. 
And, that rule must always be applied. If optimizations are on, apply the above 
optimizations first. Otherwise, just apply the rewrites directly.

h4. Handling {{NVL2()}} in the Function Registry

Finally, the change creates function registry entries for {{nvl2()}} so that it 
can be rewritten in the same place as the other conditional functions (directly 
to {{CASE}} without first going through {{if()}}). Since the signature of this 
function is n^2 in types, the function entry would be huge (which is probably 
why it was rewritten very early in the parse process originally.) A special 
handling of {{NVL2()}} (and, later, {{DECODE()}}, see IMPALA-7747) is needed. 
The special handling can exploit the fact that {{NVL2()}} is a useful fiction; 
a short-hand for a {{CASE}}, rather than being a real function that must 
enforce stricter rules.

h4. Issues with IS [NOT] DISTINCT FROM Optimizations

See IMPALA-7755.

h4. Incomplete Analysis After Rules Fire

See IMPALA-7754.

h4. Rewrite of the Wrong Order By Expressions

See IMPALA-7753.

h4. Invalid Logic to Reset a SELECT Statement

See IMPALA-7752.


was (Author: paul.rogers):
The devil, on this kind of change, is always in the details. Here are some that 
must be resolved

h4. Rule Ordering Dependencies

Prior to this change, several functions were rewritten to use {{if}}, and 
{{if}} was simplified to handle degenerate cases. Now, those functions are 
rewritten to use {{CASE}}. This then leads to a rule-ordering issue. If we 
simplify {{ifNull(null, y)}} to {{CASE WHEN null is not null THEN null ELSE 
y}}, we'd expect this to further simplify to {{y}}.

In unit test path, each simplification rule is applied once. The 
constant-folding rule is applied before the if-rewrite. Thus, if we generate 
{{null is null}}, the constant folding rule won't be applied a second time. 
Instead, the {{ifnull()}} rewrite code must handle these optimizations.

Oddly, outside of the rule rewrite unit tests, rules are applied repeatedly, 
and so the upstream rewrites might be done. This seems a limitation in the 
testing framework.

In general, the conditional function rewrites must handle their own null-based 
simplifications; they can't assume {{CASE}} rewrites will handle them.

h4. Special Aggregate Handling

Another issue is how IMPALA-5125 was implemented: if rewriting an expression 
drops the last aggregate, the fix simply throws away the rewrite. This is fine 
for simplifications, but not for the conditional rewrites. Since there is no BE 
implementation for the conditional functions, we can't revert to the original 
form. This means the the conditional rewrites must be aware of aggregates.

h4. Conditional Rewrites Always Enabled

Next, {{BETWEEN}} is rewritten as {{IF()}}, but appears to be done *after* the 
conditional function rewrite rules, leaving an {{if()}} call in the final plan.

Turns out that this issue can occur because, in the {{Analyzer}}, the user can 
disable optimization. At present, the conditional rewrite is part of the 
{{SimplifyConditionalsRule}} which is optional. But, the conditional rewrite is 
not optional, it must always be done whether optimizations are on or not.

Further, if optimizations are on, we want to apply constant folding and other 
rules before the conditional rewrite.

As a result, the conditional rewrites need to be pulled out into a new rule. 
And, that rule must always be applied. If optimizations are on, apply the above 
optimizations first. Otherwise, just apply the rewrites directly.

h4. Handling {{NVL2()}} in the Function Registry

Finally, the change creates function registry entries for {{nvl2()}} so that it 
can be rewritten in the same place as the other conditional functions (directly 
to {{CASE}} without first going through {{if()}}). Since the signature of this 
function is n^2 in types, the function entry would be huge (which is probably 
why it was rewritten very early in the parse process originally.) A special 
handling of {{NVL2()}} (and, later, {{DECODE()}}, see IMPALA-7747) is needed. 
The special handling can exploit the fact that {{NVL2()}} is a useful fiction; 
a short-hand for a {{CASE}}, rather than being a real function that must 
enforce stricter rules.

h4. Issues with IS [NOT] DISTINCT FROM Optimizations

Testing revealed one issue and one limitation of the {{IS [NOT] DISTINCT FROM}} 
optimizations in {{SimplifyDistinctFromRule}}:

* The rule can simplify {{sum(id) <=> sum(id)}} to {{false}}. But, according to 
IMPALA-5125, the rewrite cannot be applied if it would remove the only 
aggregate. This is a bug.
* A simplification is missing: {{id IS DISTINCT FROM NULL}} &rarr; {{id IS NOT 
NULL}}.

h4. Incomplete Analysis After Rules Fire

The analyzer has a chain of rules which fire in order without (as noted above) 
repeats. The result of rule A (rewriting conditional functions) is fed into 
rule B (simplify CASE). Each rule requires that analysis be done so that 
attributes of expressions can be picked out.

As it turns out, in the current code, this is rather ad-hoc. The 
{{SimplifyConditionalsRule}} re-analyzes its result as part of the fix for 
IMPALA-5125, but others do not, leading to optimizations not working. In 
particular, in a chain of rewrites for {{IS DISTINCT FROM}}, certain rules 
didn't fire because previous rules left new expressions in an un-analyzed 
state. This is a bug.

The fix is to analyze the result any time a rule fires, before passing the 
result to the next rule.Then, in rules that simply refused to run if an 
expression is to analyzed:

{code:java}
public class SimplifyDistinctFromRule implements ExprRewriteRule {
  public Expr apply(Expr expr, Analyzer analyzer) {
    if (!expr.isAnalyzed()) return expr;
{code}

Replace this with an assertion that analysis must have been done:

{code:java}
public class SimplifyDistinctFromRule implements ExprRewriteRule {
  public Expr apply(Expr expr, Analyzer analyzer) {
    assert expr.isAnalyzed();
{code}

To be safe, the assertion fires only in "debug" mode, not in production.

h4. Rewrite of the Wrong Order By Expressions

See IMPALA-7753.

h4. Invalid Logic to Reset a SELECT Statement

See IMPALA-7752.

> Codegen output for conditional functions (if,isnull, coalesce) is very 
> suboptimal
> ---------------------------------------------------------------------------------
>
>                 Key: IMPALA-7655
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7655
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>            Reporter: Tim Armstrong
>            Priority: Major
>              Labels: codegen, perf, performance
>
> https://gerrit.cloudera.org/#/c/11565/ provided a clue that an aggregation 
> involving an if() function was very slow, 10x slower than the equivalent 
> version using a case:
> {noformat}
> [localhost:21000] default> set num_nodes=1; set mt_dop=1; select count(case 
> when l_orderkey is NULL then 1 else NULL end) from 
> tpch10_parquet.lineitem;summary;
> NUM_NODES set to 1
> MT_DOP set to 1
> Query: select count(case when l_orderkey is NULL then 1 else NULL end) from 
> tpch10_parquet.lineitem
> Query submitted at: 2018-10-04 11:17:31 (Coordinator: 
> http://tarmstrong-box:25000)
> Query progress can be monitored at: 
> http://tarmstrong-box:25000/query_plan?query_id=274b2a6f35cefe31:95a1964200000000
> +----------------------------------------------------------+
> | count(case when l_orderkey is null then 1 else null end) |
> +----------------------------------------------------------+
> | 0                                                        |
> +----------------------------------------------------------+
> Fetched 1 row(s) in 0.51s
> +--------------+--------+----------+----------+--------+------------+----------+---------------+-------------------------+
> | Operator     | #Hosts | Avg Time | Max Time | #Rows  | Est. #Rows | Peak 
> Mem | Est. Peak Mem | Detail                  |
> +--------------+--------+----------+----------+--------+------------+----------+---------------+-------------------------+
> | 01:AGGREGATE | 1      | 44.03ms  | 44.03ms  | 1      | 1          | 25.00 
> KB | 10.00 MB      | FINALIZE                |
> | 00:SCAN HDFS | 1      | 411.57ms | 411.57ms | 59.99M | -1         | 16.61 
> MB | 88.00 MB      | tpch10_parquet.lineitem |
> +--------------+--------+----------+----------+--------+------------+----------+---------------+-------------------------+
> [localhost:21000] default> set num_nodes=1; set mt_dop=1; select 
> count(if(l_orderkey is NULL, 1, NULL)) from tpch10_parquet.lineitem;summary;
> NUM_NODES set to 1
> MT_DOP set to 1
> Query: select count(if(l_orderkey is NULL, 1, NULL)) from 
> tpch10_parquet.lineitem
> Query submitted at: 2018-10-04 11:23:07 (Coordinator: 
> http://tarmstrong-box:25000)
> Query progress can be monitored at: 
> http://tarmstrong-box:25000/query_plan?query_id=8e46ab1b84c4dbff:2786ca2600000000
> +----------------------------------------+
> | count(if(l_orderkey is null, 1, null)) |
> +----------------------------------------+
> | 0                                      |
> +----------------------------------------+
> Fetched 1 row(s) in 1.01s
> +--------------+--------+----------+----------+--------+------------+----------+---------------+-------------------------+
> | Operator     | #Hosts | Avg Time | Max Time | #Rows  | Est. #Rows | Peak 
> Mem | Est. Peak Mem | Detail                  |
> +--------------+--------+----------+----------+--------+------------+----------+---------------+-------------------------+
> | 01:AGGREGATE | 1      | 422.07ms | 422.07ms | 1      | 1          | 25.00 
> KB | 10.00 MB      | FINALIZE                |
> | 00:SCAN HDFS | 1      | 511.13ms | 511.13ms | 59.99M | -1         | 16.61 
> MB | 88.00 MB      | tpch10_parquet.lineitem |
> +--------------+--------+----------+----------+--------+------------+----------+---------------+-------------------------+
> {noformat}
> It turns out that this is because we don't have good codegen support for 
> ConditionalFunction, and just fall back to emitting a call to the interpreted 
> path: 
> https://github.com/apache/impala/blob/master/be/src/exprs/conditional-functions.cc#L28
> See CaseExpr for an example of much better codegen support: 
> https://github.com/apache/impala/blob/master/be/src/exprs/case-expr.cc#L178



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to