Paul Rogers created IMPALA-7831:
-----------------------------------

             Summary: Revisit expression rewriting integration with planner
                 Key: IMPALA-7831
                 URL: https://issues.apache.org/jira/browse/IMPALA-7831
             Project: IMPALA
          Issue Type: Improvement
          Components: Frontend
    Affects Versions: Impala 3.0
            Reporter: Paul Rogers
            Assignee: Paul Rogers


The planner performs expression rewriting. It appears that the rewrite engine 
was added late in planner development, as an add-on step in {{AnalysisContext}} 
after we create the plan. Since that time, it appears that a number of fixes 
and patches have been applied to work around the inevitable bugs that resulted 
from this placement of the logic.

At present, the planner flow, with rewrites, is:
 * Analyze the entire query
 * Assign WHERE clause "conjuncts" to scan nodes, etc.
 * Cerate theĀ full plan
 * Rewrite the SELECT, WHERE, HAVING and GROUP BY clauses
 * Throw away the plan create above and create a new one

This ticket proposes to adjust the flow to incorporate rewrites earlier in the 
process, allowing the planner to make a single pass over the query. (Which will 
solve a number of bugs described in associated tickets.)
h4. Background

The above logic evolved because of a timing issue: once we assign conjuncts, we 
have plan nodes that point to the original WHERE clause expressions. We later 
rewrite these, but we do so by throwing away the original nodes, replacing them 
with new ones. Since the scan and other nodes still have a pointer to the old 
version, the rewrites can have no effect.

To work around this, the code throws away that original plan and replans using 
the new, rewritten nodes.

This then creates an interesting issue. We do the full analysis (and plan) 
because we need the column bindings in order to do the rewrite. Since 
plan/analysis is implemented as a single black box, rewrites can't be done 
before planning (no column binding yet) so must be done after (column bindings 
available, but so is the entire plan.)

Some expression nodes have incomplete implementations. For example, {{X BETWEEN 
Y AND Z}} does not compute a cost (because it is a "virtual" node: it does not 
exist at run time, having been rewritten to {{Y <= X AND X <= Z}}.) This means 
that, not only do we throw away the first plan, that first plan was actually 
wrong: it used incomplete information.

Thus, in order to get the semantic info needed for rewrites (column bindings), 
we end up creating an entire plan which we must then discard and rebuild after 
doing the rewrites (so the planner has the full information.)
h4. Alternative

The alternative approach is to integrate expression rewrites into the planner 
process, rather than doing them from the outside so that we make only a single 
pass through the planner. In particular:
 * Analyze expressions to create column bindings.
 * Match up SELECT and GROUP BY and other expressions (if required.) GROUP BY 
points to a SELECT clause node (so it will see rewrites) rather than each 
SELECT expression (which will be discarded.)
 * Rewrite SELECT and WHERE clause expressions. (Bound GROUP BY expressions 
will see the rewrites.)
 * Complete the plan as today.

With this approach, we plan only once, and that plan has a full set of cost 
information based on the rewritten expressions which the BE will execute.

The purpose of this ticket is to track this analysis and to later propose a 
detailed fix.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to