Paul Rogers created IMPALA-8019:
-----------------------------------

             Summary: Incorporate join input/output cardinality trade-offs into 
planner
                 Key: IMPALA-8019
                 URL: https://issues.apache.org/jira/browse/IMPALA-8019
             Project: IMPALA
          Issue Type: Improvement
          Components: Frontend
    Affects Versions: Impala 3.1.0
            Reporter: Paul Rogers


The issues discussed in IMPALA-8015, IMPALA-8015 and IMPALA-8018 suggest we 
should revisit the core ideas behind Impala plan optimization. We have two 
parallel goals:

* Minimize join hash table size.
* Minimize join output cardinality.

h4. Heuristics and their Limits

Today, Impala has a simple rule for output cardinality control:

* Determine if a M:1 (FK/PK) relationship exists between the probe:build sides. 
If so, then the join is a candidate as it ensures that the output cardinality 
won't grow beyond the probe cardinality.
* Of all the M:1 candidates, pick the one with the smallest input cardinality.
* Repeat the above using the output of the previous join as the probe side of 
the next join.

The model is based on certain assumptions:

* A M:1 relationship will always be cheaper than a M:N relationship.
* Smaller build cardinality is always better than smaller output cardinality.
* That picking the smallest build-side cardinality (from the M:1 candidates) 
will minimize total cost.

While these are good rules of thumb, they are not universal laws. Consider this 
scenario for the first case:

* We have two possible build-side tables, both of which need the same size hash 
table. One is a M:1 relationship that does not reduce output cardinality. The 
other is a M:N relationship which reduces output cardinality by 90%. Which is 
better?
* We have two possible M:1 build-side tables: one needs 1 MB of memory, but 
does not reduce output cardinality. The other needs 2 MB of memory but reduces 
output cardinality by 90%. Which is better?
* A particular build-side table is large, but produces a 90% reduction in 
output from the probe side. Should we apply smaller builds first (pick the ones 
with the smallest hash tables) and to the bigger (more selective one) last? Or, 
should we do the most selective (largest build side) first, and enjoy the 
benefit of fewer output rows through all the other joins?

A very pragmatic question to ask is this: is any of this relevant in the real 
world? Don't real data warehouses always use clear M:1 relationships with 
cardinality reduction at each step? Can't we just ignore the above issues? As 
it turns out, we've seen cases similar to the above which, when planned with 
Impala's existing heuristics, result in less-than-optimal plans.

h4. Cost-Based Dynamic Programming

A classic approach to the above optimization problem is [dynamic 
programming|https://people.eecs.berkeley.edu/~brewer/cs262/queryopt.html]: 
consider a range of options at each step, combine those with a range of options 
at the next step, and keep the cumulative best for consideration in the 
subsequent step. Here "best" is defined by a cost function as described below.

For example, suppose we have five tables. There are 5! (120) different possible 
plans -- too many to consider in detail. So, we proceed as follows:

* In the first round, pick the x best choices for the first table (the x 
largest input tables.)
* In the next round, pick the y best choices for the second table for each of 
the x cases. Keep x of the best.
* Repeat until the plan is complete. Pick the best of the remaining plans.

This approach allows the planner to "look ahead", to consider the effect of 
multiple rounds of decision making. Today, Impala uses x=1 so that Impala looks 
only one step ahead: it can't use the combined cost of two stages, say, to 
answer the trade-off questions posed above.

Dynamic programming "memoizes" subplans so that we don't have to recompute them 
if we consider them twice. (Impala currently will reconsider the same plans 
over and over as it works up the tree.)

h4. The Cost Function

Minimizing the join hash table size means (basically) minimizing the size of 
the build-side input. Specifically, at each step, choosing the join that would 
have the smallest build-side cardinality.

On the other hand, we don't want the result to expand the join output 
cardinality beyond that on the probe side. So, we use heuristics to choose only 
M:1 (FK/PK) joins as candidates for the first optimizing step (minimizing hash 
table size.)

Since we wish to minimize two (related) variables, we are in the realm of 
multi-objective optimization. One way to approach it is with a joint cost 
function that allows us to compare the two objectives:

{noformat}
cost(plan) = f(|build|) + g(|join|)
{noformat}

Suppose we have two choices for a join: one has a small input cardinality, but 
a large output cardinality, the other has a larger input cardinality, but 
reduces the output cardinality. How do we decide? The cost function's job is to 
encode those tradeoffs.

A naive cost function sets {{f}} and {{g}} to the identity function:

{noformat}
cost(plan) = |build| * |join|
{noformat}

Comments in the code appear to assume (but never actually use) a cost function 
which assumes that it costs twice a as much to build a table than to probe:

{noformat}
cost(plan) = 2 * |build| + |join|
{noformat}

If we think deeper, we see that we have a very strong preference for smaller 
build-side cardinality. Would we be willing to double the build side table to 
reduce output cardinality by 90%? Seems reasonable. For a 10% reduction? 
Probably not. So, the weight should be stronger to reflect this:

{noformat}
cost(plan) = 10 * |build| + |join|
{noformat}

If the build side doubles, the join cardinality has to reduce by a factor of 10 
to compensate. Of course, we would really prefer not to have large has tables, 
though changes in size of small tables are not much of a concern. So, perhaps 
the cost function should be non-linear:

{noformat}
cost(plan) = |build| * log(|build|) + |join|
{noformat}

This says, when the table is small, we're willing to grow the table to get a 
smallish reduction in output. But, as the table grows, we need a very large 
reduction in join cardinality to justify the cost.

The exact cost function requires more analysis and probably empirical 
experiments. The key point is, we can design such a function and it can encode 
our intuition for the tradeoffs we are willing to make.

h4. Uniform Cardinality Estimation

The final step is to recognize, as shown in IMPALA-8018, that the "generic" and 
"FK/PK" cases are really the same: both are based on NDV values and table 
cardinalities: the math for the generic case reduces to the "FK/PK" case. 
Further, once we move to a cost-based model, the intuition that "FK/PK is 
better" is encoded into the cost function which is designed so that a M:N join 
has to provide a large cardinality reduction to be considered over a M:1 
relationship.

h4. Calcite?

[Apache Calcite|https://calcite.apache.org] includes a cost-based planner. 
However, converting from the home-grown planner to Calcite would be expensive 
and is not necessary to perform the join selection described here. Instead, the 
proposed change can be don as an add-on to the existing planner and would 
affect only the join selection code. (Of course, if the project were to decide 
to switch to Calcite, then the join optimization would come for "free.")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to