Paul Rogers created IMPALA-8019: ----------------------------------- Summary: Incorporate join input/output cardinality trade-offs into planner Key: IMPALA-8019 URL: https://issues.apache.org/jira/browse/IMPALA-8019 Project: IMPALA Issue Type: Improvement Components: Frontend Affects Versions: Impala 3.1.0 Reporter: Paul Rogers
The issues discussed in IMPALA-8015, IMPALA-8015 and IMPALA-8018 suggest we should revisit the core ideas behind Impala plan optimization. We have two parallel goals: * Minimize join hash table size. * Minimize join output cardinality. h4. Heuristics and their Limits Today, Impala has a simple rule for output cardinality control: * Determine if a M:1 (FK/PK) relationship exists between the probe:build sides. If so, then the join is a candidate as it ensures that the output cardinality won't grow beyond the probe cardinality. * Of all the M:1 candidates, pick the one with the smallest input cardinality. * Repeat the above using the output of the previous join as the probe side of the next join. The model is based on certain assumptions: * A M:1 relationship will always be cheaper than a M:N relationship. * Smaller build cardinality is always better than smaller output cardinality. * That picking the smallest build-side cardinality (from the M:1 candidates) will minimize total cost. While these are good rules of thumb, they are not universal laws. Consider this scenario for the first case: * We have two possible build-side tables, both of which need the same size hash table. One is a M:1 relationship that does not reduce output cardinality. The other is a M:N relationship which reduces output cardinality by 90%. Which is better? * We have two possible M:1 build-side tables: one needs 1 MB of memory, but does not reduce output cardinality. The other needs 2 MB of memory but reduces output cardinality by 90%. Which is better? * A particular build-side table is large, but produces a 90% reduction in output from the probe side. Should we apply smaller builds first (pick the ones with the smallest hash tables) and to the bigger (more selective one) last? Or, should we do the most selective (largest build side) first, and enjoy the benefit of fewer output rows through all the other joins? A very pragmatic question to ask is this: is any of this relevant in the real world? Don't real data warehouses always use clear M:1 relationships with cardinality reduction at each step? Can't we just ignore the above issues? As it turns out, we've seen cases similar to the above which, when planned with Impala's existing heuristics, result in less-than-optimal plans. h4. Cost-Based Dynamic Programming A classic approach to the above optimization problem is [dynamic programming|https://people.eecs.berkeley.edu/~brewer/cs262/queryopt.html]: consider a range of options at each step, combine those with a range of options at the next step, and keep the cumulative best for consideration in the subsequent step. Here "best" is defined by a cost function as described below. For example, suppose we have five tables. There are 5! (120) different possible plans -- too many to consider in detail. So, we proceed as follows: * In the first round, pick the x best choices for the first table (the x largest input tables.) * In the next round, pick the y best choices for the second table for each of the x cases. Keep x of the best. * Repeat until the plan is complete. Pick the best of the remaining plans. This approach allows the planner to "look ahead", to consider the effect of multiple rounds of decision making. Today, Impala uses x=1 so that Impala looks only one step ahead: it can't use the combined cost of two stages, say, to answer the trade-off questions posed above. Dynamic programming "memoizes" subplans so that we don't have to recompute them if we consider them twice. (Impala currently will reconsider the same plans over and over as it works up the tree.) h4. The Cost Function Minimizing the join hash table size means (basically) minimizing the size of the build-side input. Specifically, at each step, choosing the join that would have the smallest build-side cardinality. On the other hand, we don't want the result to expand the join output cardinality beyond that on the probe side. So, we use heuristics to choose only M:1 (FK/PK) joins as candidates for the first optimizing step (minimizing hash table size.) Since we wish to minimize two (related) variables, we are in the realm of multi-objective optimization. One way to approach it is with a joint cost function that allows us to compare the two objectives: {noformat} cost(plan) = f(|build|) + g(|join|) {noformat} Suppose we have two choices for a join: one has a small input cardinality, but a large output cardinality, the other has a larger input cardinality, but reduces the output cardinality. How do we decide? The cost function's job is to encode those tradeoffs. A naive cost function sets {{f}} and {{g}} to the identity function: {noformat} cost(plan) = |build| * |join| {noformat} Comments in the code appear to assume (but never actually use) a cost function which assumes that it costs twice a as much to build a table than to probe: {noformat} cost(plan) = 2 * |build| + |join| {noformat} If we think deeper, we see that we have a very strong preference for smaller build-side cardinality. Would we be willing to double the build side table to reduce output cardinality by 90%? Seems reasonable. For a 10% reduction? Probably not. So, the weight should be stronger to reflect this: {noformat} cost(plan) = 10 * |build| + |join| {noformat} If the build side doubles, the join cardinality has to reduce by a factor of 10 to compensate. Of course, we would really prefer not to have large has tables, though changes in size of small tables are not much of a concern. So, perhaps the cost function should be non-linear: {noformat} cost(plan) = |build| * log(|build|) + |join| {noformat} This says, when the table is small, we're willing to grow the table to get a smallish reduction in output. But, as the table grows, we need a very large reduction in join cardinality to justify the cost. The exact cost function requires more analysis and probably empirical experiments. The key point is, we can design such a function and it can encode our intuition for the tradeoffs we are willing to make. h4. Uniform Cardinality Estimation The final step is to recognize, as shown in IMPALA-8018, that the "generic" and "FK/PK" cases are really the same: both are based on NDV values and table cardinalities: the math for the generic case reduces to the "FK/PK" case. Further, once we move to a cost-based model, the intuition that "FK/PK is better" is encoded into the cost function which is designed so that a M:N join has to provide a large cardinality reduction to be considered over a M:1 relationship. h4. Calcite? [Apache Calcite|https://calcite.apache.org] includes a cost-based planner. However, converting from the home-grown planner to Calcite would be expensive and is not necessary to perform the join selection described here. Instead, the proposed change can be don as an add-on to the existing planner and would affect only the join selection code. (Of course, if the project were to decide to switch to Calcite, then the join optimization would come for "free.") -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org