[ 
https://issues.apache.org/jira/browse/IMPALA-8045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated IMPALA-8045:
--------------------------------
    Description: 
The work to review join cardinality has found some major issues recorded as 
JIRA tickets. This ticket records a number of smaller issues. Some of these 
issues are a bit tricky because they appear only when some of the other issues 
are resolved. Reporting them directly could be misleading.

h4. ScanNode confusion between table and scan input cardinality

The {{ScanNode}} class in the scanner contains an {{inputCardinality_}} field 
used by join calculations as a proxy for the table size. However, the actual 
scan node implementations set the {{inputCardinality_}} to the estimated number 
of rows *read* by the scan, which is useful when understanding the physical 
scan structure. But, for joins, we need the base table cardinality.

For example, the join may use the input cardinality to understand the reduction 
in rows due to filters in order to adjust the NDV of key columns. But, since 
the input cardinality is the scan count, not the table row count, the math does 
not work out.

The solution is to clarify the code to separate the idea of scan count vs. base 
table row count.

h4. Selectivity Confusion

Similarly, each node computes its selectivity. However, the selectivity is only 
for those predicates that will be applied via a projection. Predicates that can 
be applied because of partition pruning (HDFS), key range pruning (HBase) and 
so on do not "count". While this produces accurate execution estimates, it is 
not helpful for join planning.

In join planning, we need to know the number of filtered rows relative to the 
total table cardinality. This allows us to adjust HDV key cardinality in order 
to estimate the number of rows produced by the join.

Using the partial selectivity, or partial input cardinality (above issue) 
causes inaccurate key cardinality adjustments and incorrect join cardinality 
estimates.

h4. Join Node Does not Apply Selectivity from Its Predicates

A join node can have "additional predicates" applied after creating a join row. 
Accurate estimation of join cardinality must include the selectivity from those 
predicates, but is not currently done. Perhaps because such predicates, in the 
current estimation scheme, always produce an estimated selectivity of .1. This 
will be more important as we add more realistic estimates.

h4. Use Double, not Long for Cardinality Values

In scan nodes, row counts can be reasonable numbers and a Java {{long}} is 
fine. But, once one starts computing join cardinalities, values can grow fast, 
especially for cross joins. The code currently has special checks to limit 
products to {{Long.MAX_VALUE}}. While this solves the overflow issue, it has 
undesirable downstream affects. First, it throws of selectivity calculations 
since the reported cardinality is not the real cardinality. Second, it requires 
special math calls whenever we multiply cardinalities.

Much simper to work with a {{double}}. When values get large, the extra 
precision from a integer value is completely lost in the noise of assumptions 
and estimations.

h4. Revisit Cardinality Calcs for Join Nodes

The method {{JoinNode.computeStats()}} is a bit muddled. It starts by computing 
cardinality depending on the major type family (semi-join, inner/outer join, 
cross join). It then revises those calcs based on the specific join type. This 
makes it very hard to follow the logic case we have to follow two distinct 
blocks of code. There is also redundancy. The cross join cardinality is 
calculated twice, for example.

Refactor to have a cardinality/selectivity calculation per join type.

h4. Disallow Unknown Cardinality

Multiple nodes can produce a cardinality of -1 (unknown). Since it is 
impossible to plan based on an unknown cardinality, we must have an estimate, 
however good or bad. For cases where we have no stats, estimate cardinality 
based on other factors. If we have no column NDV, perhaps guesstimate 
something, or use an alternative join calculation that avoids the need for NDV 
(while producing much cruder estimates.) However, refusing to play the game at 
all is not helpful unless we choose to fail the query for lack of stats.

h4. Revisit Join Cardinality Limit

The JoinNode has several methods that limit cardinality:

{code:java}
  public void computeStats(Analyzer analyzer) {
    ...
        cardinality_ = capCardinalityAtLimit(cardinality_);
    ...
  }

  public boolean hasLimit() { return limit_ > -1; }

  protected long capCardinalityAtLimit(long cardinality) {
    if (hasLimit()) {
      return capCardinalityAtLimit(cardinality, limit_);
    }
    return cardinality;
  }
{code}

It is not clear when or why we apply a limit. Perhaps as part of {{LIMIT x}} 
processing? Revisit if the limit is helpful, and remove it if not. Imposing a 
limit throws off downstream joins, which is probably not what is wanted here.

# Properly Handle Duplicated Filters in Outer Joins
Consider the following query on TPC-H which picks 1/3 (newer version) or 1/10 
(older version) of orders. It then joins them with their customers:

{code:sql}
select c.c_custkey, o.o_orderkey
from tpch.customer c
left outer join tpch.orders o on c.c_custkey = o.o_custkey
where o.o_clerk < 'foo'
{code}

The plan produced places the WHERE clause predicate on both the scan *and* join 
nodes:

{noformat}
PLAN-ROOT SINK
|
02:HASH JOIN [RIGHT OUTER JOIN]
|  hash predicates: o.o_custkey = c.c_custkey
|  other predicates: o.o_clerk < 'foo'           <== Huh?
|  row-size=51B cardinality=163.35K
|
|--00:SCAN HDFS [tpch.customer c]
|     partitions=1/1 files=1 size=23.08MB row-size=8B cardinality=150.00K
|
01:SCAN HDFS [tpch.orders o]
   partitions=1/1 files=1 size=162.56MB row-size=43B cardinality=495.00K
   predicates: o.o_clerk < 'foo'                  <== Obvious location
{noformat}

The filter must be applied twice because the outer join will produce null order 
rows. The filter won't match if the clerk field is null, so the predicate is 
applied again.

However, the meaning of the predicate in the join is "remove null rows" which 
kind of means to undo the "outer". Given that, such a query is rather 
meaningless.

The point here is to properly account for the predicate. Simply applying the 
{{sel(predicate)}} value twice is clearly wrong. We actually want to apply the 
predicate only to those rows that were generated in the outer join to avoid 
double-accounting.

This is an obscure corner-case; the question is whether we need to account for 
this kind of issue in multiple places. If we do,  we need a more sophisticated 
set of data models to account for predicates than is currently used.

See test case in {{card-joins.test}} that references this ticket number for an 
example.

h4. Better Handling of OUTER JOIN with Column Filter

Consider:

{code:sql}
SELECT c.c_name, o.o_orderkey
FROM customers c RIGHT OUTER JOIN orders o
WHERE c.c_name = 'Bob'
{code}

The query runs the "c.c_name = 'Bob'" predicate twice: once in the scan and a 
second time in the join. Why in the join? To check if the null left (customer) 
columns match the predicate (which they don't.) The revised code handles this 
more-or-less correctly using the NDV of "c_name". But, doing so is subtly wrong.

We want to know the effect of doing the filtering past the join. We know that 
all the rows that match do have the name "Bob", so the NDV should be 1. The 
only non-"Bob" rows are the null rows inserted by the join. So, we should use 
logic that is aware of this case to provide a better estimate.

On the other hand, If the predicate where "WHERE c.c_name is null", the results 
would be far different. So, the outer join logic should also be aware of the 
meaning (null handling) of the predicate. In this case. all the inserted null 
rows match, so we should *not* use the null count or NDV to guess the filtering.

Point is, this area is subtle and can't be brute-forced.

  was:
The work to review join cardinality has found some major issues recorded as 
JIRA tickets. This ticket records a number of smaller issues. Some of these 
issues are a bit tricky because they appear only when some of the other issues 
are resolved. Reporting them directly could be misleading.

h4. ScanNode confusion between table and scan input cardinality

The {{ScanNode}} class in the scanner contains an {{inputCardinality_}} field 
used by join calculations as a proxy for the table size. However, the actual 
scan node implementations set the {{inputCardinality_}} to the estimated number 
of rows *read* by the scan, which is useful when understanding the physical 
scan structure. But, for joins, we need the base table cardinality.

For example, the join may use the input cardinality to understand the reduction 
in rows due to filters in order to adjust the NDV of key columns. But, since 
the input cardinality is the scan count, not the table row count, the math does 
not work out.

The solution is to clarify the code to separate the idea of scan count vs. base 
table row count.

h4. Selectivity Confusion

Similarly, each node computes its selectivity. However, the selectivity is only 
for those predicates that will be applied via a projection. Predicates that can 
be applied because of partition pruning (HDFS), key range pruning (HBase) and 
so on do not "count". While this produces accurate execution estimates, it is 
not helpful for join planning.

In join planning, we need to know the number of filtered rows relative to the 
total table cardinality. This allows us to adjust HDV key cardinality in order 
to estimate the number of rows produced by the join.

Using the partial selectivity, or partial input cardinality (above issue) 
causes inaccurate key cardinality adjustments and incorrect join cardinality 
estimates.

h4. Join Node Does not Apply Selectivity from Its Predicates

A join node can have "additional predicates" applied after creating a join row. 
Accurate estimation of join cardinality must include the selectivity from those 
predicates, but is not currently done. Perhaps because such predicates, in the 
current estimation scheme, always produce an estimated selectivity of .1. This 
will be more important as we add more realistic estimates.

h4. Use Double, not Long for Cardinality Values

In scan nodes, row counts can be reasonable numbers and a Java {{long}} is 
fine. But, once one starts computing join cardinalities, values can grow fast, 
especially for cross joins. The code currently has special checks to limit 
products to {{Long.MAX_VALUE}}. While this solves the overflow issue, it has 
undesirable downstream affects. First, it throws of selectivity calculations 
since the reported cardinality is not the real cardinality. Second, it requires 
special math calls whenever we multiply cardinalities.

Much simper to work with a {{double}}. When values get large, the extra 
precision from a integer value is completely lost in the noise of assumptions 
and estimations.

h4. Revisit Cardinality Calcs for Join Nodes

The method {{JoinNode.computeStats()}} is a bit muddled. It starts by computing 
cardinality depending on the major type family (semi-join, inner/outer join, 
cross join). It then revises those calcs based on the specific join type. This 
makes it very hard to follow the logic case we have to follow two distinct 
blocks of code. There is also redundancy. The cross join cardinality is 
calculated twice, for example.

Refactor to have a cardinality/selectivity calculation per join type.

h4. Disallow Unknown Cardinality

Multiple nodes can produce a cardinality of -1 (unknown). Since it is 
impossible to plan based on an unknown cardinality, we must have an estimate, 
however good or bad. For cases where we have no stats, estimate cardinality 
based on other factors. If we have no column NDV, perhaps guesstimate 
something, or use an alternative join calculation that avoids the need for NDV 
(while producing much cruder estimates.) However, refusing to play the game at 
all is not helpful unless we choose to fail the query for lack of stats.

h4. Revisit Join Cardinality Limit

The JoinNode has several methods that limit cardinality:

{code:java}
  public void computeStats(Analyzer analyzer) {
    ...
        cardinality_ = capCardinalityAtLimit(cardinality_);
    ...
  }

  public boolean hasLimit() { return limit_ > -1; }

  protected long capCardinalityAtLimit(long cardinality) {
    if (hasLimit()) {
      return capCardinalityAtLimit(cardinality, limit_);
    }
    return cardinality;
  }
{code}

It is not clear when or why we apply a limit. Perhaps as part of {{LIMIT x}} 
processing? Revisit if the limit is helpful, and remove it if not. Imposing a 
limit throws off downstream joins, which is probably not what is wanted here.

# Properly Handle Duplicated Filters in Outer Joins
Consider the following query on TPC-H which picks 1/3 (newer version) or 1/10 
(older version) of orders. It then joins them with their customers:

{code:sql}
select c.c_custkey, o.o_orderkey
from tpch.customer c
left outer join tpch.orders o on c.c_custkey = o.o_custkey
where o.o_clerk < 'foo'
{code}

The plan produced places the WHERE clause predicate on both the scan *and* join 
nodes:

{noformat}
PLAN-ROOT SINK
|
02:HASH JOIN [RIGHT OUTER JOIN]
|  hash predicates: o.o_custkey = c.c_custkey
|  other predicates: o.o_clerk < 'foo'           <== Huh?
|  row-size=51B cardinality=163.35K
|
|--00:SCAN HDFS [tpch.customer c]
|     partitions=1/1 files=1 size=23.08MB row-size=8B cardinality=150.00K
|
01:SCAN HDFS [tpch.orders o]
   partitions=1/1 files=1 size=162.56MB row-size=43B cardinality=495.00K
   predicates: o.o_clerk < 'foo'                  <== Obvious location
{noformat}

The filter must be applied twice because the outer join will produce null order 
rows. The filter won't match if the clerk field is null, so the predicate is 
applied again.

However, the meaning of the predicate in the join is "remove null rows" which 
kind of means to undo the "outer". Given that, such a query is rather 
meaningless.

The point here is to properly account for the predicate. Simply applying the 
{{sel(predicate)}} value twice is clearly wrong. We actually want to apply the 
predicate only to those rows that were generated in the outer join to avoid 
double-accounting.

This is an obscure corner-case; the question is whether we need to account for 
this kind of issue in multiple places. If we do,  we need a more sophisticated 
set of data models to account for predicates than is currently used.

See test case in {{card-joins.test}} that references this ticket number for an 
example.


> Rollup of Smaller Join Cardinality Issues
> -----------------------------------------
>
>                 Key: IMPALA-8045
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8045
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>    Affects Versions: Impala 3.1.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>            Priority: Major
>
> The work to review join cardinality has found some major issues recorded as 
> JIRA tickets. This ticket records a number of smaller issues. Some of these 
> issues are a bit tricky because they appear only when some of the other 
> issues are resolved. Reporting them directly could be misleading.
> h4. ScanNode confusion between table and scan input cardinality
> The {{ScanNode}} class in the scanner contains an {{inputCardinality_}} field 
> used by join calculations as a proxy for the table size. However, the actual 
> scan node implementations set the {{inputCardinality_}} to the estimated 
> number of rows *read* by the scan, which is useful when understanding the 
> physical scan structure. But, for joins, we need the base table cardinality.
> For example, the join may use the input cardinality to understand the 
> reduction in rows due to filters in order to adjust the NDV of key columns. 
> But, since the input cardinality is the scan count, not the table row count, 
> the math does not work out.
> The solution is to clarify the code to separate the idea of scan count vs. 
> base table row count.
> h4. Selectivity Confusion
> Similarly, each node computes its selectivity. However, the selectivity is 
> only for those predicates that will be applied via a projection. Predicates 
> that can be applied because of partition pruning (HDFS), key range pruning 
> (HBase) and so on do not "count". While this produces accurate execution 
> estimates, it is not helpful for join planning.
> In join planning, we need to know the number of filtered rows relative to the 
> total table cardinality. This allows us to adjust HDV key cardinality in 
> order to estimate the number of rows produced by the join.
> Using the partial selectivity, or partial input cardinality (above issue) 
> causes inaccurate key cardinality adjustments and incorrect join cardinality 
> estimates.
> h4. Join Node Does not Apply Selectivity from Its Predicates
> A join node can have "additional predicates" applied after creating a join 
> row. Accurate estimation of join cardinality must include the selectivity 
> from those predicates, but is not currently done. Perhaps because such 
> predicates, in the current estimation scheme, always produce an estimated 
> selectivity of .1. This will be more important as we add more realistic 
> estimates.
> h4. Use Double, not Long for Cardinality Values
> In scan nodes, row counts can be reasonable numbers and a Java {{long}} is 
> fine. But, once one starts computing join cardinalities, values can grow 
> fast, especially for cross joins. The code currently has special checks to 
> limit products to {{Long.MAX_VALUE}}. While this solves the overflow issue, 
> it has undesirable downstream affects. First, it throws of selectivity 
> calculations since the reported cardinality is not the real cardinality. 
> Second, it requires special math calls whenever we multiply cardinalities.
> Much simper to work with a {{double}}. When values get large, the extra 
> precision from a integer value is completely lost in the noise of assumptions 
> and estimations.
> h4. Revisit Cardinality Calcs for Join Nodes
> The method {{JoinNode.computeStats()}} is a bit muddled. It starts by 
> computing cardinality depending on the major type family (semi-join, 
> inner/outer join, cross join). It then revises those calcs based on the 
> specific join type. This makes it very hard to follow the logic case we have 
> to follow two distinct blocks of code. There is also redundancy. The cross 
> join cardinality is calculated twice, for example.
> Refactor to have a cardinality/selectivity calculation per join type.
> h4. Disallow Unknown Cardinality
> Multiple nodes can produce a cardinality of -1 (unknown). Since it is 
> impossible to plan based on an unknown cardinality, we must have an estimate, 
> however good or bad. For cases where we have no stats, estimate cardinality 
> based on other factors. If we have no column NDV, perhaps guesstimate 
> something, or use an alternative join calculation that avoids the need for 
> NDV (while producing much cruder estimates.) However, refusing to play the 
> game at all is not helpful unless we choose to fail the query for lack of 
> stats.
> h4. Revisit Join Cardinality Limit
> The JoinNode has several methods that limit cardinality:
> {code:java}
>   public void computeStats(Analyzer analyzer) {
>     ...
>         cardinality_ = capCardinalityAtLimit(cardinality_);
>     ...
>   }
>   public boolean hasLimit() { return limit_ > -1; }
>   protected long capCardinalityAtLimit(long cardinality) {
>     if (hasLimit()) {
>       return capCardinalityAtLimit(cardinality, limit_);
>     }
>     return cardinality;
>   }
> {code}
> It is not clear when or why we apply a limit. Perhaps as part of {{LIMIT x}} 
> processing? Revisit if the limit is helpful, and remove it if not. Imposing a 
> limit throws off downstream joins, which is probably not what is wanted here.
> # Properly Handle Duplicated Filters in Outer Joins
> Consider the following query on TPC-H which picks 1/3 (newer version) or 1/10 
> (older version) of orders. It then joins them with their customers:
> {code:sql}
> select c.c_custkey, o.o_orderkey
> from tpch.customer c
> left outer join tpch.orders o on c.c_custkey = o.o_custkey
> where o.o_clerk < 'foo'
> {code}
> The plan produced places the WHERE clause predicate on both the scan *and* 
> join nodes:
> {noformat}
> PLAN-ROOT SINK
> |
> 02:HASH JOIN [RIGHT OUTER JOIN]
> |  hash predicates: o.o_custkey = c.c_custkey
> |  other predicates: o.o_clerk < 'foo'           <== Huh?
> |  row-size=51B cardinality=163.35K
> |
> |--00:SCAN HDFS [tpch.customer c]
> |     partitions=1/1 files=1 size=23.08MB row-size=8B cardinality=150.00K
> |
> 01:SCAN HDFS [tpch.orders o]
>    partitions=1/1 files=1 size=162.56MB row-size=43B cardinality=495.00K
>    predicates: o.o_clerk < 'foo'                  <== Obvious location
> {noformat}
> The filter must be applied twice because the outer join will produce null 
> order rows. The filter won't match if the clerk field is null, so the 
> predicate is applied again.
> However, the meaning of the predicate in the join is "remove null rows" which 
> kind of means to undo the "outer". Given that, such a query is rather 
> meaningless.
> The point here is to properly account for the predicate. Simply applying the 
> {{sel(predicate)}} value twice is clearly wrong. We actually want to apply 
> the predicate only to those rows that were generated in the outer join to 
> avoid double-accounting.
> This is an obscure corner-case; the question is whether we need to account 
> for this kind of issue in multiple places. If we do,  we need a more 
> sophisticated set of data models to account for predicates than is currently 
> used.
> See test case in {{card-joins.test}} that references this ticket number for 
> an example.
> h4. Better Handling of OUTER JOIN with Column Filter
> Consider:
> {code:sql}
> SELECT c.c_name, o.o_orderkey
> FROM customers c RIGHT OUTER JOIN orders o
> WHERE c.c_name = 'Bob'
> {code}
> The query runs the "c.c_name = 'Bob'" predicate twice: once in the scan and a 
> second time in the join. Why in the join? To check if the null left 
> (customer) columns match the predicate (which they don't.) The revised code 
> handles this more-or-less correctly using the NDV of "c_name". But, doing so 
> is subtly wrong.
> We want to know the effect of doing the filtering past the join. We know that 
> all the rows that match do have the name "Bob", so the NDV should be 1. The 
> only non-"Bob" rows are the null rows inserted by the join. So, we should use 
> logic that is aware of this case to provide a better estimate.
> On the other hand, If the predicate where "WHERE c.c_name is null", the 
> results would be far different. So, the outer join logic should also be aware 
> of the meaning (null handling) of the predicate. In this case. all the 
> inserted null rows match, so we should *not* use the null count or NDV to 
> guess the filtering.
> Point is, this area is subtle and can't be brute-forced.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to