Re: PATCH: add support for IN and @> in functional-dependency statistics use

Tomas Vondra Fri, 13 Mar 2020 09:10:06 -0700

On Fri, Mar 13, 2020 at 08:42:49AM +0000, Dean Rasheed wrote:

On Thu, 12 Mar 2020 at 17:30, Tomas Vondra <[email protected]> wrote:


I'm sorry, but I don't see how we could do this for arbitrary clauses. I
think we could do that for clauses that have equality semantics and
reference column values as a whole. So I think it's possible to do this
for IN clauses (which is what the first part of the patch does), but I
don't think we can do it for the containment operator.

I.e. we can do that for

     WHERE a IN (...) AND b IN (...)


Hmm, the difficulty always comes back to the compatibility of the
clauses though. It's easy to come up with artificial examples for
which functional dependencies come up with bad estimates, even with
just = and IN (...) operators. For example, given a perfect
correlation like

 a | b
-------
 1 | 1
 2 | 2
 3 | 3
 : | :

you only need to write a query like "WHERE a IN (1,3,5,7,9,...) AND b
IN (2,4,6,8,...)" to get a very bad estimate from functional
dependencies.

However, I don't think such artificial examples are that useful. I
think you have to think in terms of real data distributions together
with real queries expected to go with them. For example:

Using the OP's original example of a multi-tenant system, you might
well have a table with columns (product_type, tenant_id) and a
functional dependency product_type => tenant_id. In that case, it
could well be very useful in optimising queries like "WHERE
product_type IN (X,Y,Z) AND tenant_id = 123".

But this isn't necessarily limited to = and IN (...). For example,
consider a table with UK-based geographic data with columns (location
point, postcode text). Then there would be a strong functional
dependency location => postcode (and possibly also the other way
round, depending on how dense the points were). That dependency could
be used to estimate much more general queries like "WHERE location <@
some_area AND postcode ~ '^CB.*'", where there may be no useful stats
on location, but a histogram on postcode might give a reasonable
estimate.

This also extends to inequalities. For example a table with columns
(weight, category) might have a strong functional dependency weight =>
category. Then a query like "WHERE weight > 10 AND weight < 20 AND
category = 'large'" could get a decent estimate from a histogram on
the weight column, and then use the functional dependency to note that
that implies the category. Note that such an example would work with
my patch from the other thread, because it groups clauses by column,
and uses clauselist_selectivity_simple() on them. So in this case, the
clauses "weight > 10 AND weight < 20" would be estimated together, and
would be able to make use of the RangeQueryClause code.

Of course, it's equally easy to come up with counter-example queries
for any of those examples, where using the functional dependency would
produce a poor estimate. Ultimately, it's up to the user to decide
whether or not to build functional dependency statistics, and that
decision needs to be based not just on the data distribution, but also
on the types of queries expected.


Well, yeah. I'm sure we can produce countless examples where applying
the functional dependencies to additional types of clauses helps a lot.
I'm somewhat hesitant to just drop any restrictions, though, because
it's equally simple to produce examples with poor results.

The main issue I have with just applying dependencies to arbitrary
clauses is it uses "degree" computed for the value as a whole, and
just uses it to estimate dependency between pieces of the values.

The IN() clause does not have this problem, the other cases like @> or
pattern matching do.

Given the timing though, perhaps it is best to limit this to IN (..)
clauses for PG13, and we can consider other possibilities later.


Yeah, I was gonna propose the same thing. I'll get the IN bit committed
shortly (both for dependencies and MCV), improved handling of OR clauses
and some additional regression tests to increase the coverage.

Then we can discuss these improvement in more detail for PG14.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PATCH: add support for IN and @> in functional-dependency statistics use

Reply via email to