Hi,
On 07/30/2015 06:58 PM, Heikki Linnakangas wrote:
The problem with a threshold is that around that threshold, even a
small change in the data set can drastically change the produced
estimates. For example, imagine that we know from the stats that zip
code implies city. But then someone adds a single row to the table
with an odd zip code & city combination, which pushes the estimator
over the threshold, and the columns are no longer considered
dependent, and the estimates are now completely different. We should
avoid steep cliffs like that.
BTW, what is the threshold in the current patch?
There's not a simple threshold - the algorithm mining the functional
dependencies is a bit more complicated. I tried to explain it in the
comment before build_mv_dependencies (in dependencies.c), but let me
briefly summarize it here.
To mine dependency [A => B], build_mv_dependencies does this:
(1) sort the sample by {A,B}
(2) split the sample into groups with the same value of A
(3) for each group, decide if it's consistent with the dependency
(a) if the group is too small (less than 3 rows), ignore it
(a) if the group is consistent, update
n_supporting
n_supporting_rows
(b) if the group is inconsistent, update
n_contradicting
n_contradicting_rows
(4) decide whether the dependency is "valid" by checking
n_supporting_rows >= n_contradicting_rows * 10
The limit is rather arbitrary and yes - I can imagine a more complex
condition (e.g. looking at average number of tuples per group etc.), but
I haven't looked into that - the point was to use something very simple,
only to illustrate the infrastructure.
I think we might come up with some elaborate way of associating "degree"
with the functional dependency, but at that point we really loose the
simplicity, and also make it indistinguishable from the remaining
statistics (because it won't be possible to reduce the clauses like
this, before performing the regular estimation). Which is exactly what
makes the functional dependencies so neat and efficient, so I'm not
overly enthusiastic about doing that.
What seems more interesting is implementing the ndistinct coefficient
instead, as proposed by Kyotaro-san - that seems to have the nice
"smooth" behavior you desire, while keeping the simplicity.
Both statistics types (functional dependencies and ndistinct coeff) have
one weak point, though - they somehow assume the queries use
"compatible" values. For example if you use a query with
WHERE city = 'New York' AND zip = 'zip for Detroit'
they can't detect cases like this, because those statistics types are
oblivious to individual values. I don't see this as a fatal flaw, though
- it's rather a consequence of the nature of the stats. And I tend to
look at the functional dependencies the same way.
If you need stats without these "issues" you'll have to use MCV list or
a histogram. Trying to fix the simple statistics types is futile, IMHO.
regards
Tomas
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers