On 07/30/2015 03:55 PM, Tomas Vondra wrote:
On 07/30/2015 10:21 AM, Heikki Linnakangas wrote:
I have some doubts about the clause reduction and functional
dependencies part of this. It seems to treat functional dependency as
a boolean property, but even with the classic zipcode and city case,
it's not always an all or nothing thing. At least in some countries,
there can be zipcodes that span multiple cities. So zipcode=X does
not completely imply city=Y, although there is a strong correlation
(if that's the right term). How strong does the correlation need to
be for this patch to decide that zipcode implies city? I couldn't
actually see a clear threshold stated anywhere.

So rather than treating functional dependence as a boolean, I think
it would make more sense to put a 0.0-1.0 number to it. That means
that you can't do clause reduction like it's done in this patch,
where you actually remove clauses from the query for cost esimation
purposes. Instead, you need to calculate the selectivity for each
clause independently, but instead of just multiplying the
selectivities together, apply the "dependence factor" to it.

Does that make sense? I haven't really looked at the MCV, histogram
and "multi-statistics estimation" patches yet. Do those patches make
the clause reduction patch obsolete? Should we forget about the
clause reduction and functional dependency patch, and focus on those
later patches instead?

Perhaps. It's true that most real-world data sets are not 100% valid
with respect to functional dependencies - either because of natural
imperfections (multiple cities with the same ZIP code) or just noise in
the data (incorrect entries ...). And it's even mentioned in the code
comments somewhere, I guess.

But there are two main reasons why I chose not to extend the functional
dependencies with the [0.0-1.0] value you propose.

Firstly, functional dependencies were meant to be the simplest possible
implementation, illustrating how the "infrastructure" is supposed to
work (which is the main topic of the first patch).

Secondly, all kinds of statistics are "simplifications" of the actual
data. So I think it's not incorrect to ignore the exceptions up to some
threshold.

The problem with a threshold is that around that threshold, even a small change in the data set can drastically change the produced estimates. For example, imagine that we know from the stats that zip code implies city. But then someone adds a single row to the table with an odd zip code & city combination, which pushes the estimator over the threshold, and the columns are no longer considered dependent, and the estimates are now completely different. We should avoid steep cliffs like that.

BTW, what is the threshold in the current patch?

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to