"Nathan Boley" <npbo...@gmail.com> writes: > Isn't a selectivity estimate of x = v as ( the number of values in v's > histogram bucket ) / ( number of distinct values in v's histogram > bucket ) pretty rational? Thats currently what we do for non-mcv > values, except that we look at ndistinct over the whole table instead > of individual histogram buckets.
But the histogram buckets are (meant to be) equal-population, so it should come out the same either way. The removal of MCVs from the population will knock any possible variance in ndistinct down to the point where I seriously doubt that this could offer a win. An even bigger problem is that this requires estimation of ndistinct among fractions of the population, which will be proportionally less accurate than the overall estimate. Accurate ndistinct estimation is *hard*. > now, if there are 100 histogram buckets then any values that occupy > more than 1% of the table will be mcv's regardless - why force a value > to be an mcv if it only occupies 0.1% of the table? Didn't you just contradict yourself? The cutoff would be 1% not 0.1%. In any case there's already a heuristic to cut off the MCV list at some shorter length (ie, more than 1% in this example) if it seems not worthwhile to keep the last entries. See lines 2132ff (in CVS HEAD) in analyze.c. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers