[HACKERS] patternsel() and histogram_selectivity() and the hard cutoff of 100

Gregory Stark Tue, 05 Feb 2008 06:29:10 -0800

So I had a thought about how to soften the controversial hard cutoff of 100
for the use of the histogram selectivity. Instead of switching 100% one way or
the other between the two heuristics why not calculate both and combine them.
The larger the sample size from the histogram the more we can weight the
histogram calculation. The smaller the histogram size the more we weight the
heuristic.


My first thought was to scale it linearly so we use 10% of the histogram
sample + 90% of the heuristic for default statistic sizes of 10 samples. That
degenerates to the status quo for 100 samples and up.

But actually I wonder if we can't get some solid statistics behind the
percentages. The lower the sample size the larger the 95th percentile
confidence interval and we want to use heuristic to adjust the result within
the confidence interval. I think there are even ways of calculating
pessimistic confidence intervals for unrepresentative samples.

This would allow the results to degrade smoothly. So a sample size of 50 would
be expected to give results somewhere between those of 10 and 100. Instead of
the current behaviour where the results will be exactly the same until you hit
100 and then suddenly jump to a different value.

I would do it by just making histogram_selectivity() never fail unless the
histogram is less than 2 * the ignored values. There would be an additional
parameter with a double* where the function would store the percentage weight
to give the result. The caller would be responsible for combining the result
just as it is with the MCV estimates.

-- 
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's PostGIS support!

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

                http://www.postgresql.org/about/donate

[HACKERS] patternsel() and histogram_selectivity() and the hard cutoff of 100

Reply via email to