So I had a thought about how to soften the controversial hard cutoff of 100 for the use of the histogram selectivity. Instead of switching 100% one way or the other between the two heuristics why not calculate both and combine them. The larger the sample size from the histogram the more we can weight the histogram calculation. The smaller the histogram size the more we weight the heuristic.
My first thought was to scale it linearly so we use 10% of the histogram sample + 90% of the heuristic for default statistic sizes of 10 samples. That degenerates to the status quo for 100 samples and up. But actually I wonder if we can't get some solid statistics behind the percentages. The lower the sample size the larger the 95th percentile confidence interval and we want to use heuristic to adjust the result within the confidence interval. I think there are even ways of calculating pessimistic confidence intervals for unrepresentative samples. This would allow the results to degrade smoothly. So a sample size of 50 would be expected to give results somewhere between those of 10 and 100. Instead of the current behaviour where the results will be exactly the same until you hit 100 and then suddenly jump to a different value. I would do it by just making histogram_selectivity() never fail unless the histogram is less than 2 * the ignored values. There would be an additional parameter with a double* where the function would store the percentage weight to give the result. The caller would be responsible for combining the result just as it is with the MCV estimates. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support! ---------------------------(end of broadcast)--------------------------- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate