Currently eqsel assumes that, except for the values stored as mcv's, the number of times that a value appears in a table column is independent of it's value. Unfortunately, this often yields inappropriate selectivity estimates and frequently leads to inappropriate plans.
As an example, consider an insurance company that keeps a record of patient heights. Assume there are a 1000000 patient heights in this column, and they are distributed normally with a mean of 1.7526 and a standard deviation of 0.0762. Furthermore, assume that the heights are only measured to the nearest centimeter. Then, we'd expect there to be about 73 distinct heights, with a SD of 1.5. Ignoring the effects of MCV's, the planner expects SELECT height FROM heights WHERE height = 1.75; to yield roughly 13000 results. However, given that we know the underlying distribution, we would expect to see ~52000 results. Similarly, the planner expects to see 13000 results from SELECT 1.75 FROM heights WHERE height = 2.05; While we expect to see 2.7. Obviously this example is not totally convincing: if I were to post this to pg-general looking for advice I'm sure that everyone would tell me to just increase the size of my mcv stats. However, in cases where the number of distinct values is higher, this isn't always feasible. Also, why store a list of 50 values and their frequencies when 10 extra would provide the same plans without bloating pg_statistics? To combat this problem, I have two different proposals. Idea 1: Keep an array of stadistinct that correspond to each bucket size. In the example above, ( again ignoring mcv's ) the quantile data is 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1.38 1.66 1.69 1.71 1.73 1.75 1.77 1.79 1.82 1.85 2.12 with numdistinct values of ( respectively ) 29 2 2 2 2 2 2 3 3 25 For the two above examples, this new approach would yield selectivity estimates of (1000000/10)/2 = 50000 ( vs an actual ED of ~52000 ) and (1000000/10)/25 = 4000 ( vs an actual ED of ~2.7 ) Furthermore, this is done without mcvs. Since mcv's would make the histogram more sensitive to the edges, the estimates with mcv's should be correspondingly better. There are two potential problems that I see with this approach: 1) It assumes the = is equivalent to <= and >= . This is certainly true for real numbers, but is it true for every equality relation that eqsel predicts for? 2) It bloats the stats table. Idea 2: Keep a correlation statistic between ndistinct and bucket size This addresses problem #2. In lieu of keeping an actual list of ndistinct per histogram bucket, we store the linear scaling coefficient between histogram bucket width and ndistinct/(avg ndistinct). To visualize this, it is easiest to consider plotting the bucket width versus ndistinct. The scaling coefficient is the linear line that passes through origin and minimizes the square of the difference between it's estimate for ndistinct and the actual value. When I apply this method to the above data I find a coefficient of 13.63 for an average ndist of 72/10. This provides selectivity estimates, for the above two examples, of (1000000/10)/( 13.63*7.2*(1.77 - 1.75) ) = 50950 ( vs an actual ED of ~52000 ) and (1000000/10)/( 13.63*7.2*(2.12 - 1.85) ) = 3774 ( vs an actual ED of ~2.7 ) Although this yields better results than idea 1 for this particular example, it will be much more sensitive to weird distributions. Obviously there are some special cases to consider: we wouldn't want the stats to be skewed such that they provide really bad plans. However, with some carefully designed caps I believe that we could ensure that the estimates are at least as good as they are now. In fact, I'm not certain that an R^2 penalty is the correct loss function. Ideally, we want to minimize the extra time that the db spends by choosing an incorrect plan. Maybe slight overestimations are better than slight underestimations? Maybe the cost of the occasional (really) bad plan is less than the cost of a bunch of kinda bad plans? Finally, we aren't limited to just one coefficient. We could also store multiple coefficents to improve our estimates, and provide a compromise between ideas 1 and 2. Food for future thought... I addition to the previous benefits, I think that this method has the potential to make the process by which MCV are chosen (or not chosen) smarter. Now the planner chooses a value to be an mcv candidate if it's frequency is greater than 1.25 * the average frequency. Given that this improved selectivity estimate is implemented, maybe a better way would be to include a value as an mcv if it's a) above a certain threshold and b) the histogram selectivity estimator does do a poor job. What are peoples thoughts on idea 1 vs idea 2? Am I missing any relevant details about the planner's operation? Do people think that the improved estimates would be worth the additional overhead? -Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers