Dne 18.12.2010 16:59, Florian Pflug napsal(a): > On Dec18, 2010, at 08:10 , t...@fuzzy.cz wrote: > >>> On Dec17, 2010, at 23:12 , Tomas Vondra wrote: >>>> Well, not really - I haven't done any experiments with it. For two >>>> columns selectivity equation is >>>> >>>> (dist(A) * sel(A) + dist(B) * sel(B)) / (2 * dist(A,B)) >>>> >>>> where A and B are columns, dist(X) is number of distinct values in >>>> column X and sel(X) is selectivity of column X. >>> >>> Huh? This is the selectivity estimate for "A = x AND B = y"? Surely, >>> if A and B are independent, the formula must reduce to sel(A) * sel(B), >>> and I cannot see how that'd work with the formula above. >> >> Yes, it's a selectivity estimate for P(A=a and B=b). It's based on >> conditional probability, as >> >> P(A=a and B=b) = P(A=a|B=b)*P(B=b) = P(B=b|A=a)*P(A=a) >> >> and "uniform correlation" assumption so that it's possible to replace the >> conditional probabilities with constants. And those constants are then >> estimated as dist(A)/dist(A,B) or dist(B)/dist(A,B). > > Ok, I think I understand this now. The selectivity equation actually > *does* reduce to sel(A) * sel(B), *if* we pick a very simple estimate > for sel(A). > > Take the clause "A = x AND B = y" for example. Without knowing anything > about x and y, reasonable guesses for sel(A=x) and sel(B=y) are > > sel(A=x) = 1 / dist(A) > sel(B=y) = 1 / dist(B). > > This is also what we do currently, according to var_eq_non_const() in > src/backend/utils/adt/selfuncs.c, if we don't have any additional > knowledge about x (Actually, we also factor the probability of A being > NULL into this). > > With these estimates, your formula becomes > > sel(A=x,B=y) = 1 / dist(A,B). > > and if A and B are uncorrelated, dist(A,B) ~= dist(A) * dist(B), thus > > sel(A=x,B=y) = sel(A=x) * sel(B=y). > > If, however, y is a constant, then we use the MKVs to estimate sel(B=y) > (var_eq_const() in src/backend/utils/adt/selfuncs.c). If > > sel(B=y) ~= 0, > > we'd currently also conclude that > > sel(A=x,B=y) ~= 0. > > With the "uniform correlation" approach, we'd instead estimate > > sel(A=x,B=y) ~= sel(A=x) / dist(B) > > assuming that dist(A,B) ~= dist(A)*dist(B), meaning A,B are uncorrelated. > If dist(B) is small, this estimate is much worse than what we'd currently > get, since we've effectively ignored the information that the restriction > B=y alone guarantees that only very few rows will match.
Well, I guess you're right. Let's see two examples - uncorrelated and higly correlated data (and then a few comments on what I think you're missing). uncorrelated ------------ Let there be 10 distinct values in A, 100 distinct values in B, and there are all possible combinations (10x100 = 1000). All the values (and combinations) are uniformly distributed. The current estimate is correct, i.e. sel(A=a) = 1/10 = 1/dist(A) sel(B=b) = 1/100 = 1/dist(B) sel(A=a, B=b) = sel(A) * sel(B) = 1/1000 /* due to AVI */ the proposed estimate is sel(A=a, B=b) = (dist(A)*sel(A=a) + dist(B)*sel(B=b)) / (2*dist(A,B)) = (10 * 1/10 + 100 * 1/100) / 2000 = 1/1000 so actually it produces the same estimate, but thats due to the uniform distribution assumption. Let's say the selectivities for A and B are very different - A is 10x les common, B is 10x more common (let's say we know this). Then the current estimate is still correct, i.e. it gives sel(A=a) = 1/100 sel(B=b) = 1/10 => sel(A=a,B=b) = 1/1000 but the new estimate is (10 * 1/100 + 100 * 1/10) / 2000 = (101/10) / 2000 ~ 1/200 So yes, in case of uncorrelated data this overestimates the result. highly correlated ----------------- Again, let dist(A)=10, dist(B)=100. But this case let B=>A i.e. for each value in B, there is a unique value in A. Say B=mod(A,10) or something like that. So now there is dist(A,B)=100. Assuming uniform distribution, the correct estimate is sel(A=a,B=b) = sel(B=b) = 1/100 and the current estimate is sel(A=a,B=b) = sel(A=a) * sel(B=b) = 1/10 * 1/100 = 1/1000 The proposed estimate is (10 * 1/10 + 100 * 1/100) / 200 = 2/200 = 1/100 which is correct. Without the uniform distribution (again, sel(A)=1/100, sel(B)=1/10), we currently get (just as before) sel(A=a,B=b) = 1/10 * 1/100 = 1/1000 which is 100x less than the correct value (sel(B)=1/10). The new estimate yields (10*1/100 + 100*1/10) / 200 = (1010/100) / 200 ~ 1/20 which is not as accurate as with uniform distribution, but it's considerably more accurate than the current estimate (1/1000). comments -------- I really don't think this a perfect solution giving 100% accurate results in all cases. But the examples above illustrate that it gives much better estimates in case of correlated data. It seems to me you're missing one very important thing - this was not meant as a new default way to do estimates. It was meant as an option when the user (DBA, developer, ...) realizes the current solution gives really bad estimates (due to correlation). In that case he could create 'cross-column' statistics on those columns, and the optimizer would use that info to do the estimates. This kind of eliminates the issue with uncorrelated data, as it would not actually happen. OK, the user might create cross-column stats on uncorrelated columns and he'd get incorrect estimates in that case, but I really don't think we can solve this kind of problems. regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers