On 23.12.2014 11:28, Heikki Linnakangas wrote: > On 12/07/2014 03:54 AM, Tomas Vondra wrote: >> The one interesting case is the 'step skew' with statistics_target=10, >> i.e. estimates based on mere 3000 rows. In that case, the adaptive >> estimator significantly overestimates: >> >> values current adaptive >> ------------------------------ >> 106 99 107 >> 106 8 6449190 >> 1006 38 6449190 >> 10006 327 42441 >> >> I don't know why I didn't get these errors in the previous runs, because >> when I repeat the tests with the old patches I get similar results with >> a 'good' result from time to time. Apparently I had a lucky day back >> then :-/ >> >> I've been messing with the code for a few hours, and I haven't >> found any significant error in the implementation, so it seems that >> the estimator does not perform terribly well for very small samples >> (in this case it's 3000 rows out of 10.000.000 (i.e. ~0.03%). > > The paper [1] gives an equation for an upper bound of the error of this > GEE estimator. How do the above numbers compare with that bound?
Well, that's a bit more complicated because the "Theorem 1" you mention does not directly specify upper boundary for a single estimate. It's formulated like this: Assume table with "N" rows, D distinct values and sample of "r" rows (all those values are fixed). Then there exists a dataset with those features, so that "ratio error" error(D, D') = max(D'/D, D/D') is greater than f(N, r, P) with probability at least "P". I.e. if you randomly choose a sample of 'r' rows, you'll get an error exceeding the ratio with probability P. So it's not not a hard limit, but speaks about probability of estimation error with some (unknown) dataset dataset. So it describes what you can achieve at best - if you think your estimator is better, there'll always be a dataset hiding in the shadows hissing "Theorem 1". Let's say we're looking for boundary that's crossed only in 1% (or 5%) of measurements. Applying this to the sample data I posted before, i.e. 10M rows with three sample sizes 'r' (3000, 30.000 and 300.000 rows), the ratio error boundary per the the paper is 3.000 30.000 300.000 ---------------------------------------- 1% 88 28 9 5% 70 22 7 At least that's what I get if I compute it using this python function: def err(N, r, p): return sqrt((N-r)/(2.0*r) * log(1.0/p)) So the estimates I posted before are not terribly good, I guess, especially the ones returning 6449190. I wonder whether there really is some stupid bug in the implementation. regards -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers