On 23.12.2014 11:28, Heikki Linnakangas wrote:
> On 12/07/2014 03:54 AM, Tomas Vondra wrote:
>> The one interesting case is the 'step skew' with statistics_target=10,
>> i.e. estimates based on mere 3000 rows. In that case, the adaptive
>> estimator significantly overestimates:
>>
>>      values   current    adaptive
>>      ------------------------------
>>      106           99         107
>>      106            8     6449190
>>      1006          38     6449190
>>      10006        327       42441
>>
>> I don't know why I didn't get these errors in the previous runs, because
>> when I repeat the tests with the old patches I get similar results with
>> a 'good' result from time to time. Apparently I had a lucky day back
>> then :-/
>>
>> I've been messing with the code for a few hours, and I haven't
>> found any significant error in the implementation, so it seems that
>> the estimator does not perform terribly well for very small samples
>> (in this case it's 3000 rows out of 10.000.000 (i.e. ~0.03%).
> 
> The paper [1] gives an equation for an upper bound of the error of this
> GEE estimator. How do the above numbers compare with that bound?

Well, that's a bit more complicated because the "Theorem 1" you mention
does not directly specify upper boundary for a single estimate. It's
formulated like this:

  Assume table with "N" rows, D distinct values and sample of "r"
  rows (all those values are fixed). Then there exists a dataset with
  those features, so that "ratio error"

      error(D, D') = max(D'/D, D/D')

  is greater than f(N, r, P) with probability at least "P". I.e. if you
  randomly choose a sample of 'r' rows, you'll get an error exceeding
  the ratio with probability P.

So it's not not a hard limit, but speaks about probability of estimation
error with some (unknown) dataset dataset. So it describes what you can
achieve at best - if you think your estimator is better, there'll always
be a dataset hiding in the shadows hissing "Theorem 1".


Let's say we're looking for boundary that's crossed only in 1% (or 5%)
of measurements. Applying this to the sample data I posted before, i.e.
10M rows with three sample sizes 'r' (3000, 30.000 and 300.000 rows),
the ratio error boundary per the the paper is

            3.000     30.000     300.000
  ----------------------------------------
    1%         88         28           9
    5%         70         22           7


At least that's what I get if I compute it using this python function:

    def err(N, r, p):
        return sqrt((N-r)/(2.0*r) * log(1.0/p))


So the estimates I posted before are not terribly good, I guess,
especially the ones returning 6449190. I wonder whether there really is
some stupid bug in the implementation.

regards

-- 
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to