Re: [HACKERS] PATCH: adaptive ndistinct estimator v4

Tomas Vondra Sat, 04 Apr 2015 16:58:47 -0700

Hi,

On 04/03/15 15:46, Greg Stark wrote:

 > The simple workaround for this was adding a fallback to GEE when f[1]
or f[2] is 0. GEE is another estimator described in the paper, behaving
much better in those cases.


For completeness, what's the downside in just always using GEE?


That's a good question.

GEE is the estimator with minimal average error, as defined in Theorem 1in that paper. The exact formulation of the theorem is a bit complex,but it essentially says that knowing just the sizes of the data set andsample, there's an accuracy limit.

Or put another way, it's possible to construct the data set so that theestimator gives you estimates with error exceeding some limit (with acertain probability).

Knowledge of how much the data set is skewed gives us opportunity toimprove the estimates by choosing an estimator performing better withsuch data sets. The problem is we don't know the skew - we can onlyestimate it from the sample, which is what the hybrid estimators do.

The AE estimator (presented in the paper and implemented in the patch)is an example of such hybrid estimators, but based on my experiments itdoes not work terribly well with one particular type of skew that I'dexpect to be relatively common (few very common values, many very rarevalues).

Luckily, GEE performs pretty well in this case, but we can use the AEotherwise (ISTM it gives really good estimates).

But of course - there'll always be data sets that are poorly estimated(pretty much as Theorem 1 in the paper says). I'd be nice to do moretesting on real-world data sets, to see if this performs better or worsethan our current estimator.


regards
Tomas


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: adaptive ndistinct estimator v4

Reply via email to