Hi,

On 04/03/15 15:46, Greg Stark wrote:
 > The simple workaround for this was adding a fallback to GEE when f[1]
or f[2] is 0. GEE is another estimator described in the paper, behaving
much better in those cases.

For completeness, what's the downside in just always using GEE?

That's a good question.

GEE is the estimator with minimal average error, as defined in Theorem 1 in that paper. The exact formulation of the theorem is a bit complex, but it essentially says that knowing just the sizes of the data set and sample, there's an accuracy limit.

Or put another way, it's possible to construct the data set so that the estimator gives you estimates with error exceeding some limit (with a certain probability).

Knowledge of how much the data set is skewed gives us opportunity to improve the estimates by choosing an estimator performing better with such data sets. The problem is we don't know the skew - we can only estimate it from the sample, which is what the hybrid estimators do.

The AE estimator (presented in the paper and implemented in the patch) is an example of such hybrid estimators, but based on my experiments it does not work terribly well with one particular type of skew that I'd expect to be relatively common (few very common values, many very rare values).

Luckily, GEE performs pretty well in this case, but we can use the AE otherwise (ISTM it gives really good estimates).

But of course - there'll always be data sets that are poorly estimated (pretty much as Theorem 1 in the paper says). I'd be nice to do more testing on real-world data sets, to see if this performs better or worse than our current estimator.

regards
Tomas


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to