Re: [HACKERS] ANALYZE sampling is too good

Sergey E. Koposov Tue, 10 Dec 2013 17:28:35 -0800

For what it's worth.

I'll quote Chaudhuri et al. first line from the abstract about the blocksampling."Block-level sampling is far more efficient than true uniform-randomsampling over a large database, but prone to significant errors if usedto create database statistics."

And after briefly glancing through the paper, my opinion is why it worksis because after making one version of statistics they cross-validate, seehow well it goes and then collect more if the cross-validation error islarge (for example because the data is clustered). Without this bit, asfar as I can a simply block based sampler will be bound to makecatastrophic mistakes depending on the distribution

Also, just another point about targets (e.g X%) for estimating stuff fromthe samples (as it was discussed in the thread). Basically, the is apoint talking about a sampling a fixed target (5%) of the dataONLY if you fix the actual distribution of your data in the table, anddecide what statistic you are trying to find, e.g. average, std. dev. a90% percentile, ndistinct or a histogram and so forth. There won't be ageneral answer as the percentages will be distribution dependend andstatistic dependent.


Cheers,
        Sergey

PS I'm not a statistician, but I use statistics a lot

*******************************************************************
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ANALYZE sampling is too good

Reply via email to