On Thu, May 10, 2012 at 6:33 PM, Robert Haas <robertmh...@gmail.com> wrote: > I'm worried this project is getting so complicated that it will be > beyond the ability of a new hacker to get anything useful done. Can > we simplify the requirements here to something that is reasonable for > a beginner?
It seems to me that the simplest thing to do would be to lift the sampling done in analyze.c (acquire_sample_rows) and use that to implement the SYSTEM sampling method. The language in the standard leads me to believe that Vitter's algorithm used in analyze.c is exactly what was intended by the authors. The difference between Vitter's algorithm and a pure Bernoulli process is precisely that Vitter's method increases the chances to see multiple rows picked from a single page. One tricky issue is that tablesample is defined in terms of a percentage of the underlying table while Vitter's algorithm needs a fixed number of rows. The standard does state that the result needs to contain "approximately" the stated percentage of rows. I'm not sure if calculating the amount of rows to return from reltuples would fit that definition of approximate. If not, would re-estimating the amount of reltuples after sampling and taking appropriate corrective action make it better or would an accurate number be necessary. Getting an accurate number efficiently would require solving of the COUNT(*) issue. For the Bernoulli case I can't think of anything simple that would better than just scanning the table or poking it with random TIDs. (the latter has the same problem of estimating the desired result set size) It seems to me that Vitter's approach could be amended to produce independent samples by selecting slightly more pages than result tuples and tweaking the acceptance levels to cancel out the bias. But that definitely isn't in the territory of simple and would require rigorous statistical analysis. And as for the monetary unit sampling, I agree that this is better left as an optional extra. Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers