On Fri, Jun 19, 2015 at 12:27 PM, Tomas Vondra <tomas.von...@2ndquadrant.com > wrote:
> But I think you might be on to something, because I manually collected a > random sample with 30k rows (by explicitly generating 30k random TIDs), and > I get this: > > tpch=# select cnt, count(*) from (select l_orderkey, count(*) AS cnt from > lineitem_sample group by 1) foo group by 1; > > cnt | count > -----+------- > 1 | 29998 > 2 | 1 > (2 rows) > > > That's quite different compared to what analyze gets, which effectively > looks something like this (this is derived from the logs, so not perfectly > accurate - I only have f1, ndistinct, nmultiple): > > cnt | count > -----+------- > 1 | 27976 > 2 | 976 > 3 | 24 > > Am I wrong or is the sample not that random? The sample is not truly random. The two-stage sampling method causes too few blocks to have exactly one row chosen from them, and too many to have either 0 or 2+ rows chosen from them. When values in the same block are likely to be equal, then it finds too many duplicates because it too often picks two rows from a single block. See analysis here: http://www.postgresql.org/message-id/CAMkU=1wrh_jopycayukbdqy4dwhsx1-1e2s0vvgfrryfxde...@mail.gmail.com If we assume all the blocks have the same tuple density, then it is easy to correct this. But without that assumption of constant tuple density, I don't know how to get a truly random sample while still skipping most of the table. Cheers, Jeff