On Fri, Jun 19, 2015 at 12:27 PM, Tomas Vondra <tomas.von...@2ndquadrant.com
> wrote:


> But I think you might be on to something, because I manually collected a
> random sample with 30k rows (by explicitly generating 30k random TIDs), and
> I get this:
>
> tpch=# select cnt, count(*) from (select l_orderkey, count(*) AS cnt from
> lineitem_sample group by 1) foo group by 1;
>
>  cnt | count
> -----+-------
>    1 | 29998
>    2 |     1
> (2 rows)
>
>
> That's quite different compared to what analyze gets, which effectively
> looks something like this (this is derived from the logs, so not perfectly
> accurate - I only have f1, ndistinct, nmultiple):
>
>  cnt | count
> -----+-------
>    1 | 27976
>    2 |   976
>    3 |    24
>
> Am I wrong or is the sample not that random?


The sample is not truly random.  The two-stage sampling method causes too
few blocks to have exactly one row chosen from them, and too many to have
either 0 or 2+ rows chosen from them.

When values in the same block are likely to be equal, then it finds too
many duplicates because it too often picks two rows from a single block.

See analysis here:

http://www.postgresql.org/message-id/CAMkU=1wrh_jopycayukbdqy4dwhsx1-1e2s0vvgfrryfxde...@mail.gmail.com

If we assume all the blocks have the same tuple density, then it is easy to
correct this.  But without that assumption of constant tuple density, I
don't know how to get a truly random sample while still skipping most of
the table.

Cheers,

Jeff

Reply via email to