On 06/22/2015 07:21 AM, Jeff Janes wrote:
On Sat, Jun 20, 2015 at 9:55 AM, Tomas Vondra
<tomas.von...@2ndquadrant.com <mailto:tomas.von...@2ndquadrant.com>> wrote:

    Hi,

    On 06/20/2015 03:01 AM, Jeff Janes wrote:



             I don't think we need to really assume the density to be
        constant,
             maybe we can verify that while collecting the sample? I
        mean, we're
             already reading the pages, so we can track the density, and
        either
             do the correction or not.


        Maybe.  I don't know how that would work.


    I was thinking about something like

       (1) compute average tuples per page

       (2) generate random sample of 'samplesize' blocks

       (3) for each block, see what is the actual number of tuples on the
           page and perform correction (e.g. if there's only 1/2 the average
           tuples, toss a coin and decide whether to really sample the
    block)

       (4) repeat until you have sample of the requested size


Are you producing a block sample or a row sample?  If a block is
sampled, then what do you feed it to?

I'm not sure what's the question. The ultimate goal is to get a random row sample, but we need to choose which blocks to sample first. So you can just generate K random numbers (where K is the requested sample size). If you get the block number once, sample one row, if you get the block twice, sample two rows, etc.

Of course, this is based on the assumption of uniform tuple density, which allows you to choose block first and then independently a row from that block. Hence the correction idea I outlined in the previous message.

Are you just picking one row out of each block that survives step 3?
If so, that would be similar to my idea of counting a given value
only once per block it was in, except I was thinking of applying that
only to n_distinct, not to the entire stats collection process.

No, I'm not doing that, and I think it's a bad idea in general. The problem with the current sampling is that it does not produce a random sample of rows, but something skewed, which causes serious trouble in the estimators processing the sample. I think this 'count only once' would result in similar issues (e.g. for low-cardinality columns).


My answer was to take out the block sampling entirely and read the
whole table. That is what probably won't work for you. (Come to think
of it, I was hesitant to deploy custom code to production, so I never
actually deployed that. Instead I cranked up the stats target and let
the chips fall where they may)

Yeah, reading the whole table is not going to fly I guess. But the estimates would be very accurate! ;-)


I think you want to go back to the table to find new blocks to
replace ones which were included in the original block sample but
ended up having no tuples in the end tuple sample, either because
they were low density blocks, or just through the luck of the draw.
But I don't see how fixes the problem, unless you also prevent a
block from contributing more than 1 tuple. And at that point, I worry
about the uneven density again. If a block has 2 rows selected by
pure random chance, I think it would be fine to keep only one (or, go
find another random block to pull one row from as a replacement). But
if it has 2 rows selected because it has more rows than average, then
we would have to pull the replacement from a random block *of similar
density* to avoid swapping one bias for another one.

Hmmm, maybe limiting the sample to just 1 tuple is a good idea, but I have to think about that a bit more.

The basic idea behind density correction is that when you get a block with density significantly different from the average (let's say more than 10%?), you can either treat it as a partial block (density below average), or a collection of multiple blocks (density above average), and see if it really sampled.

Handling the 'partial block' seems rather simple - if the block has half the average density, treat do a random coin toss and either keep or discard the tuple.

With high density blocks, it's probably a complicated, and I think it really means treating it as multiple 'virtual' blocks, and only keep samples from the first one.

And of course, this needs to be repeated if some of the rows get evicted because of the correction mechanism. Which will lead to slightly larger samples, but I don't see that as a major problem (and the difference is not that big, at least not in the case discussed in this thread previously).

--
Tomas Vondra                   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to