Re: [HACKERS] pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

Tomas Vondra Mon, 22 Jun 2015 08:25:40 -0700


On 06/22/2015 07:21 AM, Jeff Janes wrote:

On Sat, Jun 20, 2015 at 9:55 AM, Tomas Vondra
<tomas.von...@2ndquadrant.com <mailto:tomas.von...@2ndquadrant.com>> wrote:

    Hi,

    On 06/20/2015 03:01 AM, Jeff Janes wrote:



             I don't think we need to really assume the density to be
        constant,
             maybe we can verify that while collecting the sample? I
        mean, we're
             already reading the pages, so we can track the density, and
        either
             do the correction or not.


        Maybe.  I don't know how that would work.


    I was thinking about something like

       (1) compute average tuples per page

       (2) generate random sample of 'samplesize' blocks

       (3) for each block, see what is the actual number of tuples on the
           page and perform correction (e.g. if there's only 1/2 the average
           tuples, toss a coin and decide whether to really sample the
    block)

       (4) repeat until you have sample of the requested size


Are you producing a block sample or a row sample?  If a block is
sampled, then what do you feed it to?

I'm not sure what's the question. The ultimate goal is to get a randomrow sample, but we need to choose which blocks to sample first. So youcan just generate K random numbers (where K is the requested samplesize). If you get the block number once, sample one row, if you get theblock twice, sample two rows, etc.

Of course, this is based on the assumption of uniform tuple density,which allows you to choose block first and then independently a row fromthat block. Hence the correction idea I outlined in the previous message.

Are you just picking one row out of each block that survives step 3?
If so, that would be similar to my idea of counting a given value
only once per block it was in, except I was thinking of applying that
only to n_distinct, not to the entire stats collection process.

No, I'm not doing that, and I think it's a bad idea in general. Theproblem with the current sampling is that it does not produce a randomsample of rows, but something skewed, which causes serious trouble inthe estimators processing the sample. I think this 'count only once'would result in similar issues (e.g. for low-cardinality columns).


My answer was to take out the block sampling entirely and read the
whole table. That is what probably won't work for you. (Come to think
of it, I was hesitant to deploy custom code to production, so I never
actually deployed that. Instead I cranked up the stats target and let
the chips fall where they may)

Yeah, reading the whole table is not going to fly I guess. But theestimates would be very accurate! ;-)


I think you want to go back to the table to find new blocks to
replace ones which were included in the original block sample but
ended up having no tuples in the end tuple sample, either because
they were low density blocks, or just through the luck of the draw.
But I don't see how fixes the problem, unless you also prevent a
block from contributing more than 1 tuple. And at that point, I worry
about the uneven density again. If a block has 2 rows selected by
pure random chance, I think it would be fine to keep only one (or, go
find another random block to pull one row from as a replacement). But
if it has 2 rows selected because it has more rows than average, then
we would have to pull the replacement from a random block *of similar
density* to avoid swapping one bias for another one.

Hmmm, maybe limiting the sample to just 1 tuple is a good idea, but Ihave to think about that a bit more.

The basic idea behind density correction is that when you get a blockwith density significantly different from the average (let's say morethan 10%?), you can either treat it as a partial block (density belowaverage), or a collection of multiple blocks (density above average),and see if it really sampled.

Handling the 'partial block' seems rather simple - if the block has halfthe average density, treat do a random coin toss and either keep ordiscard the tuple.

With high density blocks, it's probably a complicated, and I think itreally means treating it as multiple 'virtual' blocks, and only keepsamples from the first one.

And of course, this needs to be repeated if some of the rows get evictedbecause of the correction mechanism. Which will lead to slightly largersamples, but I don't see that as a major problem (and the difference isnot that big, at least not in the case discussed in this thread previously).


--
Tomas Vondra                   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

Reply via email to