Simon Riggs <si...@2ndquadrant.com> writes:
> On 6 January 2013 18:58, Tom Lane <t...@sss.pgh.pa.us> wrote:
>> IIRC, one of my very first attempts to deal with this was to charge
>> random_page_cost per level of index descended.  This was such a horrid
>> overestimate that it never went anywhere.  I think that reflects that in
>> practical applications, the upper levels of the index tend to stay in
>> cache.  We could ignore I/O on that assumption and still try to model
>> CPU costs of the descent, which is basically what Jeff is proposing.
>> My objection to his formula is mainly that it ignores physical index
>> size, which I think is important to include somehow for the reasons
>> I explained in my other message.

> Having a well principled approach will help bring us towards a
> realistic estimate.

I thought about this some more and came up with what might be a
reasonably principled compromise.  Assume that we know there are N
leaf entries in the index (from VACUUM stats) and that we know the
root page height is H (from looking at the btree metapage).  (Note:
H starts at zero for a single-page index.)  If we assume that the
number of tuples per page, P, is more or less uniform across leaf
and upper pages (ie P is the fanout for upper pages), then we have
        N/P = number of leaf pages
        N/P/P = number of level 1 pages
        N/P^3 = number of level 2 pages
        N/P^(h+1) = number of level h pages
Solving for the minimum P that makes N/P^(H+1) <= 1, we get
        P = ceil(exp(ln(N)/(H+1)))
as an estimate of P given the known N and H values.

Now, if we consider only CPU costs of index descent, we expect
about log2(P) comparisons to be needed on each of the H upper pages
to be descended through, that is we have total descent cost
        cpu_index_tuple_cost * H * log2(P)

If we ignore the ceil() step as being a second-order correction, this
can be simplified to

        cpu_index_tuple_cost * H * log2(N)/(H+1)

I propose this, rather than Jeff's formula of cpu_index_tuple_cost *
log2(N), as our fudge factor.  The reason I like this better is that
the additional factor of H/(H+1) provides the correction I want for
bloated indexes: if an index is bloated, the way that reflects into
the cost of any particular search is that the number of pages to be
descended through is larger than otherwise.  The correction is fairly
small, particularly for large indexes, but that seems to be what's
expected given the rest of our discussion.

We could further extend this by adding some I/O charge when the index is
sufficiently large, as per Simon's comments, but frankly I think that's
unnecessary.  Unless the fan-out factor is really awful, practical-sized
indexes probably have all their upper pages in memory.  What's more, per
my earlier comment, when you start to think about tables so huge that
that's not true it really doesn't matter if we charge another
random_page_cost or two for an indexscan --- it'll still be peanuts
compared to the seqscan alternative.

To illustrate the behavior of this function, I've replotted my previous
graph, still taking the assumed fanout to be 256 tuples/page.  I limited
the range of the functions to 0.0001 to 100 to keep the log-scale graph
readable, but actually the H/(H+1) formulation would charge zero for
indexes of less than 256 tuples.  I think it's significant (and a good
thing) that this curve is nowhere significantly more than the historical
pre-9.2 fudge factor.

Thoughts?

                        regards, tom lane

<<image/png>>

set terminal png small color
set output 'new_fudge.png'
set logscale x
set logscale y
h(x) = (x <= 256) ? 0.0001/0.005 : (x <= 256*256) ? (1./2)*log(x)/log(2) : (x 
<= 256^3) ? (2./3)*log(x)/log(2) : (x <= 256^4) ? (3./4)*log(x)/log(2) : (x <= 
256^5) ? (4./5)*log(x)/log(2) : (5./6)*log(x)/log(2)
historical(x) = (4 * x/100000) < 100 ? 4 * x/100000 : 1/0
ninepoint2(x) = (4 * x/10000) < 100 ? 4 * x/10000 : 1/0
head(x) = 4*log(1 + x/10000)
plot [10:1e9] h(x)*0.005, 0.005 * log(x)/log(2), head(x), historical(x), 
ninepoint2(x)
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to