Tom Lane wrote:

Another thing that's bothering me is that the index access cost computation
(in genericcostestimate) is looking sillier and sillier:

    /*
     * Estimate the number of index pages that will be retrieved.
     *
     * For all currently-supported index types, the first page of the index is
     * a metadata page, and we should figure on fetching that plus a pro-rated
     * fraction of the remaining pages.
     */
    if (index->pages > 1 && index->tuples > 0)
    {
        numIndexPages = (numIndexTuples / index->tuples) * (index->pages - 1);
        numIndexPages += 1;        /* count the metapage too */
        numIndexPages = ceil(numIndexPages);
    }
    else
        numIndexPages = 1.0;

    /*
     * Compute the index access cost.
     *
     * Disk cost: our generic assumption is that the index pages will be read
     * sequentially, so they have cost 1.0 each, not random_page_cost.
     */
    *indexTotalCost = numIndexPages;

There are several things wrong with this, at least in its application to
btrees.  It's not counting descent of the btree (ie, touches of the root
page and intermediate-level pages).  On the other hand it's surely
overcharging for metapage touches.  As of CVS tip we cache the metapage in
the relcache, so it's probably fair to disregard that cost altogether.
And on the third hand, when we need to retrieve multiple leaf pages it's
over-optimistic to assume that those'll be purely sequential fetches.
(That would be approximately true in a freshly-built btree, but not in one
that's suffered any amount of page splitting or recycling.)


I am inclined to think that a reasonable model is to continue to estimate
the number of index leaf pages touched as above (pro-rating against the
total number of index entries), to charge random_page_cost per leaf page
touched, and not to count anything for metapage or tree descent.  I
justify the latter on the grounds that upper tree levels will tend to stay
in cache because they're touched so often.  random_page_cost per leaf page
may be an overestimate, since sometimes logically adjacent leaf pages will
be physically adjacent too, but not charging for tree descent should help
to cancel that out.  With this model, the disk cost to fetch a single
index entry will be estimated as random_page_cost (default 4.0) rather
than the current fixed 2.0.  This shouldn't hurt things too much for
simple indexscans --- especially since anyone running with a reduced
random_page_cost won't see as much change.  And it will increase the costs
for bitmap scans that cross many index pages, which is what we need in
light of Philippe's numbers.


This sounds good to me, although the 2.0 -> 4.0 cost jump may cause (more) cases of people seeing their index scans in pre-8.2 versions becoming some other type of access in 8.2. I guess a comment about testing existing applications could be placed in the 8.2 release notes?

Now we have seen a lot of cases in which indexscans were being drastically
overestimated, so increasing the cost estimate even more may seem like a
horrid idea.  But I believe that most or all of these cases were ones
where the same index was being visited many times, and so the real
estimation failure is not accounting for cache effects across multiple
indexscans.  Rather than being afraid to raise the cost estimate for an
indexscan in isolation, I think it's time to bite the bullet and do
something about that.

The big difficulty in modeling cache effects from multiple scans is that
we usually don't know how many times the index will be scanned.  If we did
have that number, I think it'd be reasonable to use the Mackert and Lohman
approximation already used in cost_index to estimate the total number of
index leaf pages fetched over N scans, and then divide by N to get our
per-scan estimated cost.  But because the planner builds cost estimates
bottom-up, in most scenarios we don't have any idea whether an indexscan
plan node will be iterated once or many times in the finished plan.
However there is one case where we do have some information: when
computing a best_inner_indexscan() for a nestloop join, it'd be reasonable
to use the estimated size of the outer relation as the loop count.
And this is exactly the case where we are falling down in practice:
the complaints we've seen are mostly about overestimating the cost of a
nestloop-with-inner-index-scan.  So I think handling just this case would
be a big step forward, even if it's not a theoretically pure general
solution.



If I understand correctly, this is the case were we normally see folks needing add several 'set enable_xxx=off' statements to get the nested loop plan that *actually* works best :-). So, also looks good!

Cheers

Mark


---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
      choose an index scan if your joining column's datatypes do not
      match

Reply via email to