Re: [HACKERS] Correlation in cost_index()

Manfred Koizar Sun, 10 Aug 2003 12:44:04 -0700

On Fri, 8 Aug 2003 15:10:06 -0700, Sean Chittenden
<[EMAIL PROTECTED]> wrote:
>> Yes, we knew that already.  Oliver had suggested simply dropping the
>> division by nKeys, thus pretending that the first-column correlation
>> is close enough.  That seems to me to be going too far in the other
>> direction,
>
>But is it really?


Yes.  I once posted two use cases showing that one side is as wrong as
the other.  Wait a momemt ...  Here it is:
http://archives.postgresql.org/pgsql-hackers/2002-10/msg00229.php.

>indexCorrelation is 1.0 for the 1st key in a multi-column index. [...]

Unfortunately there are cases, where the correlation of the first
column doesn't tell us anything about the real index correlation.

>  An indexCorrelation
>of 0.0 doesn't mean that the index is useless either, just that we
>take the full hit of a completely random page read as opposed to some
>fraction of a random page cost.

... and that we cannot expect the tuples we are looking for to be on
the same page or on pages near to each other.  So we have to read a
higher number of pages.

>        indexCorrelation: 0.183380
> Seq Scan on report_user_cat_count rucc  (cost=0.00..21472.69 rows=31893 width=64) 
> (actual time=444.25..2489.27 rows=514 loops=1)
>
> Index Scan using report_user_cat_count_html_bytes_idx on report_user_cat_count rucc  
> (cost=0.00..112165.07 rows=31893 width=64) (actual time=68.64..85.75 rows=514 
> loops=1)

The main problem here is the bad estimation for number of rows: 31893
vs. 514.  The seq scan cost depends on the size of the heap relation
and is always the same (~ 21000) for different search conditions.
With low correlation, index scan cost is roughly proportional to the
number of tuples fetched (ignoring the effects of effective_cache_size
for now).  So for a correct guess of number of rows we'd get

             seq    idx
estimated   21000  1800
actual       2500    86
ratio         8.4  20.4

As estimated and actual are not given in the same units, we look at
the estimated/actual ratio and see that index scan cost is
over-estimated by a factor 2.5.  index_cost_algorithm 4 (geometric
interpolation) should get this right.  Can you ANALYSE with a higher
statistics_target and then test with different index_cost_algorithms? 

Servus
 Manfred

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
      joining column's datatypes do not match

Re: [HACKERS] Correlation in cost_index()

Reply via email to