Recent discussion of index cost estimation ("[HACKERS] Correlation in cost_index()" ca. two weeks ago) has lead to the conclusion that the column correlation calculated by VACUUM does not always help when we want to find out how well index access order corresponds to physical tuple position. Most problems arise when dealing with multi-column or functional indices. Another case is when there are many equal values in the first index column.
What we really want is not column correlation, but index correlation which turns out to be surprisingly easy to calculate. There's no need to look up comparison functions and handle different datatypes; there's no need to look at the key values at all: Just read the index items in index order, sort them by heap page, and compute the Spearman Rho function. This even works for non-btree indices. Try it yourself: . download http://www.pivot.at/pg/contrib_icorrel.tgz . unpack . make . make install . psql . \i path/to/share/postgresql/contrib/icorrel.sql . SELECT icorrel('myindex'::regclass); This should work with 7.3.x and 7.4Beta. How could the planner make use of index correlation? Here (http://www.pivot.at/pg/22-IndexCorrel_74b1.diff) is an experimental patch that introduces a new system table pg_indexstat. If the GUC variable enable_indexstat is set to on, genericcostestimate tries to get index correlation from pg_indexstat; if there is none, btcostestimate falls back to the old method. Compatibility: Although there is a new catalog table, initdb is not required. A patched postmaster still works with an old cluster, even without creating pg_catalog.pg_indexstat. Before you can make use of pg_indexstat you have to create it via a standalone backend: $ bin/postgres -D data -O template1 backend> CREATE TABLE pg_catalog.pg_indexstat( \ istindex oid NOT NULL, \ istcorrel float4 NOT NULL) WITHOUT OIDS; backend> CREATE UNIQUE INDEX pg_indexstat_index_index \ ON pg_indexstat(istindex); Repeat this for each database. Usage example (using Sean's data): psql testdb testdb=# \d rucc Table "public.rucc" Column | Type | Modifiers ---------------+--------------------------+----------- user_id | integer | not null category_id | integer | not null img_bytes | bigint | not null img_hits | integer | not null html_bytes | bigint | not null html_hits | integer | not null unknown_bytes | bigint | not null unknown_hits | integer | not null utc_date | timestamp with time zone | not null time_interval | interval | not null Indexes: "rucc_htmlbytes_idx" btree (html_bytes), "rucc_id_date_idx" btree (user_id, utc_date) testdb=# SELECT 'rucc_id_date_idx'::regclass::oid; oid --------- 1281422 (1 row) testdb=# set enable_seqscan = off; SET testdb=# set enable_indexstat = on; SET testdb=# INSERT INTO pg_indexstat VALUES (1281422, 0.0001); INSERT 0 1 testdb=# EXPLAIN SELECT * FROM rucc WHERE user_id < 1000; QUERY PLAN -------------------------------------------------------------------------------- Index Scan using rucc_id_date_idx on rucc (cost=0.00..634342.50 rows=139802 width=64) Index Cond: (user_id < 1000) (2 rows) testdb=# UPDATE pg_indexstat SET istcorrel=0.1 WHERE istindex=1281422; testdb=# EXPLAIN SELECT ... istcorrel | cost ----------+---------- 0.0001 | 634342.50 0.1 | 514678.48 0.2 | 407497.85 0.5 | 161612.89 0.9 | 10299.07 1.0 | 3994.32 Actually the table is clustered on rucc_id_date_idx, so index correlation is 1.0, but there is no way to know that, when we only have the column correlations for user_id (1.0) and utc_date (0.59). The current code guesses the index correlation to be 0.5 which gives a cost estimate that is far too high. For comparison: seq scan estimated cost ~ 21000, actual ~ 11500, index scan actual ~ 4000 If you are going to test this patch, please be aware that I created it on top of another one of my experimental patches (http://www.pivot.at/pg/16d-correlation_74b1.diff). If you don't want to apply this one, one hunk of the IndexCorrel patch will fail in selfuncs.c. Should be no problem to apply it manually. And those who are still experimenting with 7.3.4 performance, can use http://www.pivot.at/pg/16d-correlation_734.diff and http://www.pivot.at/pg/22-IndexCorrel_74b1.diff. ToDo: .. Move get_index_correlation from contrib into the backend. .. ANALYZE table computes index correlation for all indexes. .. New command ANALYZE index? .. System cache invalidation? syscache.c: reloidattr = Anum_pg_indexstat_istindex ? .. Dependency? .. Remove GUC variable enable_indexstat .. Remove old method in btcostestimate() .. Automatic catalog upgrade Servus Manfred ---------------------------(end of broadcast)--------------------------- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html