Short summary: * It looks to me like the planner vastly overestimates the # of pages read by index scan in quite a few of my tables even though stats collected by ANALYZE are correct.
* The problem happens any time you have multiple columns that have a number of repeated values in them, and you CLUSTER the table by a sort using both columns (like "city,state,zip,phone#" or "firstname,lastname"). * I think this is the problem that Mark Kirkwood is seeing in his threads Query optimizer 8.0.1 and "One Big trend vs multiple smaller trends" in hackers. * A test script demonstrating the issue also follows. * I think keeping one more stat per attribute in pg_stastic that could describe this behavior. Longer: If I understand the optimizer correctly, correlation is used to both guess how much random disk access will be required in a query; as well as estimate how many pages will be read. Unfortunately, many tables in my larger databases have columns with values that are tightly packed on a few pages; even though there is no total-ordering across the whole table. Stephan Szabo described this as a "clumping effect": http://archives.postgresql.org/pgsql-performance/2003-01/msg00286.php The test script below shows a table with 6 columns and 1 million rows. Note that ANALYZE correctly observes that the correlation for all the columns except "A" is near zero. However, also note that columns A,B,C,and even D will have extremely few UNIQUE values on any given page of data. And conversely, an index scan on any of those columns be a good choice (as seen by EXPLAIN ANALYZE output below). Instead of just storing "correlation", I would also like to store a variation of "correlation" that "pretends" that the disk-pages for a particular column were sorted by the min value for that column. Wherever the optimizer would use the existing 'correlation' to estimate the number of pages that would be accesses; it could use this "correlation discounting the order of blocks" value instead. All the math/stastics/theory that suggests correlation is good at estimating # of pages would remain intact. Anyway... I've talked too much... The test script showing 1) the tables 2) the values in pg_stats 3) EXPLAIN ANALYZE of columns showing the problem follows: ******************************************************************************** ******************************************************************************** ******************************************************************************** ******************************************************************************** ******************************************************************************** fli=# create temporary table tmp1mil as select * from (select generate_series as a from generate_series(0,9)) as a, (select generate_series as b from generate_series(0,9)) as b, (select generate_series as c from generate_series(0,9)) as c, (select generate_series as d from generate_series(0,9)) as d, (select generate_series as e from generate_series(0,9)) as e, (select generate_series as f from generate_series(0,9)) as f order by a,b,c,d,e,f; fli-# fli-# fli-# fli-# fli-# fli-# fli-# fli-# SELECT fli=# fli=# vacuum analyze tmp1mil; VACUUM fli=# select * from pg_stats where tablename='tmp1mil'; schemaname | tablename | attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation ------------+-----------+---------+-----------+-----------+------------+-----------------------+------------------------------------------------------------------------------------------------+------------------+------------- pg_temp_4 | tmp1mil | a | 0 | 4 | 10 | {7,8,5,0,6,3,4,1,2,9} | {0.113,0.106667,0.105667,0.104333,0.101333,0.097,0.095,0.0936667,0.092,0.0913333} | | 1 pg_temp_4 | tmp1mil | b | 0 | 4 | 10 | {1,4,0,5,6,2,9,7,3,8} | {0.119333,0.112667,0.107,0.101,0.099,0.0976667,0.0946667,0.092,0.0886667,0.088} | | 0.229754 pg_temp_4 | tmp1mil | c | 0 | 4 | 10 | {9,5,0,1,8,6,4,7,3,2} | {0.114667,0.107,0.103,0.101667,0.101667,0.0996667,0.0956667,0.094,0.0933333,0.0893333} | | 0.142119 pg_temp_4 | tmp1mil | d | 0 | 4 | 10 | {4,3,2,8,7,9,5,6,1,0} | {0.114667,0.11,0.108,0.104,0.102667,0.099,0.098,0.0903333,0.0893333,0.084} | | 0.0930835 pg_temp_4 | tmp1mil | e | 0 | 4 | 10 | {0,5,1,9,4,7,8,2,3,6} | {0.109333,0.109333,0.108333,0.100667,0.100333,0.0983333,0.0966667,0.0936667,0.092,0.0913333} | | 0.138575 pg_temp_4 | tmp1mil | f | 0 | 4 | 10 | {2,8,6,0,3,5,9,1,7,4} | {0.104667,0.103667,0.102667,0.102333,0.101667,0.100667,0.100667,0.0986667,0.0943333,0.0906667} | | 0.139991 (6 rows) fli=# create index tmp1mil__c on tmp1mil(c); CREATE INDEX fli=# explain analyze select * from tmp1mil where c=5; QUERY PLAN ------------------------------------------------------------------------------------------------------------------ Seq Scan on tmp1mil (cost=0.00..19853.00 rows=107001 width=24) (actual time=2.164..480.863 rows=100000 loops=1) Filter: (c = 5) Total runtime: 531.345 ms (3 rows) fli=# explain analyze select * from tmp1mil where c=5; QUERY PLAN ------------------------------------------------------------------------------------------------------------------ Seq Scan on tmp1mil (cost=0.00..19853.00 rows=107001 width=24) (actual time=2.191..481.849 rows=100000 loops=1) Filter: (c = 5) Total runtime: 531.858 ms (3 rows) fli=# set enable_seqscan=false; SET fli=# explain analyze select * from tmp1mil where c=5; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------- Index Scan using tmp1mil__c on tmp1mil (cost=0.00..364309.19 rows=107001 width=24) (actual time=0.102..101.460 rows=100000 loops=1) Index Cond: (c = 5) Total runtime: 152.103 ms (3 rows) fli=# explain analyze select * from tmp1mil where c=5; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------- Index Scan using tmp1mil__c on tmp1mil (cost=0.00..364309.19 rows=107001 width=24) (actual time=0.082..101.298 rows=100000 loops=1) Index Cond: (c = 5) Total runtime: 151.914 ms (3 rows) fli=# ---------------------------(end of broadcast)--------------------------- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly