I am trying to figure out how the distribution of data affects index usage by the query because I am seeing some behavior that does not seem optimal to my uneducated eye.
I am on PG 8.1.8. I have two tables foo and foo_detail, both have been vacuum analyzed recently. Both have a property_id column, both have an index on it. The foo table has a state_code, also indexed, and the relative share of rows for the two state_codes used in the example below is: PA 2842 2.80% MN 2858 2.81% The distribution of distinct property_ids is fairly similar: PA 719 2.90% MN 765 3.09% A simple query filtered by PA vs. MN produces different results (see below). The PA query does a Seq Scan, the MN query uses the index and is >20 times faster. Both return about the same number of rows. I tried it with all state_codes that have rows in foo and it seems that the cutoff is somewhere around 3%, but there isn't a direct correlation (there are state_codes that are < 3% that trigger a Seq Scan and there are ones above 3% that result in an Index scan). I am curious what could make the PA query to ignore the index. What are the specific stats that are being used to make this decision? Would it perform better if it were to use the index? Anything I can do to "nudge" it towards using the index, which seems like a rather beneficial thing? The actual queries: explain analyze select f.property_id from foo f inner join foo_detail fd using (property_id) where f.state_code = 'PA' Merge Join (cost=17842.71..18436.30 rows=3347 width=4) (actual time=594.538..972.032 rows=2842 loops=1) Merge Cond: ("outer".property_id = "inner".property_id) -> Sort (cost=4381.72..4390.09 rows=3347 width=4) (actual time=14.092..18.497 rows=2842 loops=1) Sort Key: f.property_id -> Bitmap Heap Scan on foo f (cost=22.71..4185.78 rows=3347 width=4) (actual time=0.826..7.008 rows=2842 loops=1) Recheck Cond: (state_code = 'PA'::bpchar) -> Bitmap Index Scan on mv_search_state (cost=0.00..22.71 rows=3347 width=0) (actual time=0.734..0.734 rows=2842 loops=1) Index Cond: (state_code = 'PA'::bpchar) -> Sort (cost=13460.99..13732.84 rows=108742 width=4) (actual time=580.312..754.012 rows=110731 loops=1) Sort Key: fd.property_id -> Seq Scan on foo_detail fd (cost=0.00..4364.42 rows=108742 width=4) (actual time=0.006..210.846 rows=108742 loops=1) Total runtime: 991.852 ms explain analyze select f.property_id from foo f inner join foo_detail fd using (property_id) where f.state_code = 'MN' Nested Loop (cost=7.62..8545.85 rows=1036 width=4) (actual time=0.877..44.196 rows=2858 loops=1) -> Bitmap Heap Scan on foo f (cost=7.62..2404.44 rows=1036 width=4) (actual time=0.852..6.579 rows=2858 loops=1) Recheck Cond: (state_code = 'MN'::bpchar) -> Bitmap Index Scan on mv_search_state (cost=0.00..7.62 rows=1036 width=0) (actual time=0.744..0.744 rows=2858 loops=1) Index Cond: (state_code = 'MN'::bpchar) -> Index Scan using ix_fd on foo_detail fd (cost=0.00..5.92 rows=1 width=4) (actual time=0.005..0.007 rows=1 loops=2858) Index Cond: ("outer".property_id = fd.property_id) Total runtime: 48.439 ms ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster