I am trying to figure out how the distribution of data affects index
usage by the query because I am seeing some behavior that does not seem
optimal to my uneducated eye. 

I am on PG 8.1.8. I have two tables foo and foo_detail, both have been
vacuum analyzed recently. Both have a property_id column, both have an
index on it. The foo table has a state_code, also indexed, and the
relative share of rows for the two state_codes used in the example below
is:
  PA    2842    2.80%
  MN    2858    2.81%

The distribution of distinct property_ids is fairly similar:
  PA    719     2.90%
  MN    765     3.09%

A simple query filtered by PA vs. MN produces different results (see
below). The PA query does a Seq Scan, the MN query uses the index and is
>20 times faster. Both return about the same number of rows. I tried it
with all state_codes that have rows in foo and it seems that the cutoff
is somewhere around 3%, but there isn't a direct correlation (there are
state_codes that are < 3% that trigger a Seq Scan and there are ones
above 3% that result in an Index scan).

I am curious what could make the PA query to ignore the index. What are
the specific stats that are being used to make this decision? Would it
perform better if it were to use the index? Anything I can do to "nudge"
it towards using the index, which seems like a rather beneficial thing?

The actual queries:

explain analyze
select
  f.property_id
from foo f
  inner join foo_detail fd using (property_id)
where f.state_code = 'PA'

Merge Join  (cost=17842.71..18436.30 rows=3347 width=4) (actual
time=594.538..972.032 rows=2842 loops=1)
  Merge Cond: ("outer".property_id = "inner".property_id)
  ->  Sort  (cost=4381.72..4390.09 rows=3347 width=4) (actual
time=14.092..18.497 rows=2842 loops=1)
        Sort Key: f.property_id
        ->  Bitmap Heap Scan on foo f  (cost=22.71..4185.78 rows=3347
width=4) (actual time=0.826..7.008 rows=2842 loops=1)
              Recheck Cond: (state_code = 'PA'::bpchar)
              ->  Bitmap Index Scan on mv_search_state
(cost=0.00..22.71 rows=3347 width=0) (actual time=0.734..0.734 rows=2842
loops=1)
                    Index Cond: (state_code = 'PA'::bpchar)
  ->  Sort  (cost=13460.99..13732.84 rows=108742 width=4) (actual
time=580.312..754.012 rows=110731 loops=1)
        Sort Key: fd.property_id
        ->  Seq Scan on foo_detail fd  (cost=0.00..4364.42 rows=108742
width=4) (actual time=0.006..210.846 rows=108742 loops=1)
Total runtime: 991.852 ms

explain analyze
select
  f.property_id
from foo f
  inner join foo_detail fd using (property_id)
where f.state_code = 'MN'

Nested Loop  (cost=7.62..8545.85 rows=1036 width=4) (actual
time=0.877..44.196 rows=2858 loops=1)
  ->  Bitmap Heap Scan on foo f  (cost=7.62..2404.44 rows=1036 width=4)
(actual time=0.852..6.579 rows=2858 loops=1)
        Recheck Cond: (state_code = 'MN'::bpchar)
        ->  Bitmap Index Scan on mv_search_state  (cost=0.00..7.62
rows=1036 width=0) (actual time=0.744..0.744 rows=2858 loops=1)
              Index Cond: (state_code = 'MN'::bpchar)
  ->  Index Scan using ix_fd on foo_detail fd  (cost=0.00..5.92 rows=1
width=4) (actual time=0.005..0.007 rows=1 loops=2858)
        Index Cond: ("outer".property_id = fd.property_id)
Total runtime: 48.439 ms

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Reply via email to