Tom, Just to clarify, is the lefthand input customers or balances?
And turning off merge joins "fixes" everything, including the runtime: production=> set enable_mergejoin = off; SET production=> explain analyze SELECT ac.* FROM balances ac JOIN customers o ON (o.id = ac.customer_id AND o.group_id = 45); QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------- Hash Join (cost=20288.24..20722.35 rows=7 width=80) (actual time=133.202..140.876 rows=7318 loops=1) Hash Cond: (ac.customer_id = o.id) -> Seq Scan on balances ac (cost=0.00..370.76 rows=16876 width=80) (actual time=0.015..5.853 rows=16876 loops=1) -> Hash (cost=20285.94..20285.94 rows=184 width=8) (actual time=126.768..126.768 rows=430 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 12kB -> Seq Scan on customers o (cost=0.00..20285.94 rows=184 width=8) (actual time=16.901..126.606 rows=430 loops=1) Filter: (group_id = 45) Rows Removed by Filter: 476221 Total runtime: 142.089 ms (9 rows) Sean On Tue, Oct 13, 2015 at 11:20 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: > Sean Rhea <sean.c.r...@gmail.com> writes: > > No, the customers table is not 100% the same. This is a live production > > system, so the data is (unfortunately) changing under us a bit here. That > > said, there are still some strange things going on. I just reran > > everything. The query plan time hasn't changed, but as Jeremy, Igor, and > > David all pointed out, there's something funky going on with the apparent > > size of the customers table. These queries were all run within 5 minutes > of > > each other: > > > production=> explain analyze SELECT ac.* FROM balances ac JOIN customers > o > > ON (o.id= ac.customer_id AND o.group_id = 45); > > QUERY > > PLAN > > > ----------------------------------------------------------------------------------------------------------------------------------------------------------- > > Merge Join (cost=2475.89..20223.08 rows=7 width=80) (actual > > time=157.437..243670.853 rows=7318 loops=1) > > Merge Cond: (ac.customer_id = o.id) -> Index Scan using > > balances_customer_id_index on balances ac (cost=0.00..727.42 rows=16876 > > width=80) (actual time=0.489..30.573 rows=16876 loops=1) > > -> Index Scan using customers_pkey on customers o > (cost=0.00..65080.01 > > rows=184 width=8) (actual time=127.266..243582.767 rows=*7672* loops=1) > > Filter: (group_id = 45) > > Rows Removed by Filter: *212699113* > > Total runtime: 243674.288 ms > > (7 rows) > > > production=> select count(*) from customers where group_id = 45; > > count > > ------- > > 430 > > (1 row) > > What you're looking at there is rows being read repeatedly as a > consequence of the mergejoin applying mark/restore operations to rescan > portions of its righthand input. This will happen whenever there are > duplicate keys in the lefthand input. > > I think the planner does take the possibility of rescans into account > in its cost estimates, but perhaps it's not weighing it heavily > enough. It would be interesting to see what you get as a second-choice > plan if you set enable_mergejoin = off. > > regards, tom lane >