The problem was described here earlier but there is no answers unfortunately: http://www.postgresql.org/message-id/1387780366.910542...@f394.i.mail.ru It looks like defect.
CREATE TABLE t1 (c1 INTEGER, id INTEGER PRIMARY KEY); INSERT INTO t1 (c1, id) SELECT b, b FROM generate_series(1, 1000000) a (b); REINDEX TABLE t1; ANALYZE t1; CREATE TABLE t2 (c1 INTEGER, id INTEGER PRIMARY KEY); INSERT INTO t2 (c1, id) SELECT b, b FROM generate_series(1, 1000000) a (b); REINDEX TABLE t2; ANALYZE t2; CREATE TEMP TABLE t3 (ref_id INTEGER); INSERT INTO t3 (ref_id) VALUES (333333), (666666); ANALYZE t3; test=> EXPLAIN (ANALYZE) SELECT * FROM (SELECT * FROM t1 UNION ALL SELECT * FROM t2) t INNER JOIN t3 ON t3.ref_id = t.id; QUERY PLAN ---------------------------------------------------------------------------------------------------- -------------------- Nested Loop (cost=0.42..34.84 rows=20000 width=12) (actual time=0.046..0.104 rows=4 loops=1) -> Seq Scan on t3 (cost=0.00..1.02 rows=2 width=4) (actual time=0.008..0.009 rows=2 loops=1) -> Append (cost=0.42..16.89 rows=2 width=8) (actual time=0.023..0.042 rows=2 loops=2) -> Index Scan using t1_pkey on t1 (cost=0.42..8.45 rows=1 width=8) (actual time=0.020..0.022 rows=1 loops=2) Index Cond: (id = t3.ref_id) -> Index Scan using t2_pkey on t2 (cost=0.42..8.45 rows=1 width=8) (actual time=0.015..0.016 rows=1 loops=2) Index Cond: (id = t3.ref_id) Total runtime: 0.184 ms (8 rows) This plan is perfect. But the rows estimation is not: 20000 vs 4. As I can see Pg is able to do correct rows estimation: inner append: rows = 2, outer seq scan: rows = 2. And nested loop has to know that one is able to produce 2 * 2 = 4 rows max. Moreover the cost estimation is _correct_! It is corresponded to 'rows=4'. Why it is important to make correct row estimation? 'Cause it does matter in more complex query. Let's join another big table in that query: CREATE TABLE links (c1 INTEGER PRIMARY KEY, descr TEXT); INSERT INTO links (c1, descr) SELECT b, '2' FROM generate_series(1, 100000) a (b); REINDEX TABLE links; ANALYZE links; test=> EXPLAIN (ANALYZE) SELECT * FROM (SELECT * FROM t1 UNION ALL SELECT * FROM t2) t INNER JOIN t3 ON t3.ref_id = t.id INNER JOIN links l ON (t.c1 = l.c1); QUERY PLAN ---------------------------------------------------------------------------------------------------- -------------------------- Hash Join (cost=2693.43..3127.84 rows=20000 width=18) (actual time=33.619..33.619 rows=0 loops=1) Hash Cond: (t1.c1 = l.c1) -> Nested Loop (cost=0.42..34.84 rows=20000 width=12) (actual time=0.038..0.078 rows=4 loops=1) -> Seq Scan on t3 (cost=0.00..1.02 rows=2 width=4) (actual time=0.006..0.007 rows=2 loops=1) -> Append (cost=0.42..16.89 rows=2 width=8) (actual time=0.017..0.029 rows=2 loops=2) -> Index Scan using t1_pkey on t1 (cost=0.42..8.45 rows=1 width=8) (actual time=0.015..0.017 rows=1 loops=2) Index Cond: (id = t3.ref_id) -> Index Scan using t2_pkey on t2 (cost=0.42..8.45 rows=1 width=8) (actual time=0.009..0.009 rows=1 loops=2) Index Cond: (id = t3.ref_id) -> Hash (cost=1443.00..1443.00 rows=100000 width=6) (actual time=33.479..33.479 rows=100000 loops=1) Buckets: 16384 Batches: 1 Memory Usage: 3711kB -> Seq Scan on links l (cost=0.00..1443.00 rows=100000 width=6) (actual time=0.017..14.853 rows=100000 loops=1) Total runtime: 33.716 ms (13 rows) Planner thinks there'll be 20000 rows when join is performed between "t" and "t3". And that's why it makes a decision to use hash join with "links" table. Let's prove it: CREATE OR REPLACE FUNCTION public.f1() RETURNS SETOF integer LANGUAGE plpgsql ROWS 20000 AS $function$ BEGIN RETURN QUERY EXECUTE 'SELECT t.c1 FROM (SELECT * FROM t1 UNION ALL SELECT * FROM t2) t INNER JOIN t3 ON t3.ref_id = t.id'; END; $function$ test=> explain select * from f1() t(c1) INNER JOIN links l ON (t.c1 = l.c1); QUERY PLAN --------------------------------------------------------------------------- Hash Join (cost=2693.25..3293.25 rows=20000 width=10) Hash Cond: (t.c1 = l.c1) -> Function Scan on f1 t (cost=0.25..200.25 rows=20000 width=4) -> Hash (cost=1443.00..1443.00 rows=100000 width=6) -> Seq Scan on links l (cost=0.00..1443.00 rows=100000 width=6) (5 rows) The same "defect" plan. test=> ALTER FUNCTION f1() ROWS 4; ALTER FUNCTION test=> explain select * from f1() t(c1) INNER JOIN links l ON (t.c1 = l.c1); QUERY PLAN -------------------------------------------------------------------------------- Nested Loop (cost=0.54..33.58 rows=4 width=10) -> Function Scan on f1 t (cost=0.25..0.29 rows=4 width=4) -> Index Scan using links_pkey on links l (cost=0.29..8.31 rows=1 width=6) Index Cond: (c1 = t.c1) (4 rows) The correct/perfect plan. In real life I have bigger "links" table and wrong plan slows execution significantly. I found several workarounds. And it is not a problem anymore for me. I just want to report this "strange thing". I tried to look into source code, found some interesting places there but I think it is useless: Pg developers know the situation much better than me.