>The most obvious thing to do about this is to consider that one path can >dominate another on cost only if it is both cheaper *and* produces no >more rows. But I'm concerned about the cost of inserting yet another >test condition into add_path, which is slow enough already. Has anyone >got an idea for a different formulation that would avoid that?
Will it discard the seq. scan path if the number of rows is more and cost is less or will it keep both paths and then later based on cost estimation for join it discards one of them? Can it be like if seq. scan has low cost and number of rows is only greater by certain thresh-hold, only then seq. scan will be kept else index scan will dominate. -----Original Message----- From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Tom Lane Sent: Wednesday, February 29, 2012 4:05 AM To: pgsql-hackers@postgreSQL.org Subject: [HACKERS] Parameterized-path cost comparisons need some work I was experimenting today with a test case sent to me by somebody at Red Hat. The details aren't too important, except that it involves an inner join on the inside of a left join (where it cannot commute with the left join). I can reproduce the behavior with a standard regression test table, if I crank random_page_cost up a bit: regression=# set random_page_cost TO 5; SET regression=# explain select * from tenk1 t1 left join (tenk1 t2 join tenk1 t3 on t2.thousand = t3.unique2) on t1.hundred = t2.hundred where t1.unique1 = 1; QUERY PLAN ---------------------------------------------------------------------------- -------------------- Nested Loop Left Join (cost=0.00..507.16 rows=100 width=732) -> Index Scan using tenk1_unique1 on tenk1 t1 (cost=0.00..10.28 rows=1 width=244) Index Cond: (unique1 = 1) -> Nested Loop (cost=0.00..495.63 rows=100 width=488) -> Index Scan using tenk1_hundred on tenk1 t2 (cost=0.00..446.98 rows=100 width=244) Index Cond: (t1.hundred = hundred) -> Index Scan using tenk1_unique2 on tenk1 t3 (cost=0.00..0.48 rows=1 width=244) Index Cond: (unique2 = t2.thousand) (8 rows) regression=# set random_page_cost TO 6; SET regression=# explain select * from tenk1 t1 left join (tenk1 t2 join tenk1 t3 on t2.thousand = t3.unique2) on t1.hundred = t2.hundred where t1.unique1 = 1; QUERY PLAN ---------------------------------------------------------------------------- ----------------- Hash Right Join (cost=928.30..2573.80 rows=100 width=732) Hash Cond: (t2.hundred = t1.hundred) -> Hash Join (cost=916.00..2523.00 rows=10000 width=488) Hash Cond: (t2.thousand = t3.unique2) -> Seq Scan on tenk1 t2 (cost=0.00..458.00 rows=10000 width=244) -> Hash (cost=458.00..458.00 rows=10000 width=244) -> Seq Scan on tenk1 t3 (cost=0.00..458.00 rows=10000 width=244) -> Hash (cost=12.29..12.29 rows=1 width=244) -> Index Scan using tenk1_unique1 on tenk1 t1 (cost=0.00..12.29 rows=1 width=244) Index Cond: (unique1 = 1) (10 rows) WTF? How did it suddenly fail to find the double-nested-loop plan, and have to settle for a much worse plan? The reason seems to be that for large enough random_page_cost, the seqscan on t2 is costed as cheaper than an indexscan that selects a significant fraction of the table. (Notice the t2 scans are nearly the same cost in these two examples.) That causes add_path to decide that the unparameterized seqscan is unconditionally better than the parameterized indexscan, and throw out the latter. Now it is impossible to form the parameterized nestloop subplan. The flaw in this logic, of course, is that the seqscan might be cheaper than the parameterized indexscan, but *it produces a whole lot more rows*, meaning that any subsequent join will be a lot more expensive. Previously add_path didn't have to worry about that, because all ordinary paths for a given relation produce the same number of rows (and we studiously kept things like inner indexscan paths out of add_path's view of the world). The most obvious thing to do about this is to consider that one path can dominate another on cost only if it is both cheaper *and* produces no more rows. But I'm concerned about the cost of inserting yet another test condition into add_path, which is slow enough already. Has anyone got an idea for a different formulation that would avoid that? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers