I looked into the problem reported in bug #7604 (Bill MacArthur was kind enough to send me a reproducer off-list). The error can be demonstrated by this example in the regression test database:
select f1, unique2, case when unique2 is null then f1 else 0 end from int4_tbl a left join tenk1 b on f1 = unique2 where (case when unique2 is null then f1 else 0 end) = 0; If you look at the output of the basic join: regression=# select f1, unique2, case when unique2 is null then f1 else 0 end regression-# from int4_tbl a left join tenk1 b on f1 = unique2; f1 | unique2 | case -------------+---------+------------- 0 | 0 | 0 123456 | | 123456 -123456 | | -123456 2147483647 | | 2147483647 -2147483647 | | -2147483647 (5 rows) it's obvious that only the first of these rows passes the added WHERE condition, so the query should produce just that row; but what you actually get in 9.2 is all five rows! EXPLAIN gives a clue what's going wrong: Nested Loop Left Join (cost=0.00..22.51 rows=1 width=8) -> Seq Scan on int4_tbl a (cost=0.00..1.05 rows=5 width=4) -> Index Only Scan using tenk1_unique2 on tenk1 b (cost=0.00..4.28 rows=1 w idth=4) Index Cond: (unique2 = a.f1) Filter: (CASE WHEN (unique2 IS NULL) THEN a.f1 ELSE 0 END = 0) (5 rows) The filter condition has been dropped down to the scan of tenk1 (relying on a.f1 being passed in as a parameter), which is wrong since there it is unable to filter null-extended rows produced by the left join operator. At first I didn't see how this could be happening: the code for placing quals looks at RestrictInfo.nullable_relids, and this qual should have tenk1 in its nullable_relids since it's above a left join to tenk1. But gdb soon told me that that wasn't what join_clause_is_movable_to was seeing. Eventually I figured out that the equivalence-class machinery was eating the clause (forming an EquivalenceClass consisting of the CASE expression and the constant 0) and reconstructing a new clause that didn't have nullable_relids set. So this is essentially an oversight in the patch that added tracking of nullable_relids. I got confused about the difference between outerjoin_delayed (this clause, as a whole, is not outerjoin_delayed because its natural semantic level would be at the join anyway) and having nonempty nullable_relids, and thought that equivalence-class processing would never see such clauses because it doesn't accept outerjoin_delayed clauses. So there's no code in equivclass.c to compute nullable_relids sets for constructed clauses. At some point it might be worth adding same, but for now I'm just going to tweak distribute_qual_to_rels to not believe that clauses with nonempty nullable_relids can be equivalence clauses. It gets worse though. Tracking clauses' nullable_relids is new in 9.2, but variants of this example can be demonstrated to fail as far back as 7.4. Here's one: select q1, unique2, thousand, hundred from int8_tbl a left join tenk1 b on q1 = unique2 where coalesce(thousand,123) = q1 and q1 = coalesce(hundred,123); If you look at just the bare join output, there are no rows that should pass the WHERE clause, but actually what you get in 7.4-9.1 is q1 | unique2 | thousand | hundred -----+---------+----------+--------- 123 | | | 123 | | | (2 rows) which is completely broken because it's not even a subset of the bare join output :-(. 9.1 shows this EXPLAIN output: Nested Loop Left Join (cost=0.00..42.48 rows=1 width=20) Filter: (a.q1 = COALESCE(b.thousand, 123)) -> Seq Scan on int8_tbl a (cost=0.00..1.05 rows=5 width=8) -> Index Scan using tenk1_unique2 on tenk1 b (cost=0.00..8.27 rows=1 width= 12) Index Cond: (a.q1 = unique2) Filter: (COALESCE(thousand, 123) = COALESCE(hundred, 123)) What's happening here is that because the two WHERE clauses are not outerjoin_delayed, they're being deconstructed as equivalence clauses, and then we synthesize an equivalence constraint on the two COALESCE expressions that can be applied at the tenk1 scan level. That gets rid of the row that should join to q1=123, allowing the bogus null-extended join rows to be formed. So this shows that, although 9.2 has a more extensive form of the disease, the use of outerjoin_delayed to decide whether a clause can be an equivalence clause is really quite wrong. I believe what I need to do to fix this is back-port the 9.2 logic that computes a nullable_relids set for each clause. We won't need to store it, just check whether it's empty before letting the clause be treated as an equivalence clause. Is anybody concerned about the compatibility implications of fixing this bug in the back branches? I'm worried about people complaining that we broke their application in a minor release. Maybe they were depending on incorrect behavior, but they might complain anyway. On the other hand, the fact that this hasn't been reported from the field in nine years suggests that not many people write queries like this. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers