Wrong results due to missing quals

2023-05-24 Thread Richard Guo
Testing with SQLancer reports a wrong results issue on master and I
reduced it to the repro query below.

create table t (a int, b int);

explain (costs off)
select * from t t1 left join
(t t2 left join t t3 full join t t4 on false on false)
left join t t5 on t2.a = t5.a
on t2.b = 1;
QUERY PLAN
--
 Nested Loop Left Join
   ->  Seq Scan on t t1
   ->  Materialize
 ->  Nested Loop Left Join
   ->  Nested Loop Left Join
 Join Filter: false
 ->  Seq Scan on t t2
   Filter: (b = 1)
 ->  Result
   One-Time Filter: false
   ->  Materialize
 ->  Seq Scan on t t5
(12 rows)

So the qual 't2.a = t5.a' is missing.

I looked into it and found that both clones of this joinqual are
rejected by clause_is_computable_at, because their required_relids do
not include the outer join of t2/(t3/t4), and meanwhile include nullable
rels of this outer join.

I think the root cause is that, as Tom pointed out in [1], we're not
maintaining required_relids very accurately.  In b9c755a2, we make
clause_is_computable_at test required_relids for clone clauses.  I think
this is how this issue sneaks in.

To fix it, it seems to me that the ideal way would be to always compute
accurate required_relids.  But I'm not sure how difficult it is.

[1] https://www.postgresql.org/message-id/395264.1684698283%40sss.pgh.pa.us

Thanks
Richard


Re: Wrong results due to missing quals

2023-05-24 Thread Tom Lane
Richard Guo  writes:
> So the qual 't2.a = t5.a' is missing.

Ugh.

> I looked into it and found that both clones of this joinqual are
> rejected by clause_is_computable_at, because their required_relids do
> not include the outer join of t2/(t3/t4), and meanwhile include nullable
> rels of this outer join.
> I think the root cause is that, as Tom pointed out in [1], we're not
> maintaining required_relids very accurately.  In b9c755a2, we make
> clause_is_computable_at test required_relids for clone clauses.  I think
> this is how this issue sneaks in.

Yeah.  I'm starting to think that b9c755a2 took the wrong approach.
Really, required_relids is about making sure that a qual isn't
evaluated "too low", before all necessary joins have been formed.  But
clause_is_computable_at is charged with making sure we don't evaluate
it "too high", after some incompatible join has been formed.  There's
no really good reason to suppose that required_relids can serve both
purposes, even if it were computed perfectly accurately (and what is
perfect, anyway?).

So right now I'm playing with the idea of reverting the change in
clause_is_computable_at and seeing how else we can fix the previous
bug.  Don't have anything to show yet, but one thought is that maybe
deconstruct_distribute_oj_quals needs to set up clause_relids for
clone clauses differently.  Another idea is that maybe we need another
RestrictInfo field that's directly a set of OJ relids that this clause
can't be applied above.  That'd reduce clause_is_computable_at to
basically a bms_intersect test which would be nice speed-wise.  The
space consumption could be annoying, but I'm thinking that we might
only have to populate the field in clone clauses, which would
alleviate that issue.

regards, tom lane




Re: Wrong results due to missing quals

2023-05-24 Thread Tom Lane
I wrote:
> ... Another idea is that maybe we need another
> RestrictInfo field that's directly a set of OJ relids that this clause
> can't be applied above.  That'd reduce clause_is_computable_at to
> basically a bms_intersect test which would be nice speed-wise.  The
> space consumption could be annoying, but I'm thinking that we might
> only have to populate the field in clone clauses, which would
> alleviate that issue.

I tried this and it seems to work all right: it fixes the example
you showed while not causing any new failures.  (Doesn't address
the broken join-removal logic you showed in the other thread,
though.)

While at it, I also changed make_restrictinfo to treat has_clone
and is_clone as first-class citizens, to fix the dubious coding in
equivclass.c that I mentioned at [1].

regards, tom lane

[1] https://www.postgresql.org/message-id/395264.1684698283%40sss.pgh.pa.us

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 428ea3810f..c5cada55fb 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -6521,8 +6521,11 @@ foreign_grouping_ok(PlannerInfo *root, RelOptInfo *grouped_rel,
 	  expr,
 	  true,
 	  false,
+	  false,
+	  false,
 	  root->qual_security_level,
 	  grouped_rel->relids,
+	  NULL,
 	  NULL);
 			if (is_foreign_expr(root, grouped_rel, expr))
 fpinfo->remote_conds = lappend(fpinfo->remote_conds, rinfo);
diff --git a/src/backend/optimizer/README b/src/backend/optimizer/README
index f1a96b8584..2ab4f3dbf3 100644
--- a/src/backend/optimizer/README
+++ b/src/backend/optimizer/README
@@ -539,14 +539,13 @@ set includes the A/B join relid which is not in the input.  However,
 if we form B/C after A/B, then both forms of the clause are applicable
 so far as that test can tell.  We have to look more closely to notice
 that the "Pbc" clause form refers to relation B which is no longer
-directly accessible.  While this check is straightforward, it's not
-especially cheap (see clause_is_computable_at()).  To avoid doing it
-unnecessarily, we mark the variant versions of a redundant clause as
-either "has_clone" or "is_clone".  When considering a clone clause,
-we must check clause_is_computable_at() to disentangle which version
-to apply at the current join level.  (In debug builds, we also Assert
-that non-clone clauses are validly computable at the current level;
-but that seems too expensive for production usage.)
+directly accessible.  While such a check could be performed using the
+per-relation RelOptInfo.nulling_relids data, it would be annoyingly
+expensive to do over and over as we consider different join paths.
+To make this simple and reliable, we compute an "incompatible_relids"
+set for each variant version (clone) of a redundant clause.  A clone
+clause should not be applied if any of the outer-join relids listed in
+incompatible_relids has already been computed below the current join.
 
 
 Optimizer Functions
diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index 2db1bf6448..7fa502d6e2 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -197,9 +197,12 @@ process_equivalence(PlannerInfo *root,
 make_restrictinfo(root,
   (Expr *) ntest,
   restrictinfo->is_pushed_down,
+  restrictinfo->has_clone,
+  restrictinfo->is_clone,
   restrictinfo->pseudoconstant,
   restrictinfo->security_level,
   NULL,
+  restrictinfo->incompatible_relids,
   restrictinfo->outer_relids);
 		}
 		return false;
@@ -1972,7 +1975,8 @@ create_join_clause(PlannerInfo *root,
  * clause into the regular processing, because otherwise the join will be
  * seen as a clauseless join and avoided during join order searching.
  * We handle this by generating a constant-TRUE clause that is marked with
- * required_relids that make it a join between the correct relations.
+ * the same required_relids etc as the removed outer-join clause, thus
+ * making it a join clause between the correct relations.
  */
 void
 reconsider_outer_join_clauses(PlannerInfo *root)
@@ -2001,10 +2005,13 @@ reconsider_outer_join_clauses(PlannerInfo *root)
 /* throw back a dummy replacement clause (see notes above) */
 rinfo = make_restrictinfo(root,
 		  (Expr *) makeBoolConst(true, false),
-		  true, /* is_pushed_down */
+		  rinfo->is_pushed_down,
+		  rinfo->has_clone,
+		  rinfo->is_clone,
 		  false,	/* pseudoconstant */
 		  0,	/* security_level */
 		  rinfo->required_relids,
+		  rinfo->incompatible_relids,
 		  rinfo->outer_relids);
 distribute_restrictinfo_to_rels(root, rinfo);
 			}
@@ -2026,10 +2033,13 @@ reconsider_outer_join_clauses(PlannerInfo *root)
 /* throw back a dummy replacement

Re: Wrong results due to missing quals

2023-05-24 Thread Richard Guo
On Thu, May 25, 2023 at 5:28 AM Tom Lane  wrote:

> I tried this and it seems to work all right: it fixes the example
> you showed while not causing any new failures.  (Doesn't address
> the broken join-removal logic you showed in the other thread,
> though.)
>
> While at it, I also changed make_restrictinfo to treat has_clone
> and is_clone as first-class citizens, to fix the dubious coding in
> equivclass.c that I mentioned at [1].


The "incompatible_relids" idea is a stroke of genius.  I reviewed the
patch and did not find any problem.  So big +1 to the patch.

Thanks
Richard


Re: Wrong results due to missing quals

2023-05-25 Thread Tom Lane
Richard Guo  writes:
> The "incompatible_relids" idea is a stroke of genius.  I reviewed the
> patch and did not find any problem.  So big +1 to the patch.

Pushed, thanks for the report and the review.

regards, tom lane