Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs

Alena Rybakina Thu, 06 Jul 2023 06:52:09 -0700

Hi, all!

On 26.06.2023 12:22, Andrey Lepikhov wrote:

On 24/6/2023 17:23, Tomas Vondra wrote:
I really hope what I just wrote makes at least a little bit of sense.
Throw in one more example:
SELECT i AS id INTO l FROM generate_series(1,100000) i;
CREATE TABLE r (id int8, v text);
INSERT INTO r (id, v) VALUES (1, 't'), (-1, 'f');
ANALYZE l,r;
EXPLAIN ANALYZE
SELECT * FROM l LEFT OUTER JOIN r ON (r.id = l.id) WHERE r.v IS NULL;

Here you can see the same kind of underestimation:
Hash Left Join  (... rows=500 width=14) (... rows=99999 ...)
So the eqjoinsel_unmatch_left() function should be modified for thecase where nd1<nd2.
Unfortunately, this patch could not fix the cardinality calculation inthis request, I'll try to look and figure out what is missing here.


I tried to fix the cardinality score in the query above by changing:

diff --git a/src/backend/utils/adt/selfuncs.cb/src/backend/utils/adt/selfuncs.c

index 8e18aa1dd2b..40901836146 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -2604,11 +2604,16 @@ eqjoinsel_inner(Oid opfuncoid, Oid collation,

* if we're calculating fraction of NULLs or fractionof unmatched rows.

                 */
                // unmatchfreq = (1.0 - nullfrac1) * (1.0 - nullfrac2);
-               if (nd1 > nd2)
+               if (nd1 != nd2)
                {
-                       selec /= nd1;
-                       *unmatched_frac = (nd1 - nd2) * 1.0 / nd1;
+                       selec /= Max(nd1, nd2);

+ *unmatched_frac = abs(nd1 - nd2) * 1.0 /Max(nd1, nd2);

                }
+               /*if (nd1 > nd2)
+               {
+                       selec /= nd1;
+                       *unmatched_frac = nd1 - nd2 * 1.0 / nd1;
+               }*/
                else
                {
                        selec /= nd2;

and it worked:

SELECT i AS id INTO l FROM generate_series(1,100000) i;
CREATE TABLE r (id int8, v text);
INSERT INTO r (id, v) VALUES (1, 't'), (-1, 'f');
ANALYZE l,r;
EXPLAIN ANALYZE
SELECT * FROM l LEFT OUTER JOIN r ON (r.id = l.id) WHERE r.v IS NULL;
ERROR:  relation "l" already exists
ERROR:  relation "r" already exists
INSERT 0 2
ANALYZE
                                                  QUERY PLAN
---------------------------------------------------------------------------------------------------------------

Hash Left Join (cost=1.09..1944.13 rows=99998 width=14) (actualtime=0.152..84.184 rows=99999 loops=1)

   Hash Cond: (l.id = r.id)
   Filter: (r.v IS NULL)
   Rows Removed by Filter: 2

-> Seq Scan on l (cost=0.00..1443.00 rows=100000 width=4) (actualtime=0.040..27.635 rows=100000 loops=1) -> Hash (cost=1.04..1.04 rows=4 width=10) (actualtime=0.020..0.022 rows=4 loops=1)

         Buckets: 1024  Batches: 1  Memory Usage: 9kB

-> Seq Scan on r (cost=0.00..1.04 rows=4 width=10) (actualtime=0.009..0.011 rows=4 loops=1)

 Planning Time: 0.954 ms
 Execution Time: 92.309 ms
(10 rows)

It looks too simple and I suspect that I might have missed somethingsomewhere, but so far I haven't found any examples of queries where itdoesn't work.

I didn't see it breaking anything in the examples from my previousletter [1].

1.https://www.postgresql.org/message-id/7af1464e-2e24-cfb1-b6d4-1544757f8cfa%40yandex.ru



Unfortunately, I can't understand your idea from point 4, please explain it?

The good thing is this helps even for IS NULL checks on non-join-key
columns (where we don't switch to an antijoin), but there's a couple
things that I dislike ...

1) It's not restricted to outer joins or anything like that (this is
mostly just my laziness / interest in one particular query, but also
something the outer-join-aware patch might help with).

2) We probably don't want to pass this kind of information through
sjinfo. That was the simplest thing for an experimental patch, but I
suspect it's not the only piece of information we may need to pass to
the lower levels of estimation code.

3) I kinda doubt we actually want to move this responsibility (to
consider fraction of unmatched rows) to the low-level estimation
routines (e.g. nulltestsel and various others). AFAICS this just
"introduces NULLs" into the relation, so maybe we could "adjust" the
attribute statistics (in examine_variable?) by inflating null_frac and
modifying the other frequencies in MCV/histogram.

4) But I'm not sure we actually want to do that in these low-level
selectivity functions. The outer join essentially produces output with
two subsets - one with matches on the outer side, one without them. But
the side without matches has NULLs in all columns. In a way, we know
exactly how are these columns correlated - if we do the usual estimation
(even with the null_frac adjusted), we just throw this information away.
And when there's a lot of rows without a match, that seems bad.

--
Regards,
Alena Rybakina
Postgres Professional

From 54a24390f1137a77c9755a875774a75ae8cf2424 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.von...@enterprisedb.com>
Date: Mon, 26 Jun 2023 15:39:11 +0300
Subject: [PATCH] Fixed the case of calculating underestimated cardinality for
 an LEFT JOIN with the restriction "IS NULL" in the clause. This error is
 caused by an incorrect calculation of selectivity in the "IS NULL" clause,
 since it took into account only table-level statistics without zero values in
 the results of the join operation. This patch fixes this by calculating the
 fraction of zero values on the right side of the join without of matching row
 on left.

Co-authored-by: Alena Rybakina <a.rybak...@postgrespro.ru>
---
 src/backend/utils/adt/selfuncs.c | 39 ++++++++++++++++++++++++++++----
 src/include/nodes/pathnodes.h    |  3 +++
 2 files changed, 37 insertions(+), 5 deletions(-)

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index c4fcd0076ea..a0e3834453b 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -153,7 +153,7 @@ static double eqjoinsel_inner(Oid opfuncoid, Oid collation,
 							  bool isdefault1, bool isdefault2,
 							  AttStatsSlot *sslot1, AttStatsSlot *sslot2,
 							  Form_pg_statistic stats1, Form_pg_statistic stats2,
-							  bool have_mcvs1, bool have_mcvs2);
+							  bool have_mcvs1, bool have_mcvs2, double *unmatched_frac);
 static double eqjoinsel_semi(Oid opfuncoid, Oid collation,
 							 VariableStatData *vardata1, VariableStatData *vardata2,
 							 double nd1, double nd2,
@@ -1710,6 +1710,9 @@ nulltestsel(PlannerInfo *root, NullTestType nulltesttype, Node *arg,
 		stats = (Form_pg_statistic) GETSTRUCT(vardata.statsTuple);
 		freq_null = stats->stanullfrac;
 
+		if (sjinfo)
+			freq_null = freq_null + sjinfo->unmatched_frac - freq_null * sjinfo->unmatched_frac;
+
 		switch (nulltesttype)
 		{
 			case IS_NULL:
@@ -2313,13 +2316,24 @@ eqjoinsel(PG_FUNCTION_ARGS)
 	}
 
 	/* We need to compute the inner-join selectivity in all cases */
+	/*
+	 * calculate fraction of right without of matching row on left
+	 *
+	 * FIXME Should be restricted to JOIN_LEFT, we should have similar logic
+	 * for JOIN_FULL.
+	 *
+	 * XXX Probably should calculate unmatched as fraction of the join result,
+	 * not of the relation on the right (because the matched part can have more
+	 * matches per row and thus grow). Not sure. Depends on how it's used later.
+	 */
 	selec_inner = eqjoinsel_inner(opfuncoid, collation,
 								  &vardata1, &vardata2,
 								  nd1, nd2,
 								  isdefault1, isdefault2,
 								  &sslot1, &sslot2,
 								  stats1, stats2,
-								  have_mcvs1, have_mcvs2);
+								  have_mcvs1, have_mcvs2,
+								  &sjinfo->unmatched_frac);
 
 	switch (sjinfo->jointype)
 	{
@@ -2407,7 +2421,7 @@ eqjoinsel_inner(Oid opfuncoid, Oid collation,
 				bool isdefault1, bool isdefault2,
 				AttStatsSlot *sslot1, AttStatsSlot *sslot2,
 				Form_pg_statistic stats1, Form_pg_statistic stats2,
-				bool have_mcvs1, bool have_mcvs2)
+				bool have_mcvs1, bool have_mcvs2, double *unmatched_frac)
 {
 	double		selec;
 
@@ -2503,7 +2517,10 @@ eqjoinsel_inner(Oid opfuncoid, Oid collation,
 		}
 		CLAMP_PROBABILITY(matchfreq1);
 		CLAMP_PROBABILITY(unmatchfreq1);
+
+		*unmatched_frac = unmatchfreq1;
 		matchfreq2 = unmatchfreq2 = 0.0;
+
 		for (i = 0; i < sslot2->nvalues; i++)
 		{
 			if (hasmatch2[i])
@@ -2581,10 +2598,22 @@ eqjoinsel_inner(Oid opfuncoid, Oid collation,
 		double		nullfrac2 = stats2 ? stats2->stanullfrac : 0.0;
 
 		selec = (1.0 - nullfrac1) * (1.0 - nullfrac2);
-		if (nd1 > nd2)
-			selec /= nd1;
+
+		/*
+		 * XXX Should this look at nullfrac on either side? Probably depends on
+		 * if we're calculating fraction of NULLs or fraction of unmatched rows.
+		 */
+		// unmatchfreq = (1.0 - nullfrac1) * (1.0 - nullfrac2);
+		if (nd1 != nd2)
+		{
+			selec /= Max(nd1, nd2);
+			*unmatched_frac = abs(nd1 - nd2) * 1.0 / Max(nd1, nd2);
+		}
 		else
+		{
 			selec /= nd2;
+			*unmatched_frac = 0.0;
+		}
 	}
 
 	return selec;
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index c17b53f7adb..6bc63e648e6 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -2853,6 +2853,9 @@ struct SpecialJoinInfo
 	bool		semi_can_hash;	/* true if semi_operators are all hash */
 	List	   *semi_operators; /* OIDs of equality join operators */
 	List	   *semi_rhs_exprs; /* righthand-side expressions of these ops */
+
+	/* For outer join, fraction of rows without a match. */
+	Selectivity	unmatched_frac;
 };
 
 /*
-- 
2.34.1

Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs

Reply via email to