Well, I don't have a detailed plan either. In principle it shouldn't be
that hard, I think - examine_variable is loading the statistics, so it
could apply the same null_frac correction, just like nulltestsel would
do a bit later.

The main question is how to pass the information to examine_variable. It
doesn't get the SpecialJoinInfo (which is what nulltestsel used), so
we'd need to invent something new ... add a new argument?

Sorry I didn't answer right away, I could adapt the last version of the patch [2] to the current idea, but so far I have implemented it only for the situation where we save the number of zero values in SpecialJoinInfo variable.

I'm starting to look for different functions scalararraysel_containment, boolvarsel and I try to find some bad cases for current problem, when I can fix in similar way it in those functions. But I'm not sure about different ones functions: (mergejoinscansel, estimate_num_groups, estimate_hash_bucket_stats, get_restriction_variable, get_join_variables).

The examine_variable function is also called in them, and even though there is no being outer join in them (the absence of a SpecialJoinInfo variable),  I can't think of similar cases, when we have such problem caused by the same reasons.


The code passes all regression tests and I found no deterioration in cardinality prediction for queries from [1], except for one:

EXPLAIN ANALYZE
  SELECT * FROM large FULL JOIN small ON (large.id = small.id)
WHERE (large.a IS NULL);

MASTER:

*QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------

 Merge Full Join  (cost=127921.69..299941.59 rows=56503 width=16)(actual time=795.092..795.094 rows=0 loops=1)

   Merge Cond: (small.id = large.id)
   Filter: (large.a IS NULL)
   Rows Removed by Filter: 1000000

   ->  Sort  (cost=158.51..164.16 rows=2260 width=8) (actualtime=0.038..0.046 rows=100 loops=1)

         Sort Key: small.id
         Sort Method: quicksort  Memory: 29kB

         ->  Seq Scan on small  (cost=0.00..32.60 rows=2260 width=8)(actual time=0.013..0.022 rows=100 loops=1)   ->  Materialize (cost=127763.19..132763.44 rows=1000050 width=8)(actual time=363.016..649.103 rows=1000000 loops=1)         ->  Sort (cost=127763.19..130263.31 rows=1000050 width=8)(actual time=363.012..481.480 rows=1000000 loops=1)

               Sort Key: large.id
               Sort Method: external merge  Disk: 17664kB

               ->  Seq Scan on large (cost=0.00..14425.50 rows=1000050width=8) (actual time=0.009..111.166 rows=1000000 loops=1)

 Planning Time: 0.124 ms
 Execution Time: 797.139 ms
(15 rows)*

With patch:

*QUERY PLAN
----------------------------------------------------------------------------------------------------------------------

 Hash Full Join  (cost=3.25..18179.25 rows=999900 width=16) (actualtime=261.480..261.482 rows=0 loops=1)

   Hash Cond: (large.id = small.id)
   Filter: (large.a IS NULL)
   Rows Removed by Filter: 1000000

   ->  Seq Scan on large  (cost=0.00..14425.00 rows=1000000 width=8)(actual time=0.006..92.827 rows=1000000 loops=1)   ->  Hash (cost=2.00..2.00 rows=100 width=8) (actualtime=0.032..0.034 rows=100 loops=1)

         Buckets: 1024  Batches: 1  Memory Usage: 12kB

         ->  Seq Scan on small  (cost=0.00..2.00 rows=100 width=8)(actual time=0.008..0.015 rows=100 loops=1)

 Planning Time: 0.151 ms
 Execution Time: 261.529 ms
(10 rows)

[1] https://www.mail-archive.com/pgsql-hackers@lists.postgresql.org/msg146044.html

[2] https://www.postgresql.org/message-id/148ff8f1-067b-1409-c754-af6117de9b7d%40yandex.ru


Unfortunately, I found that my previous change leads to a big error in predicting selectivity. I will investigate this case in more detail and try to find a solution (I wrote the code and query below).

// unmatchfreq = (1.0 - nullfrac1) * (1.0 - nullfrac2);
if (nd1 != nd2)
{
        selec /= Max(nd1, nd2);
        *unmatched_frac = abs(nd1 - nd2) * 1.0 / Max(nd1, nd2);
}
else
{
        selec /= nd2;
        *unmatched_frac = 0.0;
}


postgres=# explain analyze select * from large l1 inner join large l2 on l1.a is null where l1.a <100;

MASTER:

QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=1000.00..35058.43 rows=1000000 width=16) (actual time=91.846..93.622 rows=0 loops=1)    ->  Gather (cost=1000.00..10633.43 rows=1 width=8) (actual time=91.844..93.619 rows=0 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Parallel Seq Scan on large l1  (cost=0.00..9633.33 rows=1 width=8) (actual time=86.153..86.154 rows=0 loops=3)
               Filter: ((a IS NULL) AND (a < 100))
               Rows Removed by Filter: 333333
   ->  Seq Scan on large l2 (cost=0.00..14425.00 rows=1000000 width=8) (never executed)
 Planning Time: 0.299 ms
 Execution Time: 93.689 ms

(10 rows)


With patch:

QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop (cost=1000.00..20863771.84 rows=1667083350 width=16) (actual time=290.255..290.323 rows=0 loops=1)    ->  Seq Scan on large l2 (cost=0.00..14425.50 rows=1000050 width=8) (actual time=0.104..94.037 rows=1000000 loops=1)    ->  Materialize (cost=1000.00..10808.63 rows=1667 width=8) (actual time=0.000..0.000 rows=0 loops=1000000)          ->  Gather (cost=1000.00..10800.29 rows=1667 width=8) (actual time=79.472..79.539 rows=0 loops=1)
               Workers Planned: 2
               Workers Launched: 2
               ->  Parallel Seq Scan on large l1  (cost=0.00..9633.59 rows=695 width=8) (actual time=75.337..75.338 rows=0 loops=3)
                     Filter: ((a IS NULL) AND (a < 100))
                     Rows Removed by Filter: 333333
 Planning Time: 0.721 ms
 Execution Time: 290.425 ms

(11 rows)

I remember, it could fix this one:

SELECT i AS id INTO l FROM generate_series(1,100000) i;
CREATE TABLE r (id int8, v text);
INSERT INTO r (id, v) VALUES (1, 't'), (-1, 'f');
ANALYZE l,r;
EXPLAIN ANALYZE
SELECT * FROM l LEFT OUTER JOIN r ON (r.id = l.id) WHERE r.v IS NULL;
ERROR:  relation "l" already exists
ERROR:  relation "r" already exists
INSERT 0 2
ANALYZE
                                                  QUERY PLAN
---------------------------------------------------------------------------------------------------------------

 Hash Left Join  (cost=1.09..1944.13 rows=99998 width=14) (actualtime=0.152..84.184 rows=99999 loops=1)

   Hash Cond: (l.id = r.id)
   Filter: (r.v IS NULL)
   Rows Removed by Filter: 2

   ->  Seq Scan on l  (cost=0.00..1443.00 rows=100000 width=4) (actualtime=0.040..27.635 rows=100000 loops=1)   ->  Hash (cost=1.04..1.04 rows=4 width=10) (actualtime=0.020..0.022 rows=4 loops=1)

         Buckets: 1024  Batches: 1  Memory Usage: 9kB

         ->  Seq Scan on r  (cost=0.00..1.04 rows=4 width=10) (actualtime=0.009..0.011 rows=4 loops=1)

 Planning Time: 0.954 ms
 Execution Time: 92.309 ms
(10 rows)

Do you think it's worth trying to apply the fourth method now?
As far as I remember, here we will face the problem of estimating multiple conditions, in the 4th approach there is a chance to avoid this.

I couldn't get such case. I found only one that I also found interesting, but so far I haven't been able to figure out well enough what influenced the prediction of cardinality and, accordingly, the choice of a better plan.

CREATE TABLE large (id INT, a INT);
  INSERT INTO large SELECT i, 1 FROM generate_series(1,50000) s(i);
CREATE TABLE small (id INT, b INT);
  INSERT INTO small SELECT i, 1 FROM generate_series(1,25000) s(i);
INSERT INTO large SELECT i+50000, i+2 FROM generate_series(1,50000) s(i);
INSERT INTO small SELECT i+25000, i+2 FROM generate_series(1,25000) s(i);

explain analyze SELECT * FROM large LEFT JOIN small ON (large.id = small.id)
          WHERE ((large.a=1) OR (small.b=1));

With patch:

QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
 Hash Left Join  (cost=1347.00..3911.91 rows=99775 width=16) (actual time=36.864..82.634 rows=50000 loops=1)
   Hash Cond: (large.id = small.id)
   Filter: ((large.a = 1) OR (small.b = 1))
   Rows Removed by Filter: 50000
   ->  Seq Scan on large  (cost=0.00..1440.75 rows=99775 width=8) (actual time=0.034..12.536 rows=100000 loops=1)    ->  Hash  (cost=722.00..722.00 rows=50000 width=8) (actual time=36.752..36.754 rows=50000 loops=1)
         Buckets: 65536  Batches: 1  Memory Usage: 2466kB
         ->  Seq Scan on small  (cost=0.00..722.00 rows=50000 width=8) (actual time=0.028..13.337 rows=50000 loops=1)
 Planning Time: 2.363 ms
 Execution Time: 84.790 ms
(10 rows)

original:

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
 Merge Right Join  (cost=14400.45..516963.33 rows=250528 width=16) (actual time=85.498..126.799 rows=50000 loops=1)
   Merge Cond: (small.id = large.id)
   Filter: ((large.a = 1) OR (small.b = 1))
   Rows Removed by Filter: 50000
   ->  Sort  (cost=4640.80..4766.23 rows=50172 width=8) (actual time=41.538..44.204 rows=50000 loops=1)
         Sort Key: small.id
         Sort Method: quicksort  Memory: 3710kB
         ->  Seq Scan on small  (cost=0.00..723.72 rows=50172 width=8) (actual time=0.068..15.182 rows=50000 loops=1)    ->  Sort  (cost=9759.65..10009.95 rows=100118 width=8) (actual time=43.939..53.697 rows=100000 loops=1)
         Sort Key: large.id
         Sort Method: external sort  Disk: 2160kB
         ->  Seq Scan on large  (cost=0.00..1444.18 rows=100118 width=8) (actual time=0.046..10.378 rows=100000 loops=1)
 Planning Time: 0.406 ms
 Execution Time: 130.109 ms
(14 rows)

If you disable merge join with the original code, then the plans coincide, as well as the error value approximately, only in one case cardinality
overestimation occurs in the other vice versa.

QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
 Hash Left Join  (cost=1347.00..3915.00 rows=75082 width=16) (actual time=43.625..68.901 rows=50000 loops=1)
   Hash Cond: (large.id = small.id)
   Filter: ((large.a = 1) OR (small.b = 1))
   Rows Removed by Filter: 50000
   ->  Seq Scan on large  (cost=0.00..1443.00 rows=100000 width=8) (actual time=0.008..9.895 rows=100000 loops=1)    ->  Hash  (cost=722.00..722.00 rows=50000 width=8) (actual time=22.546..22.548 rows=50000 loops=1)
         Buckets: 65536  Batches: 1  Memory Usage: 2466kB
         ->  Seq Scan on small  (cost=0.00..722.00 rows=50000 width=8) (actual time=0.006..7.218 rows=50000 loops=1)
 Planning Time: 0.239 ms
 Execution Time: 70.860 ms
(10 rows)

--
Regards,
Alena Rybakina
Postgres Professional
From 47daf6945403425f3a09a2df450a2cd2e96ae23c Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.von...@enterprisedb.com>
Date: Mon, 10 Jul 2023 17:29:41 +0300
Subject: [PATCH] Fix the case of calculating the underestimated cardinality
 for LEFT/RIGHT/FULL OUTER JOIN, where zero elements were not taken into
 account as a result of the connection operation. This patch fixes this by
 calculating the fraction of zero values on the one side of the join without
 of matching row on the other side.

Co-authored-by: Alena Rybakina <a.rybak...@postgrespro.ru>
---
 src/backend/utils/adt/array_selfuncs.c |   2 +-
 src/backend/utils/adt/selfuncs.c       | 314 ++++++++++++++-----------
 src/include/nodes/pathnodes.h          |   3 +
 src/include/utils/selfuncs.h           |   2 +-
 4 files changed, 178 insertions(+), 143 deletions(-)

diff --git a/src/backend/utils/adt/array_selfuncs.c b/src/backend/utils/adt/array_selfuncs.c
index 9207a5ed193..56ce460cf05 100644
--- a/src/backend/utils/adt/array_selfuncs.c
+++ b/src/backend/utils/adt/array_selfuncs.c
@@ -93,7 +93,7 @@ scalararraysel_containment(PlannerInfo *root,
 	/*
 	 * rightop must be a variable, else punt.
 	 */
-	examine_variable(root, rightop, varRelid, &vardata);
+	examine_variable(root, rightop, varRelid, NULL, &vardata);
 	if (!vardata.rel)
 	{
 		ReleaseVariableStats(vardata);
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index c4fcd0076ea..072579dbfd5 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -153,7 +153,7 @@ static double eqjoinsel_inner(Oid opfuncoid, Oid collation,
 							  bool isdefault1, bool isdefault2,
 							  AttStatsSlot *sslot1, AttStatsSlot *sslot2,
 							  Form_pg_statistic stats1, Form_pg_statistic stats2,
-							  bool have_mcvs1, bool have_mcvs2);
+							  bool have_mcvs1, bool have_mcvs2, double *unmatched_frac);
 static double eqjoinsel_semi(Oid opfuncoid, Oid collation,
 							 VariableStatData *vardata1, VariableStatData *vardata2,
 							 double nd1, double nd2,
@@ -1513,7 +1513,7 @@ boolvarsel(PlannerInfo *root, Node *arg, int varRelid)
 	VariableStatData vardata;
 	double		selec;
 
-	examine_variable(root, arg, varRelid, &vardata);
+	examine_variable(root, arg, varRelid, NULL, &vardata);
 	if (HeapTupleIsValid(vardata.statsTuple))
 	{
 		/*
@@ -1541,110 +1541,20 @@ booltestsel(PlannerInfo *root, BoolTestType booltesttype, Node *arg,
 {
 	VariableStatData vardata;
 	double		selec;
+	double		freq_null;
+	AttStatsSlot sslot;
 
-	examine_variable(root, arg, varRelid, &vardata);
+	examine_variable(root, arg, varRelid, sjinfo, &vardata);
 
-	if (HeapTupleIsValid(vardata.statsTuple))
+	if (sjinfo)
+	{
+		freq_null = sjinfo->unmatched_frac;
+	}
+	else if(HeapTupleIsValid(vardata.statsTuple))
 	{
 		Form_pg_statistic stats;
-		double		freq_null;
-		AttStatsSlot sslot;
-
 		stats = (Form_pg_statistic) GETSTRUCT(vardata.statsTuple);
 		freq_null = stats->stanullfrac;
-
-		if (get_attstatsslot(&sslot, vardata.statsTuple,
-							 STATISTIC_KIND_MCV, InvalidOid,
-							 ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS)
-			&& sslot.nnumbers > 0)
-		{
-			double		freq_true;
-			double		freq_false;
-
-			/*
-			 * Get first MCV frequency and derive frequency for true.
-			 */
-			if (DatumGetBool(sslot.values[0]))
-				freq_true = sslot.numbers[0];
-			else
-				freq_true = 1.0 - sslot.numbers[0] - freq_null;
-
-			/*
-			 * Next derive frequency for false. Then use these as appropriate
-			 * to derive frequency for each case.
-			 */
-			freq_false = 1.0 - freq_true - freq_null;
-
-			switch (booltesttype)
-			{
-				case IS_UNKNOWN:
-					/* select only NULL values */
-					selec = freq_null;
-					break;
-				case IS_NOT_UNKNOWN:
-					/* select non-NULL values */
-					selec = 1.0 - freq_null;
-					break;
-				case IS_TRUE:
-					/* select only TRUE values */
-					selec = freq_true;
-					break;
-				case IS_NOT_TRUE:
-					/* select non-TRUE values */
-					selec = 1.0 - freq_true;
-					break;
-				case IS_FALSE:
-					/* select only FALSE values */
-					selec = freq_false;
-					break;
-				case IS_NOT_FALSE:
-					/* select non-FALSE values */
-					selec = 1.0 - freq_false;
-					break;
-				default:
-					elog(ERROR, "unrecognized booltesttype: %d",
-						 (int) booltesttype);
-					selec = 0.0;	/* Keep compiler quiet */
-					break;
-			}
-
-			free_attstatsslot(&sslot);
-		}
-		else
-		{
-			/*
-			 * No most-common-value info available. Still have null fraction
-			 * information, so use it for IS [NOT] UNKNOWN. Otherwise adjust
-			 * for null fraction and assume a 50-50 split of TRUE and FALSE.
-			 */
-			switch (booltesttype)
-			{
-				case IS_UNKNOWN:
-					/* select only NULL values */
-					selec = freq_null;
-					break;
-				case IS_NOT_UNKNOWN:
-					/* select non-NULL values */
-					selec = 1.0 - freq_null;
-					break;
-				case IS_TRUE:
-				case IS_FALSE:
-					/* Assume we select half of the non-NULL values */
-					selec = (1.0 - freq_null) / 2.0;
-					break;
-				case IS_NOT_TRUE:
-				case IS_NOT_FALSE:
-					/* Assume we select NULLs plus half of the non-NULLs */
-					/* equiv. to freq_null + (1.0 - freq_null) / 2.0 */
-					selec = (freq_null + 1.0) / 2.0;
-					break;
-				default:
-					elog(ERROR, "unrecognized booltesttype: %d",
-						 (int) booltesttype);
-					selec = 0.0;	/* Keep compiler quiet */
-					break;
-			}
-		}
 	}
 	else
 	{
@@ -1680,8 +1590,103 @@ booltestsel(PlannerInfo *root, BoolTestType booltesttype, Node *arg,
 				selec = 0.0;	/* Keep compiler quiet */
 				break;
 		}
+		goto end;
+	}
+
+	if (vardata.statsTuple && get_attstatsslot(&sslot, vardata.statsTuple,
+							STATISTIC_KIND_MCV, InvalidOid,
+							ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS)
+		&& sslot.nnumbers > 0)
+	{
+		double		freq_true;
+		double		freq_false;
+
+		/*
+			* Get first MCV frequency and derive frequency for true.
+			*/
+		if (DatumGetBool(sslot.values[0]))
+			freq_true = sslot.numbers[0];
+		else
+			freq_true = 1.0 - sslot.numbers[0] - freq_null;
+
+		/*
+			* Next derive frequency for false. Then use these as appropriate
+			* to derive frequency for each case.
+			*/
+		freq_false = 1.0 - freq_true - freq_null;
+
+		switch (booltesttype)
+		{
+			case IS_UNKNOWN:
+				/* select only NULL values */
+				selec = freq_null;
+				break;
+			case IS_NOT_UNKNOWN:
+				/* select non-NULL values */
+				selec = 1.0 - freq_null;
+				break;
+			case IS_TRUE:
+				/* select only TRUE values */
+				selec = freq_true;
+				break;
+			case IS_NOT_TRUE:
+				/* select non-TRUE values */
+				selec = 1.0 - freq_true;
+				break;
+			case IS_FALSE:
+				/* select only FALSE values */
+				selec = freq_false;
+				break;
+			case IS_NOT_FALSE:
+				/* select non-FALSE values */
+				selec = 1.0 - freq_false;
+				break;
+			default:
+				elog(ERROR, "unrecognized booltesttype: %d",
+						(int) booltesttype);
+				selec = 0.0;	/* Keep compiler quiet */
+				break;
+		}
+
+		free_attstatsslot(&sslot);
+	}
+	else
+	{
+		/*
+			* No most-common-value info available. Still have null fraction
+			* information, so use it for IS [NOT] UNKNOWN. Otherwise adjust
+			* for null fraction and assume a 50-50 split of TRUE and FALSE.
+			*/
+		switch (booltesttype)
+		{
+			case IS_UNKNOWN:
+				/* select only NULL values */
+				selec = freq_null;
+				break;
+			case IS_NOT_UNKNOWN:
+				/* select non-NULL values */
+				selec = 1.0 - freq_null;
+				break;
+			case IS_TRUE:
+			case IS_FALSE:
+				/* Assume we select half of the non-NULL values */
+				selec = (1.0 - freq_null) / 2.0;
+				break;
+			case IS_NOT_TRUE:
+			case IS_NOT_FALSE:
+				/* Assume we select NULLs plus half of the non-NULLs */
+				/* equiv. to freq_null + (1.0 - freq_null) / 2.0 */
+				selec = (freq_null + 1.0) / 2.0;
+				break;
+			default:
+				elog(ERROR, "unrecognized booltesttype: %d",
+						(int) booltesttype);
+				selec = 0.0;	/* Keep compiler quiet */
+				break;
+		}
 	}
 
+	end:
 	ReleaseVariableStats(vardata);
 
 	/* result should be in range, but make sure... */
@@ -1699,39 +1704,19 @@ nulltestsel(PlannerInfo *root, NullTestType nulltesttype, Node *arg,
 {
 	VariableStatData vardata;
 	double		selec;
+	double		freq_null;
 
-	examine_variable(root, arg, varRelid, &vardata);
+	examine_variable(root, arg, varRelid, sjinfo, &vardata);
 
-	if (HeapTupleIsValid(vardata.statsTuple))
+	if (sjinfo)
+	{
+		freq_null = sjinfo->unmatched_frac;
+	}
+	else if (HeapTupleIsValid(vardata.statsTuple))
 	{
 		Form_pg_statistic stats;
-		double		freq_null;
-
 		stats = (Form_pg_statistic) GETSTRUCT(vardata.statsTuple);
 		freq_null = stats->stanullfrac;
-
-		switch (nulltesttype)
-		{
-			case IS_NULL:
-
-				/*
-				 * Use freq_null directly.
-				 */
-				selec = freq_null;
-				break;
-			case IS_NOT_NULL:
-
-				/*
-				 * Select not unknown (not null) values. Calculate from
-				 * freq_null.
-				 */
-				selec = 1.0 - freq_null;
-				break;
-			default:
-				elog(ERROR, "unrecognized nulltesttype: %d",
-					 (int) nulltesttype);
-				return (Selectivity) 0; /* keep compiler quiet */
-		}
 	}
 	else if (vardata.var && IsA(vardata.var, Var) &&
 			 ((Var *) vardata.var)->varattno < 0)
@@ -1741,6 +1726,7 @@ nulltestsel(PlannerInfo *root, NullTestType nulltesttype, Node *arg,
 		 * NULL.
 		 */
 		selec = (nulltesttype == IS_NULL) ? 0.0 : 1.0;
+		goto end;
 	}
 	else
 	{
@@ -1760,8 +1746,33 @@ nulltestsel(PlannerInfo *root, NullTestType nulltesttype, Node *arg,
 					 (int) nulltesttype);
 				return (Selectivity) 0; /* keep compiler quiet */
 		}
+		goto end;
+	}
+
+	switch (nulltesttype)
+	{
+		case IS_NULL:
+
+			/*
+				* Use freq_null directly.
+				*/
+			selec = freq_null;
+			break;
+		case IS_NOT_NULL:
+
+			/*
+				* Select not unknown (not null) values. Calculate from
+				* freq_null.
+				*/
+			selec = 1.0 - freq_null;
+			break;
+		default:
+			elog(ERROR, "unrecognized nulltesttype: %d",
+					(int) nulltesttype);
+			return (Selectivity) 0; /* keep compiler quiet */
 	}
 
+	end:
 	ReleaseVariableStats(vardata);
 
 	/* result should be in range, but make sure... */
@@ -2319,7 +2330,7 @@ eqjoinsel(PG_FUNCTION_ARGS)
 								  isdefault1, isdefault2,
 								  &sslot1, &sslot2,
 								  stats1, stats2,
-								  have_mcvs1, have_mcvs2);
+								  have_mcvs1, have_mcvs2, &sjinfo->unmatched_frac);
 
 	switch (sjinfo->jointype)
 	{
@@ -2407,7 +2418,8 @@ eqjoinsel_inner(Oid opfuncoid, Oid collation,
 				bool isdefault1, bool isdefault2,
 				AttStatsSlot *sslot1, AttStatsSlot *sslot2,
 				Form_pg_statistic stats1, Form_pg_statistic stats2,
-				bool have_mcvs1, bool have_mcvs2)
+				bool have_mcvs1, bool have_mcvs2,
+				double *unmatched_frac)
 {
 	double		selec;
 
@@ -2503,6 +2515,9 @@ eqjoinsel_inner(Oid opfuncoid, Oid collation,
 		}
 		CLAMP_PROBABILITY(matchfreq1);
 		CLAMP_PROBABILITY(unmatchfreq1);
+
+		*unmatched_frac = unmatchfreq1;
+
 		matchfreq2 = unmatchfreq2 = 0.0;
 		for (i = 0; i < sslot2->nvalues; i++)
 		{
@@ -2581,10 +2596,21 @@ eqjoinsel_inner(Oid opfuncoid, Oid collation,
 		double		nullfrac2 = stats2 ? stats2->stanullfrac : 0.0;
 
 		selec = (1.0 - nullfrac1) * (1.0 - nullfrac2);
-		if (nd1 > nd2)
-			selec /= nd1;
+		/*
+		 * XXX Should this look at nullfrac on either side? Probably depends on
+		 * if we're calculating fraction of NULLs or fraction of unmatched rows.
+		 */
+		// unmatchfreq = (1.0 - nullfrac1) * (1.0 - nullfrac2);
+		if (nd1 != nd2)
+		{
+			selec /= Max(nd1, nd2);
+			*unmatched_frac = abs(nd1 - nd2) * 1.0 / Max(nd1, nd2);
+		}
 		else
+		{
 			selec /= nd2;
+			*unmatched_frac = 0.0;
+		}
 	}
 
 	return selec;
@@ -2964,8 +2990,8 @@ mergejoinscansel(PlannerInfo *root, Node *clause,
 		return;					/* shouldn't happen */
 
 	/* Look for stats for the inputs */
-	examine_variable(root, left, 0, &leftvar);
-	examine_variable(root, right, 0, &rightvar);
+	examine_variable(root, left, 0, NULL, &leftvar);
+	examine_variable(root, right, 0, NULL, &rightvar);
 
 	/* Extract the operator's declared left/right datatypes */
 	get_op_opfamily_properties(opno, opfamily, false,
@@ -3469,7 +3495,7 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
 		 * independent, than to combine estimates for the extracted variables
 		 * when we don't know how that relates to the expressions.
 		 */
-		examine_variable(root, groupexpr, 0, &vardata);
+		examine_variable(root, groupexpr, 0, NULL, &vardata);
 		if (HeapTupleIsValid(vardata.statsTuple) || vardata.isunique)
 		{
 			varinfos = add_unique_group_var(root, varinfos,
@@ -3510,7 +3536,7 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
 		{
 			Node	   *var = (Node *) lfirst(l2);
 
-			examine_variable(root, var, 0, &vardata);
+			examine_variable(root, var, 0, NULL, &vardata);
 			varinfos = add_unique_group_var(root, varinfos, var, &vardata);
 			ReleaseVariableStats(vardata);
 		}
@@ -3777,7 +3803,7 @@ estimate_hash_bucket_stats(PlannerInfo *root, Node *hashkey, double nbuckets,
 	bool		isdefault;
 	AttStatsSlot sslot;
 
-	examine_variable(root, hashkey, 0, &vardata);
+	examine_variable(root, hashkey, 0, NULL, &vardata);
 
 	/* Look up the frequency of the most common value, if available */
 	*mcv_freq = 0.0;
@@ -4865,8 +4891,8 @@ get_restriction_variable(PlannerInfo *root, List *args, int varRelid,
 	 * Examine both sides.  Note that when varRelid is nonzero, Vars of other
 	 * relations will be treated as pseudoconstants.
 	 */
-	examine_variable(root, left, varRelid, vardata);
-	examine_variable(root, right, varRelid, &rdata);
+	examine_variable(root, left, varRelid, NULL, vardata);
+	examine_variable(root, right, varRelid, NULL, &rdata);
 
 	/*
 	 * If one side is a variable and the other not, we win.
@@ -4919,8 +4945,8 @@ get_join_variables(PlannerInfo *root, List *args, SpecialJoinInfo *sjinfo,
 	left = (Node *) linitial(args);
 	right = (Node *) lsecond(args);
 
-	examine_variable(root, left, 0, vardata1);
-	examine_variable(root, right, 0, vardata2);
+	examine_variable(root, left, 0, NULL, vardata1);
+	examine_variable(root, right, 0, NULL, vardata2);
 
 	if (vardata1->rel &&
 		bms_is_subset(vardata1->rel->relids, sjinfo->syn_righthand))
@@ -4975,12 +5001,13 @@ ReleaseDummy(HeapTuple tuple)
  * Caller is responsible for doing ReleaseVariableStats() before exiting.
  */
 void
-examine_variable(PlannerInfo *root, Node *node, int varRelid,
+examine_variable(PlannerInfo *root, Node *node, int varRelid, SpecialJoinInfo *sjinfo,
 				 VariableStatData *vardata)
 {
 	Node	   *basenode;
 	Relids		varnos;
 	RelOptInfo *onerel;
+	Form_pg_statistic stats;
 
 	/* Make sure we don't return dangling pointers in vardata */
 	MemSet(vardata, 0, sizeof(VariableStatData));
@@ -5240,6 +5267,11 @@ examine_variable(PlannerInfo *root, Node *node, int varRelid,
 				break;
 		}
 
+		if (HeapTupleIsValid(vardata->statsTuple) && sjinfo)
+		{
+			stats = (Form_pg_statistic) GETSTRUCT(vardata->statsTuple);
+			sjinfo->unmatched_frac = stats->stanullfrac + sjinfo->unmatched_frac - stats->stanullfrac * sjinfo->unmatched_frac;
+		}
 		/*
 		 * Search extended statistics for one with a matching expression.
 		 * There might be multiple ones, so just grab the first one. In the
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index c17b53f7adb..6bc63e648e6 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -2853,6 +2853,9 @@ struct SpecialJoinInfo
 	bool		semi_can_hash;	/* true if semi_operators are all hash */
 	List	   *semi_operators; /* OIDs of equality join operators */
 	List	   *semi_rhs_exprs; /* righthand-side expressions of these ops */
+
+	/* For outer join, fraction of rows without a match. */
+	Selectivity	unmatched_frac;
 };
 
 /*
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
index 2f76c473db4..d9418f7541c 100644
--- a/src/include/utils/selfuncs.h
+++ b/src/include/utils/selfuncs.h
@@ -148,7 +148,7 @@ extern PGDLLIMPORT get_index_stats_hook_type get_index_stats_hook;
 /* Functions in selfuncs.c */
 
 extern void examine_variable(PlannerInfo *root, Node *node, int varRelid,
-							 VariableStatData *vardata);
+								SpecialJoinInfo *sjinfo, VariableStatData *vardata);
 extern bool statistic_proc_security_check(VariableStatData *vardata, Oid func_oid);
 extern bool get_restriction_variable(PlannerInfo *root, List *args,
 									 int varRelid,
-- 
2.34.1

Reply via email to