Re: Hash-based MCV matching for large IN-lists

Ilia Evdokimov Tue, 27 Jan 2026 07:43:29 -0800

Hi,

On 19.01.2026 17:01, David Geier wrote:

Does that mean that we get a different estimation result, depending on
if the IN list is smaller or not? I think we should avoid that because
estimation quality might flip for the user unexpectedly.


I think you're right.

To address this, I changed the hash-table entry to track an additional'count' filed, representing how many times a particular value appears onthe hashed side. When inserting into the hash table, if the value isalready present, I increment 'count', otherwise, I create a new entrywith count = 1

The code in master currently calls an operator-specific selectivity
estimation function. For equality this is typically eqsel() but the
function can be specified during CREATE OPERATOR.

Can be safely special-case the behavior of eqsel() for all possible
operators for the ScalarArrayOpExpr case?


Unfortunately there is no safe way to make this optimization generic for
arbitrary restrict functions, because a custom RESTRICT function does
not have to use MCVs at all. IMO, in practice the vast majority of
ScalarArrayOpExpr uses with = or <> rely on the built-in equality
operators whose selectivity is computed by eqsel()/neqsel(), so I
limited this optimization to those cases.

How did you do that? I cannot find the code that checks for that.

In scalararraysel(), before attempting the hash-based path, we determinewhether the operator behaves like equality or inequality based on itsselectivity function:


if (oprsel == F_EQSEL || oprsel == F_EQJOINSEL)
    isEquality = true;
else if (oprsel == F_NEQSEL || oprsel == F_NEQJOINSEL)
    isInequality = true;

Then the hash-based MCV matching is only attempted under:

if ((isEquality || isInequality) && !is_join_clause)

So effectively this restricts the optimization to operators whoseselectivity is computed by eqsel()/neqsel() on restriction clauses. Joinclauses (which would use eqjoinsel/neqjoinsel) are excluded via!is_join_clause

For the MCVs, can't we reuse some code from the eqjoinsel() optimization
we did? The entry and context structs look similar enough to only need one.

I considered reusing pieces from the eqjoinsel() , but in practice itturned out to be difficult to share code cleanly. Also, when looking atthis file more broadly, we already have multiple places that reimplementsimilar pattern.

Making the code more compact would ease reviewing a lot.

Agreed — I also think making the code more compact would significantlyease reviewing. I’ve found a way to unify the Const-array and ArrayExprcases: in the ArrayExpr path, we can first construct the same arrays asin the Const-array case (elem_values, elem_nulls), and additionallybuild a boolean array elem_const[] indicating whether each element is aConst. Then the hash-based MCV matching function can:

- Ignore NULL and non-Const elements when building and probing the hashtable.

- Count how many non-Const elements are present.

- After MCV and non-MCV constant handling, account for non-Constelements separately using var_eq_non_const() and fold theirprobabilities into the same ANY/ALL accumulation logic.


I've attached v3 patch with it.

To validate the same estimation results, I temporarily kept bothimplementations (hash-based and nested-loop) and compared theirresulting selectivity values. Whenever they differed, I logged it. I ranregression tests and some local workload testing with this checkenabled, and did not observe any mismatches. I attached patch with thislogging.


--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC,
https://tantorlabs.com/

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index f6d99773b70..f4c7fb605ce 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -1966,7 +1966,7 @@ scalararraysel(PlannerInfo *root,
 	TypeCacheEntry *typentry;
 	RegProcedure oprsel;
 	FmgrInfo	oprselproc;
-	Selectivity s1;
+	Selectivity s1, hash_selec = -1.0;
 	Selectivity s1disjoint;
 
 	/* First, deconstruct the expression */
@@ -2106,7 +2106,7 @@ scalararraysel(PlannerInfo *root,
 				ReleaseVariableStats(vardata);
 
 				if (s1 >= 0.0)
-					return s1;
+					hash_selec = s1;
 			}
 		}
 
@@ -2173,6 +2173,9 @@ scalararraysel(PlannerInfo *root,
 		if ((useOr ? isEquality : isInequality) &&
 			s1disjoint >= 0.0 && s1disjoint <= 1.0)
 			s1 = s1disjoint;
+
+		if (hash_selec >= 0 && fabs(s1 - hash_selec) > 0.000000000001)
+			elog(LOG, "nested_loop selectivity = %f, hash_loop selectivity = %f", s1, hash_selec);
 	}
 	else if (rightop && IsA(rightop, ArrayExpr) &&
 			 !((ArrayExpr *) rightop)->multidims)
@@ -2251,7 +2254,7 @@ scalararraysel(PlannerInfo *root,
 				ReleaseVariableStats(vardata);
 
 				if (s1 >= 0.0)
-					return s1;
+					hash_selec = s1;
 			}
 		}
 
@@ -2310,6 +2313,9 @@ scalararraysel(PlannerInfo *root,
 		if ((useOr ? isEquality : isInequality) &&
 			s1disjoint >= 0.0 && s1disjoint <= 1.0)
 			s1 = s1disjoint;
+
+		if (hash_selec >= 0 && fabs(s1 - hash_selec) > 0.000000000001)
+			elog(LOG, "nested_loop selectivity = %f, hash_loop selectivity = %f", s1, hash_selec);
 	}
 	else
 	{

From d439674b3752dd7862bcbb132db2dc91996b1a3c Mon Sep 17 00:00:00 2001
From: Ilia Evdokimov <[email protected]>
Date: Tue, 27 Jan 2026 18:24:50 +0300
Subject: [PATCH v3] Use hash-based MCV matching for ScalarArrayOpExpr
 selectivity

When estimating selectivity for ScalarArrayOpExpr (IN / ANY / ALL) with
available MCV statistics, the planner currently matches IN-list elements
against the MCV array using nested loops. For large IN-lists and/or large
MCV lists this leads to O(N*M) planning-time behavior.

This patch adds a hash-based matching strategy, similar to the one used
in join selectivity estimation. When MCV statistics are available and the
operator supports hashing, the smaller of the two inputs (MCV list or
IN-list constant elements) is chosen as the hash table build side, and
the other side is scanned once, reducing complexity to O(N+M).

The hash-based path is restricted to equality and inequality operators
that use eqsel()/neqsel(), and is applied only when suitable hash
functions and MCV statistics are available.
---
 src/backend/utils/adt/selfuncs.c | 588 ++++++++++++++++++++++++++++++-
 src/tools/pgindent/typedefs.list |   3 +
 2 files changed, 590 insertions(+), 1 deletion(-)

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 29fec655593..f6d99773b70 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -146,7 +146,7 @@
 /*
  * In production builds, switch to hash-based MCV matching when the lists are
  * large enough to amortize hash setup cost.  (This threshold is compared to
- * the sum of the lengths of the two MCV lists.  This is simplistic but seems
+ * the sum of the lengths of the two lists.  This is simplistic but seems
  * to work well enough.)  In debug builds, we use a smaller threshold so that
  * the regression tests cover both paths well.
  */
@@ -156,6 +156,12 @@
 #define EQJOINSEL_MCV_HASH_THRESHOLD 20
 #endif
 
+#ifndef USE_ASSERT_CHECKING
+#define SCALARARRAY_MCV_HASH_THRESHOLD 200
+#else
+#define SCALARARRAY_MCV_HASH_THRESHOLD 20
+#endif
+
 /* Entries in the simplehash hash table used by eqjoinsel_find_matches */
 typedef struct MCVHashEntry
 {
@@ -176,14 +182,46 @@ typedef struct MCVHashContext
 	int16		hash_typlen;	/* typlen of hashed data type */
 } MCVHashContext;
 
+/* Entries in the simplehash hash table used by scalararray_mcv_hash_match */
+typedef struct MCVInHashEntry
+{
+	Datum		value;			/* the value represented by this entry */
+	int			index;			/* its index in the relevant AttStatsSlot */
+	uint32		hash;			/* hash code for the Datum */
+	char		status;			/* status code used by simplehash.h */
+	int			count;			/* number of occurrences od current value in
+								 * the IN-list */
+} MCVInHashEntry;
+
+/* private_data for the simplehash hash table */
+typedef struct MCVInHashContext
+{
+	FunctionCallInfo equal_fcinfo;	/* the equality join operator */
+	FunctionCallInfo hash_fcinfo;	/* the hash function to use */
+	bool		insert_mode;	/* doing inserts or lookups? */
+	bool		hash_typbyval;	/* typbyval of hashed data type */
+	int16		hash_typlen;	/* typlen of hashed data type */
+} MCVInHashContext;
+
 /* forward reference */
 typedef struct MCVHashTable_hash MCVHashTable_hash;
+typedef struct MCVInHashTable_hash MCVInHashTable_hash;
 
 /* Hooks for plugins to get control when we ask for stats */
 get_relation_stats_hook_type get_relation_stats_hook = NULL;
 get_index_stats_hook_type get_index_stats_hook = NULL;
 
 static double eqsel_internal(PG_FUNCTION_ARGS, bool negate);
+static double scalararray_mcv_hash_match(VariableStatData *vardata, Oid operator, Oid collation,
+										 Node *other_op, bool var_on_left, Datum *elem_values,
+										 bool *elem_nulls, int num_elems, bool *elem_const,
+										 Oid nominal_element_type, bool useOr, bool isEquality,
+										 bool isInequality);
+static void accum_scalararray_prob(Selectivity s1, bool useOr, bool isEquality,
+								   bool isInequality, double nullfrac,
+								   double *selec, double *s1disjoint);
+static uint32 hash_mcv_in(MCVInHashTable_hash *tab, Datum key);
+static bool mcvs_in_equal(MCVInHashTable_hash *tab, Datum key0, Datum key1);
 static double eqjoinsel_inner(FmgrInfo *eqproc, Oid collation,
 							  Oid hashLeft, Oid hashRight,
 							  VariableStatData *vardata1, VariableStatData *vardata2,
@@ -287,6 +325,19 @@ static double btcost_correlation(IndexOptInfo *index,
 #define SH_DECLARE
 #include "lib/simplehash.h"
 
+#define SH_PREFIX				MCVInHashTable
+#define SH_ELEMENT_TYPE			MCVInHashEntry
+#define SH_KEY_TYPE				Datum
+#define SH_KEY					value
+#define SH_HASH_KEY(tab,key)	hash_mcv_in(tab, key)
+#define SH_EQUAL(tab,key0,key1)	mcvs_in_equal(tab, key0, key1)
+#define SH_SCOPE				static inline
+#define SH_STORE_HASH
+#define SH_GET_HASH(tab,ent)	(ent)->hash
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
 
 /*
  *		eqsel			- Selectivity of "=" for any data types.
@@ -2025,6 +2076,40 @@ scalararraysel(PlannerInfo *root,
 						  elmlen, elmbyval, elmalign,
 						  &elem_values, &elem_nulls, &num_elems);
 
+		/*
+		 * Try to calculate selectivity by hash-search O(N) instead of O(N^2)
+		 * in case of MCV matching.  We use hash-search only for eqsel() and
+		 * neqsel().
+		 */
+		if ((isEquality || isInequality) && !is_join_clause)
+		{
+			VariableStatData vardata;
+			Node	   *other_op = NULL;
+			bool		var_on_left;
+
+			/*
+			 * If expression is not variable = something or something =
+			 * variable, then punt and return a default estimate.
+			 */
+			if (get_restriction_variable(root, clause->args, varRelid,
+										 &vardata, &other_op, &var_on_left))
+			{
+				bool	   *elem_const = (bool *) palloc(sizeof(bool) * num_elems);
+
+				/* all array elements are Const nodes */
+				memset(elem_const, true, sizeof(bool) * num_elems);
+
+				s1 = scalararray_mcv_hash_match(&vardata, operator, clause->inputcollid, other_op, var_on_left,
+												elem_values, elem_nulls, num_elems, elem_const,
+												nominal_element_type, useOr, isEquality, isInequality);
+				pfree(elem_const);
+				ReleaseVariableStats(vardata);
+
+				if (s1 >= 0.0)
+					return s1;
+			}
+		}
+
 		/*
 		 * For generic operators, we assume the probability of success is
 		 * independent for each array element.  But for "= ANY" or "<> ALL",
@@ -2100,6 +2185,76 @@ scalararraysel(PlannerInfo *root,
 		get_typlenbyval(arrayexpr->element_typeid,
 						&elmlen, &elmbyval);
 
+		/*
+		 * Try to calculate selectivity by hash-search O(N) instead of O(N^2)
+		 * in case of MCV matching.  We use hash-search only for eqsel() and
+		 * neqsel().
+		 */
+		if ((isEquality || isInequality) && !is_join_clause)
+		{
+			VariableStatData vardata;
+			Node	   *other_op = NULL;
+			bool		var_on_left;
+
+			/*
+			 * If expression is not variable = something or something =
+			 * variable, then punt and return a default estimate.
+			 */
+			if (get_restriction_variable(root, clause->args, varRelid,
+										 &vardata, &other_op, &var_on_left))
+			{
+				int			num_elems;
+				Datum	   *elem_values;
+				bool	   *elem_nulls;
+				bool	   *elem_const;
+				ListCell   *lc;
+				int			i = 0;
+
+				num_elems = list_length(arrayexpr->elements);
+				elem_values = (Datum *) palloc0(sizeof(Datum) * num_elems);
+				elem_nulls = (bool *) palloc0(sizeof(bool) * num_elems);
+				elem_const = (bool *) palloc0(sizeof(bool) * num_elems);
+
+				/*
+				 * Build arrays describing ARRAY[] elements: - elem_values:
+				 * Datum value for Const elements - elem_nulls: whether
+				 * element is NULL - elem_const: whether element is a Const
+				 * node
+				 */
+				foreach(lc, arrayexpr->elements)
+				{
+					Node	   *elem_value = (Node *) lfirst(lc);
+
+					if (IsA(elem_value, Const))
+					{
+						elem_values[i] = ((Const *) elem_value)->constvalue;
+						elem_nulls[i] = ((Const *) elem_value)->constisnull;
+						elem_const[i] = true;
+					}
+					else
+					{
+						elem_nulls[i] = false;
+						elem_const[i] = false;
+					}
+
+					i++;
+				}
+
+				s1 = scalararray_mcv_hash_match(&vardata, operator, clause->inputcollid, other_op, var_on_left,
+												elem_values, elem_nulls, num_elems, elem_const,
+												nominal_element_type, useOr, isEquality, isInequality);
+
+				pfree(elem_values);
+				pfree(elem_nulls);
+				pfree(elem_const);
+
+				ReleaseVariableStats(vardata);
+
+				if (s1 >= 0.0)
+					return s1;
+			}
+		}
+
 		/*
 		 * We use the assumption of disjoint probabilities here too, although
 		 * the odds of equal array elements are rather higher if the elements
@@ -2210,6 +2365,437 @@ scalararraysel(PlannerInfo *root,
 	return s1;
 }
 
+
+/*
+ * Estimate selectivity of a ScalarArrayOpExpr (ANY/ALL) using MCV statistics
+ * with hash-based matching.
+ *
+ * This function follows the same probability model as the generic
+ * ScalarArrayOpExpr selectivity code (independent or disjoint probabilities
+ * for OR/AND combinations), but attempts to speed up matching between
+ * IN-list elements and the column's most-common-values (MCV) statistics by
+ * using hashing instead of nested loops.
+ *
+ * MCV statistics are used only to obtain per-value selectivities for
+ * constants that match MCV entries.  All probabilities are combined using
+ * the standard ANY/ALL formulas, exactly as in the generic estimator.
+ *
+ * The function may return -1.0 to indicate that hash-based MCV estimation
+ * is not applicable (for example, missing statistics, unsupported operator,
+ * or unavailable hash functions), in which case the caller should fall back
+ * to the generic ScalarArrayOpExpr selectivity estimation.
+ *
+ * Inputs:
+ *	vardata: statistics and metadata for the variable being estimated
+ *	operator: equality or inequality operator to apply
+ *	collation: OID of collation to use
+ *	other_op: expression for the non-variable side of the comparison
+ *	var_on_left: true if the variable is on the left side of the operator
+ *	elem_values: array of IN-list element values
+ *	elem_nulls: array indicating which IN-list elements are NULL
+ *	elem_const: array indicating which IN-list elements are Const nodes
+ *	num_elems: number of IN-list elements
+ *	nominal_element_type: type of IN-list elements
+ *	useOr: true if elements are combined using OR semantics, false for AND
+ *	isEquality: true if the operator behaves like equality
+ *	isInequality: true if the operator behaves like inequality
+ *
+ * Result:
+ *	Selectivity estimate in the range [0.0, 1.0], or -1.0 if no estimate
+ *	could be produced by this function.
+ *
+ * Note:
+ *	This function assumes that the operator’s selectivity behavior matches
+ *	eqsel()/neqsel semantics.  It must not be used for operators with custom
+ *	or non-standard selectivity behavior.
+ */
+static double
+scalararray_mcv_hash_match(VariableStatData *vardata, Oid operator, Oid collation,
+						   Node *other_op, bool var_on_left,
+						   Datum *elem_values, bool *elem_nulls, int num_elems, bool *elem_const,
+						   Oid nominal_element_type, bool useOr, bool isEquality,
+						   bool isInequality)
+{
+	Form_pg_statistic stats;
+	AttStatsSlot sslot;
+	FmgrInfo	eqproc;
+	double		selec = -1.0,
+				s1disjoint,
+				nullfrac = 0.0;
+	Oid			hashLeft = InvalidOid,
+				hashRight = InvalidOid,
+				opfuncoid;
+	bool		have_mcvs = false;
+
+	/*
+	 * If the variable is known to be unique, MCV statistics do not represent
+	 * a meaningful frequency distribution, so skip MCV-based estimation.
+	 */
+	if (vardata->isunique && vardata->rel && vardata->rel->tuples >= 1.0)
+		return -1.0;
+
+	/*
+	 * For inequality (<>, ALL), we compute probabilities using the negated
+	 * equality operator and later transform them as
+	 *
+	 * p(x <> c) = 1 - p(x = c) - nullfrac
+	 */
+	if (isInequality)
+	{
+		operator = get_negator(operator);
+		if (!OidIsValid(operator))
+			return -1.0;
+	}
+
+	opfuncoid = get_opcode(operator);
+	memset(&sslot, 0, sizeof(sslot));
+
+	if (HeapTupleIsValid(vardata->statsTuple))
+	{
+		if (statistic_proc_security_check(vardata, opfuncoid))
+			have_mcvs = get_attstatsslot(&sslot, vardata->statsTuple,
+										 STATISTIC_KIND_MCV, InvalidOid,
+										 ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS);
+	}
+
+	if (have_mcvs)
+	{
+		fmgr_info(opfuncoid, &eqproc);
+
+		/*
+		 * f the MCV list and IN-list are large enough, and the operator
+		 * supports hashing, attempt to use hash functions so that MCV–IN
+		 * matching can be done in O(N+M) instead of O(N×M).
+		 */
+		if (sslot.nvalues + num_elems >= SCALARARRAY_MCV_HASH_THRESHOLD)
+			(void) get_op_hash_functions(operator, &hashLeft, &hashRight);
+	}
+
+	if (have_mcvs && OidIsValid(hashLeft) && OidIsValid(hashRight))
+	{
+		/* Use a hash table to speed up the matching */
+		LOCAL_FCINFO(fcinfo, 2);
+		LOCAL_FCINFO(hash_fcinfo, 1);
+		MCVInHashTable_hash *hashTable;
+		FmgrInfo	hash_proc;
+		MCVInHashContext hashContext;
+		double		sumallcommon = 0.0,
+					nonmcv_selec = 0.0;
+		bool		isdefault;
+		bool		hash_mcv;
+		double		otherdistinct;
+		Datum	   *arrayHash;
+		Datum	   *arrayProbe;
+		int			nvaluesHash;
+		int			nvaluesProbe;
+		int			nvaluesnonmcv = num_elems;
+		int			nvaluesnonconst = 0;
+
+		/* Grab the nullfrac for use below. */
+		stats = (Form_pg_statistic) GETSTRUCT(vardata->statsTuple);
+		nullfrac = stats->stanullfrac;
+
+		selec = s1disjoint = (useOr ? 0.0 : 1.0);
+
+		InitFunctionCallInfoData(*fcinfo, &eqproc, 2, collation,
+								 NULL, NULL);
+		fcinfo->args[0].isnull = false;
+		fcinfo->args[1].isnull = false;
+
+		for (int i = 0; i < sslot.nvalues; i++)
+			sumallcommon += sslot.numbers[i];
+
+		/*
+		 * Compute the total probability mass of all non-MCV values. This is
+		 * the part of the column distribution not covered by MCVs.
+		 */
+		nonmcv_selec = 1.0 - sumallcommon - nullfrac;
+		CLAMP_PROBABILITY(nonmcv_selec);
+
+		/*
+		 * Approximate the per-value probability of a non-MCV constant by
+		 * dividing the remaining probability mass by the number of other
+		 * distinct values.
+		 */
+		otherdistinct = get_variable_numdistinct(vardata, &isdefault) - sslot.nnumbers;
+		if (otherdistinct > 1)
+			nonmcv_selec /= otherdistinct;
+
+		if (sslot.nnumbers > 0 && nonmcv_selec > sslot.numbers[sslot.nnumbers - 1])
+			nonmcv_selec = sslot.numbers[sslot.nnumbers - 1];
+
+		/* Make sure we build the hash table on the smaller array. */
+		if (sslot.nvalues <= num_elems)
+		{
+			hash_mcv = true;
+			nvaluesHash = sslot.nvalues;
+			nvaluesProbe = num_elems;
+			arrayHash = sslot.values;
+			arrayProbe = elem_values;
+		}
+		else
+		{
+			hash_mcv = false;
+			nvaluesHash = num_elems;
+			nvaluesProbe = sslot.nvalues;
+			arrayHash = elem_values;
+			arrayProbe = sslot.values;
+		}
+
+		fmgr_info(hash_mcv ? hashLeft : hashRight, &hash_proc);
+		InitFunctionCallInfoData(*hash_fcinfo, &hash_proc, 1, collation,
+								 NULL, NULL);
+		hash_fcinfo->args[0].isnull = false;
+
+		hashContext.equal_fcinfo = fcinfo;
+		hashContext.hash_fcinfo = hash_fcinfo;
+		hashContext.insert_mode = true;
+
+		get_typlenbyval(hash_mcv ? sslot.valuetype : nominal_element_type,
+						&hashContext.hash_typlen,
+						&hashContext.hash_typbyval);
+
+		hashTable = MCVInHashTable_create(CurrentMemoryContext,
+										  nvaluesHash,
+										  &hashContext);
+
+		/* Build a hash table over the smaller input side. */
+		for (int i = 0; i < nvaluesHash; i++)
+		{
+			bool		found = false;
+			MCVInHashEntry *entry;
+
+			/*
+			 * When hashing IN-list values (hash_mcv == false), we only insert
+			 * constant, non-NULL elements.  NULL and non-Const elements are
+			 * counted separately, because they cannot participate in MCV
+			 * matching and must be handled later using generic selectivity
+			 * estimation.
+			 */
+			if (!hash_mcv)
+			{
+				if (elem_nulls[i])
+				{
+					nvaluesnonmcv--;
+					continue;
+				}
+
+				if (!elem_const[i])
+				{
+					nvaluesnonmcv--;
+					nvaluesnonconst++;
+					continue;
+				}
+			}
+
+			entry = MCVInHashTable_insert(hashTable, arrayHash[i], &found);
+
+			/*
+			 * entry->count tracks how many times the same value appears, so
+			 * that duplicate IN-list elements can be folded into the
+			 * probability calculation.
+			 */
+			if (likely(!found))
+			{
+				entry->index = i;
+				entry->count = 1;
+			}
+			else
+				entry->count++;
+		}
+
+		hashContext.insert_mode = false;
+		if (hashLeft != hashRight)
+		{
+			fmgr_info(hash_mcv ? hashRight : hashLeft, &hash_proc);
+			/* Resetting hash_fcinfo is probably unnecessary, but be safe */
+			InitFunctionCallInfoData(*hash_fcinfo, &hash_proc, 1, collation,
+									 NULL, NULL);
+			hash_fcinfo->args[0].isnull = false;
+		}
+
+		for (int i = 0; i < nvaluesProbe; i++)
+		{
+			MCVInHashEntry *entry;
+			Selectivity s1;
+			int			nvaluesmcv;
+
+			/*
+			 * When probing with IN-list elements, ignore NULLs and non-Const
+			 * expressions: they cannot be matched against MCVs and will be
+			 * accounted for later by generic estimation.
+			 */
+			if (hash_mcv)
+			{
+				if (elem_nulls[i])
+				{
+					nvaluesnonmcv--;
+					continue;
+				}
+
+				if (!elem_const[i])
+				{
+					nvaluesnonmcv--;
+					nvaluesnonconst++;
+					continue;
+				}
+			}
+
+			entry = MCVInHashTable_lookup(hashTable, arrayProbe[i]);
+
+			/*
+			 * If found, obtain its MCV frequency and remember how many values
+			 * on the hashed side map to this entry.
+			 */
+			if (entry != NULL)
+			{
+				s1 = hash_mcv ? sslot.numbers[entry->index]
+					: sslot.numbers[i];
+
+				nvaluesmcv = entry->count;
+
+				/* Matched values are no longer considered non-MCV */
+				nvaluesnonmcv -= nvaluesmcv;
+			}
+			else
+			{
+				/* No MCV match for this value */
+				continue;
+			}
+
+			/*
+			 * Fold this value's probability into the running ANY/ALL
+			 * selectivity estimate once for each occurrence.
+			 */
+			for (int j = 0; j < nvaluesmcv; j++)
+				accum_scalararray_prob(s1, useOr, isEquality, isInequality,
+									   nullfrac, &selec, &s1disjoint);
+		}
+
+		/*
+		 * Account for constant IN-list values that did not match any MCV.
+		 *
+		 * Each such value is assumed to have probability = nonmcv_selec,
+		 * derived from the remaining (non-MCV) probability mass.
+		 */
+		for (int i = 0; i < nvaluesnonmcv; i++)
+			accum_scalararray_prob(nonmcv_selec, useOr, isEquality, isInequality,
+								   nullfrac, &selec, &s1disjoint);
+
+		/*
+		 * Account for non-Const IN-list elements.
+		 *
+		 * These values cannot be matched against MCVs, so we rely on the
+		 * operator's generic selectivity estimator for each of them.
+		 */
+		for (int i = 0; i < nvaluesnonconst; i++)
+		{
+			Selectivity s1 = var_eq_non_const(vardata, operator, collation,
+											  other_op, var_on_left, isInequality);
+
+			accum_scalararray_prob(s1, useOr, isEquality, isInequality,
+								   nullfrac, &selec, &s1disjoint);
+		}
+
+		/*
+		 * For = ANY or <> ALL, if the IN-list elements are assumed distinct,
+		 * the events are disjoint and the total probability is the sum of
+		 * individual probabilities.  Use that estimate if it lies in [0,1].
+		 */
+		if ((useOr ? isEquality : isInequality) &&
+			s1disjoint >= 0.0 && s1disjoint <= 1.0)
+			selec = s1disjoint;
+
+		CLAMP_PROBABILITY(selec);
+
+		MCVInHashTable_destroy(hashTable);
+		free_attstatsslot(&sslot);
+	}
+
+	return selec;
+}
+
+/*
+ * Accumulate the selectivity contribution of a single array element
+ * into the running ScalarArrayOpExpr selectivity estimate.
+ */
+static void
+accum_scalararray_prob(Selectivity s1,
+					   bool useOr,
+					   bool isEquality,
+					   bool isInequality,
+					   double nullfrac,
+					   double *selec,
+					   double *s1disjoint)
+{
+	Selectivity s2 = s1;
+
+	if (isInequality)
+		s2 = 1.0 - s2 - nullfrac;
+
+	CLAMP_PROBABILITY(s2);
+
+	if (useOr)
+	{
+		*selec = *selec + s2 - (*selec) * s2;
+		if (isEquality)
+			*s1disjoint += s2;
+	}
+	else
+	{
+		*selec = (*selec) * s2;
+		if (isInequality)
+			*s1disjoint += s2 - 1.0;
+	}
+}
+
+/*
+ * Support functions for the hash tables used by eqjoinsel_find_matches
+ */
+static uint32
+hash_mcv_in(MCVInHashTable_hash *tab, Datum key)
+{
+	MCVInHashContext *context = (MCVInHashContext *) tab->private_data;
+	FunctionCallInfo fcinfo = context->hash_fcinfo;
+	Datum		fresult;
+
+	fcinfo->args[0].value = key;
+	fcinfo->isnull = false;
+	fresult = FunctionCallInvoke(fcinfo);
+	Assert(!fcinfo->isnull);
+	return DatumGetUInt32(fresult);
+}
+
+static bool
+mcvs_in_equal(MCVInHashTable_hash *tab, Datum key0, Datum key1)
+{
+	MCVInHashContext *context = (MCVInHashContext *) tab->private_data;
+
+	if (context->insert_mode)
+	{
+		/*
+		 * During the insertion step, any comparisons will be between two
+		 * Datums of the hash table's data type, so if the given operator is
+		 * cross-type it will be the wrong thing to use.  Fortunately, we can
+		 * use datum_image_eq instead.  The MCV values should all be distinct
+		 * anyway, so it's mostly pro-forma to compare them at all.
+		 */
+		return datum_image_eq(key0, key1,
+							  context->hash_typbyval, context->hash_typlen);
+	}
+	else
+	{
+		FunctionCallInfo fcinfo = context->equal_fcinfo;
+		Datum		fresult;
+
+		fcinfo->args[0].value = key0;
+		fcinfo->args[1].value = key1;
+		fcinfo->isnull = false;
+		fresult = FunctionCallInvoke(fcinfo);
+		return (!fcinfo->isnull && DatumGetBool(fresult));
+	}
+}
+
 /*
  * Estimate number of elements in the array yielded by an expression.
  *
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index ddbe4c64971..48b886c2ae4 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1671,6 +1671,9 @@ MBuf
 MCVHashContext
 MCVHashEntry
 MCVHashTable_hash
+MCVInHashContext
+MCVInHashEntry
+MCVInHashTable_hash
 MCVItem
 MCVList
 MEMORY_BASIC_INFORMATION
-- 
2.34.1

Re: Hash-based MCV matching for large IN-lists

Reply via email to