Re: Additional improvements to extended statistics

Dean Rasheed Wed, 11 Mar 2020 07:22:12 -0700

On Mon, 9 Mar 2020 at 18:19, Tomas Vondra <tomas.von...@2ndquadrant.com> wrote:>
> On Mon, Mar 09, 2020 at 08:35:48AM +0000, Dean Rasheed wrote:
> >
> >  P(a,b) = P(a) * [f + (1-f)*P(b)]
> >
> >because it might return a value that is larger that P(b), which
> >obviously should not be possible.
>
> Hmmm, yeah. It took me a while to reproduce this - the trick is in
> "inverting" the dependency on a subset of data, e.g. like this:
>
>    create table t (a int, b int);
>    insert into t select mod(i,50), mod(i,25)
>      from generate_series(1,5000) s(i);
>    update t set a = 0 where a < 3;
>    create statistics s (dependencies) on a,b from t;
>
> which then does this:
>
>    test=# explain select * from t where a = 0;
>                         QUERY PLAN
>    ----------------------------------------------------
>     Seq Scan on t  (cost=0.00..86.50 rows=300 width=8)
>       Filter: (a = 0)
>    (2 rows)
>
>    test=# explain select * from t where b = 0;
>                         QUERY PLAN
>    ----------------------------------------------------
>     Seq Scan on t  (cost=0.00..86.50 rows=200 width=8)
>       Filter: (b = 0)
>    (2 rows)
>
>    test=# explain select * from t where a = 0 and b = 0;
>                         QUERY PLAN
>    ----------------------------------------------------
>     Seq Scan on t  (cost=0.00..99.00 rows=283 width=8)
>       Filter: ((a = 0) AND (b = 0))
>    (2 rows)
>
> Which I think is the issue you've described ...
>


I think this is also related to the problem that functional dependency
stats don't take into account the fact that the user clauses may not
be compatible with one another. For example:

CREATE TABLE t (a int, b int);
INSERT INTO t
  SELECT x/10,x/10 FROM (SELECT generate_series(1,x)
                     FROM generate_series(1,100) g(x)) AS t(x);
CREATE STATISTICS s (dependencies) ON a,b FROM t;
ANALYSE t;

EXPLAIN SELECT * FROM t WHERE a = 10;

                    QUERY PLAN
--------------------------------------------------
 Seq Scan on t  (cost=0.00..86.12 rows=1 width=8)
   Filter: (a = 10)
(2 rows)

EXPLAIN SELECT * FROM t WHERE b = 1;

                     QUERY PLAN
----------------------------------------------------
 Seq Scan on t  (cost=0.00..86.12 rows=865 width=8)
   Filter: (b = 1)
(2 rows)

EXPLAIN SELECT * FROM t WHERE a = 10 AND b = 1;

                     QUERY PLAN
----------------------------------------------------
 Seq Scan on t  (cost=0.00..98.75 rows=865 width=8)
   Filter: ((a = 10) AND (b = 1))
(2 rows)

whereas without stats it would estimate 1 row. That kind of
over-estimate could get very bad, so it would be good to find a way to
fix it.


> >We should amend that formula to prevent a result larger than P(b). The
> >obvious way to do that would be to use:
> >
> >  P(a,b) = Min(P(a) * [f + (1-f)*P(b)], P(b))
> >
> >but actually I think it would be better and more principled to use:
> >
> >  P(a,b) = f*Min(P(a),P(b)) + (1-f)*P(a)*P(b)
> >
> >I.e., for those rows believed to be functionally dependent, we use the
> >minimum probability, and for the rows believed to be independent, we
> >use the product.
> >
>
> Hmmm, yeah. The trouble is that we currently don't really have both
> selectivities in dependencies_clauselist_selectivity :-(
>

I hacked on this a bit, and I think it's possible to apply dependency
stats in a more general way (not necessarily assuming equality
clauses), something like the attached very rough patch.

This approach guarantees that the result of combining a pair of
selectivities with a functional dependency between them gives a
combined selectivity that is never greater than either individual
selectivity.

One regression test fails, but looking at it, that's to be expected --
the test alters the type of a column, causing its univariate stats to
be dropped, so the single-column estimate is reduced, and the new code
refuses to give a higher estimate than the single clause's new
estimate.


> It's also not clear to me how would this work for more than two clauses,
> that are all functionally dependent. Like (a => b => c), for example.
> But I haven't thought about this very much yet.
>

I attempted to solve that by computing a chain of conditional
probabilities. The maths needs checking over (as I said, this is a
very rough patch). In particular, I think it's wrong for cases like (
a->b, a->c ), but I think it's along the right lines.


> >I think that would solve the problem with the example you gave at the
> >end of [2], but I'm not sure if it helps with the general case.
> >
>
> I don't follow. I think the issue with [2] is that we can't really apply
> stats about the array values to queries on individual array elements.
> Can you explain how would the proposed changes deal with this?
>

With this patch, the original estimate of ~900 rows in that example is
restored with functional dependencies, because of the way it utilises
the minimum selectivity of the 2 clauses.

I've not fully thought this through, but I think it might allow
functional dependencies to be applied to a wider range of operators.

Regards,
Dean

diff --git a/src/backend/statistics/dependencies.c b/src/backend/statistics/dependencies.c
new file mode 100644
index e2f6c5b..5bd7bb5
--- a/src/backend/statistics/dependencies.c
+++ b/src/backend/statistics/dependencies.c
@@ -30,6 +30,7 @@
 #include "utils/fmgroids.h"
 #include "utils/fmgrprotos.h"
 #include "utils/lsyscache.h"
+#include "utils/selfuncs.h"
 #include "utils/syscache.h"
 #include "utils/typcache.h"
 
@@ -73,8 +74,6 @@ static double dependency_degree(int numr
 								AttrNumber *dependency, VacAttrStats **stats, Bitmapset *attrs);
 static bool dependency_is_fully_matched(MVDependency *dependency,
 										Bitmapset *attnums);
-static bool dependency_implies_attribute(MVDependency *dependency,
-										 AttrNumber attnum);
 static bool dependency_is_compatible_clause(Node *clause, Index relid,
 											AttrNumber *attnum);
 static MVDependency *find_strongest_dependency(MVDependencies **dependencies,
@@ -614,19 +613,6 @@ dependency_is_fully_matched(MVDependency
 }
 
 /*
- * dependency_implies_attribute
- *		check that the attnum matches is implied by the functional dependency
- */
-static bool
-dependency_implies_attribute(MVDependency *dependency, AttrNumber attnum)
-{
-	if (attnum == dependency->attributes[dependency->nattributes - 1])
-		return true;
-
-	return false;
-}
-
-/*
  * statext_dependencies_load
  *		Load the functional dependencies for the indicated pg_statistic_ext tuple
  */
@@ -949,20 +935,27 @@ dependencies_clauselist_selectivity(Plan
 									RelOptInfo *rel,
 									Bitmapset **estimatedclauses)
 {
-	Selectivity s1 = 1.0;
+	Selectivity s1;
 	ListCell   *l;
-	Bitmapset  *clauses_attnums = NULL;
-	Bitmapset **list_attnums;
+	Bitmapset  *clauses_attnums;
+	AttrNumber *list_attnums;
 	int			listidx;
-	MVDependencies    **dependencies = NULL;
-	int					ndependencies = 0;
+	MVDependencies **dependencies;
+	int			ndependencies;
+	int			max_deps;
+	MVDependency **dependencies_used;
+	int			ndependencies_used;
+	Bitmapset  *attrs_used;
+	int			nattrs_used;
+	Selectivity *attr_sel;
 	int			i;
+	int			attidx;
 
 	/* check if there's any stats that might be useful for us. */
 	if (!has_stats_of_kind(rel->statlist, STATS_EXT_DEPENDENCIES))
 		return 1.0;
 
-	list_attnums = (Bitmapset **) palloc(sizeof(Bitmapset *) *
+	list_attnums = (AttrNumber *) palloc(sizeof(AttrNumber) *
 										 list_length(clauses));
 
 	/*
@@ -976,6 +969,7 @@ dependencies_clauselist_selectivity(Plan
 	 * We also skip clauses that we already estimated using different types of
 	 * statistics (we treat them as incompatible).
 	 */
+	clauses_attnums = NULL;
 	listidx = 0;
 	foreach(l, clauses)
 	{
@@ -985,11 +979,11 @@ dependencies_clauselist_selectivity(Plan
 		if (!bms_is_member(listidx, *estimatedclauses) &&
 			dependency_is_compatible_clause(clause, rel->relid, &attnum))
 		{
-			list_attnums[listidx] = bms_make_singleton(attnum);
+			list_attnums[listidx] = attnum;
 			clauses_attnums = bms_add_member(clauses_attnums, attnum);
 		}
 		else
-			list_attnums[listidx] = NULL;
+			list_attnums[listidx] = 0;
 
 		listidx++;
 	}
@@ -1001,6 +995,7 @@ dependencies_clauselist_selectivity(Plan
 	 */
 	if (bms_num_members(clauses_attnums) < 2)
 	{
+		bms_free(clauses_attnums);
 		pfree(list_attnums);
 		return 1.0;
 	}
@@ -1017,6 +1012,7 @@ dependencies_clauselist_selectivity(Plan
 	 * to moving the freed chunks to freelists etc.
 	 */
 	ndependencies = 0;
+	max_deps = 0;
 	dependencies = (MVDependencies **) palloc(sizeof(MVDependencies *) *
 											  list_length(rel->statlist));
 
@@ -1038,104 +1034,183 @@ dependencies_clauselist_selectivity(Plan
 		if (num_matched < 2)
 			continue;
 
-		dependencies[ndependencies++]
+		dependencies[ndependencies]
 			= statext_dependencies_load(stat->statOid);
+
+		max_deps += dependencies[ndependencies]->ndeps;
+		ndependencies++;
 	}
 
 	/* if no matching stats could be found then we've nothing to do */
 	if (!ndependencies)
 	{
+		pfree(dependencies);
+		bms_free(clauses_attnums);
 		pfree(list_attnums);
 		return 1.0;
 	}
 
 	/*
-	 * Apply the dependencies recursively, starting with the widest/strongest
-	 * ones, and proceeding to the smaller/weaker ones. At the end of each
-	 * round we factor in the selectivity of clauses on the implied attribute,
-	 * and remove the clauses from the list.
+	 * Work out which dependencies we can apply, starting with the
+	 * widest/stongest ones, and proceeding to smaller/weaker ones.
 	 */
+	ndependencies_used = 0;
+	dependencies_used = (MVDependency **) palloc(sizeof(MVDependency *) *
+												 max_deps);
+
+	attrs_used = NULL;
+
 	while (true)
 	{
-		Selectivity s2 = 1.0;
 		MVDependency *dependency;
+		AttrNumber	attnum;
 
-		/* the widest/strongest dependency, fully matched by clauses */
+		/* The widest/strongest dependency, fully matched by clauses */
 		dependency = find_strongest_dependency(dependencies, ndependencies,
 											   clauses_attnums);
-
-		/* if no suitable dependency was found, we're done */
 		if (!dependency)
 			break;
 
-		/*
-		 * We found an applicable dependency, so find all the clauses on the
-		 * implied attribute - with dependency (a,b => c) we look for clauses
-		 * on 'c'.
-		 */
-		listidx = -1;
-		foreach(l, clauses)
+		/* Record all attributes used by this dependency */
+		for (i = 0; i < dependency->nattributes; i++)
 		{
-			Node	   *clause;
-			AttrNumber	attnum;
+			attnum = dependency->attributes[i];
+			attrs_used = bms_add_member(attrs_used, attnum);
+		}
 
-			listidx++;
+		/* Ignore any other dependencies with the same implied attribute */
+		clauses_attnums = bms_del_member(clauses_attnums, attnum);
 
-			/*
-			 * Skip incompatible clauses, and ones we've already estimated on.
-			 */
-			if (!list_attnums[listidx])
-				continue;
+		dependencies_used[ndependencies_used++] = dependency;
+	}
 
-			/*
-			 * We expect the bitmaps ton contain a single attribute number.
-			 */
-			attnum = bms_singleton_member(list_attnums[listidx]);
+	if (!ndependencies_used)
+	{
+		pfree(dependencies_used);
+		pfree(dependencies);
+		bms_free(clauses_attnums);
+		pfree(list_attnums);
+		return 1.0;
+	}
 
-			/*
-			 * Technically we could find more than one clause for a given
-			 * attnum. Since these clauses must be equality clauses, we choose
-			 * to only take the selectivity estimate from the final clause in
-			 * the list for this attnum. If the attnum happens to be compared
-			 * to a different Const in another clause then no rows will match
-			 * anyway. If it happens to be compared to the same Const, then
-			 * ignoring the additional clause is just the thing to do.
-			 */
-			if (dependency_implies_attribute(dependency, attnum))
-			{
-				clause = (Node *) lfirst(l);
+	/*
+	 * For each attribute used in the dependencies to be applied, compute the
+	 * selectivity of all the clauses using that attribute, and mark all those
+	 * clauses as having been estimated.
+	 */
+	nattrs_used = bms_num_members(attrs_used);
+	attr_sel = (Selectivity *) palloc(sizeof(Selectivity) * nattrs_used);
 
-				s2 = clause_selectivity(root, clause, varRelid, jointype,
-										sjinfo);
+	attidx = 0;
+	i = -1;
+	while ((i = bms_next_member(attrs_used, i)) >= 0)
+	{
+		List	   *attr_clauses = NIL;
+		Selectivity	simple_sel;
 
-				/* mark this one as done, so we don't touch it again. */
-				*estimatedclauses = bms_add_member(*estimatedclauses, listidx);
+		listidx = -1;
+		foreach(l, clauses)
+		{
+			Node	   *clause = (Node *) lfirst(l);
 
-				/*
-				 * Mark that we've got and used the dependency on this clause.
-				 * We'll want to ignore this when looking for the next
-				 * strongest dependency above.
-				 */
-				clauses_attnums = bms_del_member(clauses_attnums, attnum);
+			listidx++;
+			if (list_attnums[listidx] == i)
+			{
+				attr_clauses = lappend(attr_clauses, clause);
+				*estimatedclauses = bms_add_member(*estimatedclauses, listidx);
 			}
 		}
 
+		simple_sel = clauselist_selectivity_simple(root, attr_clauses, varRelid,
+												   jointype, sjinfo, NULL);
+		attr_sel[attidx++] = simple_sel;
+	}
+
+	/*
+	 * Now combine these selectivities using the dependency information.  We
+	 * traverse the dependencies in the opposite order to which we found them,
+	 * since dependencies near the start of the list may depend on attributes
+	 * implied by dependencies later in the list (but the opposite cannot
+	 * happen).
+	 *
+	 * Pairs of selectivities for attributes with dependencies are combined
+	 * using the formula
+	 *
+	 * P(a,b) = f * Min(P(a), P(b)) + (1-f) * P(a) * P(b)
+	 *
+	 * where 'f' is the degree of validity of the dependency.  This ensures
+	 * that the combined selectivity is never greater than either individual
+	 * selectivity.
+	 *
+	 * Where multiple dependencies apply (e.g., a -> b -> c), we use
+	 * conditional probabilities to compute the overall result as follows:
+	 *
+	 * P(a,b,c) = P(c|a,b) * P(a,b) = P(c|a,b) * P(a|b) * P(b)
+	 *
+	 * so we replace the probabilities of all implied attributes with
+	 * conditional probabilities, conditional on all their implying
+	 * attributes.  The probabilities of all other non-implied attributes are
+	 * left as they are.
+	 */
+	for (i = ndependencies_used-1; i >= 0; i--)
+	{
+		MVDependency *dependency = dependencies_used[i];
+		int			j;
+		AttrNumber	attnum;
+		Selectivity	s2;
+		double		f;
+
+		/* Selectivity of all the implying attributes */
+		s1 = 1.0;
+		for (j = 0; j < dependency->nattributes - 1; j++)
+		{
+			attnum = dependency->attributes[j];
+			attidx = bms_member_index(attrs_used, attnum);
+			s1 *= attr_sel[attidx];
+		}
+
+		/* Base selectivity of the implied attribute */
+		attnum = dependency->attributes[j];
+		attidx = bms_member_index(attrs_used, attnum);
+		s2 = attr_sel[attidx];
+
 		/*
-		 * Now factor in the selectivity for all the "implied" clauses into
-		 * the final one, using this formula:
+		 * Replace the probability s2 with the conditional probability s2
+		 * given s1, computed using the formula P(b|a) = P(a,b) / P(a), which
+		 * simplifies to
 		 *
-		 * P(a,b) = P(a) * (f + (1-f) * P(b))
+		 * P(b|a) = f * Min(P(a), P(b)) / P(a) + (1-f) * P(b)
 		 *
-		 * where 'f' is the degree of validity of the dependency.
+		 * where P(a) = s1, the selectivity of the implying attributes, and
+		 * P(b) = s2, the selectivity of the implied attribute.
 		 */
-		s1 *= (dependency->degree + (1 - dependency->degree) * s2);
+		f = dependency->degree;
+
+		if (s1 <= s2)
+			attr_sel[attidx] = f + (1 - f) * s2;
+		else
+			attr_sel[attidx] = f * s2 / s1 + (1 -f) * s2;
 	}
 
+	/*
+	 * The overall selectivity of all the clauses on all these attributes is
+	 * then the product of all these conditional/base probabilities.
+	 */
+	s1 = 1.0;
+	for (i = 0; i < nattrs_used; i++)
+		s1 *= attr_sel[i];
+
+	CLAMP_PROBABILITY(s1);
+
 	/* free deserialized functional dependencies (and then the array) */
 	for (i = 0; i < ndependencies; i++)
 		pfree(dependencies[i]);
 
+	pfree(attr_sel);
+	bms_free(attrs_used);
+	pfree(dependencies_used);
 	pfree(dependencies);
+	bms_free(clauses_attnums);
 	pfree(list_attnums);
 
 	return s1;

Re: Additional improvements to extended statistics

Reply via email to