Hi!

This was initially posted to pgsql-performance in this thread:

  http://www.postgresql.org/message-id/5472416c.3080...@fuzzy.cz

but pgsql-hackers seems like a more appropriate place for further
discussion.

Anyways, attached is v3 of the patch implementing the adaptive ndistinct
estimator. Just like the previous version, the original estimate is the
one stored/used, and the alternative one is just printed, to make it
possible to compare the results.

Changes in this version:

1) implementing compute_minimal_stats

   - So far only the 'scalar' (more common) case was handled.

   - The algorithm requires more detailed input data, the MCV-based
     stats insufficient, so the code hashes the values and then
     determines the f1, f2, ..., fN coefficients by sorting and
     walking the array of hashes.

2) handling wide values properly (now are counted into f1)

3) compensating for NULL values when calling optimize_estimate

   - The estimator has no notion of NULL values, so it's necessary to
     remove them both from the total number of rows, and sampled rows.

4) some minor fixes and refactorings


I also repeated the tests comparing the results to the current estimator
- full results are at the end of the post.

The one interesting case is the 'step skew' with statistics_target=10,
i.e. estimates based on mere 3000 rows. In that case, the adaptive
estimator significantly overestimates:

    values   current    adaptive
    ------------------------------
    106           99         107
    106            8     6449190
    1006          38     6449190
    10006        327       42441

I don't know why I didn't get these errors in the previous runs, because
when I repeat the tests with the old patches I get similar results with
a 'good' result from time to time. Apparently I had a lucky day back
then :-/

I've been messing with the code for a few hours, and I haven't found any
significant error in the implementation, so it seems that the estimator
does not perform terribly well for very small samples (in this case it's
3000 rows out of 10.000.000 (i.e. ~0.03%).

However, I've been able to come up with a simple way to limit such
errors, because the number of distinct values is naturally bounded by

    (totalrows / samplerows) * ndistinct

where ndistinct is the number of distinct values in the sample. This
essentially means that if you slice the table into sets of samplerows
rows, you get different ndistinct values.

BTW, this also fixes the issue reported by Jeff Janes on 21/11.

With this additional sanity check, the results look like this:

    values   current    adaptive
    ------------------------------
    106           99         116
    106            8       23331
    1006          38       96657
    10006        327       12400

Which is much better, but clearly still a bit on the high side.

So either the estimator really is a bit unstable for such small samples
(it tends to overestimate a bit in all the tests), or there's a bug in
the implementation - I'd be grateful if someone could peek at the code
and maybe compare it to the paper describing the estimator. I've spent a
fair amount of time analyzing it, but found nothing.

But maybe the estimator really is unstable for such small samples - in
that case we could probably use the current estimator as a fallback.
After all, this only happens when someone explicitly decreases the
statistics target to 10 - with the default statistics target it's damn
accurate.

kind regards
Tomas


statistics_target = 10
======================

a) smooth skew, 101 values, different skew ('k')

   - defaults to the current estimator

b) smooth skew, 10.001 values, different skew ('k')

    k    current  adaptive
    -----------------------
    1      10231     11259
    2       6327      8543
    3       4364      7707
    4       3436      7052
    5       2725      5868
    6       2223      5071
    7       1979      5011
    8       1802      5017
    9       1581      4546

c) step skew (different numbers of values)

    values   current    adaptive
    ------------------------------
    106           99         107
    106            8     6449190
    1006          38     6449190
    10006        327       42441

   patched:

    values   current    adaptive
    ------------------------------
    106           99         116
    106            8       23331
    1006          38       96657
    10006        327       12400


statistics_target = 100
=======================

a) smooth skew, 101 values, different skew ('k')

   - defaults to the current estimator

b) smooth skew, 10.001 values, different skew ('k')

    k      current     adaptive
    -----------------------------
    1        10011        10655
    2         9641        10944
    3         8837        10846
    4         8315        10992
    5         7654        10760
    6         7162        10524
    7         6650        10375
    8         6268        10275
    9         5871         9783

c) step skew (different numbers of values)

    values   current    adaptive
    ------------------------------
    106           30          70
    1006         271        1181
    10006       2804       10312


statistics_target = 1000
========================

a) smooth skew, 101 values, different skew ('k')

   - defaults to the current estimator

b) smooth skew, 10.001 values, different skew ('k')

    k    current     adaptive
    ---------------------------
    3      10001        10002
    4      10000        10003
    5       9996        10008
    6       9985        10013
    7       9973        10047
    8       9954        10082
    9       9932        10100

c) step skew (different numbers of values)

    values   current    adaptive
    ------------------------------
    106          105         113
    1006         958        1077
    10006       9592       10840

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 732ab22..5cf3cf0 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -16,6 +16,7 @@
 
 #include <math.h>
 
+#include "access/hash.h"
 #include "access/multixact.h"
 #include "access/transam.h"
 #include "access/tupconvert.h"
@@ -110,6 +111,9 @@ static void update_attstats(Oid relid, bool inh,
 static Datum std_fetch_func(VacAttrStatsP stats, int rownum, bool *isNull);
 static Datum ind_fetch_func(VacAttrStatsP stats, int rownum, bool *isNull);
 
+static double optimize_estimate(int total_rows, int sample_rows,
+								int *f, int f_max);
+static int hash_comparator(const void *a, const void *b);
 
 /*
  *	analyze_rel() -- analyze one relation
@@ -1849,6 +1853,7 @@ static void compute_scalar_stats(VacAttrStatsP stats,
 					 int samplerows,
 					 double totalrows);
 static int	compare_scalars(const void *a, const void *b, void *arg);
+static int	compare_scalars_simple(const void *a, const void *b, void *arg);
 static int	compare_mcvs(const void *a, const void *b);
 
 
@@ -1967,6 +1972,23 @@ compute_minimal_stats(VacAttrStatsP stats,
 	StdAnalyzeData *mystats = (StdAnalyzeData *) stats->extra_data;
 
 	/*
+	 * The adaptive ndistinct estimator requires counts for all the
+	 * repetition counts - we can't do the sort-based count directly
+	 * (because this handles data types with just = operator), and the
+	 * MCV-based counting seems insufficient. We'll instead compute
+	 * hash values, and sort those. We're using just 32-bit hashes,
+	 * which may result in a few collisions - for 30k rows (sampled
+	 * rows for default_statistics_target=100) there's 1:10 chance of
+	 * a hash collision (assuming all values are distinct). But this
+	 * seems like a small error compared to the other factors involved
+	 * (sampling, ...) or compared to the MCV-based counting.
+	 */
+	uint32	   *hashes = (uint32*)palloc0(samplerows * sizeof(uint32));
+
+	/* number of computed hashes (technically equal to nonnull_cnt) */
+	int			nhashes = 0;
+
+	/*
 	 * We track up to 2*n values for an n-element MCV list; but at least 10
 	 */
 	track_max = 2 * num_mcv;
@@ -2027,6 +2049,36 @@ compute_minimal_stats(VacAttrStatsP stats,
 			total_width += strlen(DatumGetCString(value)) + 1;
 		}
 
+		/* compute the hash value, depending on the data type kind */
+		if (stats->attrtype->typbyval)
+		{
+			/* simple pass-by-value data type, with 'typlen' bytes */
+			hashes[nhashes++]
+				= DatumGetUInt32(hash_any((unsigned char *) &value,
+										  stats->attrtype->typlen));
+		}
+		else if (is_varlena)
+		{
+			/* regular varlena data type */
+			hashes[nhashes++]
+				= DatumGetUInt32(hash_any((unsigned char *) VARDATA_ANY(value),
+										  VARSIZE_ANY_EXHDR(DatumGetPointer(value))));
+		}
+		else if (is_varwidth)
+		{
+			/* pass-by-reference with a variable length (e.g. cstring) */
+			hashes[nhashes++]
+				= DatumGetUInt32(hash_any((unsigned char *) DatumGetCString(value),
+										  strlen(DatumGetCString(value))));
+		}
+		else
+		{
+			/* pass-by-reference with fixed length (e.g. name) */
+			hashes[nhashes++]
+				= DatumGetUInt32(hash_any((unsigned char *) DatumGetCString(value),
+										  stats->attrtype->typlen));
+		}
+
 		/*
 		 * See if the value matches anything we're already tracking.
 		 */
@@ -2082,6 +2134,43 @@ compute_minimal_stats(VacAttrStatsP stats,
 		int			nmultiple,
 					summultiple;
 
+		/* values needed by the adaptive ndistinct estimator */
+		int			f_max = 0;
+		int		   *f_count = (int*)palloc0(sizeof(int) * (nhashes + 1));
+		int			prev_index;
+
+		/* sort the hashes and then count the repetitions */
+		qsort(hashes, nhashes, sizeof(uint32), hash_comparator);
+
+		/*
+		 * Counting repetitions - walk through the sorted array, compare
+		 * the value to the previous one, and whenever it changes the
+		 * we can compute the repetitions using the array indexes.
+		 */
+		prev_index = 0;
+		for (i = 1; i < nhashes; i++)
+		{
+			/* the hashes are different - store the repetition count */
+			if (hashes[i] != hashes[i-1])
+			{
+				f_count[i - prev_index] += 1;
+
+				if (f_max < (i - prev_index))
+					f_max = (i - prev_index);
+
+				prev_index = i;
+			}
+		}
+
+		/* the last element is not updated in the loop */
+		f_count[nhashes - prev_index] += 1;
+
+		if (f_max < (nhashes - prev_index))
+			f_max = (nhashes - prev_index);
+
+		/* wide values are assumed to be distinct */
+		f_count[1] += toowide_cnt;
+
 		stats->stats_valid = true;
 		/* Do the simple null-frac and width stats */
 		stats->stanullfrac = (double) null_cnt / (double) samplerows;
@@ -2139,6 +2228,7 @@ compute_minimal_stats(VacAttrStatsP stats,
 			double		numer,
 						denom,
 						stadistinct;
+			double		adaptdistinct;
 
 			numer = (double) samplerows *(double) d;
 
@@ -2152,6 +2242,30 @@ compute_minimal_stats(VacAttrStatsP stats,
 			if (stadistinct > totalrows)
 				stadistinct = totalrows;
 			stats->stadistinct = floor(stadistinct + 0.5);
+
+			/*
+			 * When computing the adaptive estimate, we're only considering
+			 * non-null values, so we need to perform correction of the
+			 * total rows / sample rows to reflect this. Otherwise the
+			 * coefficients (f_count / f_max) are out of sync. We could
+			 * probably do the inverse thing (including NULL values into
+			 * f_count) with the same effect.
+			 */
+			adaptdistinct
+				= optimize_estimate(totalrows * (nonnull_cnt / (double)samplerows),
+									nonnull_cnt, f_count, f_max);
+
+			elog(WARNING, "ndistinct estimate current=%.2f adaptive=%.2f",
+						  stadistinct, adaptdistinct);
+
+			/* if we've seen 'almost' all rows, use the estimate instead */
+			if (samplerows >= 0.95 * totalrows)
+			{
+				adaptdistinct = (d + d/0.95)/2;
+				elog(WARNING, "corrected ndistinct estimate current=%.2f adaptive=%.2f",
+					 stadistinct, adaptdistinct);
+			}
+
 		}
 
 		/*
@@ -2369,6 +2483,12 @@ compute_scalar_stats(VacAttrStatsP stats,
 		int			slot_idx = 0;
 		CompareScalarsContext cxt;
 
+		/* f values for the estimator - messy and we likely need much
+		 * less memory, but who cares */
+		int			f_max = 0; /* max number of duplicates */
+		int		   *f_count = (int*)palloc0(sizeof(int)*(values_cnt+1));
+		int			first_index = 0;	/* first index of a group */
+
 		/* Sort the collected values */
 		cxt.ssup = &ssup;
 		cxt.tupnoLink = tupnoLink;
@@ -2439,6 +2559,35 @@ compute_scalar_stats(VacAttrStatsP stats,
 			}
 		}
 
+		/*
+		 * Counting repetitions - walk through the sorted array, compare
+		 * the value to the previous one, and whenever it changes the
+		 * we can compute the repetitions using the array indexes.
+		 */
+		for (i = 1; i < values_cnt; i++)
+		{
+			/* the hashes are different - store the repetition count */
+			if (compare_scalars_simple(&values[i], &values[first_index], &cxt) != 0)
+			{
+				/* found first element of the following group, so (i-first) is the count */
+				f_count[i - first_index] += 1;
+
+				if (f_max < (i - first_index))
+					f_max = (i - first_index);
+
+				first_index = i;
+			}
+		}
+
+		/* the last element is not updated in the loop */
+		f_count[values_cnt - first_index] += 1;
+
+		if (f_max < (values_cnt - first_index))
+			f_max = (values_cnt - first_index);
+
+		/* compensate for wide values (assumed to be distinct) */
+		f_count[1] += toowide_cnt;
+
 		stats->stats_valid = true;
 		/* Do the simple null-frac and width stats */
 		stats->stanullfrac = (double) null_cnt / (double) samplerows;
@@ -2481,6 +2630,7 @@ compute_scalar_stats(VacAttrStatsP stats,
 			double		numer,
 						denom,
 						stadistinct;
+			double		adaptdistinct;	/* adaptive estimate */
 
 			numer = (double) samplerows *(double) d;
 
@@ -2494,6 +2644,29 @@ compute_scalar_stats(VacAttrStatsP stats,
 			if (stadistinct > totalrows)
 				stadistinct = totalrows;
 			stats->stadistinct = floor(stadistinct + 0.5);
+
+			/*
+			 * When computing the adaptive estimate, we're only considering
+			 * non-null values, so we need to perform correction of the
+			 * total rows / sample rows to reflect this. Otherwise the
+			 * coefficients (f_count / f_max) are out of sync. We could
+			 * probably do the inverse thing (including NULL values into
+			 * f_count) with the same effect.
+			 */
+			adaptdistinct
+				= optimize_estimate(totalrows * (nonnull_cnt / (double)samplerows),
+									nonnull_cnt, f_count, f_max);
+
+			elog(WARNING, "ndistinct estimate current=%.2f adaptive=%.2f",
+						  stadistinct, adaptdistinct);
+
+			/* if we've seen 'almost' all rows, use the estimate instead */
+			if (samplerows >= 0.95 * totalrows)
+			{
+				adaptdistinct = (d + d/0.95)/2;
+				elog(WARNING, "corrected ndistinct estimate current=%.2f adaptive=%.2f",
+					 stadistinct, adaptdistinct);
+			}
 		}
 
 		/*
@@ -2809,6 +2982,21 @@ compare_scalars(const void *a, const void *b, void *arg)
 }
 
 /*
+ * qsort_arg comparator for sorting ScalarItems
+ *
+ */
+static int
+compare_scalars_simple(const void *a, const void *b, void *arg)
+{
+	Datum		da = ((const ScalarItem *) a)->value;
+	Datum		db = ((const ScalarItem *) b)->value;
+
+	CompareScalarsContext *cxt = (CompareScalarsContext *) arg;
+
+	return ApplySortComparator(da, false, db, false, cxt->ssup);
+}
+
+/*
  * qsort comparator for sorting ScalarMCVItems by position
  */
 static int
@@ -2819,3 +3007,100 @@ compare_mcvs(const void *a, const void *b)
 
 	return da - db;
 }
+
+/*
+ * We need to minimize this equality (find "m" solving it)
+ *
+ *     m - f1 - f2 = f1 * (A + A(m)) / (B + B(m))
+ *
+ * where A, B are effectively constants (not depending on m), and A(m)
+ * and B(m) are functions. This is equal to solving
+ *
+ *     0 = f1 * (A + A(m)) / (B + B(m)) - (m - f1 - f2)
+ *
+ * Instead of looking for the exact solution to this equation (which
+ * might be fractional), we'll look for a natural number minimizing
+ * the absolute difference. Number of (distinct) elements is a natural
+ * number, and we don't mind if the number os slightly wrong. It's
+ * just an estimate, after all. The error from sampling will be much
+ * worse in most cases.
+ *
+ * We know the acceptable values of 'm' are [d,N] where 'd' is the number
+ * of distinct elements in the sample, and N is the number of rows in
+ * the table (not just the sample). For large tables (billions of rows)
+ * that'd be quite time-consuming to compute, so we'll approximate the
+ * solution by gradually increasing the step to ~1% of the current value
+ * of 'm'. This will make it much faster and yet very accurate.
+ * 
+ * All this of course assumes the function behaves reasonably (not
+ * oscillating etc.), but that's a safe assumption as the estimator
+ * would perform terribly otherwise.
+ */
+static double
+optimize_estimate(int total_rows, int sample_rows, int *f, int f_max)
+{
+	int		i, m;
+	double	A = 0.0, B = 0.0;
+	int		opt_m = 0;
+	double	opt_diff = total_rows;
+	int		step = 1;
+	int		ndistinct;
+	int		d = f[1] + f[2];
+
+	/* compute the 'constant' parts of the equality (A, B) */
+	for (i = 3; i <= f_max; i++)
+	{
+		double p = i / (double)sample_rows;
+		A +=     pow((1.0 - p), sample_rows  ) * f[i];
+		B += i * pow((1.0 - p), sample_rows-1) * f[i];
+		d += f[i];
+	}
+
+	/* find the 'm' value minimizing the difference */
+	for (m = 1; m <= total_rows; m += step)
+	{
+		double k = (f[1] + 2*f[2]);
+		double q = k / (sample_rows * m);
+
+		double A_m = A + m * pow((1 - q), sample_rows  );
+		double B_m = B + k * pow((1 - q), sample_rows-1);
+
+		double diff = fabs(f[1] * (A_m / B_m) - (m - f[1] - f[2]));
+
+		/* if this is a better solution */
+		if (diff < opt_diff)
+		{
+			opt_diff = diff;
+			opt_m = m;
+		}
+
+		/* tweak the step to 1% to make it faster */
+		step = ((int)(0.01 * m) > step) ? (int)(0.01 * m) : step;
+	}
+
+	/* compute the final estimate */
+	ndistinct = d + opt_m - f[1] - f[2];
+
+	/* sanity checks that the estimate is within [d,total_rows] */
+	if (ndistinct < d)
+		ndistinct = d;
+	else if (ndistinct > total_rows)
+		ndistinct = total_rows;
+	else if (ndistinct > total_rows / sample_rows * d)
+		ndistinct = total_rows / sample_rows * d;
+
+	return ndistinct;
+}
+
+static int
+hash_comparator(const void *a, const void *b)
+{
+	uint32 ai = *(uint32*)a;
+	uint32 bi = *(uint32*)b;
+	if (ai < bi)
+		return -1;
+	else if (ai > bi)
+		return 1;
+	else
+		return 0;
+}
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to