Re: MCV lists for highly skewed distributions

Dean Rasheed Thu, 01 Feb 2018 09:22:07 -0800

On 1 February 2018 at 13:16, Simon Riggs <si...@2ndquadrant.com> wrote:
> On 25 January 2018 at 22:19, Tom Lane <t...@sss.pgh.pa.us> wrote:
>> In any case, since it looks like the next step is for someone to come
>> up with a new proposal, I'm going to set this to Waiting on Author.
>
> Dean and John's results show that different algorithms work better for
> different cases.
>
> How about we make ANALYZE's MCV algorithm pluggable? And then include,
> say, 2 additional algorithms.
>

I don't think we've yet proved that that's needed. I'd rather attempt
to come up with a decent general-purpose algorithm, if possible.

The main problem I have with the originally proposed patch is the lack
of any kind of rigorous justification for the approach of changing the
algorithm to include values that are "significantly more common than
average frequency for values which will not be represented in the MCV
list". So there's no guarantee that the MCVs produced will be backed
by sufficient evidence, and it risks making the too-many-MCVs case
worse.

Of course the current code suffers from both the too-many-MCVs and
too-few-MCVs problems, depending on the data distribution:

For a reasonably uniform distribution with quite a large number of
distinct values, it is possible to generate MCVs on the basis of
having seen only a few instances of the values. Those few values seen
are then not sufficiently statistically significant, and extrapolating
to the whole table produces wildly inaccurate estimates.

For a highly skewed distribution, it is possible for there to be
hardly any values (maybe only one) that appears more than 1.25 times
the average frequency, and so lots of otherwise perfectly good common
values get discarded. For such a distribution, I don't think that the
average frequency is particularly interesting, and it shouldn't be
used to filter the MCV list.

There is also another variant of the too-many-MCVs problem that I
believe is also possible -- if the sample contains a large number of
NULLs or too-wide values, values_cnt could be reduced to the point
where maxmincount is quite small, and again MCVs could get included on
the basis of a very small number of occurrences.

I think it would be better to try to come up with an alternative
algorithm that has a better theoretical basis, and then test that to
see how it holds up in practice.

With that in mind, attached is a patch based on the idea of setting a
bound on the relative standard error on the sample frequency -- so we
include values in the MCV list if and only if they are seen enough
times in the sample that the standard error on the sample frequency is
small compared to the sample frequency itself, and thus we expect that
the relative error resulting from extrapolating that to the whole
table is also small. In theory then, it should never generate too many
MCVs, although tuning of the relative error threshold might be
required to prevent it from generating too few.

Note, this is not a finished patch (e.g., I haven't touched
compute_distinct_stats()). It's merely a possible starting point from
which a lot more testing will be required.

Testing it with the example from [1], it does indeed appear to solve
the too-many-MCVs problem in that particular case (actually generating
no MCVs).

Testing it with the highly skewed example at the start of this thread,
it typically generates a couple more MCVs, producing somewhat better
estimates, but to get really good estimates for who=17, you need to
crank up the stats target. It does at least appear to no longer be the
case that cranking up the stats target has a weak effect.

Regards,
Dean

[1]
https://www.postgresql.org/message-id/flat/20170522132017.29944.48...@wrigleys.postgresql.org

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
new file mode 100644
index 5f21fcb..af92835
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -2557,25 +2557,21 @@ compute_scalar_stats(VacAttrStatsP stats
 		 * Decide how many values are worth storing as most-common values. If
 		 * we are able to generate a complete MCV list (all the values in the
 		 * sample will fit, and we think these are all the ones in the table),
-		 * then do so.  Otherwise, store only those values that are
-		 * significantly more common than the (estimated) average. We set the
-		 * threshold rather arbitrarily at 25% more than average, with at
-		 * least 2 instances in the sample.  Also, we won't suppress values
-		 * that have a frequency of at least 1/K where K is the intended
-		 * number of histogram bins; such values might otherwise cause us to
-		 * emit duplicate histogram bin boundaries.  (We might end up with
-		 * duplicate histogram entries anyway, if the distribution is skewed;
-		 * but we prefer to treat such values as MCVs if at all possible.)
+		 * then do so.  Otherwise, keep only those values that appear
+		 * sufficiently often in the sample that it is reasonable to
+		 * extrapolate their sample frequencies to the entire table. We do
+		 * this by placing an upper bound on the relative standard error of
+		 * the sample frequency, so that any estimates the planner generates
+		 * from these statistics can be expected to be reasonably accurate.
 		 *
 		 * Note: the first of these cases is meant to address columns with
 		 * small, fixed sets of possible values, such as boolean or enum
 		 * columns.  If we can *completely* represent the column population by
 		 * an MCV list that will fit into the stats target, then we should do
 		 * so and thus provide the planner with complete information.  But if
-		 * the MCV list is not complete, it's generally worth being more
-		 * selective, and not just filling it all the way up to the stats
-		 * target.  So for an incomplete list, we try to take only MCVs that
-		 * are significantly more common than average.
+		 * the MCV list is not complete, then we need to be more selective, to
+		 * avoid including values that aren't common enough in the sample to
+		 * generate accurate statistics for the population.
 		 */
 		if (track_cnt == ndistinct && toowide_cnt == 0 &&
 			stats->stadistinct > 0 &&
@@ -2586,24 +2582,39 @@ compute_scalar_stats(VacAttrStatsP stats
 		}
 		else
 		{
-			double		ndistinct_table = stats->stadistinct;
-			double		avgcount,
-						mincount,
-						maxmincount;
+			/*----------
+			 * Discard values whose relative standard error is too high. A
+			 * common rule of thumb when estimating errors in this situation
+			 * is to require at least 10 instances in the sample. This is
+			 * sufficient to allow the distribution of the sample frequency (a
+			 * hypergeometric distribution, since we are doing sampling
+			 * without replacement) to be approximated by a normal
+			 * distribution, and standard error analysis techniques can be
+			 * applied. Then, if the sample size is n, the population size is
+			 * N, and the sample frequency is p=cnt/n, the standard error on p
+			 * is given by
+			 *		SE = sqrt(p*(1-p)/n) * sqrt((N-n)/(N-1))
+			 * where the second term is the finite population correction. We
+			 * impose an (arbitrarily chosen) upper bound on the relative
+			 * standard error of 10% -- i.e., SE/p < 0.1. This gives a lower
+			 * bound on the number of instances of the value seen:
+			 *		cnt > n*(N-n) / (N-n+0.01*n*(N-1))
+			 * This bound is at most 100, and approaches 0 as n approaches 0
+			 * or N. The case where n approaches 0 isn't actually possible,
+			 * since the sample size is at least 300. The case where n
+			 * approaches N corresponds to sampling most of the table, in
+			 * which case it is reasonable to keep the whole MCV list, as we
+			 * do above. Thus it is reasonable to apply this bound for all
+			 * inputs (even though the formula is technically only valid when
+			 * the right hand side is at least around 10), giving a smooth
+			 * transition from this code branch to the all-values-seen branch
+			 * above.
+			 *----------
+			 */
+			double		n = samplerows;
+			double		N = totalrows;
+			double		mincount = n*(N-n) / (N-n+0.01*n*(N-1));
 
-			/* Re-extract estimate of # distinct nonnull values in table */
-			if (ndistinct_table < 0)
-				ndistinct_table = -ndistinct_table * totalrows;
-			/* estimate # occurrences in sample of a typical nonnull value */
-			avgcount = (double) nonnull_cnt / ndistinct_table;
-			/* set minimum threshold count to store a value */
-			mincount = avgcount * 1.25;
-			if (mincount < 2)
-				mincount = 2;
-			/* don't let threshold exceed 1/K, however */
-			maxmincount = (double) values_cnt / (double) num_bins;
-			if (mincount > maxmincount)
-				mincount = maxmincount;
 			if (num_mcv > track_cnt)
 				num_mcv = track_cnt;
 			for (i = 0; i < num_mcv; i++)

Re: MCV lists for highly skewed distributions

Reply via email to