On Wed, Mar 2, 2016 at 5:46 PM, David Steele <da...@pgmasters.net> wrote:

> On 3/2/16 11:10 AM, Shulgin, Oleksandr wrote:
> > On Wed, Feb 24, 2016 at 12:30 AM, Tomas Vondra
> > <tomas.von...@2ndquadrant.com <mailto:tomas.von...@2ndquadrant.com>>
> wrote:
> >
> >     I think it'd be useful not to have all the changes in one lump, but
> >     structure this as a patch series with related changes in separate
> >     chunks. I doubt we'd like to mix the changes in a single commit, and
> >     it makes the reviews and reasoning easier. So those NULL-handling
> >     fixes should be in one patch, the MCV patches in another one.
> >
> >
> > OK, such a split would make sense to me.  Though, I'm a bit late as the
> > commitfest is already closed to new patches, I guess asking the CF
> > manager to split this might work (assuming I produce the patch files)?
>
> If the patch is broken into two files that gives the review/committer
> more options but I don't think it requires another CF entry.
>

Alright.  I'm attaching the latest version of this patch split in two
parts: the first one is NULLs-related bugfix and the second is the
"improvement" part, which applies on top of the first one.

--
Alex
From a6b1cb866f9374cdc893e9a318959eccaa5bfbc9 Mon Sep 17 00:00:00 2001
From: Oleksandr Shulgin <oleksandr.shul...@zalando.de>
Date: Wed, 2 Mar 2016 18:18:36 +0100
Subject: [PATCH 1/2] Account for NULLs in ANALYZE more strictly

Previously the ndistinct and avgcount calculation (for MCV list) could
be affected greatly by high fraction of NULLs in the sample.  Account
for that by subtracting the number of NULLs we've seen from the total
sample size explicitly.

At the same time, values that are considered "too wide" are accounted
for in ndistinct, but removed from sample size for MCV list
calculation.  In compute_distinct_stats() we need to do that manually,
in compute_scalar_stats() the value_cnt is already holding the number
of non-null, not too-wide values.
---
 src/backend/commands/analyze.c | 42 ++++++++++++++++++++++++------------------
 1 file changed, 24 insertions(+), 18 deletions(-)

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 8a5f07c..f05b496 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -2085,17 +2085,21 @@ compute_distinct_stats(VacAttrStatsP stats,
 						denom,
 						stadistinct;
 
-			numer = (double) samplerows *(double) d;
+			double		samplerows_nonnull = samplerows - null_cnt;
+			double		totalrows_nonnull
+							= totalrows * (1.0 - stats->stanullfrac);
 
-			denom = (double) (samplerows - f1) +
-				(double) f1 *(double) samplerows / totalrows;
+			numer = samplerows_nonnull * (double) d;
+
+			denom = (samplerows_nonnull - f1) +
+				(double) f1 * samplerows_nonnull / totalrows_nonnull;
 
 			stadistinct = numer / denom;
 			/* Clamp to sane range in case of roundoff error */
 			if (stadistinct < (double) d)
 				stadistinct = (double) d;
-			if (stadistinct > totalrows)
-				stadistinct = totalrows;
+			if (stadistinct > totalrows_nonnull)
+				stadistinct = totalrows_nonnull;
 			stats->stadistinct = floor(stadistinct + 0.5);
 		}
 
@@ -2124,16 +2128,17 @@ compute_distinct_stats(VacAttrStatsP stats,
 			/* Track list includes all values seen, and all will fit */
 			num_mcv = track_cnt;
 		}
-		else
+		else if (track_cnt != 0)
 		{
+			int			sample_cnt = nonnull_cnt - toowide_cnt;
 			double		ndistinct = stats->stadistinct;
 			double		avgcount,
 						mincount;
 
 			if (ndistinct < 0)
-				ndistinct = -ndistinct * totalrows;
+				ndistinct = -ndistinct * sample_cnt;
 			/* estimate # of occurrences in sample of a typical value */
-			avgcount = (double) samplerows / ndistinct;
+			avgcount = (double) sample_cnt / ndistinct;
 			/* set minimum threshold count to store a value */
 			mincount = avgcount * 1.25;
 			if (mincount < 2)
@@ -2434,17 +2439,21 @@ compute_scalar_stats(VacAttrStatsP stats,
 						denom,
 						stadistinct;
 
-			numer = (double) samplerows *(double) d;
+			double		samplerows_nonnull = samplerows - null_cnt;
+			double		totalrows_nonnull
+							= totalrows * (1.0 - stats->stanullfrac);
+
+			numer = samplerows_nonnull * (double) d;
 
-			denom = (double) (samplerows - f1) +
-				(double) f1 *(double) samplerows / totalrows;
+			denom = (samplerows_nonnull - f1) +
+				(double) f1 * samplerows_nonnull / totalrows_nonnull;
 
 			stadistinct = numer / denom;
 			/* Clamp to sane range in case of roundoff error */
 			if (stadistinct < (double) d)
 				stadistinct = (double) d;
-			if (stadistinct > totalrows)
-				stadistinct = totalrows;
+			if (stadistinct > totalrows_nonnull)
+				stadistinct = totalrows_nonnull;
 			stats->stadistinct = floor(stadistinct + 0.5);
 		}
 
@@ -2480,21 +2489,18 @@ compute_scalar_stats(VacAttrStatsP stats,
 		}
 		else
 		{
-			double		ndistinct = stats->stadistinct;
 			double		avgcount,
 						mincount,
 						maxmincount;
 
-			if (ndistinct < 0)
-				ndistinct = -ndistinct * totalrows;
 			/* estimate # of occurrences in sample of a typical value */
-			avgcount = (double) samplerows / ndistinct;
+			avgcount = (double) values_cnt / (double) ndistinct;
 			/* set minimum threshold count to store a value */
 			mincount = avgcount * 1.25;
 			if (mincount < 2)
 				mincount = 2;
 			/* don't let threshold exceed 1/K, however */
-			maxmincount = (double) samplerows / (double) num_bins;
+			maxmincount = (double) values_cnt / (double) num_bins;
 			if (mincount > maxmincount)
 				mincount = maxmincount;
 			if (num_mcv > track_cnt)
-- 
2.5.0

From f17117fd0aa3a362294ba51475abc576668ab95d Mon Sep 17 00:00:00 2001
From: Oleksandr Shulgin <oleksandr.shul...@zalando.de>
Date: Wed, 2 Mar 2016 18:54:09 +0100
Subject: [PATCH 2/2] Try to account for skewed distributions in ANALYZE

This is aimed to produce better (more unique) compressed histograms
where appropriate by allowing slightly more values in the MCV lists
than previously.

Another observed effect is that in face of highly skewed distribution
of value frequencies in the sample, the MCV list is more predictable
between the runs and is less dependent on the factors of pure luck.
---
 src/backend/commands/analyze.c | 90 +++++++++++++++++++++++++++++-------------
 1 file changed, 62 insertions(+), 28 deletions(-)

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index f05b496..4661f7f 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -2132,26 +2132,35 @@ compute_distinct_stats(VacAttrStatsP stats,
 		{
 			int			sample_cnt = nonnull_cnt - toowide_cnt;
 			double		ndistinct = stats->stadistinct;
-			double		avgcount,
-						mincount;
 
 			if (ndistinct < 0)
 				ndistinct = -ndistinct * sample_cnt;
-			/* estimate # of occurrences in sample of a typical value */
-			avgcount = (double) sample_cnt / ndistinct;
-			/* set minimum threshold count to store a value */
-			mincount = avgcount * 1.25;
-			if (mincount < 2)
-				mincount = 2;
+
+			/* sanity check in case of a very wrong estimate above */
+			if (ndistinct < track_cnt)
+				ndistinct = track_cnt;
+
 			if (num_mcv > track_cnt)
 				num_mcv = track_cnt;
 			for (i = 0; i < num_mcv; i++)
 			{
+				double		avgcount,
+							mincount;
+
+				/* estimate # of occurrences in sample of a typical value */
+				avgcount = (double) sample_cnt / ndistinct;
+
+				/* set minimum threshold count to store a value */
+				mincount = avgcount * 1.25;
+
 				if (track[i].count < mincount)
 				{
 					num_mcv = i;
 					break;
 				}
+
+				sample_cnt -= track[i].count;
+				ndistinct -= 1.0;
 			}
 		}
 
@@ -2479,6 +2488,8 @@ compute_scalar_stats(VacAttrStatsP stats,
 		 * emit duplicate histogram bin boundaries.  (We might end up with
 		 * duplicate histogram entries anyway, if the distribution is skewed;
 		 * but we prefer to treat such values as MCVs if at all possible.)
+		 * We also decrease ndistinct in the process such that going forward
+		 * it refers to the number of distinct values left for the histogram.
 		 */
 		if (track_cnt == ndistinct && toowide_cnt == 0 &&
 			stats->stadistinct > 0 &&
@@ -2486,32 +2497,58 @@ compute_scalar_stats(VacAttrStatsP stats,
 		{
 			/* Track list includes all values seen, and all will fit */
 			num_mcv = track_cnt;
+
+			/* Nothing left for the histogram */
+			num_hist = 0;
+			ndistinct = 0;
 		}
 		else
 		{
-			double		avgcount,
-						mincount,
-						maxmincount;
-
-			/* estimate # of occurrences in sample of a typical value */
-			avgcount = (double) values_cnt / (double) ndistinct;
-			/* set minimum threshold count to store a value */
-			mincount = avgcount * 1.25;
-			if (mincount < 2)
-				mincount = 2;
-			/* don't let threshold exceed 1/K, however */
-			maxmincount = (double) values_cnt / (double) num_bins;
-			if (mincount > maxmincount)
-				mincount = maxmincount;
+			/*
+			 * Starting number of values left for the histogram: samplerows
+			 * sans nulls and too wide ones.
+			 */
+			int			sample_cnt = values_cnt;
+
+			num_hist = ndistinct;
+			if (num_hist > num_bins)
+				num_hist = num_bins + 1;
+
 			if (num_mcv > track_cnt)
 				num_mcv = track_cnt;
 			for (i = 0; i < num_mcv; i++)
 			{
-				if (track[i].count < mincount)
+				if (num_hist >= 2)
 				{
-					num_mcv = i;
-					break;
+					double		avgcount,
+								mincount,
+								maxmincount;
+
+					/* estimate # of occurrences in sample of a typical value */
+					avgcount = (double) sample_cnt / (double) ndistinct;
+
+					/* set minimum threshold count to store a value */
+					mincount = 1.25 * avgcount;
+
+					/* don't let threshold exceed 1/K, however */
+					maxmincount = (sample_cnt - 1) / (double) (num_hist - 1);
+					if (mincount > maxmincount)
+						mincount = maxmincount;
+					if (track[i].count < mincount)
+					{
+						num_mcv = i;
+						break;
+					}
 				}
+
+				/* Narrow our view of samples left for the histogram */
+				sample_cnt -= track[i].count;
+				ndistinct--;
+
+				/* Recalculate histogram size due to lower ndistinct */
+				num_hist = ndistinct;
+				if (num_hist > num_bins)
+					num_hist = num_bins + 1;
 			}
 		}
 
@@ -2554,9 +2591,6 @@ compute_scalar_stats(VacAttrStatsP stats,
 		 * values not accounted for in the MCV list.  (This ensures the
 		 * histogram won't collapse to empty or a singleton.)
 		 */
-		num_hist = ndistinct - num_mcv;
-		if (num_hist > num_bins)
-			num_hist = num_bins + 1;
 		if (num_hist >= 2)
 		{
 			MemoryContext old_context;
-- 
2.5.0

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to