I wrote:
> Well, we don't have a most common element in this scenario --- the
> whole point is that the occurrence counts resulting from the lossy
> counting algorithm are too low to be trustworthy. However, what we
> do have is the cutoff frequency, and it seems to me that we could use
> that as the estimate of the maximum frequency of the non-MCEs.
Here's a less crude patch for that. The array_typanalyze.c changes
are the same as before, but I reconsidered what to do about this
stumbling block:
> Assuming we like this direction, the main thing that makes this a hack
> not a polished patch is that I had to strongarm the code into storing
> a zero-length values array. What update_attstats would normally do
> is leave the values column of the MCELEM stats slot NULL, which then
> causes get_attstatsslot to throw a that-shouldn't-be-null error.
> An alternative answer is to change get_attstatsslot to allow a null,
> but I'm not sure that that's any cleaner. Either way it seems like
> there's a hazard of breaking some code that isn't expecting the case.
I concluded that letting get_attstatsslot accept nulls is too risky;
there is probably code that assumes that get_attstatsslot would've
rejected that. Instead, this version changes update_attstats' rule
for when to store an array rather than a null. Now it will do so
if the passed-in stavalues pointer isn't null, even if numvalues
is zero. I think that this doesn't change the outcome for any
existing analyze routines because they wouldn't pass a data pointer
if they have no values; and even if it does, storing an empty array
not a null seems like it should be pretty harmless.
> An alternative that feels cleaner but a good bit more invasive
> is to get rid of the convention of storing the min/max/nulls
> frequencies as extra entries in the MCELEM numbers entry ---
> which surely is a hack too --- and put them into some new slot
> type. I'm not sure if that's enough nicer to be worth the
> conversion pain.
A year ago I would have seriously considered doing it that way.
But now that we have code to dump-n-restore stats, that code would
have to be adjusted to convert the old representation. It's not
worth it for this case.
Hence, v1 attached, now with a commit message.
regards, tom lane
From ac1f22903594ce3613d615fe5fba09c77f216769 Mon Sep 17 00:00:00 2001
From: Tom Lane
Date: Fri, 18 Jul 2025 17:43:21 -0400
Subject: [PATCH v1] Track the maximum possible frequency of non-MCE array
elements.
The lossy-counting algorithm that ANALYZE uses to identify
most-common array elements may decide that none of the values
are common enough to be called MCEs. In such cases we stored
nothing in the STATISTIC_KIND_MCELEM pg_statistic slot, which
resulted in array_selfuncs.c falling back to default estimates.
However, we do in fact have valuable information to convey:
we know that none of the array elements are as common as the
cutoff frequency that lossy counting reported. If we report
that as the minimum MCE frequency, array_selfuncs.c will
arrive at significantly-better-than-default estimates.
So we want to construct a STATISTIC_KIND_MCELEM entry that contains
no "values" but does have "numbers", to wit the three extra numbers
that the MCELEM entry type defines. A small obstacle is that
update_attstats() has traditionally stored a null, not an empty array,
when passed zero "values" for a slot. That gives rise to an MCELEM
entry that get_attstatsslot() will spit up on. The least risky
solution seems to be to adjust update_attstats() so that it will emit
a non-null (but possibly empty) array when the passed stavalues array
pointer isn't NULL, rather than conditioning that on numvalues > 0.
In other existing cases I don't believe that that changes anything.
For consistency, handle the stanumbers array the same way.
Reported-by: Mark Frost
Author: Tom Lane
Discussion: https://postgr.es/m/ph3ppf1c905d6e6f24a5c1a1a1d8345b593e1...@ph3ppf1c905d6e6.namprd15.prod.outlook.com
---
src/backend/commands/analyze.c | 7 +++
src/backend/utils/adt/array_typanalyze.c | 15 ++-
2 files changed, 17 insertions(+), 5 deletions(-)
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 7111d5d5334..a47e23eaa8c 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1712,10 +1712,9 @@ update_attstats(Oid relid, bool inh, int natts, VacAttrStats **vacattrstats)
i = Anum_pg_statistic_stanumbers1 - 1;
for (k = 0; k < STATISTIC_NUM_SLOTS; k++)
{
- int nnum = stats->numnumbers[k];
-
- if (nnum > 0)
+ if (stats->stanumbers[k] != NULL)
{
+int nnum = stats->numnumbers[k];
Datum *numdatums = (Datum *) palloc(nnum * sizeof(Datum));
ArrayType *arry;
@@ -1733,7 +1732,7 @@ update_attstats(Oid relid, bool inh, int natts, VacAttrStats **vacattrstats)
i = Anum_pg_statistic_stavalues1 - 1;
for (k = 0; k < STATISTIC_NUM_SLOTS; k++)
{