Re: Statistics Import and Export

Stephen Frost Fri, 08 Mar 2024 04:05:23 -0800

Greetings,

* Corey Huinker ([email protected]) wrote:
> > Having some discussion around that would be useful.  Is it better to
> > have a situation where there are stats for some columns but no stats for
> > other columns?  There would be a good chance that this would lead to a
> > set of queries that were properly planned out and a set which end up
> > with unexpected and likely poor query plans due to lack of stats.
> > Arguably that's better overall, but either way an ANALYZE needs to be
> > done to address the lack of stats for those columns and then that
> > ANALYZE is going to blow away whatever stats got loaded previously
> > anyway and all we did with a partial stats load was maybe have a subset
> > of queries have better plans in the interim, after having expended the
> > cost to try and individually load the stats and dealing with the case of
> > some of them succeeding and some failing.
> 
> It is my (incomplete and entirely second-hand) understanding is that
> pg_upgrade doesn't STOP autovacuum, but sets a delay to a very long value
> and then resets it on completion, presumably because analyzing a table
> before its data is loaded and indexes are created would just be a waste of
> time.


No, pg_upgrade starts the postmaster with -b (undocumented
binary-upgrade mode) which prevents autovacuum (and logical replication
workers) from starting, so we don't need to worry about autovacuum
coming in and causing issues during binary upgrade.  That doesn't
entirely eliminate the concerns around pg_dump vs. autovacuum because we
may restore a dump into a non-binary-upgrade-in-progress system though,
of course.

> > Overall, I'd suggest we wait to see what Corey comes up with in terms of
> > doing the stats load for all attributes in a single function call,
> > perhaps using the VALUES construct as you suggested up-thread, and then
> > we can contemplate if that's clean enough to work or if it's so grotty
> > that the better plan would be to do per-attribute function calls.  If it
> > ends up being the latter, then we can revisit this discussion and try to
> > answer some of the questions raised above.
> 
> In the patch below, I ended up doing per-attribute function calls, mostly
> because it allowed me to avoid creating a custom data type for the portable
> version of pg_statistic. This comes at the cost of a very high number of
> parameters, but that's the breaks.

Perhaps it's just me ... but it doesn't seem like it's all that many
parameters.

> I am a bit concerned about the number of locks on pg_statistic and the
> relation itself, doing CatalogOpenIndexes/CatalogCloseIndexes once per
> attribute rather than once per relation. But I also see that this will
> mostly get used at a time when no other traffic is on the machine, and
> whatever it costs, it's still faster than the smallest table sample (insert
> joke about "don't have to be faster than the bear" here).

I continue to not be too concerned about this until and unless it's
actually shown to be an issue.  Keeping things simple and
straight-forward has its own value.

> This raises questions about whether a failure in one attribute update
> statement should cause the others in that relation to roll back or not, and
> I can see situations where both would be desirable.
> 
> I'm putting this out there ahead of the pg_dump / fe_utils work, mostly
> because what I do there heavily depends on how this is received.
> 
> Also, I'm still seeking confirmation that I can create a pg_dump TOC entry
> with a chain of commands (e.g. BEGIN; ...  COMMIT; ) or if I have to fan
> them out into multiple entries.

If we do go with this approach, we'd certainly want to make sure to do
it in a manner which would allow pg_restore's single-transaction mode to
still work, which definitely complicates this some.

Given that and the other general feeling that the locking won't be a big
issue, I'd suggest the simple approach on the pg_dump side of just
dumping the stats out without the BEGIN/COMMIT.

> Anyway, here's v7. Eagerly awaiting feedback.

> Subject: [PATCH v7] Create pg_set_relation_stats, pg_set_attribute_stats.

> diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
> index 291ed876fc..d12b6e3ca3 100644
> --- a/src/include/catalog/pg_proc.dat
> +++ b/src/include/catalog/pg_proc.dat
> @@ -8818,7 +8818,6 @@
>  { oid => '3813', descr => 'generate XML text node',
>    proname => 'xmltext', proisstrict => 't', prorettype => 'xml',
>    proargtypes => 'text', prosrc => 'xmltext' },
> -
>  { oid => '2923', descr => 'map table contents to XML',
>    proname => 'table_to_xml', procost => '100', provolatile => 's',
>    proparallel => 'r', prorettype => 'xml',
> @@ -12163,8 +12162,24 @@
>  
>  # GiST stratnum implementations
>  { oid => '8047', descr => 'GiST support',
> -  proname => 'gist_stratnum_identity', prorettype => 'int2',
> +  proname => 'gist_stratnum_identity',prorettype => 'int2',
>    proargtypes => 'int2',
>    prosrc => 'gist_stratnum_identity' },

Random whitespace hunks shouldn't be included 

> diff --git a/src/backend/statistics/statistics.c 
> b/src/backend/statistics/statistics.c
> new file mode 100644
> index 0000000000..999aebdfa9
> --- /dev/null
> +++ b/src/backend/statistics/statistics.c
> @@ -0,0 +1,360 @@
> +/*------------------------------------------------------------------------- 
> * * statistics.c *
> + * IDENTIFICATION
> + *       src/backend/statistics/statistics.c
> + *
> + *-------------------------------------------------------------------------
> + */

Top-of-file comment should be cleaned up.

> +/*
> + * Set statistics for a given pg_class entry.
> + *
> + * pg_set_relation_stats(relation Oid, reltuples double, relpages int)
> + *
> + * This does an in-place (i.e. non-transactional) update of pg_class, just as
> + * is done in ANALYZE.
> + *
> + */
> +Datum
> +pg_set_relation_stats(PG_FUNCTION_ARGS)
> +{
> +     const char *param_names[] = {
> +             "relation",
> +             "reltuples",
> +             "relpages",
> +     };
> +
> +     Oid                             relid;
> +     Relation                rel;
> +     HeapTuple               ctup;
> +     Form_pg_class   pgcform;
> +
> +     for (int i = 0; i <= 2; i++)
> +             if (PG_ARGISNULL(i))
> +                     ereport(ERROR,
> +                                     
> (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +                                      errmsg("%s cannot be NULL", 
> param_names[i])));

Why not just mark this function as strict..?  Or perhaps we should allow
NULLs to be passed in and just not update the current value in that
case?  Also, in some cases we allow the function to be called with a
NULL but then make it a no-op rather than throwing an ERROR (eg, if the
OID ends up being NULL).  Not sure if that makes sense here or not
offhand but figured I'd mention it as something to consider.

> +     pgcform = (Form_pg_class) GETSTRUCT(ctup);
> +     pgcform->reltuples = PG_GETARG_FLOAT4(1);
> +     pgcform->relpages = PG_GETARG_INT32(2);

Shouldn't we include relallvisible?

Also, perhaps we should use the approach that we have in ANALYZE, and
only actually do something if the values are different rather than just
always doing an update.

> +/*
> + * Import statistics for a given relation attribute
> + *
> + * pg_set_attribute_stats(relation Oid, attname name, stainherit bool,
> + *                        stanullfrac float4, stawidth int, stadistinct 
> float4,
> + *                        stakind1 int2, stakind2 int2, stakind3 int3,
> + *                        stakind4 int2, stakind5 int2, stanumbers1 float4[],
> + *                        stanumbers2 float4[], stanumbers3 float4[],
> + *                        stanumbers4 float4[], stanumbers5 float4[],
> + *                        stanumbers1 float4[], stanumbers2 float4[],
> + *                        stanumbers3 float4[], stanumbers4 float4[],
> + *                        stanumbers5 float4[], stavalues1 text,
> + *                        stavalues2 text, stavalues3 text,
> + *                        stavalues4 text, stavalues5 text);
> + *
> + *
> + */

Don't know that it makes sense to just repeat the function declaration
inside a comment like this- it'll just end up out of date.

> +Datum
> +pg_set_attribute_stats(PG_FUNCTION_ARGS)

> +     /* names of columns that cannot be null */
> +     const char *required_param_names[] = {
> +             "relation",
> +             "attname",
> +             "stainherit",
> +             "stanullfrac",
> +             "stawidth",
> +             "stadistinct",
> +             "stakind1",
> +             "stakind2",
> +             "stakind3",
> +             "stakind4",
> +             "stakind5",
> +     };

Same comment here as above wrt NULL being passed in.

> +     for (int k = 0; k < 5; k++)

Shouldn't we use STATISTIC_NUM_SLOTS here?

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: Statistics Import and Export

Reply via email to