Re: Statistics Import and Export

Corey Huinker Sun, 31 Mar 2024 17:10:49 -0700

>
>
> I concur with the plan of extracting data from pg_stats not
> pg_statistics, and with emitting a single "set statistics"
> call per attribute.  (I think at one point I'd suggested a call
> per stakind slot, but that would lead to a bunch of UPDATEs on
> existing pg_attribute tuples and hence a bunch of dead tuples
> at the end of an import, so it's not the way to go.  A series
> of UPDATEs would likely also play poorly with a background
> auto-ANALYZE happening concurrently.)
>


That was my reasoning as well.



> I do not like the current design for pg_set_attribute_stats' API
> though: I don't think it's at all future-proof.  What happens when
> somebody adds a new stakind (and hence new pg_stats column)?
> You could try to add an overloaded pg_set_attribute_stats
> version with more parameters, but I'm pretty sure that would
> lead to "ambiguous function call" failures when trying to load
> old dump files containing only the original parameters.


I don't think we'd overload, we'd just add new parameters to the function
signature.


>   The
> present design is also fragile in that an unrecognized parameter
> will lead to a parse-time failure and no function call happening,
> which is less robust than I'd like.


There was a lot of back-and-forth about what sorts of failures were
error-worthy, and which were warn-worthy. I'll discuss further below.


>   As lesser points,
> the relation argument ought to be declared regclass not oid for
> convenience of use,


+1


> and I really think that we need to provide
> the source server's major version number --- maybe we will never
> need that, but if we do and we don't have it we will be sad.
>

The JSON had it, and I never did use it. Not against having it again.


>
> So this leads me to suggest that we'd be best off with a VARIADIC
> ANY signature, where the variadic part consists of alternating
> parameter labels and values:
>
> pg_set_attribute_stats(table regclass, attribute name,
>                        inherited bool, source_version int,
>                        variadic "any") returns void
>
> where a call might look like
>
> SELECT pg_set_attribute_stats('public.mytable', 'mycolumn',
>                               false, -- not inherited
>                               16, -- source server major version
>                               -- pairs of labels and values follow
>                               'null_frac', 0.4,
>                               'avg_width', 42,
>                               'histogram_bounds',
>                               array['a', 'b', 'c']::text[],
>                               ...);
>
> Note a couple of useful things here:
>
> * AFAICS we could label the function strict and remove all those ad-hoc
> null checks.  If you don't have a value for a particular stat, you
> just leave that pair of arguments out (exactly as the existing code
> in 0002 does, just using a different notation).  This also means that
> we don't need any default arguments and so no need for hackery in
> system_functions.sql.
>

I'm not aware of how strict works with variadics. Would the lack of any
variadic parameters trigger it?

Also going with strict means that an inadvertent explicit NULL in one
parameter would cause the entire attribute import to fail silently. I'd
rather fail loudly.



> * If we don't recognize a parameter label at runtime, we can treat
> that as a warning rather than a hard error, and press on.  This case
> would mostly be useful in major version downgrades I suppose, but
> that will be something people will want eventually.
>

Interesting.

* We can require the calling statement to cast arguments, particularly
> arrays, to the proper type, removing the need for conversions within
> the stats-setting function.  (But instead, it'd need to check that the
> next "any" argument is the type it ought to be based on the type of
> the target column.)
>

So, that's tricky. The type of the values is not always the attribute type,
for expression indexes, we do call exprType() and exprCollation(), in which
case we always use the expression type over the attribute type, but only
use the collation type if the attribute had no collation. This mimics the
behavior of ANALYZE.

Then, for the MCELEM and DECHIST stakinds we have to find the type's
element type, and that has special logic for tsvectors, ranges, and other
non-scalars, borrowing from the various *_typanalyze() functions. For that
matter, the existing typanalyze functions don't grab the < operator, which
I need for later data validations, so using examine_attribute() was
simultaneously overkill and insufficient.

None of this functionality is accessible from a client program, so we'd
have to pull in a lot of backend stuff to pg_dump to make it resolve the
typecasts correctly. Text and array_in() was just easier.


> pg_set_relation_stats is simpler in that the set of stats values
> to be set will probably remain fairly static, and there seems little
> reason to allow only part of them to be supplied (so personally I'd
> drop the business about accepting nulls there too).  If we do grow
> another value or values for it to set there shouldn't be much problem
> with overloading it with another version with more arguments.
> Still needs to take regclass not oid though ...
>

I'm still iffy about the silent failures of strict.

I looked it up, and the only change needed for changing oid to regclass is
in the pg_proc.dat. (and the docs, of course). So I'm already on board.


> * why is check_relation_permissions looking up the pg_class row?
> There's already a copy of that in the Relation struct.  Likewise
> for the other caller of can_modify_relation (but why is that
> caller not using check_relation_permissions?)  That all looks
> overly complicated and duplicative.  I think you don't need two
> layers of function there.
>

To prove that the caller is the owner (or better) of the table.


>
> * The array manipulations seem to me to be mostly not well chosen.
> There's no reason to use expanded arrays here, since you won't be
> modifying the arrays in-place; all that's doing is wasting memory.
> I'm also noting a lack of defenses against nulls in the arrays.
>

Easily remedied in light of the deconstruct_array() suggestion below, but I
do want to add that value_not_null_array_len() does check for nulls, and
that function is used to generate all but one of the arrays (and that one
we're just verifying that it's length matches the length of the other
array).There's even a regression test that checks it (search for:
"elem_count_histogram null element").


> I'd suggest using deconstruct_array to disassemble the arrays,
> if indeed they need disassembled at all.  (Maybe they don't, see
> next item.)
>

+1


>
> * I'm dubious that we can fully vet the contents of these arrays,
> and even a little dubious that we need to try.  As an example,
> what's the worst that's going to happen if a histogram array isn't
> sorted precisely?  You might get bogus selectivity estimates
> from the planner, but that's no worse than you would've got with
> no stats at all.  (It used to be that selfuncs.c would use a
> histogram even if its contents didn't match the query's collation.
> The comments justifying that seem to be gone, but I think it's
> still the case that the code isn't *really* dependent on the sort
> order being exactly so.)  The amount of hastily-written code in the
> patch for checking this seems a bit scary, and it's well within the
> realm of possibility that it introduces more bugs than it prevents.
> We do need to verify data types, lack of nulls, and maybe
> 1-dimensional-ness, which could break the accessing code at a fairly
> low level; but I'm not sure that we need more than that.
>

A lot of the feedback I got on this patch over the months concerned giving
inaccurate, nonsensical, or malicious data to the planner. Surely the
planner does do *some* defensive programming when fetching these values,
but this is the first time those values were potentially set by a user, not
by our own internal code. We can try to match types, collations, etc from
source to dest, but even that would fall victim to another glibc-level
collation change. Verifying that the list the source system said was sorted
is actually sorted when put on the destination system is the truest test
we're ever going to get, albeit for sampled elements.


>
> * There's a lot of ERROR cases that maybe we ought to downgrade
> to WARN-and-press-on, in the service of not breaking the restore
> completely in case of trouble.
>

All cases were made error precisely to spark debate about which cases we'd
want to continue from and which we'd want to error from. Also, I was under
the impression it was bad form to follow up NOTICE/WARN with an ERROR in
the same function call.



> * 0002 is confused about whether the tag for these new TOC
> entries is "STATISTICS" or "STATISTICS DATA".  I also think
> they need to be in SECTION_DATA not SECTION_NONE, and I'd be
> inclined to make them dependent on the table data objects
> not the table declarations.  We don't really want a parallel
> restore to load them before the data is loaded: that just
> increases the risk of bad interactions with concurrent
> auto-analyze.
>

SECTION_NONE works the best, but we're getting some situations where the
relpages/reltuples/relallvisible gets reset to 0s in pg_class. Hence the
temporary --no-statistics in the pg_upgrade TAP test.

SECTION_POST_DATA (a previous suggestion) causes something weird to happen
where certain GRANT/REVOKEs happen outside of their expected section.

In work I've done since v15, I tried giving the table stats archive entry a
dependency on every index (and index constraint) as well as the table
itself, thinking that would get us past all resets of pg_class, but it
hasn't worked.


> * It'd definitely not be OK to put BEGIN/COMMIT into the commands
> in these TOC entries.  But I don't think we need to.
>

Agreed. Don't need to, each function call now sinks or swims on its own.


>
> * dumpRelationStats seems to be dumping the relation-level
> stats twice.
>

+1

* Why exactly are you suppressing testing of statistics upgrade
> in 002_pg_upgrade??
>

Temporary. Related to the pg_class overwrite issue above.

Re: Statistics Import and Export

Reply via email to