Re: Statistics Import and Export

Tomas Vondra Wed, 07 Feb 2024 13:47:12 -0800

Hi,

I took a quick look at the v4 patches. I haven't done much testing yet,
so only some basic review.


0001

- The SGML docs for pg_import_rel_stats may need some changes. It starts
with description of what gets overwritten (non-)transactionally (which
gets repeated twice), but that seems more like an implementation detail.
But it does not really say which pg_class fields get updated. Then it
speculates about the possible use case (pg_upgrade). I think it'd be
better to focus on the overall goal of updating statistics, explain what
gets updated/how, and only then maybe mention the pg_upgrade use case.

Also, it says "statistics are replaced" but it's quite clear if that
applies only to matching statistics or if all stats are deleted first
and then the new stuff is inserted. (FWIW remove_pg_statistics clearly
deletes all pre-existing stats).


- import_pg_statistics: I somewhat dislike that we're passing arguments
as datum[] array - it's hard to say what the elements are expected to
be, etc. Maybe we should expand this, to make it clear. How do we even
know the array is large enough?

- I don't quite understand why we need examine_rel_attribute. It sets a
lot of fields in the VacAttrStats struct, but then we only use attrtypid
and attrtypmod from it - so why bother and not simply load just these
two fields? Or maybe I miss something.

- examine_rel_attribute can return NULL, but get_attrinfo does not check
for NULL and just dereferences the pointer. Surely that can lead to
segfaults?

- validate_no_duplicates and the other validate functions would deserve
a better docs, explaining what exactly is checked (it took me a while to
realize we check just for duplicates), what the parameters do etc.

- Do we want to make the validate_ functions part of the public API? I
realize we want to use them from multiple places (regular and extended
stats), but maybe it'd be better to have an "internal" header file, just
like we have extended_stats_internal?

- I'm not sure we do "\set debug f" elsewhere. It took me a while to
realize why the query outputs are empty ...


0002

- I'd rename create_stat_ext_entry to statext_create_entry.

- Do we even want to include OIDs from the source server? Why not to
just have object names and resolve those? Seems safer - if the target
server has the OID allocated to a different object, that could lead to
confusing / hard to detect issues.

- What happens if we import statistics which includes data for extended
statistics object which does not exist on the target machine?

- pg_import_ext_stats seems to not use require_match_oids - bug?


0003

- no SGML docs for the new tools?

- The help() seems to be wrong / copied from "clusterdb" or something
like that, right?


On 2/2/24 09:37, Corey Huinker wrote:
> (hit send before attaching patches, reposting message as well)
> 
> Attached is v4 of the statistics export/import patch.
> 
> This version has been refactored to match the design feedback received
> previously.
> 
> The system views are gone. These were mostly there to serve as a baseline
> for what an export query would look like. That role is temporarily
> reassigned to pg_export_stats.c, but hopefully they will be integrated into
> pg_dump in the next version. The regression test also contains the version
> of each query suitable for the current server version.
> 

OK

> The export format is far closer to the raw format of pg_statistic and
> pg_statistic_ext_data, respectively. This format involves exporting oid
> values for types, collations, operators, and attributes - values which are
> specific to the server they were created on. To make sense of those values,
> a subset of the columns of pg_type, pg_attribute, pg_collation, and
> pg_operator are exported as well, which allows pg_import_rel_stats() and
> pg_import_ext_stats() to reconstitute the data structure as it existed on
> the old server, and adapt it to the modern structure and local schema
> objects.

I have no opinion on the proposed format - still JSON, but closer to the
original data. Works for me, but I wonder what Tom thinks about it,
considering he suggested making it closer to the raw data.

> 
> pg_import_rel_stats matches up local columns with the exported stats by
> column name, not attnum. This allows for stats to be imported when columns
> have been dropped, added, or reordered.
> 

Makes sense. What will happen if we try to import data for extended
statistics (or index) that does not exist on the target server?

> pg_import_ext_stats can also handle column reordering, though it currently
> would get confused by changes in expressions that maintain the same result
> data type. I'm not yet brave enough to handle importing nodetrees, nor do I
> think it's wise to try. I think we'd be better off validating that the
> destination extended stats object is identical in structure, and to fail
> the import of that one object if it isn't perfect.
> 

Yeah, column reordering is something we probably need to handle. The
stats order them by attnum, so if we want to allow import on a system
where the attributes were dropped/created in a different way, this is
necessary. I haven't tested this - is there a regression test for this?

I agree expressions are hard. I don't think it's feasible to import
nodetree from other server versions, but why don't we simply deparse the
expression on the source, and either parse it on the target (and then
compare the two nodetrees), or deparse the target too and compare the
two deparsed expressions? I suspect the deparsing may produce slightly
different results on the two versions (causing false mismatches), but
perhaps the deparse on source + parse on target + compare nodetrees
would work? Haven't tried, though.

> Export formats go back to v10.
> 

Do we even want/need to go beyond 12? All earlier versions are EOL.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Statistics Import and Export

Reply via email to