On Mon, Dec 25, 2023 at 8:18 PM Tomas Vondra <tomas.von...@enterprisedb.com>
wrote:

> Hi,
>
> I finally had time to look at the last version of the patch, so here's a
> couple thoughts and questions in somewhat random order. Please take this
> as a bit of a brainstorming and push back if you disagree some of my
> comments.
>
> In general, I like the goal of this patch - not having statistics is a
> common issue after an upgrade, and people sometimes don't even realize
> they need to run analyze. So, it's definitely worth improving.
>
> I'm not entirely sure about the other use case - allowing people to
> tweak optimizer statistics on a running cluster, to see what would be
> the plan in that case. Or more precisely - I agree that would be an
> interesting and useful feature, but maybe the interface should not be
> the same as for the binary upgrade use case?
>
>
> interfaces
> ----------
>
> When I thought about the ability to dump/load statistics in the past, I
> usually envisioned some sort of DDL that would do the export and import.
> So for example we'd have EXPORT STATISTICS / IMPORT STATISTICS commands,
> or something like that, and that'd do all the work. This would mean
> stats are "first-class citizens" and it'd be fairly straightforward to
> add this into pg_dump, for example. Or at least I think so ...
>
> Alternatively we could have the usual "functional" interface, with a
> functions to export/import statistics, replacing the DDL commands.
>
> Unfortunately, none of this works for the pg_upgrade use case, because
> existing cluster versions would not support this new interface, of
> course. That's a significant flaw, as it'd make this useful only for
> upgrades of future versions.
>

This was the reason I settled on the interface that I did: while we can
create whatever interface we want for importing the statistics, we would
need to be able to extract stats from databases using only the facilities
available in those same databases, and then store that in a medium that
could be conveyed across databases, either by text files or by saving them
off in a side table prior to upgrade. JSONB met the criteria.


>
> So I think for the pg_upgrade use case, we don't have much choice other
> than using "custom" export through a view, which is what the patch does.
>
> However, for the other use case (tweaking optimizer stats) this is not
> really an issue - that always happens on the same instance, so no issue
> with not having the "export" function and so on. I'd bet there are more
> convenient ways to do this than using the export view. I'm sure it could
> share a lot of the infrastructure, ofc.
>

So, there is a third use case - foreign data wrappers. When analyzing a
foreign table, at least one in the postgresql_fdw family of foreign
servers, we should be able to send a query specific to the version and
dialect of that server, get back the JSONB, and import those results. That
use case may be more tangible to you than the tweak/tuning case.


>
>
>
> JSON format
> -----------
>
> As for the JSON format, I wonder if we need that at all? Isn't that an
> unnecessary layer of indirection? Couldn't we simply dump pg_statistic
> and pg_statistic_ext_data in CSV, or something like that? The amount of
> new JSONB code seems to be very small, so it's OK I guess.
>

I see a few problems with dumping pg_statistic[_ext_data]. The first is
that the importer now has to understand all of the past formats of those
two tables. The next is that the tables are chock full of Oids that don't
necessarily carry forward. I could see us having a text-ified version of
those two tables, but we'd need that for all previous iterations of those
table formats. Instead, I put the burden on the stats export to de-oid the
data and make it *_in() function friendly.


> That's pretty much why I envisioned a format "grouping" the arrays for a
> particular type of statistics (MCV, histogram) into the same object, as
> for example in
>
>  {
>    "mcv" : {"values" : [...], "frequencies" : [...]}
>    "histogram" : {"bounds" : [...]}
>  }
>

I agree that would be a lot more readable, and probably a lot more
debuggable. But I went into this unsure if there could be more than one
stats slot of a given kind per table. Knowing that they must be unique
helps.


> But that's probably much harder to generate from plain SQL (at least I
> think so, I haven't tried).
>

I think it would be harder, but far from impossible.



> data missing in the export
> --------------------------
>
> I think the data needs to include more information. Maybe not for the
> pg_upgrade use case, where it's mostly guaranteed not to change, but for
> the "manual tweak" use case it can change. And I don't think we want two
> different formats - we want one, working for everything.
>

I"m not against this at all, and I started out doing that, but the
qualified names of operators got _ugly_, and I quickly realized that what I
was generating wouldn't matter, either the input data would make sense for
the attribute's stats or it would fail trying.


> Consider for example about the staopN and stacollN fields - if we clone
> the stats from one table to the other, and the table uses different
> collations, will that still work? Similarly, I think we should include
> the type of each column, because it's absolutely not guaranteed the
> import function will fail if the type changes. For example, if the type
> changes from integer to text, that will work, but the ordering will
> absolutely not be the same. And so on.
>

I can see including the type of the column, that's a lot cleaner than the
operator names for sure, and I can see us rejecting stats or sections of
stats in certain situations. Like in your example, if the collation
changed, then reject all "<" op stats but keep the "=" ones.


> For the extended statistics export, I think we need to include also the
> attribute names and expressions, because these can be different between
> the statistics. And not only that - the statistics values reference the
> attributes by positions, but if the two tables have the attributes in a
> different order (when ordered by attnum), that will break stuff.
>

Correct me if I'm wrong, but I thought expression parse trees change _a
lot_ from version to version?

Attribute reordering is a definite vulnerability of the current
implementation, so an attribute name export might be a way to mitigate that.


>
> * making sure the frequencies in MCV lists are not obviously wrong
>   (outside [0,1], sum exceeding > 1.0, etc.)
>

+1


>
> * cross-checking that stanumbers/stavalues make sense (e.g. that MCV has
>   both arrays while histogram has only stavalues, that the arrays have
>   the same length for MCV, etc.)
>

To this end, there's an edge-case hack in the code where I have to derive
the array elemtype. I had thought that examine_attribute() or
std_typanalyze() was going to do that for me, but it didn't. Very much want
your input there.


>
> * checking there are no duplicate stakind values (e.g. two MCV lists)
>

Per previous comment, it's good to learn these restrictions.


> Not sure if all the checks need to be regular elog(ERROR), perhaps some
> could/should be just asserts.
>

For this first pass, all errors were one-size fits all, safe for the
WARNING vs ERROR.


>
>
> minor questions
> ---------------
>
> 1) Should the views be called pg_statistic_export or pg_stats_export?
> Perhaps pg_stats_export is better, because the format is meant to be
> human-readable (rather than 100% internal).
>

I have no opinion on what the best name would be, and will go with
consensus.


>
> 2) It's not very clear what "non-transactional update" of pg_class
> fields actually means. Does that mean we update the fields in-place,
> can't be rolled back, is not subject to MVCC or what? I suspect users
> won't know unless the docs say that explicitly.
>

Correct. Cannot be rolled back, not subject to MVCC.



> 3) The "statistics.c" code should really document the JSON structure. Or
> maybe if we plan to use this for other purposes, it should be documented
> in the SGML?
>

I agree, but I also didn't expect the format to survive first contact with
reviewers, so I held back.


>
> 4) Why do we need the separate "replaced" flags in import_stakinds? Can
> it happen that collreplaces/opreplaces differ from kindreplaces?
>

That was initially done to maximize the amount of code that could be copied
from do_analyze(). In retrospect, I like how extended statistics just
deletes all the pg_statistic_ext_data rows and replaces them and I would
like to do the same for pg_statistic before this is all done.


>
> 5) What happens in we import statistics for a table that already has
> some statistics? Will this discard the existing statistics, or will this
> merge them somehow? (I think we should always discard the existing
> stats, and keep only the new version.)
>

In the case of pg_statistic_ext_data, the stats are thrown out and replaced
by the imported ones.

In the case of pg_statistic, it's basically an upsert, and any values that
were missing in the JSON are not updated on the existing row. That's
appealing in a tweak situation where you want to only alter one or two bits
of a stat, but not really useful in other situations. Per previous comment,
I'd prefer a clean slate and forcing tweaking use cases to fill in all the
blanks.


>
> 6) What happens if we import extended stats with mismatching definition?
> For example, what if the "new" statistics object does not have "mcv"
> enabled, but the imported data do include MCV? What if the statistics do
> have the same number of "dimensions" but not the same number of columns
> and expressions?
>

The importer is currently driven by the types of stats to be expected for
that pg_attribute/pg_statistic_ext. It only looks for things that are
possible for that stat type, and any extra JSON values are ignored.

Reply via email to