On Thu, Jan 29, 2026 at 01:20:10AM -0500, Corey Huinker wrote: > However, pg_stats_ext_exprs is the only means we have of fetching this > data, as it provides the security barrier that pg_statistic_ext_data does > not. Prior to my awareness of the security requirement, I had assumed that > I could just serialize stxdexpr to text and then unpack it > with deconstruct_array, but even that ran into problems with the stavalues > ANYARRAY at the end, wherein the type needed for the each row of > the stxdexpr array is different based on the object definition...so that > wasn't going anywhere even without the security requirement.
I was thinking about the interface a bit more, and using
pg_restore_extended_stats() with one new argument for expressions is
good by me. Let's not introduce a new function.
> I could envision serializing pg_stats_ext_exprs to a kwargs-ish array
> {"null_frac","0.4","avg_width","4","most_common_vals",...} which would at
> least remove ambiguity about which string was meant for a given type of
> stat unpacking, at a cost of some additional overhead. For that matter we
> could collect then into a 1-level json of param_name:textified_value pairs,
> but json inputs were not well received before.
Hmm. This may be a good idea for this input, here, actually. At
least one knows in input what kind of format they are dealing in a
blink. One issue that I do have with the patch as presented in its
current shape is its overall opaqueness. It is assumed that the input
in the argument of the restore function has a specific ordering. This
is also translated in opaqueness when the data is retrieved from
pg_stats_ext_exprs in pg_dump, with a specific ordering of the fields.
This creates a stronger dependency in pg_dump and the backend in terms
of the order of the items. Without docs, it would be much harder to
navigate through it. Dependencies and ndistinct don't really have
this issue: their format is enforced in the input function of their
data type. MCV values have somewhat this opaqueness, but the dump
part is trivial: we query them directly from the catalogs and past
them in the input of the function.
Between a text[] argument like the one you are quoting upthread and a
JSON blob, a JSON blob may have my preference, especially as in this
case the values of each key would be made of individual text values
retrieved from the catalogs. That makes the input more flexible by
design so as we don't depend on the order of the items. It's true
that we don't really need a dependency with the order of the items in
a text[], as long as we have a pair number of items. JSON is easier
to edit IMO, though. This choice comes to personal taste, then. A
text[] or a JSON would be both OK. JSON has my preference also due to
the key/value dependency for expressions.
> If we went to a new function pg_restore_extended_expr_stats() which
> specified all the fields from pg_stats_ext_exprs plus an array index value,
> we'd still have some of the ANYARRAY issues necessitating casting the mcv,
> histogram, and mcelems arrays to a scalar text value, but we'd also have to
> address issues of the stxdexpr array elements being restored out of order
> or missing a leading element.
>
> All of that is a long way of explaining how the fairly opaque array of
> text-casted values won out.
Yeah, that's a bit my worry here. For the automated contents of dumps
and restores, the input is not an issue because well, that's
automated. What I am concerned with is custom data pushed for an
individual object. Contrary to MCV, dependencies and ndistinct, this
stuff is more opaque, meaning that it also makes the code harder to
think about, and somewhat harder to maintain.
Saying that, it's been already a long thread and we have done a lot
for dependencies, ndistinct and MCV. Could you post the remaining
part of the patch about expressions on a new thread, explaining the
problem at hand?
--
Michael
signature.asc
Description: PGP signature
