Hi,

Thanks for taking the time to respond. My reading of your comments is
that we are in overall agreement on the design, with the possible
exception of persisting data cross restarts. I will write more about
that topic below; but I think if that's the only design disagreement
we have, it makes sense to go forward with committing the patch that I
have, and we can continue to discuss whether we want to add something
related to persistence. The only reason not to do that would be if
there were a consensus that the lack of a persistence framework was
such a critical defect that we shouldn't ship this at all without
that, but I don't agree with that idea and I think it would be a
pretty strong position for someone to take.

On Sun, Mar 29, 2026 at 2:59 PM Lukas Fittl <[email protected]> wrote:
> I think a simple disk file is the way to go, similar to how
> autoprewarm works with its "autoprewarm.blocks" file. Its a bit
> awkward that that just sits in the main data directory, but since
> pg_prewarm already does it today, I think its okay to have another
> contrib module do the same. As noted I'm mainly worried about restarts
> that the user didn't control, causing advice that was set to be lost.
>
> I've attached a patch of how that could look like on top of your v23,
> that copies the modified stash information to a
> "pg_stash_advice.entries" file, and loads it after restarts.

I'll be honest: I don't like this design much at all, but I do see the
practical advantages of it, and we have done similar things elsewhere,
in particular in autoprewarm. Before I get to the specifics of your
patch, let me complain about some things that I don't like at the
design level. We lose a lot by directing data through a bespoke
mechanism rather than handling it as table data. There are no
checksums, so we have less protection against corruption. There is no
write-ahead logging, so data does not make it to standbys, which is
more of a potential issue for pg_stash_advice than it is for
autoprewarm. All the code to read and write the file is specific to
this contrib module, so it can have its own bugs separate from every
similar module's bugs. The data can't easily be examined and
manipulated from SQL as table data can. It's just a messy one-off that
solves a practical problem but is otherwise not very nice. Of course,
sometimes such messy one-offs are the right answer.

In terms of the patch itself, the concurrency situation here seems
noticeably worse than with autoprewarm. In that case, there's only one
authorized writer at a time, tracked via pid_using_dumpfile. But in
this case, it seems like multiple backends could be writing to the
temporary dump file at the same time, which could result in a
corrupted file that doesn't reload properly. Your code also has a race
condition when reloading the data: the first arriving backend tries to
reload the flat file, but any other backends that arrive while that's
in progress see no stashed advice, and if the load fails for some
reason, it's never retried, and the first modification to the
in-memory state will clobber the file. autoprewarm has this issue to
some extent as well, but that's more OK there because recreating the
contents of shared buffers is only an approximate good, but people
probably don't want their stashed advice to disappear out from under
them if it was billed as persistent. That said, I'm not entirely
opposed to a design where there's a small window where the advice
stash is empty after a restart, because avoiding that means that it
has to be safe to do the reload of saved advice from the middle of a
query planning cycle, which is probably true with a flat-file design
but wouldn't be true with a table. Still, I don't know whether the
current behavior is deliberate or accidental.

I also feel a bit uncomfortable with the idea of rewriting the entire
file on every single change. If the hypothesis that this is only for
adjusting the behavior of a small number of critical queries is
correct, then it won't matter, but if people start using this for lots
of queries, it's potentially painful. Neither autoprewarm nor
pg_stat_statements does that. pg_stat_statements reads data only at
postmaster startup and writes data only at postmaster shutdown, so it
simply accepts loss of incremental changes in case of a crash, but
that also means it doesn't read and write the file repeatedly.
autoprewarm writes the file periodically from a background worker so
that the on-disk state doesn't drift too far out of sync with what's
in memory, without promising perfect durability. Both of those
placements have the further advantage that the reading and writing of
the file is not being done "in medias res," which does seem to have
certain advantages from a robustness perspective. For example, without
necessarily endorsing this design, suppose you added a background
worker and there are GUCs to configure the database that it connects
to and the query it executes to restore advice stashes. Or,
alternatively, a background worker that still uses a flat file. Either
way, that opens up design ideas like: when you see that the in-memory
stashes are not yet reloaded, you can decide to wait up to X seconds
for that to happen and then proceed anyway if it hasn't happened by
then. I'm not saying that is the right idea here necessarily, but it's
an option, whereas what you've done doesn't lend itself to that sort
of idea.

One other note is that fscanf() ending in a newline could eat up
whitespace at the start of the following line. Since a stash name can
begin with whitespace, that could be an issue.

> Because the number of entries here is controlled by the user (i.e. its
> not a function of the workload, but a function of how much advice you
> as a user have set), I'm much less worried about memory usage, as long
> as we document it clearly how to measure the amount of memory used.

The module doesn't have a built-in way to do that right now. Are you
thinking we would document that pg_get_dsm_registry_allocations() can
be used?

> In practice for a good amount of our user base these days the question
> will be "Does my cloud provider give me access to create stash
> entries", so its maybe worth thinking about if we could also allow
> pg_maintain to manage entries by default?

Wouldn't it make more sense for the cloud provider to grant execute
permissions on these functions to pg_maintain if they are so inclined?
This is a brand-new facility, so I think we had better be conservative
in terms of default permissions.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


Reply via email to