Hi, Thanks for taking the time to respond. My reading of your comments is that we are in overall agreement on the design, with the possible exception of persisting data cross restarts. I will write more about that topic below; but I think if that's the only design disagreement we have, it makes sense to go forward with committing the patch that I have, and we can continue to discuss whether we want to add something related to persistence. The only reason not to do that would be if there were a consensus that the lack of a persistence framework was such a critical defect that we shouldn't ship this at all without that, but I don't agree with that idea and I think it would be a pretty strong position for someone to take.
On Sun, Mar 29, 2026 at 2:59 PM Lukas Fittl <[email protected]> wrote: > I think a simple disk file is the way to go, similar to how > autoprewarm works with its "autoprewarm.blocks" file. Its a bit > awkward that that just sits in the main data directory, but since > pg_prewarm already does it today, I think its okay to have another > contrib module do the same. As noted I'm mainly worried about restarts > that the user didn't control, causing advice that was set to be lost. > > I've attached a patch of how that could look like on top of your v23, > that copies the modified stash information to a > "pg_stash_advice.entries" file, and loads it after restarts. I'll be honest: I don't like this design much at all, but I do see the practical advantages of it, and we have done similar things elsewhere, in particular in autoprewarm. Before I get to the specifics of your patch, let me complain about some things that I don't like at the design level. We lose a lot by directing data through a bespoke mechanism rather than handling it as table data. There are no checksums, so we have less protection against corruption. There is no write-ahead logging, so data does not make it to standbys, which is more of a potential issue for pg_stash_advice than it is for autoprewarm. All the code to read and write the file is specific to this contrib module, so it can have its own bugs separate from every similar module's bugs. The data can't easily be examined and manipulated from SQL as table data can. It's just a messy one-off that solves a practical problem but is otherwise not very nice. Of course, sometimes such messy one-offs are the right answer. In terms of the patch itself, the concurrency situation here seems noticeably worse than with autoprewarm. In that case, there's only one authorized writer at a time, tracked via pid_using_dumpfile. But in this case, it seems like multiple backends could be writing to the temporary dump file at the same time, which could result in a corrupted file that doesn't reload properly. Your code also has a race condition when reloading the data: the first arriving backend tries to reload the flat file, but any other backends that arrive while that's in progress see no stashed advice, and if the load fails for some reason, it's never retried, and the first modification to the in-memory state will clobber the file. autoprewarm has this issue to some extent as well, but that's more OK there because recreating the contents of shared buffers is only an approximate good, but people probably don't want their stashed advice to disappear out from under them if it was billed as persistent. That said, I'm not entirely opposed to a design where there's a small window where the advice stash is empty after a restart, because avoiding that means that it has to be safe to do the reload of saved advice from the middle of a query planning cycle, which is probably true with a flat-file design but wouldn't be true with a table. Still, I don't know whether the current behavior is deliberate or accidental. I also feel a bit uncomfortable with the idea of rewriting the entire file on every single change. If the hypothesis that this is only for adjusting the behavior of a small number of critical queries is correct, then it won't matter, but if people start using this for lots of queries, it's potentially painful. Neither autoprewarm nor pg_stat_statements does that. pg_stat_statements reads data only at postmaster startup and writes data only at postmaster shutdown, so it simply accepts loss of incremental changes in case of a crash, but that also means it doesn't read and write the file repeatedly. autoprewarm writes the file periodically from a background worker so that the on-disk state doesn't drift too far out of sync with what's in memory, without promising perfect durability. Both of those placements have the further advantage that the reading and writing of the file is not being done "in medias res," which does seem to have certain advantages from a robustness perspective. For example, without necessarily endorsing this design, suppose you added a background worker and there are GUCs to configure the database that it connects to and the query it executes to restore advice stashes. Or, alternatively, a background worker that still uses a flat file. Either way, that opens up design ideas like: when you see that the in-memory stashes are not yet reloaded, you can decide to wait up to X seconds for that to happen and then proceed anyway if it hasn't happened by then. I'm not saying that is the right idea here necessarily, but it's an option, whereas what you've done doesn't lend itself to that sort of idea. One other note is that fscanf() ending in a newline could eat up whitespace at the start of the following line. Since a stash name can begin with whitespace, that could be an issue. > Because the number of entries here is controlled by the user (i.e. its > not a function of the workload, but a function of how much advice you > as a user have set), I'm much less worried about memory usage, as long > as we document it clearly how to measure the amount of memory used. The module doesn't have a built-in way to do that right now. Are you thinking we would document that pg_get_dsm_registry_allocations() can be used? > In practice for a good amount of our user base these days the question > will be "Does my cloud provider give me access to create stash > entries", so its maybe worth thinking about if we could also allow > pg_maintain to manage entries by default? Wouldn't it make more sense for the cloud provider to grant execute permissions on these functions to pg_maintain if they are so inclined? This is a brand-new facility, so I think we had better be conservative in terms of default permissions. -- Robert Haas EDB: http://www.enterprisedb.com
