On Thu, Nov 20, 2025 at 11:12 AM Nathan Bossart <[email protected]> wrote: > ISTM that besides pure coding errors, the most likely reasons for an ERROR > during DSM segment initialization are things like out-of-memory, too many > LWLock tranches registered, etc. In those cases, do we want every single > backend that tries to attach to the segment to retry and most likely fail > again? And what if the initialization callback completed halfway before > failing? Do we expect every extension author to carefully craft their > callbacks to handle that?
Well, I don't want to pretend like I know all the answers here. I do think it's pretty clear that making it the initialization callback's job to deal with a partially-failed initialization would be crazy. What I'd be inclined to do after a failure is tear down the entire segment. Even if there's no mechanism to retry, you're saving the memory that the segment would have consumed. And maybe there is a mechanism to retry. If these initialization callbacks are cheap enough, maybe having every backend try is not really a big deal -- maybe it's good, insofar as an interactive session would be more likely to see an error indicating the real cause of failure than a fake error that basically tells you at some point in the past something went wrong. If repeated failures are too painful, another idea could be to just mark the entry as being in a failed state (maybe with a timestamp) and let the extension decide when it wants to press the reset button. Or maybe that's all too complicated for too little utility. I'm not sure. My perception is that this is setting a significantly lower standard for error tolerance than what we typically seek to achieve, but sometimes my perceptions are wrong, and sometimes better options are hard to come by. > On the other hand, if we don't handle ERRORs and someone does run into an > OOM or whatnot, do we want other backends to attach and use the unitialized > segment (leading to things like seg-faults)? Tracking whether an entry was > initialized seems like cheap insurance against these problems. That's fair. > Finally, the only report from the field about any of this involves SQLsmith > allocating tons of LWLock tranches via test_dsa, which has been fixed > independently. So I have no evidence to suggest that we need to make it > more robust. In fact, without this one arguably contrived report, I > probably wouldn't have done anything at all. Yeah, I'm not clear who is using this or how, so that makes it hard to judge. > I'll grant you that this is a judgment call. I tried to find a good > middle ground, but I would be unsurprised to learn that other committers > would have done things differently. Such is life! -- Robert Haas EDB: http://www.enterprisedb.com
