On Mon, Nov 13, 2023 at 5:14 PM yuansong <yyuans...@126.com> wrote: > > Enhancing the overall fault tolerance of the entire system for this feature > is quite important. No one can avoid bugs, and I don't believe that this > approach is a more advanced one. It might be worth considering adding it to > the roadmap so that interested parties can conduct relevant research. > > The current major issue is that when one process crashes, resetting all > connections has a significant impact on other connections. Is it possible to > only disconnect the crashed connection and have the other connections simply > roll back the current transaction without reconnecting? Perhaps this problem > can be mitigated through the use of a connection pool.
It's not about the other connections, it's that the crashed connection has no way to rollback. > > If we want to implement this feature, would it be sufficient to clean up or > restore the shared memory and disk changes caused by the crashed backend? > Besides clearing various known locks, what else needs to be changed? Does > anyone have any insights? Please help me. Thank you. > > > > > > > > At 2023-11-13 13:53:29, "Laurenz Albe" <laurenz.a...@cybertec.at> wrote: > >On Sun, 2023-11-12 at 21:55 -0500, Tom Lane wrote: > >> yuansong <yyuans...@126.com> writes: > >> > In PostgreSQL, when a backend process crashes, it can cause other backend > >> > processes to also require a restart, primarily to ensure data > >> > consistency. > >> > I understand that the correct approach is to analyze and identify the > >> > cause of the crash and resolve it. However, it is also important to be > >> > able to handle a backend process crash without affecting the operation of > >> > other processes, thus minimizing the scope of negative impact and > >> > improving availability. To achieve this goal, could we mimic the Oracle > >> > process by introducing a "pmon" process dedicated to rolling back crashed > >> > process transactions and performing resource cleanup? I wonder if anyone > >> > has attempted such a strategy or if there have been previous discussions > >> > on this topic. > >> > >> The reason we force a database-wide restart is that there's no way to > >> be certain that the crashed process didn't corrupt anything in shared > >> memory. (Even with the forced restart, there's a window where bad > >> data could reach disk before we kill off the other processes that > >> might write it. But at least it's a short window.) "Corruption" > >> here doesn't just involve bad data placed into disk buffers; more > >> often it's things like unreleased locks, which would block other > >> processes indefinitely. > >> > >> I seriously doubt that anything like what you're describing > >> could be made reliable enough to be acceptable. "Oracle does > >> it like this" isn't a counter-argument: they have a much different > >> (and non-extensible) architecture, and they also have an army of > >> programmers to deal with minutiae like undoing resource acquisition. > >> Even with that, you'd have to wonder about the number of bugs > >> existing in such necessarily-poorly-tested code paths. > > > >Yes. > >I think that PostgreSQL's approach is superior: rather than investing in > >code to mitigate the impact of data corruption caused by a crash, invest > >in quality code that doesn't crash in the first place. > > > >Euphemistically naming a crash "ORA-600 error" seems to be part of > >their strategy. > > > >Yours, > >Laurenz Albe > > -- Regards Junwang Zhao