On Mon, Nov 13, 2023 at 5:14 PM yuansong <yyuans...@126.com> wrote:
> Enhancing the overall fault tolerance of the entire system for this feature 
> is quite important. No one can avoid bugs, and I don't believe that this 
> approach is a more advanced one. It might be worth considering adding it to 
> the roadmap so that interested parties can conduct relevant research.
> The current major issue is that when one process crashes, resetting all 
> connections has a significant impact on other connections. Is it possible to 
> only disconnect the crashed connection and have the other connections simply 
> roll back the current transaction without reconnecting? Perhaps this problem 
> can be mitigated through the use of a connection pool.

It's not about the other connections, it's that the crashed connection
has no way to rollback.

> If we want to implement this feature, would it be sufficient to clean up or 
> restore the shared memory and disk changes caused by the crashed backend? 
> Besides clearing various known locks, what else needs to be changed? Does 
> anyone have any insights? Please help me. Thank you.
> At 2023-11-13 13:53:29, "Laurenz Albe" <laurenz.a...@cybertec.at> wrote:
> >On Sun, 2023-11-12 at 21:55 -0500, Tom Lane wrote:
> >> yuansong <yyuans...@126.com> writes:
> >> > In PostgreSQL, when a backend process crashes, it can cause other backend
> >> > processes to also require a restart, primarily to ensure data 
> >> > consistency.
> >> > I understand that the correct approach is to analyze and identify the
> >> > cause of the crash and resolve it. However, it is also important to be
> >> > able to handle a backend process crash without affecting the operation of
> >> > other processes, thus minimizing the scope of negative impact and
> >> > improving availability. To achieve this goal, could we mimic the Oracle
> >> > process by introducing a "pmon" process dedicated to rolling back crashed
> >> > process transactions and performing resource cleanup? I wonder if anyone
> >> > has attempted such a strategy or if there have been previous discussions
> >> > on this topic.
> >>
> >> The reason we force a database-wide restart is that there's no way to
> >> be certain that the crashed process didn't corrupt anything in shared
> >> memory.  (Even with the forced restart, there's a window where bad
> >> data could reach disk before we kill off the other processes that
> >> might write it.  But at least it's a short window.)  "Corruption"
> >> here doesn't just involve bad data placed into disk buffers; more
> >> often it's things like unreleased locks, which would block other
> >> processes indefinitely.
> >>
> >> I seriously doubt that anything like what you're describing
> >> could be made reliable enough to be acceptable.  "Oracle does
> >> it like this" isn't a counter-argument: they have a much different
> >> (and non-extensible) architecture, and they also have an army of
> >> programmers to deal with minutiae like undoing resource acquisition.
> >> Even with that, you'd have to wonder about the number of bugs
> >> existing in such necessarily-poorly-tested code paths.
> >
> >Yes.
> >I think that PostgreSQL's approach is superior: rather than investing in
> >code to mitigate the impact of data corruption caused by a crash, invest
> >in quality code that doesn't crash in the first place.
> >
> >Euphemistically naming a crash "ORA-600 error" seems to be part of
> >their strategy.
> >
> >Yours,
> >Laurenz Albe
> >

Junwang Zhao

Reply via email to