Re: [HACKERS] postmaster recovery and automatic restart suppression

Czichy, Thoralf (NSN - FI/Helsinki) Tue, 16 Jun 2009 09:07:49 -0700

hi,


I am working together with Harald on this issue. Below some thoughts on 
why we think it should be possible to disable the postmaster-internal 
recovery attempt and instead have faults in the processes started 
by postmaster escalated to postmaster-exit.



[Our typical "embedded" situation]

* Database is small 0.1 to 1 GB (e.g. we consider it the safest strategy

  to copy the whole database from the active to standby before 
  reconnecting the standby after switchover or failover).

* Few clients only (10-100)

* There is no shared storage between the two instances (this means no 
  concurrent access to shared resources, no isolation problems for 
  shared resources)

* Switchover is fast, less than a few seconds

* Disk I/O is slow (no RAID, possibly (slow) flash-based)

* The same nodes running database also run lots of other functionality 
  (some dependent on DB, most not)



[Keep recovery decision and recovery action in cluster-HA-middleware]

Actually the problem we're trying to solve is to keep the decision
what's 
the best recovery strategy outside of the DB. In our use case this logic

is expressed in the cluster-HA-middleware and recovery actions are
initiated 
by this middleware rather than each individual piece of software started
by 
it; software is generally expected to "fail fast and safe" in case of 
errors. As long as you trust hardware and OS kernel, a process exit is 
usually such a fail fast and safe operation. It's "Safe" because process

exit causes the kernel to release the resources the process holds. It's
also 
fast. Though, "fast" is a bit more debatable as a simple signal from the

postmaster to the cluster middleware would probably be faster. However 
lacking such a signal, a SIGCHILD is the next best thing.

The middleware can make decisions such as (all of this is configurable 
and postmaster-health is _just_one_input_ of many to reach a decision on

the correct behavior)

 Policy 1: By default try to restart the active instance N times, after 
           that do a switchover
 Policy 2: If the active Postgres fails and the standby is available and

           up-to-date, do an immediate switchover. If the standby is not

           available, restart.
 Policy 3: If the active Postgres fails, escalate the problem to
node-level,
           isolate the active node and do the switchover to the standby.

 Policy 4: In single-node systems, restart db instance N times. If it
fails 
           more often than N times in X seconds, stop it and give an 
           indication to the operator (SNMP-trap to management system,
text 
           message, ...) that something is seriously wrong and manual 
           intervention is needed.

In the current setup we want to go for Policy 2. In earlier unrelated 
products (not using PostgreSQL) we actually had policies 1, 3 and 4.

Another typical situation is that recovery behavior is different during 
upgrades compared to the behavior during normal operation. E.g. when 
the (new) database instance fails during an automatic schema-conversion 
during upgrade we would want to automatically fallback to the previous 
version.



[STONITH is not always best strategy if failures can be declared as 
user-space software problem only, limit STONITH to HW/OS failures]

The isolation of the failing Postgres instance does not require a
STONITH 
- mainly as there's also other software running on the same node that
we'd 
not want to automatically switchover (e.g. because it takes longer to do
or 
the functionality is more critical or less critical). Also we generally
trust 
the HW, OS kernel and cluster middleware to behave correctly . These
functions
also follow the principle of fail-fast-and-safe. This trust might be an 
assumption that not everybody agrees with, though. So, if the failure
originated 
from HW/OS/Clusterware it clearly is a STONITH situation, but if it's a 
user-space problem - the default assumption is that isolation can be
implemented on 
OS-level and that's a guarantee that the clusterware gives (using a
separate 
Quorum mechanism to avoid split-brain situations).




[Example of user-space software failures]

So, what kind of failures would cause a user-space switchover rather
than 
node-level isolation? This gets a bit philosophical. If you assume that
many 
software failures are caused by concurrency issues, switching over to
the 
standby is actually a good strategy as it's unlikely that the same
concurrency 
issue happens again on the standby. Another reason for software failures

is entering exceptional situations, such as disk getting full, overload
on the 
node (causes by some other process), backup being taken, upgrade
conversion 
etc. So here the idea is that failover to a standby instance helps as
long as 
there's some hope that on the standby side the situation is different.
If we'd 
just have an internal Postgres restart in such situations, we'd have
flapping 
db connectivity - without the operator even being aware of it (awareness
about 
problem situations is also something that the cluster HA middleware
takes care 
of).



[Possible implementation options]

I see only two solutions to allow an external cluster-HA-middleware to
make 
recovery decisions: 

   (1) postmaster process exits if it detects any unpredicted failure or

   (2) have postmaster provide an interface to notify about software 
       failures (i.e. the case it goes into postmaster re-initializing).

In case (2) it would be the cluster-HA-middleware that isolates the
postmaster 
process, e.g. by SIGKILL-ing all related processes and forcefully
releasing all 
shared resources that it uses. However, I favor case (1) as long as we
would keep 
the logic that runs within the postmaster in case it detects a backend
process 
failure as simple as possible - meaning force-stop all postgres
processes 
(SIGKILL), wait for SIGCHLD from them and exit (should only take few 
milliseconds).


[Question]

So the question remains: Is this behavior and the most likely addition
of a 
postgresql.conf ""automatic_restart_after_crash = on" something that
completely 
goes against the Postgres philosopy or is this something that once
implemented 
would be acceptable to have in the main Postgres code base?


Thoralf

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] postmaster recovery and automatic restart suppression

Reply via email to