On Fri, Mar 15, 2019 at 1:49 AM Michael Paquier <mich...@paquier.xyz> wrote:
> On Thu, Mar 14, 2019 at 03:23:59PM +0100, Magnus Hagander wrote: > > Are you suggesting we should support running with a master with checksums > > on and a standby with checksums off in the same cluster? That seems.. > Very > > fragile. > > Well, saying that it is supported is a too big term for that. What I > am saying is that the problems you are pointing out are not as bad as > you seem to mean they are as long as an operator does not copy on-disk > pages from one node to the other one. Knowing that checksums apply > only to pages flushed on disk on a local node, everything going > through WAL for availability is actually able to work fine: > - PITR > - archive recovery. > - streaming replication. > Reading the code I understand that. I have as well done some tests > with a primary/standby configuration to convince myself, using pgbench > on both nodes (read-write for the primary, read-only on the standby), > with checkpoint (or restart point) triggered on each node every 20s. > If one node has checksum enabled and the other checksum disabled, then > I am not seeing any inconsistency. > > However, anything which does a physical copy of pages could get things > easily messed up if one node has checksum disabled and the other > enabled. One such tool is pg_rewind. If the promoted standby has > checksums disabled (becoming the source), and the old master to rewind > has checksums enabled, then the rewind could likely copy pages which > have not their checksums set correctly, resulting in incorrect > checksums on the old master. > > So yes, it is easy to mess up things, however this does not apply to > all configurations. The suggestion from Christoph to enable checksums > on both nodes separately would work, and personally I find the > suggestion to update the system ID after enabling or disabling > checksums an over-engineered design because of the reasons in the > first part of this email (it is technically doable to enable checksums > with a minimum downtime and a failover), so my recommendation would be > to document that when enabling checksums on one instance in a cluster, > it should be applied to all instances as it could cause problems with > any tools performing a physical copy of relation files or blocks. > As I said, that's a big hammer. I'm all for having a better solution. But I don't think it's acceptable not to have *any* defense against it, given how bad corruption it can lead to. -- Magnus Hagander Me: https://www.hagander.net/ <http://www.hagander.net/> Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>