On Fri, Mar 15, 2019 at 1:49 AM Michael Paquier <mich...@paquier.xyz> wrote:

> On Thu, Mar 14, 2019 at 03:23:59PM +0100, Magnus Hagander wrote:
> > Are you suggesting we should support running with a master with checksums
> > on and a standby with checksums off in the same cluster? That seems..
> Very
> > fragile.
>
> Well, saying that it is supported is a too big term for that.  What I
> am saying is that the problems you are pointing out are not as bad as
> you seem to mean they are as long as an operator does not copy on-disk
> pages from one node to the other one.  Knowing that checksums apply
> only to pages flushed on disk on a local node, everything going
> through WAL for availability is actually able to work fine:
> - PITR
> - archive recovery.
> - streaming replication.
> Reading the code I understand that.  I have as well done some tests
> with a primary/standby configuration to convince myself, using pgbench
> on both nodes (read-write for the primary, read-only on the standby),
> with checkpoint (or restart point) triggered on each node every 20s.
> If one node has checksum enabled and the other checksum disabled, then
> I am not seeing any inconsistency.
>
> However, anything which does a physical copy of pages could get things
> easily messed up if one node has checksum disabled and the other
> enabled.  One such tool is pg_rewind.  If the promoted standby has
> checksums disabled (becoming the source), and the old master to rewind
> has checksums enabled, then the rewind could likely copy pages which
> have not their checksums set correctly, resulting in incorrect
> checksums on the old master.
>
> So yes, it is easy to mess up things, however this does not apply to
> all configurations.  The suggestion from Christoph to enable checksums
> on both nodes separately would work, and personally I find the
> suggestion to update the system ID after enabling or disabling
> checksums an over-engineered design because of the reasons in the
> first part of this email (it is technically doable to enable checksums
> with a minimum downtime and a failover), so my recommendation would be
> to document that when enabling checksums on one instance in a cluster,
> it should be applied to all instances as it could cause problems with
> any tools performing a physical copy of relation files or blocks.
>

As I said, that's a big hammer. I'm all for having a better solution. But I
don't think it's acceptable not to have *any* defense against it, given how
bad corruption it can lead to.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ <http://www.hagander.net/>
 Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>

Reply via email to