On Thu, Feb 20, 2020, 6:16 AM Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Thu, Feb 20, 2020 at 3:06 AM Alex Malek <magicag...@gmail.com> wrote: > > > > > > Hello Postgres Hackers - > > > > We are having a reoccurring issue on 2 of our replicas where replication > stops due to this message: > > "incorrect resource manager data checksum in record at ..." > > This has been occurring on average once every 1 to 2 weeks during large > data imports (100s of GBs being written) > > on one of two replicas. > > Fixing the issue has been relatively straight forward: shutdown replica, > remove the bad wal file, restart replica and > > the good wal file is retrieved from the master. > > We are doing streaming replication using replication slots. > > However twice now, the master had already removed the WAL file so the > file had to retrieved from the wal archive. > > > > The WAL log directories on the master and the replicas are on ZFS file > systems. > > All servers are running RHEL 7.7 (Maipo) > > PostgreSQL 10.11 > > ZFS v0.7.13-1 > > > > The issue seems similar to > https://www.postgresql.org/message-id/CANQ55Tsoa6%3Dvk2YkeVUN7qO-2YdqJf_AMVQxqsVTYJm0qqQQuw%40mail.gmail.com > and to https://github.com/timescale/timescaledb/issues/1443 > > > > One quirk in our ZFS setup is ZFS is not handling our RAID array, so ZFS > sees our array as a single device. > > > > Right before the issue started we did some upgrades and altered some > postgres configs and ZFS settings. > > We have been slowly rolling back changes but so far the the issue > continues. > > > > Some interesting data points while debugging: > > We had lowered the ZFS recordsize from 128K to 32K and for that week the > issue started happening every other day. > > Using xxd and diff we compared "good" and "bad" wal files and the > differences were not random bad bytes. > > > > The bad file either had a block of zeros that were not in the good file > at that position or other data. Occasionally the bad data has contained > legible strings not in the good file at that position. At least one of > those exact strings has existed elsewhere in the files. > > However I am not sure if that is the case for all of them. > > > > This made me think that maybe there was an issue w/ wal file recycling > and ZFS under heavy load, so we tried lowering > > min_wal_size in order to "discourage" wal file recycling but my > understanding is a low value discourages recycling but it will still > > happen (unless setting wal_recycle in psql 12). > > > > We do print a message "recycled write-ahead log file .." in DEBUG2 > mode. You either want to run the server with DEBUG2 or maybe change > the code to make it LOG and see if that is printed. If you do that, > you can verify if the corrupted WAL is the same as a recycled one. > Are you suggesting having the master, the replicas or all in debug mode? How much extra logging would this generate? A replica typically consumes over 1 TB of WAL files before a bad wal file is encountered. > > There is a third replica where this bug has not (yet?) surfaced. > > This leads me to guess the bad data does not originate on the master. > > This replica is older than the other replicas, slower CPUs, less RAM, > and the WAL disk array is spinning disks. > > The OS, version of Postgres, and version of ZFS are the same as the > other replicas. > > This replica is not using a replication slot. > > This replica does not serve users so load/contention is much lower than > the others. > > The other replicas often have 100% utilization of the disk array that > houses the (non-wal) data. > > > > Any insight into the source of this bug or how to address it? > > > > Since the master has a good copy of the WAL file, can the replica > re-request the file from the master? Or from the archive? > > > > I think we do check in the archive if we get the error during > streaming, but archive might also have the same data due to which this > problem happens. Have you checked that the archive WAL file, is it > different from the bad WAL? See the Typically the master, the archive and the other replicas all have a good copy of the WAL file. relevant bits of code in > WaitForWALToBecomeAvailable especially the code near below comment: > > "Failure while streaming. Most likely, we got here because streaming > replication was terminated, or promotion was triggered. But we also > get here if we find an invalid record in the WAL streamed from master, > in which case something is seriously wrong. There's little chance that > the problem will just go away, but PANIC is not good for availability > either, especially in hot standby mode. So, we treat that the same as > disconnection, and retry from archive/pg_wal again. The WAL in the > archive should be identical to what was streamed, so it's unlikely > that it helps, but one can hope..." > > Thank you for this comment! This made me realize that on the replicas I had mentioned we had removed the restore_command. The replica we thought was not having the issue, was actually also getting/producing bad WAL files but was eventually recovering by getting a good WAL file from the archive b/c it had the restore_command defined.