On Thu, Apr 2, 2020 at 2:10 PM Andres Freund <and...@anarazel.de> wrote:
> Hi, > > On 2020-02-19 16:35:53 -0500, Alex Malek wrote: > > We are having a reoccurring issue on 2 of our replicas where replication > > stops due to this message: > > "incorrect resource manager data checksum in record at ..." > > Could you show the *exact* log output please? Because this could > temporarily occur without signalling anything bad, if e.g. the > replication connection goes down. > Feb 23 00:02:02 wrds-pgdata10-2-w postgres[68329]: [12491-1] 5e4aac44.10ae9 (@) LOG: incorrect resource manager data checksum in record at 39002/57AC0338 When it occurred replication stopped. The only way to resume replication was to stop server and remove bad WAL file. > > > > Right before the issue started we did some upgrades and altered some > > postgres configs and ZFS settings. > > We have been slowly rolling back changes but so far the the issue > continues. > > > > Some interesting data points while debugging: > > We had lowered the ZFS recordsize from 128K to 32K and for that week the > > issue started happening every other day. > > Using xxd and diff we compared "good" and "bad" wal files and the > > differences were not random bad bytes. > > > > The bad file either had a block of zeros that were not in the good file > at > > that position or other data. Occasionally the bad data has contained > > legible strings not in the good file at that position. At least one of > > those exact strings has existed elsewhere in the files. > > However I am not sure if that is the case for all of them. > > > > This made me think that maybe there was an issue w/ wal file recycling > and > > ZFS under heavy load, so we tried lowering > > min_wal_size in order to "discourage" wal file recycling but my > > understanding is a low value discourages recycling but it will still > > happen (unless setting wal_recycle in psql 12). > > This sounds a lot more like a broken filesystem than anythingon the PG > level. > Probably. In my recent updated comment turning off ZFS compression on master seems to have fixed the issue. However I will note that the WAL file stored on the master was always fine upon inspection. > > > > When using replication slots, what circumstances would cause the master > to > > not save the WAL file? > > What do you mean by "save the WAL file"? > Typically, when using replication slots, when replication stops the master will save the next needed WAL file. However once or twice when this error occurred the master recycled/removed the WAL file needed. I suspect perhaps b/c the replica had started to read the WAL file it sent some signal to the master that the WAL file was already consumed. I am guessing, not knowing exactly what is happening and w/ the caveat that this situation was rare and not the norm. It is also possible caused by a different error. Thanks. Alex