On Tue, Oct 25, 2016 at 12:57 PM, Alvaro Herrera <alvhe...@2ndquadrant.com> wrote: > Merlin Moncure wrote: > >> After last night, I rebuilt the cluster, turning on checksums, turning >> on synchronous commit (it was off) and added a standby replica. This >> should help narrow the problem down should it re-occur; if storage is >> bad (note, other database on same machine is doing 10x write activity >> and is fine) or something is scribbling on shared memory (my guess >> here) then checksums should be popped, right? > > Not really sure about that. As I recall we compute the CRC on the > buffer's way out, based on the then-current contents, so if something > scribbles on the buffer while it's waiting to be evicted, the CRC > computation would include the new (corrupted) bytes rather than the > original ones -- see FlushBuffer.
Huh. I have a new theory on this. Dealing with the reconstituted database, I'm finding more things -- functions and such, that are simply gone and had to be rebuilt -- they escaped notice as they were not in primary code paths. Recall that the original outage came manifested as queries getting stuck, possibly on spinlock (we don't know for sure). After that, things started to randomly disappear, possibly from system catalogs (but now need to go back and verify older data, I think). There were three autovac processes running. What if the subsequent dataloss was in fact a symptom of the first outage? Is in theory possible for data to appear visible but then be eaten up as the transactions making the data visible get voided out by some other mechanic? I had to pull a quick restart the first time and everything looked ok -- or so I thought. What I think was actually happening is that data started to slip into the void. It's like randomly sys catalogs were dropping off. I bet other data was, too. I can pull older backups and verify that. It's as if some creeping xmin was snuffing everything out. The confirmation of this should be obvious -- if that's indeed the case, the backup and restored cluster should no longer present data loss. Given that I was getting that every 1-2 days, we should be able to figure that out pretty soon. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers