Hi, On 2013-11-18 10:58:26 -0800, Christophe Pettus wrote: > INCDIDENT #1: 9.0.14 -- A new secondary (S1) was initialized using > rsync off of an existing, correct primary (P1) for the base backup, > and using WAL-E for WAL segment shipping. Both the primary and > secondary were running 9.0.14. S1 properly connected to the primary > once the it was caught up on WAL segments, and S1 was then promoted as > a primary using the trigger file.
Could you detail how exactly the base backup was created? Including the *exact* logic for copying? > No errors in the log files on either system. Do you have the log entries for the startup after the base backup? > Because the client's schema included a "last_updated" field, we were > able to determine that all of the rows that were either missing or > duplicated had been updated on P1 shortly (3-5 minutes) before S1 was > promoted. It's possible, but not confirmed, that there were active > queries (including updates) running on P1 at the moment of S1's > promotion. Any chance you have log_checkpoints enabled? If so, could you check whether there was a checkpoint around the time of the base backup? This server is gone, right? If not, could you do: SELECT ctid, xmin, xmax, * FROM whatever WHERE duplicate_row? > INCIDENT #2: 9.3.1 -- In order to repair the database, a pg_dump was taken of > S1y, after having dropped the primary and unique constraints, and restored > into a new 9.3.1 server, P2. Duplicate rows were purged, and missing rows > were added again. The database, a new primary, was then put back into > production, and ran without incident. > > A new secondary, S2 was created off of the primary. This secondary was > created using pg_basebackup using --xlog-method=stream, although the WAL-E > archiving was still present. > > S2 attached to P2 without incident and no errors in the logs, but > nearly-identical corruption was discovered (although this time without the > duplicated rows, just missing rows). At this point, it's not clear if there > was some clustering in the "last_updated" timestamp for the rows that are > missing from S2. No duplicated rows were observed. > > P2 and S2 are both AWS instances running Ubuntu 12.04, using EBS (with xfs as > the file system) as the data volume. > > No errors in the log files on either system. Could you list the *exact* steps you did to startup the cluster? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers