On 15.05.2013 15:42, David Powers wrote:
First, thanks for the replies.  This sort of thing is frustrating and hard
to diagnose at a distance, and any help is appreciated.

Here is some more background:

We have 3 9.2.4 databases using the following setup:

The subject says 9.2.3. Are you sure you're running 9.2.4 on all the servers? There was a fix to a bug related to starting a standby server from a filesystem snapshot. I don't think it was quite the case you have, but pretty close.

- A primary box
- A standby box running as a hot streaming replica from the primary
- A testing box restored nightly from a static backup

As noted, the static backup is taken off of the standby by taking an LVM
snapshot of the database filesystem and rsyncing.  I don't think it's a
likely problem but the rsync leverages the previous backup (using
--link-dest) to make the rsync faster and the resulting backup smaller.

I guess that should work, as long as you make a full copy when you restore the backup, and never start a postgres server directly against a backup that contains the hardlinks.

We've been using the same system for quite some time, but previously (~ 1
month ago) had been taking the backup off of the primary (still using the
LVM snapshot).  The replication is a recent addition, and a very helpful
one.  LVM snapshots aren't lightweight in the face of writes and in some
circumstances a long running rsync would spike the IO load on the
production box.

Results of some additional tests:

After the user noticed that the test restore showed the original problem we
ran `vacuum analyze` on all three testing databases thinking that it had a
good chance of quickly touching most of the underlying files.  That gave us
errors on two of the testing restores similar to:

ERROR:  invalid page header in block 5427 of relation base/16417/199732075

Huh, that's surprising. You have to be quite unlucky to end with a corrupt page header, even if there's something wrong with the underlying storage or backup routines. Could you grab a copy of that block? Something like

dd skip=5427 bs=8192 count=1 if=base/16417/199732075 of=corrupt-block

I'm especially curious what the first 20 bytes or so look like. I'm guessing that it's all zeros, while some later part of the page contains data. Ie. the page was torn at some point in the process.

I'm also running the vacuum analyze on the production machines to double
check that the base databases don't have a subtle corruption that simply
hasn't been noticed.  They run with normal autovacuum settings, so I
suspect that they are fine/this won't show anything because we should have
seen this from the autovacuum daemon before.

Vacuum analyze doesn't always scan the whole table with default settings, so that's no proof that are no missing pages. "set vacuum_freeze_table_age=0; vacuum <table>" will force a full scan of the table.

I'm happy to share the scripts we use for the backup/restore process if the
information above isn't enough, as well as the logs - though the postgres
logs don't seem to contain much of interest (the database system doesn't
really get involved).

Yeah, please share the scripts. What you've described seems correct to me, but maybe there's something I'm missing.

I also have the rsyncs of the failed snapshots available and could restore
them for testing purposes.  It's also easy to look in them (they are just
saved as normal directories on a big SAN) if I know what to look for.

Great. Not sure what to look at right now, but good that you have kept the evidence.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to