Re: [HACKERS] streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)

Heikki Linnakangas Wed, 15 May 2013 11:50:59 -0700

On 15.05.2013 15:42, David Powers wrote:

First, thanks for the replies.  This sort of thing is frustrating and hard
to diagnose at a distance, and any help is appreciated.


Here is some more background:

We have 3 9.2.4 databases using the following setup:

The subject says 9.2.3. Are you sure you're running 9.2.4 on all theservers? There was a fix to a bug related to starting a standby serverfrom a filesystem snapshot. I don't think it was quite the case youhave, but pretty close.

- A primary box
- A standby box running as a hot streaming replica from the primary
- A testing box restored nightly from a static backup

As noted, the static backup is taken off of the standby by taking an LVM
snapshot of the database filesystem and rsyncing.  I don't think it's a
likely problem but the rsync leverages the previous backup (using
--link-dest) to make the rsync faster and the resulting backup smaller.

I guess that should work, as long as you make a full copy when yourestore the backup, and never start a postgres server directly against abackup that contains the hardlinks.

We've been using the same system for quite some time, but previously (~ 1
month ago) had been taking the backup off of the primary (still using the
LVM snapshot).  The replication is a recent addition, and a very helpful
one.  LVM snapshots aren't lightweight in the face of writes and in some
circumstances a long running rsync would spike the IO load on the
production box.

Results of some additional tests:

After the user noticed that the test restore showed the original problem we
ran `vacuum analyze` on all three testing databases thinking that it had a
good chance of quickly touching most of the underlying files.  That gave us
errors on two of the testing restores similar to:

ERROR:  invalid page header in block 5427 of relation base/16417/199732075

Huh, that's surprising. You have to be quite unlucky to end with acorrupt page header, even if there's something wrong with the underlyingstorage or backup routines. Could you grab a copy of that block?Something like


dd skip=5427 bs=8192 count=1 if=base/16417/199732075 of=corrupt-block

I'm especially curious what the first 20 bytes or so look like. I'mguessing that it's all zeros, while some later part of the page containsdata. Ie. the page was torn at some point in the process.

I'm also running the vacuum analyze on the production machines to double
check that the base databases don't have a subtle corruption that simply
hasn't been noticed.  They run with normal autovacuum settings, so I
suspect that they are fine/this won't show anything because we should have
seen this from the autovacuum daemon before.

Vacuum analyze doesn't always scan the whole table with defaultsettings, so that's no proof that are no missing pages. "setvacuum_freeze_table_age=0; vacuum <table>" will force a full scan of thetable.

I'm happy to share the scripts we use for the backup/restore process if the
information above isn't enough, as well as the logs - though the postgres
logs don't seem to contain much of interest (the database system doesn't
really get involved).

Yeah, please share the scripts. What you've described seems correct tome, but maybe there's something I'm missing.

I also have the rsyncs of the failed snapshots available and could restore
them for testing purposes.  It's also easy to look in them (they are just
saved as normal directories on a big SAN) if I know what to look for.

Great. Not sure what to look at right now, but good that you have keptthe evidence.


- Heikki


--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)

Reply via email to