On 16.03.2011 17:09, Kevin Grittner wrote:
Tambet Matiisen<tambet.matii...@gmail.com>  wrote:

Pre-live database is restored from live database dump every night.

How is that done?  A single pg_dump of the entire live database
restored using psql?  Are both database servers at the same
PostgreSQL version?

Yes, I use pg_dump on live server and the result is rdiff-backupped into development server. Whole SQL dump is 12G without compression and the rdiff delta is about 10-20MB every day. Then I drop pre-live database on development server and recreate it using createdb and psql.

For a while development server was running 8.4 and live server 8.1. Now both are 8.4, but this shouldn't matter, as I do backup and restore via SQL.


So far the errors have been in pre-live database,

You're running pg_dump against a database you just restored from a
pg_dump image?

Hmm, yeah. This sounds rather dumb, but haven't got to that yet. Development server contains some additional databases as well, that do not exist on live server.


Usually the next day error was gone. I mostly blamed badly timed
backup and restore scripts, although this shouldn't result in
errors.

No it shouldn't -- if you're following any of the documented backup
and restore techniques.  I have a suspicion that you're just doing a
file copy without stopping the live database or properly following
the documented PITR backup and recovery techniques.

No, I don't do any advanced backup tricks. Just plain pg_dump and psql.


This time the error is not in pre-live database and therefore it
doesn't go away.

If I understand you, this sounds like corruption in the live
database; nothing on the pre-live database is part of causing this
problem.

This would be the case when I do filesystem level copy, but I do not.


The server is also running [...] Samba [...]

I hope you're not trusting Samba too far.  For a while we were using
it in backups across our WAN, and it mangled at least one file
almost every day.  We had to take to running md5sum against both
ends for each file to ensure we didn't get garbage (until we
converted everything to use TCP communications, which have never
mangled anything for us).

As I said, I'm using rdiff-backup to transfer pure SQL files.


Both fsync and full_page_writes are on.

Good.  Without those an OS or hardware crash can corrupt your
database.

Actually they are commented out, but I suppose this means "on".


OK, I don't have UPS for this machine, but power has been stable.
Current uptime is 32 days, which I bet is from the last kernel
update.

OK.  A power outage wouldn't be too likely to matter if you have
fsync and full_page_writes on.

That's a relief :).


Currently I blame either faulty memory or faulty software RAID
driver.  I can easily eliminate the memory cause by running
memtest86 for few hours

Is this ECC memory?  If not, even a good test doesn't prove that a
RAM problem didn't cause the corruption.

It's not ECC memory.


Now, off to buy UPS...

Not a bad idea, but it doesn't sound like lack of that is likely to
have caused the corruption in your live database, based on the
settings you mentioned.  (Assuming those settings are in use on the
live server.)

Checked live server, it has also fsync=on and full_page_writes=on. But it shouldn't matter, because backup of live server doesn't give any errors.

It is possible, that restore of pre-live database using psql lasts so long, that backup of the same database using pg_dump is already kicking in. But again, this shouldn't matter and it doesn't explain why the last error is in another database, that hasn't changed for months.

Now I have to find time to run memtest.

  Tambet

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Reply via email to