Re: [GENERAL] database corruption questions

2012-10-23 Thread Kevin Grittner
Heine Ferreira wrote:

 Are there any best practices for avoiding database corruption?

First and foremost, do not turn off fsync or full_page_writes in your
configuration. After that the most common causes for database
corruption I've seen are bad RAM (ECC RAM is a requirement, not an
option for a serious database server in my book), failing hard drives
(use RAID and SMART monitoring, and have a sane hardware replacement
policy), or buggy device drivers (pay attention to releases which fix
bugs or security vulnerabilities which could affect you). It's
getting rare to see this now, but it used to be common for some
drives to lie to you about having written data when it was actually
just in cache -- a good BBU RAID controller will go out of its way to
keep the drives honest.

-Kevin


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] database corruption questions

2012-10-17 Thread Daniel Serodio (lists)

Craig Ringer wrote:

On 10/14/2012 05:53 AM, Heine Ferreira wrote:

Hi

Are there any best practices for avoiding database
corruption?


* Maintain rolling backups with proper ageing. For example, keep one a 
day for the last 7 days, then one a week for the last 4 weeks, then 
one a month for the rest of the year, then one a year.

What kind of rolling backups? From pg_basebackup?


* Use warm standby with log shipping and/or replication to maintain a 
live copy of the DB.


* If you want point-in-time recovery, keep a few days or weeks worth 
of WAL archives and a basebackup around. That'll help you recover from 
those oops I meant DROP TABLE unimportant; not DROP TABLE 
vital_financial_records; issues.


* Keep up to date with the latest PostgreSQL patch releases. Don't be 
one of those people still running 9.0.0 when 9.0.10 is out.
The problem is that updating the database usually results in downtime. 
Or can the downtime be avoided in a replication scenario?


* Plug-pull test your system when you're testing it before going live. 
Put it under load with something like pgbench, then literally pull the 
plug out. If your database doesn't come back up fine you have 
hardware, OS or configuration problems.


* Don't `kill -9` the postmaster. It should be fine, but it's still 
not smart.


* ABSOLUTELY NEVER DELETE postmaster.pid

* Use good quality hardware with proper cooling and a good quality 
power supply. If possible, ECC RAM is a nice extra.


* Never, ever, ever use cheap SSDs. Use good quality hard drives or 
(after proper testing) high end SSDs. Read the SSD reviews 
periodically posted on this mailing list if considering using SSDs. 
Make sure the SSD has a supercapacitor or other reliable option for 
flushing its write cache on power loss. Always do repeated plug-pull 
testing when using SSDs.


* Use a solid, reliable file system. zfs-on-linux, btrfs, etc are not 
the right choices for a database you care about. Never, ever, ever use 
FAT32.


* If on Windows, do not run an anti-virus program on your
database server. Nobody should be using it for other things or running 
programs on it anyway.


* Avoid RAID 5, mostly because the performance is terrible, but also 
because I've seen corruption issues with rebuilds from parity on 
failing disks.


* Use a good quality hardware RAID controller with a battery backup 
cache unit if you're using spinning disks in RAID. This is as much for 
performance as reliability; a BBU will make an immense difference to 
database performance.


* If you're going to have a UPS (you shouldn't need one as your system 
should be crash-safe), don't waste your money on a cheap one. Get a 
good online double-conversion unit that does proper power filtering. 
Cheap UPSs are just a battery with a fast switch, they provide no 
power filtering and what little surge protection they offer is done 
with a component that wears out after absorbing a few surges, becoming 
totally ineffective. Since your system should be crash-safe a cheap 
UPS will do nothing for corruption protection, it'll only help with 
uptime.


--
Craig Ringer


Thanks,
Daniel Serodio


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] database corruption questions

2012-10-17 Thread Daniel Serodio

Craig Ringer wrote:

On 10/14/2012 05:53 AM, Heine Ferreira wrote:

Hi

Are there any best practices for avoiding database
corruption?


* Maintain rolling backups with proper ageing. For example, keep one a 
day for the last 7 days, then one a week for the last 4 weeks, then 
one a month for the rest of the year, then one a year.

What kind of rolling backups? From pg_basebackup?


* Use warm standby with log shipping and/or replication to maintain a 
live copy of the DB.


* If you want point-in-time recovery, keep a few days or weeks worth 
of WAL archives and a basebackup around. That'll help you recover from 
those oops I meant DROP TABLE unimportant; not DROP TABLE 
vital_financial_records; issues.


* Keep up to date with the latest PostgreSQL patch releases. Don't be 
one of those people still running 9.0.0 when 9.0.10 is out.
The problem is that updating the database usually results in downtime. 
Or can the downtime be avoided in a replication scenario?


* Plug-pull test your system when you're testing it before going live. 
Put it under load with something like pgbench, then literally pull the 
plug out. If your database doesn't come back up fine you have 
hardware, OS or configuration problems.


* Don't `kill -9` the postmaster. It should be fine, but it's still 
not smart.


* ABSOLUTELY NEVER DELETE postmaster.pid

* Use good quality hardware with proper cooling and a good quality 
power supply. If possible, ECC RAM is a nice extra.


* Never, ever, ever use cheap SSDs. Use good quality hard drives or 
(after proper testing) high end SSDs. Read the SSD reviews 
periodically posted on this mailing list if considering using SSDs. 
Make sure the SSD has a supercapacitor or other reliable option for 
flushing its write cache on power loss. Always do repeated plug-pull 
testing when using SSDs.


* Use a solid, reliable file system. zfs-on-linux, btrfs, etc are not 
the right choices for a database you care about. Never, ever, ever use 
FAT32.


* If on Windows, do not run an anti-virus program on your
database server. Nobody should be using it for other things or running 
programs on it anyway.


* Avoid RAID 5, mostly because the performance is terrible, but also 
because I've seen corruption issues with rebuilds from parity on 
failing disks.


* Use a good quality hardware RAID controller with a battery backup 
cache unit if you're using spinning disks in RAID. This is as much for 
performance as reliability; a BBU will make an immense difference to 
database performance.


* If you're going to have a UPS (you shouldn't need one as your system 
should be crash-safe), don't waste your money on a cheap one. Get a 
good online double-conversion unit that does proper power filtering. 
Cheap UPSs are just a battery with a fast switch, they provide no 
power filtering and what little surge protection they offer is done 
with a component that wears out after absorbing a few surges, becoming 
totally ineffective. Since your system should be crash-safe a cheap 
UPS will do nothing for corruption protection, it'll only help with 
uptime.


--
Craig Ringer


Thanks,
Daniel Serodio


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] database corruption questions

2012-10-17 Thread Craig Ringer

On 10/18/2012 01:06 AM, Daniel Serodio wrote:

Craig Ringer wrote:

On 10/14/2012 05:53 AM, Heine Ferreira wrote:

Hi

Are there any best practices for avoiding database
corruption?


* Maintain rolling backups with proper ageing. For example, keep one a
day for the last 7 days, then one a week for the last 4 weeks, then
one a month for the rest of the year, then one a year.

What kind of rolling backups? From pg_basebackup?


I'd recommend good old `pg_dump`, that way you're not assuming that your 
cluster's on-disk format is intact and happy. Regular dumps will also 
help detect any damage that might've crept in from file system 
corruption, HDD/RAID faults, etc. Not that that should happen, but we're 
talking preventative action here.


I elaborated somewhat here:


http://blog.ringerc.id.au/2012/10/avoiding-postgresql-database-corruption.html

Alternately, check out barman: http://www.pgbarman.org/ . I haven't 
tried it yet, but it looks very promising. I'd still want to take 
periodic dumps, as I'm reluctant to rely on `pg_basebackup` style 
cluster copies alone.



* Keep up to date with the latest PostgreSQL patch releases. Don't be one of 
those people still running 9.0.0 when 9.0.10 is out.

The problem is that updating the database usually results in downtime. Or can 
the downtime be avoided in a replication scenario?


Patches within the same minor release require extremely minimal downtime.

- Stop server
- Install new binaries
- Start server

How much downtime will a data corruption bug (yes, they've happened) 
that's fixed in a new version cost you if you don't patch and it bites 
you? Or a bug that causes a server crash and restart? Plan downtime, so 
you don't have potentially much longer unplanned downtime at the worst 
possible time.


You can do zero-downtime minor updates using hot standby and standby 
promotion; see http://www.repmgr.org/ .


Updating to a new major release is a bigger job, but that's not what I'm 
talking about.



BTW, please trim your replies to quote just the relevant context.

--
Craig Ringer


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] database corruption questions

2012-10-15 Thread Bruce Momjian
On Sun, Oct 14, 2012 at 11:26:40AM +0800, Craig Ringer wrote:
 On 10/14/2012 11:00 AM, John R Pierce wrote:
 On 10/13/12 7:13 PM, Craig Ringer wrote:
 
 * Use a good quality hardware RAID controller with a battery backup
 cache unit if you're using spinning disks in RAID. This is as much for
 performance as reliability; a BBU will make an immense difference to
 database performance.
 
 a comment on this one   I have some test servers with lots of SAS
 and/or SATA drives on controllers like LSI Logic 9261-8i, with 512MB or
 1GB battery-backed cache. I can configure the controller for JBOD
 and use linux mdraid raid10 and get the same performance as the
 controllers native raid10, as long as the write-back cache is
 enabled. disable the writeback cache, and you might as well be using
 SATA JBOD.
 
 Yeah, without the write-back cache you don't gain much. I run a
 couple of DBs on plain old `md` RAID and I'm actually quite happy
 with it.
 
 I've expanded this into a blog post and improved that section there.
 
 http://blog.ringerc.id.au/2012/10/avoiding-postgresql-database-corruption.html

Craig, that is a great post.  Can you get it on Planet Postgres?

http://planet.postgresql.org/

I think you would have to subscribe your RSS blog feed.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] database corruption questions

2012-10-13 Thread Leif Biberg Kristensen
 Lørdag 13. oktober 2012 23.53.03 skrev Heine Ferreira :
 Hi
 
 Are there any best practices for avoiding database
 corruption?

In my experience, database corruption always comes down to flaky disk drives. 
Keep your disks new and shiny eg. less than 3 years, and go for some kind of 
redundancy in a RAID configuration.

regards, Leif


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] database corruption questions

2012-10-13 Thread John R Pierce

On 10/13/12 3:04 PM, Leif Biberg Kristensen wrote:

  Lørdag 13. oktober 2012 23.53.03 skrev Heine Ferreira :

Hi

Are there any best practices for avoiding database
corruption?

In my experience, database corruption always comes down to flaky disk drives.
Keep your disks new and shiny eg. less than 3 years, and go for some kind of
redundancy in a RAID configuration.


also, ECC RAM so creeping bit rot doesn't slip in from memory without 
detection.if you use a raid controller with a write-back cache, be 
sure it has BBU or flash-back.




--
john r pierceN 37, W 122
santa cruz ca mid-left coast




--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] database corruption questions

2012-10-13 Thread Craig Ringer

On 10/14/2012 05:53 AM, Heine Ferreira wrote:

Hi

Are there any best practices for avoiding database
corruption?


* Maintain rolling backups with proper ageing. For example, keep one a 
day for the last 7 days, then one a week for the last 4 weeks, then one 
a month for the rest of the year, then one a year.


* Use warm standby with log shipping and/or replication to maintain a 
live copy of the DB.


* If you want point-in-time recovery, keep a few days or weeks worth of 
WAL archives and a basebackup around. That'll help you recover from 
those oops I meant DROP TABLE unimportant; not DROP TABLE 
vital_financial_records; issues.


* Keep up to date with the latest PostgreSQL patch releases. Don't be 
one of those people still running 9.0.0 when 9.0.10 is out.


* Plug-pull test your system when you're testing it before going live. 
Put it under load with something like pgbench, then literally pull the 
plug out. If your database doesn't come back up fine you have hardware, 
OS or configuration problems.


* Don't `kill -9` the postmaster. It should be fine, but it's still not 
smart.


* ABSOLUTELY NEVER DELETE postmaster.pid

* Use good quality hardware with proper cooling and a good quality power 
supply. If possible, ECC RAM is a nice extra.


* Never, ever, ever use cheap SSDs. Use good quality hard drives or 
(after proper testing) high end SSDs. Read the SSD reviews periodically 
posted on this mailing list if considering using SSDs. Make sure the SSD 
has a supercapacitor or other reliable option for flushing its write 
cache on power loss. Always do repeated plug-pull testing when using SSDs.


* Use a solid, reliable file system. zfs-on-linux, btrfs, etc are not 
the right choices for a database you care about. Never, ever, ever use 
FAT32.


* If on Windows, do not run an anti-virus program on your
database server. Nobody should be using it for other things or running 
programs on it anyway.


* Avoid RAID 5, mostly because the performance is terrible, but also 
because I've seen corruption issues with rebuilds from parity on failing 
disks.


* Use a good quality hardware RAID controller with a battery backup 
cache unit if you're using spinning disks in RAID. This is as much for 
performance as reliability; a BBU will make an immense difference to 
database performance.


* If you're going to have a UPS (you shouldn't need one as your system 
should be crash-safe), don't waste your money on a cheap one. Get a good 
online double-conversion unit that does proper power filtering. Cheap 
UPSs are just a battery with a fast switch, they provide no power 
filtering and what little surge protection they offer is done with a 
component that wears out after absorbing a few surges, becoming totally 
ineffective. Since your system should be crash-safe a cheap UPS will do 
nothing for corruption protection, it'll only help with uptime.


--
Craig Ringer


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] database corruption questions

2012-10-13 Thread John R Pierce

On 10/13/12 7:13 PM, Craig Ringer wrote:


* Use a good quality hardware RAID controller with a battery backup 
cache unit if you're using spinning disks in RAID. This is as much for 
performance as reliability; a BBU will make an immense difference to 
database performance. 


a comment on this one   I have some test servers with lots of SAS 
and/or SATA drives on controllers like LSI Logic 9261-8i, with 512MB or 
1GB battery-backed cache. I can configure the controller for JBOD 
and use linux mdraid raid10 and get the same performance as the 
controllers native raid10, as long as the write-back cache is 
enabled. disable the writeback cache, and you might as well be using 
SATA JBOD.




--
john r pierceN 37, W 122
santa cruz ca mid-left coast



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] database corruption questions

2012-10-13 Thread Craig Ringer

On 10/14/2012 11:00 AM, John R Pierce wrote:

On 10/13/12 7:13 PM, Craig Ringer wrote:


* Use a good quality hardware RAID controller with a battery backup
cache unit if you're using spinning disks in RAID. This is as much for
performance as reliability; a BBU will make an immense difference to
database performance.


a comment on this one   I have some test servers with lots of SAS
and/or SATA drives on controllers like LSI Logic 9261-8i, with 512MB or
1GB battery-backed cache. I can configure the controller for JBOD
and use linux mdraid raid10 and get the same performance as the
controllers native raid10, as long as the write-back cache is
enabled. disable the writeback cache, and you might as well be using
SATA JBOD.


Yeah, without the write-back cache you don't gain much. I run a couple 
of DBs on plain old `md` RAID and I'm actually quite happy with it.


I've expanded this into a blog post and improved that section there.

http://blog.ringerc.id.au/2012/10/avoiding-postgresql-database-corruption.html

Comments appreciated.

--
Craig Ringer


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] database corruption questions

2012-10-13 Thread Craig Ringer

On 10/14/2012 05:53 AM, Heine Ferreira wrote:

Hi

Are there any best practices for avoiding database
corruption?


I forgot to mention, you should also read:

  http://www.postgresql.org/docs/current/static/wal-reliability.html

--
Craig Ringer


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] database corruption questions

2012-10-13 Thread Chris Angelico
On Sun, Oct 14, 2012 at 1:13 PM, Craig Ringer ring...@ringerc.id.au wrote:
 * Never, ever, ever use cheap SSDs. Use good quality hard drives or (after
 proper testing) high end SSDs. Read the SSD reviews periodically posted on
 this mailing list if considering using SSDs. Make sure the SSD has a
 supercapacitor or other reliable option for flushing its write cache on
 power loss. Always do repeated plug-pull testing when using SSDs.

Interesting. My boss just bought a set of SSDs for some test systems,
with the intention of using them for our next deployment. They're
giving really great performance under pgbench, but we haven't yet done
a plug-pull test on any of them. I'll make sure I do that next week.

Is there an article somewhere about how best to do a plug-pull test?
Or is it as simple as fire up pgbench, kill the power, bring things
back up, and see if anything isn't working?

ChrisA


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] database corruption questions

2012-10-13 Thread Craig Ringer

On 10/14/2012 12:02 PM, Chris Angelico wrote:


Is there an article somewhere about how best to do a plug-pull test?
Or is it as simple as fire up pgbench, kill the power, bring things
back up, and see if anything isn't working?


That's what I'd do and what I've always done in the past, but others 
here are much more experienced with testing gear into production.


You can also use pg_test_fsync and diskchecker.pl . See:

  http://www.postgresql.org/docs/current/static/wal-reliability.html

I do repeated plug-pull tests and make sure fsync is being honoured.

--
Craig Ringer


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general