Re: [PERFORM] select on 22 GB table causes "An I/O error occured while sending to the backend." exception

david Thu, 28 Aug 2008 19:42:55 -0700

On Thu, 28 Aug 2008, Scott Marlowe wrote:

On Thu, Aug 28, 2008 at 7:53 PM, Matthew Dennis <[EMAIL PROTECTED]> wrote:

On Thu, Aug 28, 2008 at 8:11 PM, Scott Marlowe <[EMAIL PROTECTED]>
wrote:

wait a min here, postgres is supposed to be able to survive a complete
box
failure without corrupting the database, if killing a process can
corrupt
the database it sounds like a major problem.


Yes it is a major problem, but not with postgresql.  It's a major
problem with the linux OOM killer killing processes that should not be
killed.

Would it be postgresql's fault if it corrupted data because my machine
had bad memory?  Or a bad hard drive?  This is the same kind of
failure.  The postmaster should never be killed.  It's the one thing
holding it all together.


I fail to see the difference between the OOM killing it and the power going
out.


Then you fail to understand.

scenario 1:  There's a postmaster, it owns all the child processes.
It gets killed.  The Postmaster gets restarted.  Since there isn't one

when the postmaster gets killed doesn't that kill all it's children aswell?

running, it comes up.  starts new child processes.  Meanwhile, the old
child processes that don't belong to it are busy writing to the data
store.  Instant corruption.

if so then the postmaster should not only check if there is an existingpostmaster running, it should check for the presense of the childprocesses as well.

scenario 2: Someone pulls the plug.  Every postgres child dies a quick
death.  Data on the drives is coherent and recoverable.

 And yes, if the power went out and PG came up with a corrupted DB

(assuming I didn't turn off fsync, etc) I *would* blame PG.


Then you might be wrong.  If you were using the LVM, or certain levels
of SW RAID, or a RAID controller with cache with no battery backing
that is set to write-back, or if you were using an IDE or SATA drive /
controller that didn't support write barriers, or using NFS mounts for
database storage, and so on.


these all fall under "(assuming I didn't turn off fsync, etc)"

My point being that PostgreSQL HAS to
make certain assumptions about its environment that it simply cannot
directly control or test for.  Not having the postmaster shot in the
head while the children keep running is one of those things.

 I understand
that killing the postmaster could stop all useful PG work, that it could
cause it to stop responding to clients, that it could even "crash" PG, et
ceterabut if a particular process dying causes corrupted DBs, that sounds
borked to me.


Well, design a better method and implement it.  If everything went
through the postmaster you'd be lucky to get 100 transactions per
second.

well, if you aren't going through the postmaster, what process isrecieving network messages? it can't be a group of processes, only one canbe listening to a socket at one time.

and if the postmaster isn't needed for the child processes to write to thedatastore, how are multiple child processes prevented from writing to thedatastore normally? and why doesn't that mechanism continue to work?

 There are compromises between performance and reliability
under fire that have to be made.  It is not unreasonable to assume
that your OS is not going to randomly kill off processes because of a
dodgy VM implementation quirk.

P.s. I'm a big fan of linux, and I run my dbs on it.  But I turn off
overcommit and make a few other adjustments to make sure my database
is safe.  The OOM killer as a default is fine for workstations, but
it's an insane setting for servers, much like swappiness=60 is an
insane setting for a server too.

so are you saying that the only possible thing that can kill thepostmaster is the OOM killer? it can't possilby exit in any othersituation without the children being shutdown first?


I would be surprised if that was really true.

David Lang

--
Sent via pgsql-performance mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] select on 22 GB table causes "An I/O error occured while sending to the backend." exception

Reply via email to