On Sat, Jun 15, 2002 at 04:32:23PM -0400, Greg A. Woods wrote:
> > > That's irrelevant from PostgreSQL's point of view.  There's no sure way
> > > to tell the postgresql process(es) to make the on-disk database image
> > > consistent before you create the snapshot.  The race condition is
> > > between the user-level process and the filesystem.  The only sure way to
> > > guarantee a self-consistent backup is to shut down the process so as to
> > > ensure all the files it had open are now closed.  PostgreSQL makes no
> > > claims that all data necessary to present a continually consistent view
> > > of the DB will be written in a single system call.  In fact this is
> > > impossible since the average database consistes of many files and you
> > > can only write to one file at a time through the UNIX system interface.
> > 
> > Yes it does, and no it's not impossible.
> > see http://www.postgresql.org/idocs/index.php?wal.html
> 
> Sorry but you're not talking about what I'm talking about.  Yes WAL will
> give you a means to hopefully recover a restored database into a
> consistent view.  It will NOT, and cannot possibly, guarantee the
> consistency of many files on the disk, not even with fsync() (though
> PostgreSQL does try very hard to order it's own metadata writes -- but
> that's not the point and it's still irrelevant to the question).

The term "consistency" has no meaning by itself unless there are well
defined rules about the allowed data-content. If the rule is that alle
the postgresql-datafiles together holds the up-to-date status of the
database, then it is consistant at all times. If the rule is that each
file should hold a up-to-date and complete representation of one
particular database-table then it's only consistant after a clean
shutdown.

I'll argue that the first set of rules makes the most sense.

> Unless you can do a global database lock at a point where there are no
> open transactions and no uncommitted data in-core, the only way to
> guarantee a consistent database on disk is to close all the database
> files.  Even then if you're very paranoid and worried about a hard crash
> at a critical point during the snapshot operation you'll want to first
> flush all OS buffers too, which means first unmounting and then
> remounting the filesystem(s) (and possibly even doing whatever is
> necessary to flush buffers in your hardware RAID system).

OS buffers are flushed to disk when the snapshot is taken, at least on
linux LVM. I don't know the technical details about alternative logical
volume managers.

> > Stopping the database means closing all the connections, and if you have
> > multiple applications doing long overlapping transactions that don't
> > recover well from shutting down the database, then you have a problem.
> 
> I agree, however this is an operational problem, not a technical
> problem. 

It doesn't matter much if you call it an operational or a technical
problem - it's still a real problem for some database-servers.

> If you choose to use filesystem backups for your database then
> you need to have a very clear and deep understanding of these issues.

Sure.
Unfortenately I think it's somewhat difficult to get a deep
understanding of this when there are multiple directly false statements
(or at the very least highly confusing) posts in this and other threads.


> It really Really is best to just use pg_dump and back up the result.

I agree in most cases it is. My point is merely that there are also
advantages of doing snapshot backups, and inspite of messages stating
the opposite it _does_ actually work if set up correctly.


> > The newest release of postgresql always use fsync (on it's log) unless
> > you specificly turn it off. You shouldn't do that if you care about your
> > data.
> 
> I agree, but we're not discussing what you and I do, but rather what
> random John Doe DBA does.  There's ample quantity of suggestion out in
> the world already that makes it possible he will turn off fsync for
> performance reasons

I sure hope thoose suggestions are marked with a big warning! Turning
off fsync will obviously mean you loose the most resent data after a
crash/powerfailure, but AFAIK postgreSQL doesn't make any guarantees
that the database will be runnable at all after such a crash - so you
risk loosing not only the latest data but the whole database...


> > The only requirement on the filesystem is that it is journaling as well,
> > so it's always kept in a consistant state like the postgresql-database
> > is..
> 
> Now you're getting a little out of hand.  A journaling filesystem is a
> piling of one set of warts ontop of another.  Now you've got a situation
> where even though the filesystem might be 100% consistent even after a
> catastrophic crash, the database won't be.  There's no need to use a
> journaling filesystem with PostgreSQL (particularly if you use a proper
> hardware RAID subsystem with either full mirroring or full level 5
> protection).  

Let's keep things clear here.
RAID is totally irrelevant to this question. RAID protects you from
hardware-failure, but it doesn't help to keep your filesystem or data
consistant.

I have no idea why you say journaling filesystems "is a piling of one
set of warts ontop of another", but the fact is that is required to
always keep the filesystem consistant.

In theory non-journaling filesystems make no guarantees that the
filesystem will be usable at all after a crash/powerfailure. In practise
it's not that bad, and in most cases you will be able to recover
everything but the last updates with fsck.

PostgreSQL need a filesystem that guarantees the metadata to be
up-to-date. E.g. when it appends to the WAL the new file-size must be
stored to disk or the latest data will not be available after
crash/powerfailure. BUT; it's possible that the fsync() of the file
itself will cause the metadata to be updated - at least on some
filesystems. Directory-updates on the other hand are _not_ flushed to
disk because of fsync, so e.g. when postgreSQL creates new files (when a
table/index becomes bigger than 2 GB) you risk loosing that new file if
the machine crashes. (unless postgreSQL does fsync() on the directory
after creating the file - I assume it doesn't just to be safe)

In short: postgreSQL requires a journaling filesystem to guarantee
consistant data after a crash / powerfailure.


> Indeed there are potentially performance related reasons
> to avoid journaling filesystems!  I've heard people claim they are just
> as fast with PostgreSQL, and some even have claimed performance
> improvements, but none of the claimants could give a scientific
> description of their benchmark and I'm pretty sure they were seeing
> artifacts, not a representation of real-world throughput.

If you haven't seen the scientific description of this you haven't
looked hard enough. (check out the reiserfs mailinglist archive, for
instance)

Short explanation follows:

What journaling filesystems do is to write all metadata-updates to a log
rather than updating the actual filesystem (exactly like WAL do for
databases). After the update has been written to the log in a
transactional matter (so updates are atomic), they are written to the
actual filesystem. This means that all metadata-updates are writtent to
disk twice, and this is what causes journaling filesystems to be slower
in general.

BUT: if the same metadata is updated multiple times within a short
timeframe (before it's written to the actual filesystem), the filesystem
get's by by _only_ writing the new data to the log. So, we're back to
only 1 write, and since the writes are only appends on the log rather
than writes to random places on the disk there are less seeks and this
causes a performance boost.

This only applies to metadata, as regular data is normally not logged by
the filesystem. (journaling filesystems in general only guarantees that
the metadata is consistant - not the actual data). However, some
filesystems can optionally also do data-logging. This feature will cause
the same performance-improvements on data-updates (when the application
is doing random sync updates), but for postgreSQL there is no benefit as
updates are alreaddy done in a log without seeks (the WAL).

As for benchmarks that demonstrate this effect, try running postmark on
both a journaling and non-journaling filesystem (preferably the same
with journaling enabled/disabled). I doubt the effect is visable in
database-benchmarks, as databases update data very frequently compared
to metadata.


> > Restoring a snapshot is the fastest possible way of getting the system
> > back online, and even a tape-backup of the snapshot will be faster than
> > importing the database.
> 
> This is true.  However before you give that fact 100% weight in your
> decision process you need to do a lot more risk assessment and disaster
> planning to understand whether or not the tradeoffs inherent will not
> take away from your real needs.

Yes.

> I still believe it really Really REALY is best to just use pg_dump and
> back up the result.  If the time to reload a dump is a major concern to
> you then you have other far more critical issues to deal with before you
> can make a sane choice about backup integrity and disaster recovery.

There are other advantages of snapshot-backups in addition to the
recovery-time:
* space-saving (no need for temporary space for the pg_dump)
  (unless there are lots of writes going on, of course)
* generality
  This approach works for _all_ applications, not just databases
* consistancies between multiple databases
  (AFAIK the pg_dumpall doesn't take an atomic dump of all databases,
  so if you have a weird frontend that uses multiple databases and
  expect them to be consistant....)
* No additional CPU-load on the database-server 



The space-problem with pg_dump is not a fundamental pg_dump problem BTW.
If someone wrote a amanda plugin to backup the database directly instead
of writing it to file first, there is no additional space requirement.
It should be possible to write a pg_dump plugin that works just like tar
and dump, and enables you to backup the database directly from amanda.
(it would also eliminate the need for wrapper-scripts or cron-jobs that
do the pg_dump independently from amanda.

Of course the "directory" argument doesn't make sense for pg_dump, but
one could use a syntax like "/database-name" or "/database-name/table"
to specify what to back up.



-- 
Ragnar Kjørstad
Big Storage

Reply via email to