On Sat, Jun 15, 2002 at 04:32:23PM -0400, Greg A. Woods wrote: > > > That's irrelevant from PostgreSQL's point of view. There's no sure way > > > to tell the postgresql process(es) to make the on-disk database image > > > consistent before you create the snapshot. The race condition is > > > between the user-level process and the filesystem. The only sure way to > > > guarantee a self-consistent backup is to shut down the process so as to > > > ensure all the files it had open are now closed. PostgreSQL makes no > > > claims that all data necessary to present a continually consistent view > > > of the DB will be written in a single system call. In fact this is > > > impossible since the average database consistes of many files and you > > > can only write to one file at a time through the UNIX system interface. > > > > Yes it does, and no it's not impossible. > > see http://www.postgresql.org/idocs/index.php?wal.html > > Sorry but you're not talking about what I'm talking about. Yes WAL will > give you a means to hopefully recover a restored database into a > consistent view. It will NOT, and cannot possibly, guarantee the > consistency of many files on the disk, not even with fsync() (though > PostgreSQL does try very hard to order it's own metadata writes -- but > that's not the point and it's still irrelevant to the question).
The term "consistency" has no meaning by itself unless there are well defined rules about the allowed data-content. If the rule is that alle the postgresql-datafiles together holds the up-to-date status of the database, then it is consistant at all times. If the rule is that each file should hold a up-to-date and complete representation of one particular database-table then it's only consistant after a clean shutdown. I'll argue that the first set of rules makes the most sense. > Unless you can do a global database lock at a point where there are no > open transactions and no uncommitted data in-core, the only way to > guarantee a consistent database on disk is to close all the database > files. Even then if you're very paranoid and worried about a hard crash > at a critical point during the snapshot operation you'll want to first > flush all OS buffers too, which means first unmounting and then > remounting the filesystem(s) (and possibly even doing whatever is > necessary to flush buffers in your hardware RAID system). OS buffers are flushed to disk when the snapshot is taken, at least on linux LVM. I don't know the technical details about alternative logical volume managers. > > Stopping the database means closing all the connections, and if you have > > multiple applications doing long overlapping transactions that don't > > recover well from shutting down the database, then you have a problem. > > I agree, however this is an operational problem, not a technical > problem. It doesn't matter much if you call it an operational or a technical problem - it's still a real problem for some database-servers. > If you choose to use filesystem backups for your database then > you need to have a very clear and deep understanding of these issues. Sure. Unfortenately I think it's somewhat difficult to get a deep understanding of this when there are multiple directly false statements (or at the very least highly confusing) posts in this and other threads. > It really Really is best to just use pg_dump and back up the result. I agree in most cases it is. My point is merely that there are also advantages of doing snapshot backups, and inspite of messages stating the opposite it _does_ actually work if set up correctly. > > The newest release of postgresql always use fsync (on it's log) unless > > you specificly turn it off. You shouldn't do that if you care about your > > data. > > I agree, but we're not discussing what you and I do, but rather what > random John Doe DBA does. There's ample quantity of suggestion out in > the world already that makes it possible he will turn off fsync for > performance reasons I sure hope thoose suggestions are marked with a big warning! Turning off fsync will obviously mean you loose the most resent data after a crash/powerfailure, but AFAIK postgreSQL doesn't make any guarantees that the database will be runnable at all after such a crash - so you risk loosing not only the latest data but the whole database... > > The only requirement on the filesystem is that it is journaling as well, > > so it's always kept in a consistant state like the postgresql-database > > is.. > > Now you're getting a little out of hand. A journaling filesystem is a > piling of one set of warts ontop of another. Now you've got a situation > where even though the filesystem might be 100% consistent even after a > catastrophic crash, the database won't be. There's no need to use a > journaling filesystem with PostgreSQL (particularly if you use a proper > hardware RAID subsystem with either full mirroring or full level 5 > protection). Let's keep things clear here. RAID is totally irrelevant to this question. RAID protects you from hardware-failure, but it doesn't help to keep your filesystem or data consistant. I have no idea why you say journaling filesystems "is a piling of one set of warts ontop of another", but the fact is that is required to always keep the filesystem consistant. In theory non-journaling filesystems make no guarantees that the filesystem will be usable at all after a crash/powerfailure. In practise it's not that bad, and in most cases you will be able to recover everything but the last updates with fsck. PostgreSQL need a filesystem that guarantees the metadata to be up-to-date. E.g. when it appends to the WAL the new file-size must be stored to disk or the latest data will not be available after crash/powerfailure. BUT; it's possible that the fsync() of the file itself will cause the metadata to be updated - at least on some filesystems. Directory-updates on the other hand are _not_ flushed to disk because of fsync, so e.g. when postgreSQL creates new files (when a table/index becomes bigger than 2 GB) you risk loosing that new file if the machine crashes. (unless postgreSQL does fsync() on the directory after creating the file - I assume it doesn't just to be safe) In short: postgreSQL requires a journaling filesystem to guarantee consistant data after a crash / powerfailure. > Indeed there are potentially performance related reasons > to avoid journaling filesystems! I've heard people claim they are just > as fast with PostgreSQL, and some even have claimed performance > improvements, but none of the claimants could give a scientific > description of their benchmark and I'm pretty sure they were seeing > artifacts, not a representation of real-world throughput. If you haven't seen the scientific description of this you haven't looked hard enough. (check out the reiserfs mailinglist archive, for instance) Short explanation follows: What journaling filesystems do is to write all metadata-updates to a log rather than updating the actual filesystem (exactly like WAL do for databases). After the update has been written to the log in a transactional matter (so updates are atomic), they are written to the actual filesystem. This means that all metadata-updates are writtent to disk twice, and this is what causes journaling filesystems to be slower in general. BUT: if the same metadata is updated multiple times within a short timeframe (before it's written to the actual filesystem), the filesystem get's by by _only_ writing the new data to the log. So, we're back to only 1 write, and since the writes are only appends on the log rather than writes to random places on the disk there are less seeks and this causes a performance boost. This only applies to metadata, as regular data is normally not logged by the filesystem. (journaling filesystems in general only guarantees that the metadata is consistant - not the actual data). However, some filesystems can optionally also do data-logging. This feature will cause the same performance-improvements on data-updates (when the application is doing random sync updates), but for postgreSQL there is no benefit as updates are alreaddy done in a log without seeks (the WAL). As for benchmarks that demonstrate this effect, try running postmark on both a journaling and non-journaling filesystem (preferably the same with journaling enabled/disabled). I doubt the effect is visable in database-benchmarks, as databases update data very frequently compared to metadata. > > Restoring a snapshot is the fastest possible way of getting the system > > back online, and even a tape-backup of the snapshot will be faster than > > importing the database. > > This is true. However before you give that fact 100% weight in your > decision process you need to do a lot more risk assessment and disaster > planning to understand whether or not the tradeoffs inherent will not > take away from your real needs. Yes. > I still believe it really Really REALY is best to just use pg_dump and > back up the result. If the time to reload a dump is a major concern to > you then you have other far more critical issues to deal with before you > can make a sane choice about backup integrity and disaster recovery. There are other advantages of snapshot-backups in addition to the recovery-time: * space-saving (no need for temporary space for the pg_dump) (unless there are lots of writes going on, of course) * generality This approach works for _all_ applications, not just databases * consistancies between multiple databases (AFAIK the pg_dumpall doesn't take an atomic dump of all databases, so if you have a weird frontend that uses multiple databases and expect them to be consistant....) * No additional CPU-load on the database-server The space-problem with pg_dump is not a fundamental pg_dump problem BTW. If someone wrote a amanda plugin to backup the database directly instead of writing it to file first, there is no additional space requirement. It should be possible to write a pg_dump plugin that works just like tar and dump, and enables you to backup the database directly from amanda. (it would also eliminate the need for wrapper-scripts or cron-jobs that do the pg_dump independently from amanda. Of course the "directory" argument doesn't make sense for pg_dump, but one could use a syntax like "/database-name" or "/database-name/table" to specify what to back up. -- Ragnar Kjørstad Big Storage