On Nov 25, 2016, at 1:00 PM, Jeff Janes <jeff.ja...@gmail.com> wrote: > > On Wed, Nov 23, 2016 at 10:16 AM, Israel Brewster <isr...@ravnalaska.net > <mailto:isr...@ravnalaska.net>> wrote: > I was wondering if someone could inform me about, or point me to an online > article about best practices for backing up a postgresql database cluster? At > the moment, I have two servers running with streaming replication for > failover purposes, and then I do nightly database dumps for recovery/backup > purposes. However, I feel this approach is lacking in a couple areas I can > think of: > > - If a total failure occurs towards the end of the day, we could potentially > loose a whole days worth of data. > > Why wouldn't the streaming replica salvage that? Are they expected to fail > together? Is the NFS share onto which you want to store your basebackup and > WAL also expected to fail together with them?
That's why I specified *total* failure. If only the primary dies, that's a simple cutover to the secondary, and not a total failure - no need to go to backups of any sort under that scenario :-) I'm thinking more along the lines of server room burns down, natural disaster, etc - something that causes a total failure of the db system, necessitating recovery from an offsite backup. > > Similar argument for user error - there is no way to "undo" a catastrophic > user data error without going all the way back to the previous day > - Less-than-ideal recovery under some scenarios. Since each database in the > cluster is dumped individually, this is good should only *one* database need > to be restored, but could get tedious should the entire cluster need to be > restored. > > To mitigate these issues, I am thinking of supplementing the individual dumps > with a full base backup and WAL archiving to a NFS share. This should enable > (relatively) quick/easy recovery from backup, plus the ability to do PIT > Recovery. I do have a few questions with this approach, however: > > - How do I figure out how often I should take a full base backup? I know this > will depend on the amount of traffic my database is doing, and how long I am > willing to wait for WAL files to be replayed - the more WAL files needing > replayed, the longer recovery will take - but is there some rule of thumb > that I can use to calculate how often I need a new base backup? Perhaps based > on the number of WAL files? > > You have to try it and see. Different types of wal records will take > different amounts of time to re-play, so there is no rule of thumb. It would > depend on the type of traffic you have in your database. And it could be > limited by a single CPU, or by IO. If the restore_command needs to restore > the WAL from a remote server, it is very likely to be limited by the latency > of doing that. In fact, this is often the bottleneck even if it is restoring > from the local server, at least if archival is often driven by > archive_timeout. > > When I need to re-clone production to get a fresh server to use for dev or > testing, I do so using almost exactly the same method I would use for > restoring production from a disaster (restore from most recent basebackup, > then recovery from WAL archive). So I know how long it takes for the > recovery to happen based on true experience, and I take a new basebackup when > that length of time starts to annoy me. Gotcha. Guess I'll have to start running some tests :-) > > - What is the "best" (or just a good) method of keeping the WAL archives > under control? Obviously when I do a new basebackup I can "cleanup" any old > files that said backup doesn't need, > > You have said you might be interested in doing PITR. So you want to delay the > cleanup so as to not compromise that ability. You need to develop a policy > on how far back you want to be able to do a PITR. > > > but how do I know what those are? > > pg_archivecleanup -n /mnt/server/archiverdir > 000000010000000000000010.00000020.backup Ok, but where does that "000000010000000000000010.00000020.backup" come from? I mean, I can tell it's a WAL segment file name (plus a backup label), but I don't have anything like that in my WAL archives, even though I've run pg_basebackup a couple of times. Do I have to call something to create that file? Some flag to pg_basebackup? At the moment I am running pg_basebackup such that it generates gziped tar files, if that makes a difference. > > > - Should I be looking at any other backup methods in addition to/instead of > the basebackup/WAL archive scheme? > > > You may want to consider pg_receivexlog to maintain your WAL archive, rather > than archive_command. That way you don't have to worry about the trades off > caused by setting archive_timeout. But unless you use it with a replication > slot, it is not very safe as the pg_receivexlog could stop working and your > database would happy run along without protection. Also, it is hard to be > sure you are reliably issuing an fsyncs over NFS, so with archive_command > over NFS there is always the risk your WAL data is not actually reaching disk > in a timely fashion. So if you can run pg_receivexlog running on the > NFS-host machine pointed to the local storage, not looping back over NFS, > that is safer. Thanks, I'll look into it, along with the other suggested tools. Perhaps they will solve all my confusion :) > > Cheers, > > Jeff