Re: [GENERAL] Backup "Best Practices"

Israel Brewster Mon, 28 Nov 2016 14:52:16 -0800

On Nov 25, 2016, at 1:00 PM, Jeff Janes <jeff.ja...@gmail.com> wrote:
> 
> On Wed, Nov 23, 2016 at 10:16 AM, Israel Brewster <isr...@ravnalaska.net 
> <mailto:isr...@ravnalaska.net>> wrote:
> I was wondering if someone could inform me about, or point me to an online 
> article about best practices for backing up a postgresql database cluster? At 
> the moment, I have two servers running with streaming replication for 
> failover purposes, and then I do nightly database dumps for recovery/backup 
> purposes. However, I feel this approach is lacking in a couple areas I can 
> think of:
> 
> - If a total failure occurs towards the end of the day, we could potentially 
> loose a whole days worth of data.
> 
> Why wouldn't the streaming replica salvage that?  Are they expected to fail 
> together?  Is the NFS share onto which you want to store your basebackup and 
> WAL also expected to fail together with them?


That's why I specified *total* failure. If only the primary dies, that's a 
simple cutover to the secondary, and not a total failure - no need to go to 
backups of any sort under that scenario :-) I'm thinking more along the lines 
of server room burns down, natural disaster, etc - something that causes a 
total failure of the db system, necessitating recovery from an offsite backup.

>  
> Similar argument for user error - there is no way to "undo" a catastrophic 
> user data error without going all the way back to the previous day
> - Less-than-ideal recovery under some scenarios. Since each database in the 
> cluster is dumped individually, this is good should only *one* database need 
> to be restored, but could get tedious should the entire cluster need to be 
> restored.
> 
> To mitigate these issues, I am thinking of supplementing the individual dumps 
> with a full base backup and WAL archiving to a NFS share. This should enable 
> (relatively) quick/easy recovery from backup, plus the ability to do PIT 
> Recovery. I do have a few questions with this approach, however:
> 
> - How do I figure out how often I should take a full base backup? I know this 
> will depend on the amount of traffic my database is doing, and how long I am 
> willing to wait for WAL files to be replayed - the more WAL files needing 
> replayed, the longer recovery will take - but is there some rule of thumb 
> that I can use to calculate how often I need a new base backup? Perhaps based 
> on the number of WAL files?
> 
> You have to try it and see.  Different types of wal records will take 
> different amounts of time to re-play, so there is no rule of thumb. It would 
> depend on the type of traffic you have in your database.  And it could be 
> limited by a single CPU, or by IO.  If the restore_command needs to restore 
> the WAL from a remote server, it is very likely to be limited by the latency 
> of doing that.  In fact, this is often the bottleneck even if it is restoring 
> from the local server, at least if archival is often driven by 
> archive_timeout.
> 
> When I need to re-clone production to get a fresh server to use for dev or 
> testing, I do so using almost exactly the same method I would use for 
> restoring production from a disaster (restore from most recent basebackup, 
> then recovery from WAL archive).  So I know how long it takes for the 
> recovery to happen based on true experience, and I take a new basebackup when 
> that length of time starts to annoy me.

Gotcha. Guess I'll have to start running some tests :-)

>  
> - What is the "best" (or just a good) method of keeping the WAL archives 
> under control? Obviously when I do a new basebackup I can "cleanup" any old 
> files that said backup doesn't need,
> 
> You have said you might be interested in doing PITR. So you want to delay the 
> cleanup so as to not compromise that ability.  You need to develop a policy 
> on how far back you want to be able to do a PITR.
> 
>  
> but how do I know what those are?
> 
> pg_archivecleanup -n /mnt/server/archiverdir 
> 000000010000000000000010.00000020.backup

Ok, but where does that "000000010000000000000010.00000020.backup" come from? I 
mean, I can tell it's a WAL segment file name (plus a backup label), but I 
don't have anything like that in my WAL archives, even though I've run 
pg_basebackup a couple of times. Do I have to call something to create that 
file? Some flag to pg_basebackup? At the moment I am running pg_basebackup such 
that it generates gziped tar files, if that makes a difference.

> 
>  
> - Should I be looking at any other backup methods in addition to/instead of 
> the basebackup/WAL archive scheme?
> 
> 
> You may want to consider pg_receivexlog to maintain your WAL archive, rather 
> than archive_command.  That way you don't have to worry about the trades off 
> caused by setting archive_timeout.  But unless you use it with a replication 
> slot, it is not very safe as the pg_receivexlog could stop working and your 
> database would happy run along without protection.  Also, it is hard to be 
> sure you are reliably issuing an fsyncs over NFS, so with archive_command 
> over NFS there is always the risk your WAL data is not actually reaching disk 
> in a timely fashion.  So if you can run pg_receivexlog running on the 
> NFS-host machine pointed to the local storage, not looping back over NFS, 
> that is safer.

Thanks, I'll look into it, along with the other suggested tools. Perhaps they 
will solve all my confusion :)

> 
> Cheers,
> 
> Jeff

Re: [GENERAL] Backup "Best Practices"

Reply via email to