Re: [GENERAL] Backup advice
On Mon, 15 Apr 2013 19:54:15 -0700 Jeff Janes wrote: > On Tue, Apr 9, 2013 at 3:05 AM, Eduardo Morras > > > wrote: > > > On Mon, 8 Apr 2013 10:40:16 -0500 > > Shaun Thomas > 'stho...@optionshouse.com');>> wrote: > > > > > > > > Anyone else? > > > > > > > If his db has low inserts/updates/deletes he can use diff between pg_dumps > > (with default -Fp) before compressing. > > > > Most "diff" implementations will read the entirety of both files into > memory, so may not work well with 200GB of data, unless it is broken into a > large number of much smaller files. > > open-vcdiff only reads one of the files into memory, but I couldn't really > figure out what happens memory-wise when you try to undo the resulting > patch, the documentation is a bit mysterious. > > xdelta3 will "work" on streamed files of unlimited size, but it doesn't > work very well unless the files fit in memory, or have the analogous data > in the same order between the two files. I use for my 12-13 GB dump files: git diff -p 1.sql 2.sql > diff.patch It uses 4MB for firts phase and upto 140MB on last one and makes a patch file that can be recovered with: patch 1.sql < diff.patch > 2.sql or using git apply. > A while ago I did some attempts to "co-compress" dump files, based on the > notion that the pg_dump text format does not have \n within records so it > is sortable as ordinary text, and that usually tables have their "stable" > columns, like a pk, near the beginning of the table and volatile columns > near the end, so that sorting the lines of several dump files together will > gather replicate or near-replicate lines together where ordinary > compression algorithms can work their magic. So if you tag each line with > its line number and which file it originally came from, then sort the lines > (skipping the tag), you get much better compression. But not nearly as > good as open-vcdiff, assuming you have the RAM to spare. > > Using two dumps taken months apart on a slowly-changing database, it worked > fairly well: > > cat 1.sql | pigz |wc -c > 329833147 > > cat 2.sql | pigz |wc -c > 353716759 > > cat 1.sql 2.sql | pigz |wc -c > 683548147 > > sort -k2 <(perl -lne 'print "${.}a\t$_"' 1.sql) <(perl -lne 'print > "${.}b\t$_"' 2.sql) | pigz |wc -c > 436350774 > > A certain file could be recovered by, for example: > > zcat group_compressed.gz |sort -n|perl -lne 's/^(\d+b\t)// and print' > > 2.sql2 Be careful, some z* utils decompress the whole file on /tmp (zdiff). > There all kinds of short-comings here, of course, it was just a quick and > dirty proof of concept. A nice one ! > For now I think storage is cheap enough for what I need to do to make this > not worth fleshing it out any more. > > Cheers, > > Jeff --- --- Eduardo Morras -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] Backup advice
On Tue, Apr 9, 2013 at 3:05 AM, Eduardo Morras > wrote: > On Mon, 8 Apr 2013 10:40:16 -0500 > Shaun Thomas 'stho...@optionshouse.com');>> wrote: > > > > > Anyone else? > > > > If his db has low inserts/updates/deletes he can use diff between pg_dumps > (with default -Fp) before compressing. > Most "diff" implementations will read the entirety of both files into memory, so may not work well with 200GB of data, unless it is broken into a large number of much smaller files. open-vcdiff only reads one of the files into memory, but I couldn't really figure out what happens memory-wise when you try to undo the resulting patch, the documentation is a bit mysterious. xdelta3 will "work" on streamed files of unlimited size, but it doesn't work very well unless the files fit in memory, or have the analogous data in the same order between the two files. A while ago I did some attempts to "co-compress" dump files, based on the notion that the pg_dump text format does not have \n within records so it is sortable as ordinary text, and that usually tables have their "stable" columns, like a pk, near the beginning of the table and volatile columns near the end, so that sorting the lines of several dump files together will gather replicate or near-replicate lines together where ordinary compression algorithms can work their magic. So if you tag each line with its line number and which file it originally came from, then sort the lines (skipping the tag), you get much better compression. But not nearly as good as open-vcdiff, assuming you have the RAM to spare. Using two dumps taken months apart on a slowly-changing database, it worked fairly well: cat 1.sql | pigz |wc -c 329833147 cat 2.sql | pigz |wc -c 353716759 cat 1.sql 2.sql | pigz |wc -c 683548147 sort -k2 <(perl -lne 'print "${.}a\t$_"' 1.sql) <(perl -lne 'print "${.}b\t$_"' 2.sql) | pigz |wc -c 436350774 A certain file could be recovered by, for example: zcat group_compressed.gz |sort -n|perl -lne 's/^(\d+b\t)// and print' > 2.sql2 There all kinds of short-comings here, of course, it was just a quick and dirty proof of concept. For now I think storage is cheap enough for what I need to do to make this not worth fleshing it out any more. Cheers, Jeff
Re: [GENERAL] Backup advice
Hi Johann, On Wed, 10 Apr 2013 09:58:05 +0200, Johann Spies wrote: I can specify how many versions of the files should be kept on Tivoli. Another option you can evaluate is the usage of backup catalogues, retention policies and archiving of Barman (www.pgbarman.org). We use it in some contexts with Tivoli (currently only through file system backup, but I guess that's what you do anyway). However, the approach is totally different and is based on physical backups and continuous archiving, allowing you to perform point in time recovery as well. Maybe it is worth evaluating it. Cheers, Gabriele -- Gabriele Bartolini - 2ndQuadrant Italia PostgreSQL Training, Services and Support gabriele.bartol...@2ndquadrant.it - www.2ndQuadrant.it -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Backup advice
Thanks everybody for your valuable comments. I can specify how many versions of the files should be kept on Tivoli. The database will regularly get new data and there is a continuous process of data cleaning. It is a database mainly for research purposes and a few researchers are using it. I will explore the options mentioned. After the first read it looks like continuing pg_dump is not a bad idea - maybe with some optimization (like using diff's). Regards Johann On 9 April 2013 12:05, Eduardo Morras wrote: > On Mon, 8 Apr 2013 10:40:16 -0500 > Shaun Thomas wrote: > > > > > Anyone else? > > > > If his db has low inserts/updates/deletes he can use diff between pg_dumps > (with default -Fp) before compressing. > > --- --- > Eduardo Morras > > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general > -- Because experiencing your loyal love is better than life itself, my lips will praise you. (Psalm 63:3)
Re: [GENERAL] Backup advice
On Mon, 8 Apr 2013 10:40:16 -0500 Shaun Thomas wrote: > > Anyone else? > If his db has low inserts/updates/deletes he can use diff between pg_dumps (with default -Fp) before compressing. --- --- Eduardo Morras -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Backup advice
On Mon, Apr 8, 2013 at 6:14 AM, Johann Spies wrote: > I would appreciate some advice from the experts on this list about the > best backup strategy for my database. > > The setup: > > Size: might be about 200Gb > The server uses a Tivoli backup client with daily backup > At the moment There are pg_dumps for each database on the server on a > daily, weekly and monthly basis. All this gets backed up to the Tivoli > server. > Hi Johann, This backup pattern means that after a year of operation you will have about 23 pg_dump files, 90+% of which have nothing to do with your requirement to restore the database to within the last 24 hours. In other words, your storage needs are being driven by your historical retention policy, not your immediate restoration policy. So, can you elaborate on your historical retention policy? For example, if you need to restore your database to the state it was in 9 months ago, how fast do you need to be able to do that? If you had a 12 month old pg_basebackup and 3 months of log files, how long would it take to replay those and how big would that many log files be? (Both of those questions depend on your specific usage, so it is hard to make general guesses about those--they are questions that can only be answered empirically.) pg_basebackup will almost surely be larger than pg_dumps. For one thing, it contains all the index data, for another it contains any obsolete rows which have not yet been vacuum and reused. So switching to that will save you space only if you need to keep less of them than you do of the pg_dumps. Cheers, Jeff
Re: [GENERAL] Backup advice
On 04/08/2013 08:14 AM, Johann Spies wrote: Size: might be about 200Gb The server uses a Tivoli backup client with daily backup At the moment There are pg_dumps for each database on the server on a daily, weekly and monthly basis. All this gets backed up to the Tivoli server. Ok. So far, so good. I have read about using pg_basebackup in an article from Shaun Thomas' booklet on Packt Publishers**(I will probably buy the booklet)*. *That seems to be a possible solution. Ok, with pg_basebackup, you'll get a binary backup of the actual data files involved in your database cluster. This will, in almost every case, be larger than pg_dump, and take about the same amount of time to produce. You also won't be able to get the described method you're using in your Tivoli software, since pg_basebackup works on the entire install instead of each individual database. One benefit, as Birta pointed out, is that you could use this backup as a base, and apply WAL / transaction logs instead, and those are generally smaller if your database doesn't see a lot of daily overhead. Unfortunately if you do a weekly base, and need to recover far into the week, that can be a time-consuming process. Then again, so can restoring a pg_dump of a 200GB cluster, thanks to the index creation points. One thing you might want to consider, is that 9.3 will presumably have parallel pg_dump to complement parallel pg_restore. This would greatly reduce the amount of time a full dump->restore cycle requires. Aside from that, you're already likely using the smallest backup you can have. Backing up a 600GB database takes about 120GB for us using tar, which is close to what pg_basebackup would give you in terms of size. A compressed pg_dump of the same data is around 50GB. Still, we use the binaries, because we need uptime more than size, and they are much faster to restore. We have daily binary backups going back over four years, because the backups are probably the most important thing in the company. Lose those under bad circumstances, and we'd literally be out of business. If size really is that important, you might want to check what kind of compression is going on with Tivoli. If you can produce a pg_dump and compress it with bzip2, or a utility such as lrzip that makes use of better compression algorithms like LZMA, you may be able to back up much smaller files than your current process. I'll warn you, though. Even parallel compression methods like lbzip2 are much slower than something like parallel gzip (pigz). You'll get a 20-30% smaller file at the cost of a 4-8x slower backup process. Since your data is likely to grow from its current size of 200GB, it's something to consider. Anyone else? I'm so used to using basic utilities, I know I haven't covered things like deduplication backup solutions. It may be that Tivoli isn't right for this, but I'm not familiar with that software. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-676-8870 stho...@optionshouse.com __ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Backup advice
2013/4/8 Johann Spies : > I would appreciate some advice from the experts on this list about the best > backup strategy for my database. (...) > > I have read about using pg_basebackup in an article from Shaun Thomas' > booklet on Packt Publishers (I will probably buy the booklet). Get the booklet, it's worth it. Regards Ian Barwick -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Backup advice
On 08/04/2013 16:14, Johann Spies wrote: I would appreciate some advice from the experts on this list about the best backup strategy for my database. The setup: Size: might be about 200Gb The server uses a Tivoli backup client with daily backup At the moment There are pg_dumps for each database on the server on a daily, weekly and monthly basis. All this gets backed up to the Tivoli server. I would like to reduce the size of the backups as far as possible as we have to pay for space on the backup server and I do not want any downtime on the database server. I do not want replication, but want to be able to restore the database as recent as possible (at least as it was within the past 24 hours) as quickly as possible. I have read about using pg_basebackup in an article from Shaun Thomas' booklet on Packt Publishers**(I will probably buy the booklet)*. *That seems to be a possible solution. I am considering dropping the pg_dumps in favour of the pg_basebackup method. Will that be wise? Regards Johann -- Because experiencing your loyal love is better than life itself, my lips will praise you. (Psalm 63:3) You should read this: http://www.postgresql.org/docs/9.2/static/continuous-archiving.html And you have to decide which method fit best. I dropped the dump method because the dumps size. The weekly basebackup and the WAL archives it's really smaller in size than dumps. In my case. But, with PITR, you could restore only the whole cluster...cannot restore only one database. And hey ... with PITR, you could restore the cluster to any specified moment or transaction ... not only to the time of dump or basebackup. Levi smime.p7s Description: S/MIME Cryptographic Signature
[GENERAL] Backup advice
I would appreciate some advice from the experts on this list about the best backup strategy for my database. The setup: Size: might be about 200Gb The server uses a Tivoli backup client with daily backup At the moment There are pg_dumps for each database on the server on a daily, weekly and monthly basis. All this gets backed up to the Tivoli server. I would like to reduce the size of the backups as far as possible as we have to pay for space on the backup server and I do not want any downtime on the database server. I do not want replication, but want to be able to restore the database as recent as possible (at least as it was within the past 24 hours) as quickly as possible. I have read about using pg_basebackup in an article from Shaun Thomas' booklet on Packt Publishers* *(I will probably buy the booklet)*. *That seems to be a possible solution. I am considering dropping the pg_dumps in favour of the pg_basebackup method. Will that be wise? Regards Johann -- Because experiencing your loyal love is better than life itself, my lips will praise you. (Psalm 63:3)