Re: [Bacula-users] Large backup to tape?
In the message dated: Thu, 08 Mar 2012 09:38:33 PST, The pithy ruminations from Erich Weiler on Re: [Bacula-users] Large backup to tape? were: = Thanks for the suggestions! = = We have a couple more questions that I hope have easy answers. So, it's = been strongly suggested by several folks now that we back up our 200TB = of data in smaller chunks. This is our structure: = = We have our 200TB in one directory. From there we have about 10,000 = subdirectories that each have two files in it, ranging in size between = 50GB and 300GB (an estimate). All of those 10,000 directories adds to = up about 200TB. It will grow to 3 or so petabytes in size over the next = few years. Hmmm...maybe I'm misunderstanding your data structure. Let me do a little math: 1 directories * 2 files each = 2 files 200TB / 2 files = 10GB/file 3PB(projected size) / 10GB(per file) = 314572 files 314572 files / 2 (files per directory) = 157286 directories What filesystem are you using? Many filesystems have serious performance problems when there are 10K objects (files, subdirectories) in a directory. I am absolutely not a DBA, but I'd take a close look at the data you are using for bacula, and open a discussion with the developers regarding tables, indices, and performance with such a wide shallow directory tree. = = Does anyone have an idea of how to break that up logically within = bacula, such that we could just do a bunch of smaller Full backups of Sure. I'm assuming that your directories have some kind of logical naming convention: /data/AAA000123 /data/AAA000124 : : /data/ZZZ99 In that case, you can create multiple logical filesets within bacula, for example: /data/AAA0[0-4]* /data/AAA0[5-9]* As each of these filesets would be backed up from the same fileserver (unless you're using a clustered filesystem), then you'd want to restrict backup concurrency to avoid running too many jobs at once. You could do: # of filesets = acceptable backup window (in hours) * backup rate (GB/hr) / 20GB (from earlier calculation of average 20GB per subdirectory) The regular expression used to determine the filesets would need to be based on both the current subdirectory names and the subdirectories to be added in the future, in order to keep the fileset size balanced. In other words, if the new data subdirectories will all be in the range AAABBB*, then you'll need to do something to split that range into reasonable sized chunks, not 3PB. = smaller chunks of the data? The data will never change, and will just = be added to. As in, we will be adding more subdirectories with 2 files = in them to the main directory, but will never delete or change any of = the old data. = = Is there a way to tell bacula to back up all this, but do it in small = 6TB chunks or something? So we would avoid the massive 200TB single = backup job + hundreds of (eventual) small incrementals? Or some other idea? You could use bacula's mechanism to use filesets that are dynamically generated from an external program. The external program could use a knapsack algorithm to take all the directories and divide them into sets, with each set sized to meet your acceptable backup window. The algorithm would need to be 'stable', so that directory AAA000123 is placed with the same other subdirectories each time. See the description of a file-list in: http://www.bacula.org/5.2.x-manuals/en/main/main/Configuring_Director.html#SECTION00237 Mark = = Thanks again for all the feedback! Please reply-all to this email = when replying. = = -erich = = On 3/1/12 10:18 AM, mark.berg...@uphs.upenn.edu wrote: = In the message dated: Wed, 29 Feb 2012 20:23:14 PST, = The pithy ruminations from Erich Weiler on = [Bacula-users] Large backup to tape? were: = = Hey Y'all, = = = = So I have a Dell ML6010 tape library that holds 41 LTO-5 tapes, all = = I've got a Dell ML6010, so I can offer some specific suggestions. = = [SNIP!] = = = = = The fileset I'm backing up is about 200TB large total (each file is = = about 300GB big). So, not only will it use every tape in the tape = = library (41 tapes), but we'll have to refill the tape library about 6 = = times to get the whole thing backed up. After that I want to just do = = I agree with the other suggestions to break up the dataset into smaller = chunks. = = = [SNIP!] = = = = = So, I guess a have a couple basic questions. When it uses all the tapes = = in the library in a single job (200TB! 41 tapes only hold 60TB), will it = = It'll depend a lot on the compressibility of your data. = = = simply pause, send me an email saying it's waiting for new media, then I = = load 41 new tapes? Then tell it to resume, and it uses the next 41, ad
Re: [Bacula-users] Large backup to tape?
On 3/8/12 9:38 AM, Erich Weiler wrote: Thanks for the suggestions! We have a couple more questions that I hope have easy answers. So, it's been strongly suggested by several folks now that we back up our 200TB of data in smaller chunks. This is our structure: We have our 200TB in one directory. From there we have about 10,000 subdirectories that each have two files in it, ranging in size between 50GB and 300GB (an estimate). All of those 10,000 directories adds to up about 200TB. It will grow to 3 or so petabytes in size over the next few years. Does anyone have an idea of how to break that up logically within bacula, such that we could just do a bunch of smaller Full backups of smaller chunks of the data? The data will never change, and will just be added to. As in, we will be adding more subdirectories with 2 files in them to the main directory, but will never delete or change any of the old data. Is there a way to tell bacula to back up all this, but do it in small 6TB chunks or something? So we would avoid the massive 200TB single backup job + hundreds of (eventual) small incrementals? Or some other idea? Thanks again for all the feedback! Please reply-all to this email when replying. -erich Assuming the subdirectory names are somewhat reasonably spread through the alpha space, can you do something like: FileSet { Name = A Include { File = /pathname/to/backup Options { Wild=[Aa]* } } } ... FileSet { Name = Z Include { File = /pathname/to/backup Options { Wild=[Zz]* } } } Then, specify separate Jobs for each FileSet. To break things up more you might need to break up on second or later characters rather than the first one, and you'd need to include FileSets as well any directories starting with non-alpha characters. Certainly this could be somewhat annoying to make sure you are covering all of your directories, especially if the namespace is populated very lopsidedly, but I believe it would work. Note that I have not tried this approach, but it does seem feasible. I hope you are using a filesystem that behaves well with so many subdirectories from one parent (for example, ext3 without dir_index would likely do somewhat poorly). -se -- Virtualization Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/ ___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
Re: [Bacula-users] Large backup to tape?
On Thu, Mar 08, 2012 at 09:38:33AM -0800, Erich Weiler wrote: We have our 200TB in one directory. From there we have about 10,000 subdirectories that each have two files in it, ranging in size between 50GB and 300GB (an estimate). All of those 10,000 directories adds to up about 200TB. It will grow to 3 or so petabytes in size over the next few years. As it has already been said, on most filesystems it is not a good idea to have 10k items and up in a single directory. You might want to hash it on e.g. directory names ABC789 - A/B/C/7/8/9 Depending on your directory stucture this might already be a good preselection for filesets. The reason for my mail is another point, though: Depending on your data you might want to make sure it all has the same timestamp. You could do this by using LVM and snapshots. By mounting and backing up the snapshots, you make sure no files are modified while backing up. Regards, Adrian -- LiHAS - Adrian Reyer - Hessenwiesenstraße 10 - D-70565 Stuttgart Fon: +49 (7 11) 78 28 50 90 - Fax: +49 (7 11) 78 28 50 91 Mail: li...@lihas.de - Web: http://lihas.de Linux, Netzwerke, Consulting Support - USt-ID: DE 227 816 626 Stuttgart -- Virtualization Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/ ___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
Re: [Bacula-users] Large backup to tape?
Thanks for the suggestions! We have a couple more questions that I hope have easy answers. So, it's been strongly suggested by several folks now that we back up our 200TB of data in smaller chunks. This is our structure: We have our 200TB in one directory. From there we have about 10,000 subdirectories that each have two files in it, ranging in size between 50GB and 300GB (an estimate). All of those 10,000 directories adds to up about 200TB. It will grow to 3 or so petabytes in size over the next few years. Does anyone have an idea of how to break that up logically within bacula, such that we could just do a bunch of smaller Full backups of smaller chunks of the data? The data will never change, and will just be added to. As in, we will be adding more subdirectories with 2 files in them to the main directory, but will never delete or change any of the old data. Is there a way to tell bacula to back up all this, but do it in small 6TB chunks or something? So we would avoid the massive 200TB single backup job + hundreds of (eventual) small incrementals? Or some other idea? Thanks again for all the feedback! Please reply-all to this email when replying. I was in a similar situation... I had a directory that was only every appended to, it was only around 10GB but the backup was over a 512kbps link so the initial 10GB couldn't be reliably backed up in one hit during the after-hours window I had available. So what I did was create a fileset that included around 5% of the files (eg aa*-an*), then progressively changed that fileset to include more and more files each backup. The important thing here is to use the IgnoreFileSetChanges = yes flag in the fileset so Bacula doesn't want to do a full backup every time the fileset changed. This is all using incremental backups. Once I had it all backed up it just backed up the overnight changes and everything was good. My situation was different though in that I was doing a weekly virtual full to consolidate the backups into one volume, which is harder to do with 2PB of data, but if your challenge is getting the initial data backed up in less than a single 200TB chunk then you can do it by manipulating the fileset as long as you have IgnoreFileSetChanges = yes. I don't think you would need accurate=yes to do the above either. James -- Virtualization Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/ ___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
Re: [Bacula-users] Large backup to tape?
On 01/03/12 04:23, Erich Weiler wrote: The fileset I'm backing up is about 200TB large total (each file is about 300GB big). So, not only will it use every tape in the tape library (41 tapes), but we'll have to refill the tape library about 6 times to get the whole thing backed up. And if _anything_ goes wrong you'll have to start over - for a run time of around 21 days assuming best speed. Break the job up into smaller sets. We try to use maximum units of 1Tb. After that I want to just do incrementals against the initial Full Backup; the files will never change, they will just be added to as time moves on. You will still want periodic fulls or synthetic fulls. Only having 1 backup is asking for trouble, especially if there are hundreds of subsequent incrementals to spool through too. That's why GFS (grandfather-father-son) backup strategies were developed - there are always at least 2 full backups available in case one set goes bad. So, I guess a have a couple basic questions. When it uses all the tapes in the library in a single job (200TB! 41 tapes only hold 60TB), will it simply pause, send me an email saying it's waiting for new media, then I load 41 new tapes? Then tell it to resume, and it uses the next 41, ad nauseum? Yes And, if I want to make 2 copies of the tapes, can I simply configure 2 differently named jobs that each backup the same fileset? Yes. If you have multiple tape drives you can also copy between tapes. Also, do I need to manually label the tapes (electronically) as I load them, or will the fact that the autoloader automatically reads the new barcodes be enough? look up the label barcodes directive. -- Virtualization Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/ ___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
Re: [Bacula-users] Large backup to tape?
In the message dated: Wed, 29 Feb 2012 20:23:14 PST, The pithy ruminations from Erich Weiler on [Bacula-users] Large backup to tape? were: = Hey Y'all, = = So I have a Dell ML6010 tape library that holds 41 LTO-5 tapes, all I've got a Dell ML6010, so I can offer some specific suggestions. [SNIP!] = = The fileset I'm backing up is about 200TB large total (each file is = about 300GB big). So, not only will it use every tape in the tape = library (41 tapes), but we'll have to refill the tape library about 6 = times to get the whole thing backed up. After that I want to just do I agree with the other suggestions to break up the dataset into smaller chunks. [SNIP!] = = So, I guess a have a couple basic questions. When it uses all the tapes = in the library in a single job (200TB! 41 tapes only hold 60TB), will it It'll depend a lot on the compressibility of your data. = simply pause, send me an email saying it's waiting for new media, then I = load 41 new tapes? Then tell it to resume, and it uses the next 41, ad = nauseum? Yes, sort of. You'll get lots of mail from bacula about needing to change tapes. In my experience, changing tapes in the library while a backup is running must be done very carefully. I suggest that you not use the native ML6010 tools (touch pad on the library or web interface) to move tapes to-and-from the mailbox slots. Our procedure is: use mtx to transfer full tapes from library slots to the mailbox slots remove the full tapes from the mailbox slots add new tapes to the mailbox slots allow the library to scan the new tapes, the choose to add them to partition 1 (or whatever you have named your non-system partition within the library) use mtx to transfer the new tapes from the mailbox slots to available slots in the library when complete, run update slots from within the Bacula 'bconsole' if the tapes have never been used within Bacula before, run label barcodes from within 'bconsole' = = And, if I want to make 2 copies of the tapes, can I simply configure 2 = differently named jobs that each backup the same fileset? = = Also, do I need to manually label the tapes (electronically) as I load = them, or will the fact that the autoloader automatically reads the new = barcodes be enough? You will need to logically label the tapes (writing a Bacula header to each tape). This can be done automatically with label barcodes. = = Thanks for any hints. And, if you know any gotchas I should watch for = during this process, please let me know! I don't want bacula expiring = the tapes ever, or re-using them, as the data will never change and we = need to keep it forever. Set the file/volume/job retention times to something really long. For us, 10 years =~ infinite, under the theory that after 10 years we'll have moved to different tape hardware and the old data will need to be transferred to the new media somehow. Make a backup of the Bacula database as soon as the backup is complete. Save that to both a backup tape and to some other media (external hard drive? multiple Blueray discs? punch cards?) so that you can recover data if there's ever a problem with the database--you do NOT want to be in a position of needing to bscan ~100x LTO5 tapes in order to rebuild the database. Mark = = Many thanks, = erich = -- Virtualization Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/ ___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
[Bacula-users] Large backup to tape?
Hey Y'all, So I have a Dell ML6010 tape library that holds 41 LTO-5 tapes, all barcoded with labels from the manufacturer. I want to use bacula to back up our dataset, and I've not used bacula before and am sorting through the documentation to get things set up. I have a few questions I was hoping someone could answer! The fileset I'm backing up is about 200TB large total (each file is about 300GB big). So, not only will it use every tape in the tape library (41 tapes), but we'll have to refill the tape library about 6 times to get the whole thing backed up. After that I want to just do incrementals against the initial Full Backup; the files will never change, they will just be added to as time moves on. So this is kind of like an archival solution. I can control the autoloader with the mtx-changer script just fine, it totally works as expected. So, I guess a have a couple basic questions. When it uses all the tapes in the library in a single job (200TB! 41 tapes only hold 60TB), will it simply pause, send me an email saying it's waiting for new media, then I load 41 new tapes? Then tell it to resume, and it uses the next 41, ad nauseum? And, if I want to make 2 copies of the tapes, can I simply configure 2 differently named jobs that each backup the same fileset? Also, do I need to manually label the tapes (electronically) as I load them, or will the fact that the autoloader automatically reads the new barcodes be enough? Thanks for any hints. And, if you know any gotchas I should watch for during this process, please let me know! I don't want bacula expiring the tapes ever, or re-using them, as the data will never change and we need to keep it forever. Many thanks, erich -- Virtualization Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/ ___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users