Re: [Bacula-users] Large backup to tape?

2012-03-08 Thread mark . bergman
In the message dated: Thu, 08 Mar 2012 09:38:33 PST,
The pithy ruminations from Erich Weiler on 
Re: [Bacula-users] Large backup to tape? were:
= Thanks for the suggestions!
= 
= We have a couple more questions that I hope have easy answers.  So, it's 
= been strongly suggested by several folks now that we back up our 200TB 
= of data in smaller chunks.  This is our structure:
= 
= We have our 200TB in one directory.  From there we have about 10,000 
= subdirectories that each have two files in it, ranging in size between 
= 50GB and 300GB (an estimate).  All of those 10,000 directories adds to 
= up about 200TB.  It will grow to 3 or so petabytes in size over the next 
= few years.


Hmmm...maybe I'm misunderstanding your data structure. Let me do a
little math:

1 directories * 2 files each = 2 files
200TB / 2 files = 10GB/file

3PB(projected size) / 10GB(per file) = 314572 files

314572 files / 2 (files per directory) = 157286 directories

What filesystem are you using? Many filesystems have serious performance
problems when there are 10K objects (files, subdirectories) in a
directory.

I am absolutely not a DBA, but I'd take a close look at the data you are
using for bacula, and open a discussion with the developers regarding
tables, indices, and performance with such a wide  shallow directory
tree.

= 
= Does anyone have an idea of how to break that up logically within 
= bacula, such that we could just do a bunch of smaller Full backups of 


Sure. I'm assuming that your directories have some kind of logical
naming convention:

/data/AAA000123
/data/AAA000124
 :
 :
/data/ZZZ99

In that case, you can create multiple logical filesets within bacula,
for example:

/data/AAA0[0-4]*
/data/AAA0[5-9]*

As each of these filesets would be backed up from the same fileserver
(unless you're using a clustered filesystem), then you'd want to restrict
backup concurrency to avoid running too many jobs at once.

You could do:

# of filesets =

acceptable backup window (in hours) * backup rate (GB/hr)
 /  20GB (from earlier calculation of average 20GB
  per subdirectory)

The regular expression used to determine the filesets would need to be
based on both the current subdirectory names and the subdirectories to be
added in the future, in order to keep the fileset size balanced. In other
words, if the new data subdirectories will all be in the range AAABBB*,
then you'll need to do something to split that range into reasonable
sized chunks, not 3PB.


= smaller chunks of the data?  The data will never change, and will just 
= be added to.  As in, we will be adding more subdirectories with 2 files 
= in them to the main directory, but will never delete or change any of 
= the old data.

= 
= Is there a way to tell bacula to back up all this, but do it in small 
= 6TB chunks or something?  So we would avoid the massive 200TB single 
= backup job + hundreds of (eventual) small incrementals?  Or some other idea?

You could use bacula's mechanism to use filesets that are dynamically
generated from an external program. The external program could use a
knapsack algorithm to take all the directories and divide them into sets,
with each set sized to meet your acceptable backup window. The algorithm
would need to be 'stable', so that directory AAA000123 is placed with
the same other subdirectories each time.

See the description of a file-list in:


http://www.bacula.org/5.2.x-manuals/en/main/main/Configuring_Director.html#SECTION00237

Mark

= 
= Thanks again for all the feedback!  Please reply-all to this email 
= when replying.
= 
= -erich
= 
= On 3/1/12 10:18 AM, mark.berg...@uphs.upenn.edu wrote:
=  In the message dated: Wed, 29 Feb 2012 20:23:14 PST,
=  The pithy ruminations from Erich Weiler on
=  [Bacula-users] Large backup to tape?  were:
=  =  Hey Y'all,
=  =
=  =  So I have a Dell ML6010 tape library that holds 41 LTO-5 tapes, all
= 
=  I've got a Dell ML6010, so I can offer some specific suggestions.
= 
= [SNIP!]
= 
=  =
=  =  The fileset I'm backing up is about 200TB large total (each file is
=  =  about 300GB big).  So, not only will it use every tape in the tape
=  =  library (41 tapes), but we'll have to refill the tape library about 6
=  =  times to get the whole thing backed up.  After that I want to just do
= 
=  I agree with the other suggestions to break up the dataset into smaller
=  chunks.
= 
= 
= [SNIP!]
= 
=  =
=  =  So, I guess a have a couple basic questions.  When it uses all the 
tapes
=  =  in the library in a single job (200TB! 41 tapes only hold 60TB), will 
it
= 
=  It'll depend a lot on the compressibility of your data.
= 
=  =  simply pause, send me an email saying it's waiting for new media, then 
I
=  =  load 41 new tapes?  Then tell it to resume, and it uses the next 41, ad

Re: [Bacula-users] Large backup to tape?

2012-03-08 Thread Steve Ellis
On 3/8/12 9:38 AM, Erich Weiler wrote:
 Thanks for the suggestions!

 We have a couple more questions that I hope have easy answers.  So, it's
 been strongly suggested by several folks now that we back up our 200TB
 of data in smaller chunks.  This is our structure:

 We have our 200TB in one directory.  From there we have about 10,000
 subdirectories that each have two files in it, ranging in size between
 50GB and 300GB (an estimate).  All of those 10,000 directories adds to
 up about 200TB.  It will grow to 3 or so petabytes in size over the next
 few years.

 Does anyone have an idea of how to break that up logically within
 bacula, such that we could just do a bunch of smaller Full backups of
 smaller chunks of the data?  The data will never change, and will just
 be added to.  As in, we will be adding more subdirectories with 2 files
 in them to the main directory, but will never delete or change any of
 the old data.

 Is there a way to tell bacula to back up all this, but do it in small
 6TB chunks or something?  So we would avoid the massive 200TB single
 backup job + hundreds of (eventual) small incrementals?  Or some other idea?

 Thanks again for all the feedback!  Please reply-all to this email
 when replying.

 -erich
Assuming the subdirectory names are somewhat reasonably spread through 
the alpha space, can you do something like:
FileSet {
 Name = A
 Include {
 File = /pathname/to/backup
 Options {
 Wild=[Aa]*
 }
 }
}
...
FileSet {
 Name = Z
 Include {
 File = /pathname/to/backup
 Options {
 Wild=[Zz]*
 }
 }
}

Then, specify separate Jobs for each FileSet.  To break things up more 
you might need to break up on second or later characters rather than the 
first one, and you'd need to include FileSets as well any directories 
starting with non-alpha characters.  Certainly this could be somewhat 
annoying to make sure you are covering all of your directories, 
especially if the namespace is populated very lopsidedly, but I believe 
it would work.  Note that I have not tried this approach, but it does 
seem feasible.

I hope you are using a filesystem that behaves well with so many 
subdirectories from one parent (for example, ext3 without dir_index 
would likely do somewhat poorly).

-se




--
Virtualization  Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Large backup to tape?

2012-03-08 Thread Adrian Reyer
On Thu, Mar 08, 2012 at 09:38:33AM -0800, Erich Weiler wrote:
 We have our 200TB in one directory.  From there we have about 10,000 
 subdirectories that each have two files in it, ranging in size between 
 50GB and 300GB (an estimate).  All of those 10,000 directories adds to 
 up about 200TB.  It will grow to 3 or so petabytes in size over the next 
 few years.

As it has already been said, on most filesystems it is not a good idea
to have 10k items and up in a single directory. You might want to hash
it on e.g. directory names
ABC789 - A/B/C/7/8/9
Depending on your directory stucture this might already be a good
preselection for filesets.

The reason for my mail is another point, though:
Depending on your data you might want to make sure it all has the same
timestamp. You could do this by using LVM and snapshots. By mounting and
backing up the snapshots, you make sure no files are modified while
backing up.

Regards,
Adrian
-- 
LiHAS - Adrian Reyer - Hessenwiesenstraße 10 - D-70565 Stuttgart
Fon: +49 (7 11) 78 28 50 90 - Fax:  +49 (7 11) 78 28 50 91
Mail: li...@lihas.de - Web: http://lihas.de
Linux, Netzwerke, Consulting  Support - USt-ID: DE 227 816 626 Stuttgart

--
Virtualization  Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Large backup to tape?

2012-03-08 Thread James Harper
 Thanks for the suggestions!
 
 We have a couple more questions that I hope have easy answers.  So, it's
 been strongly suggested by several folks now that we back up our 200TB of
 data in smaller chunks.  This is our structure:
 
 We have our 200TB in one directory.  From there we have about 10,000
 subdirectories that each have two files in it, ranging in size between 50GB
 and 300GB (an estimate).  All of those 10,000 directories adds to up about
 200TB.  It will grow to 3 or so petabytes in size over the next few years.
 
 Does anyone have an idea of how to break that up logically within bacula,
 such that we could just do a bunch of smaller Full backups of smaller
 chunks of the data?  The data will never change, and will just be added to.  
 As
 in, we will be adding more subdirectories with 2 files in them to the main
 directory, but will never delete or change any of the old data.
 
 Is there a way to tell bacula to back up all this, but do it in small 6TB 
 chunks
 or something?  So we would avoid the massive 200TB single backup job +
 hundreds of (eventual) small incrementals?  Or some other idea?
 
 Thanks again for all the feedback!  Please reply-all to this email when
 replying.
 

I was in a similar situation... I had a directory that was only every appended 
to, it was only around 10GB but the backup was over a 512kbps link so the 
initial 10GB couldn't be reliably backed up in one hit during the after-hours 
window I had available. So what I did was create a fileset that included around 
5% of the files (eg aa*-an*), then progressively changed that fileset to 
include more and more files each backup. The important thing here is to use the 
IgnoreFileSetChanges = yes flag in the fileset so Bacula doesn't want to do a 
full backup every time the fileset changed. This is all using incremental 
backups. Once I had it all backed up it just backed up the overnight changes 
and everything was good.

My situation was different though in that I was doing a weekly virtual full to 
consolidate the backups into one volume, which is harder to do with 2PB of 
data, but if your challenge is getting the initial data backed up in less than 
a single 200TB chunk then you can do it by manipulating the fileset as long as 
you have IgnoreFileSetChanges = yes.

I don't think you would need accurate=yes to do the above either.

James


--
Virtualization  Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Large backup to tape?

2012-03-01 Thread Alan Brown
On 01/03/12 04:23, Erich Weiler wrote:

 The fileset I'm backing up is about 200TB large total (each file is 
 about 300GB big).  So, not only will it use every tape in the tape 
 library (41 tapes), but we'll have to refill the tape library about 6 
 times to get the whole thing backed up. 

And if _anything_ goes wrong you'll have to start over - for a run time
of around 21 days assuming best speed.

Break the job up into smaller sets. We try to use maximum units of 1Tb.

 After that I want to just do 
 incrementals against the initial Full Backup; the files will never 
 change, they will just be added to as time moves on.  

You will still want periodic fulls or synthetic fulls.

Only having 1 backup is asking for trouble, especially if there are
hundreds of subsequent incrementals to spool through too.

That's why GFS (grandfather-father-son) backup strategies were developed
- there are always at least 2 full backups available in case one set
goes bad.

 So, I guess a have a couple basic questions.  When it uses all the tapes 
 in the library in a single job (200TB! 41 tapes only hold 60TB), will it 
 simply pause, send me an email saying it's waiting for new media, then I 
 load 41 new tapes?  Then tell it to resume, and it uses the next 41, ad 
 nauseum?

Yes

 And, if I want to make 2 copies of the tapes, can I simply configure 2 
 differently named jobs that each backup the same fileset?

Yes. If you have multiple tape drives you can also copy between tapes.

 Also, do I need to manually label the tapes (electronically) as I load 
 them, or will the fact that the autoloader automatically reads the new 
 barcodes be enough?

look up the label barcodes directive.




--
Virtualization  Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Large backup to tape?

2012-03-01 Thread mark . bergman
In the message dated: Wed, 29 Feb 2012 20:23:14 PST,
The pithy ruminations from Erich Weiler on 
[Bacula-users] Large backup to tape? were:
= Hey Y'all,
= 
= So I have a Dell ML6010 tape library that holds 41 LTO-5 tapes, all 

I've got a Dell ML6010, so I can offer some specific suggestions.

[SNIP!]

= 
= The fileset I'm backing up is about 200TB large total (each file is 
= about 300GB big).  So, not only will it use every tape in the tape 
= library (41 tapes), but we'll have to refill the tape library about 6 
= times to get the whole thing backed up.  After that I want to just do 

I agree with the other suggestions to break up the dataset into smaller
chunks.


[SNIP!]

= 
= So, I guess a have a couple basic questions.  When it uses all the tapes 
= in the library in a single job (200TB! 41 tapes only hold 60TB), will it 

It'll depend a lot on the compressibility of your data.

= simply pause, send me an email saying it's waiting for new media, then I 
= load 41 new tapes?  Then tell it to resume, and it uses the next 41, ad 
= nauseum?

Yes, sort of.

You'll get lots of mail from bacula about needing to change tapes.

In my experience, changing tapes in the library while a backup is running must
be done very carefully. I suggest that you not use the native ML6010 tools
(touch pad on the library or web interface) to move tapes to-and-from the
mailbox slots. Our procedure is:

use mtx to transfer full tapes from library slots to the mailbox slots

remove the full tapes from the mailbox slots

add new tapes to the mailbox slots

allow the library to scan the new tapes, the choose to add them
to partition 1 (or whatever you have named your non-system partition
within the library)

use mtx to transfer the new tapes from the mailbox slots to available
slots in the library

when complete, run update slots from within the Bacula 'bconsole'

if the tapes have never been used within Bacula before, run label
barcodes from within 'bconsole'

= 
= And, if I want to make 2 copies of the tapes, can I simply configure 2 
= differently named jobs that each backup the same fileset?
= 
= Also, do I need to manually label the tapes (electronically) as I load 
= them, or will the fact that the autoloader automatically reads the new 
= barcodes be enough?

You will need to logically label the tapes (writing a Bacula header to each
tape). This can be done automatically with label barcodes.
= 
= Thanks for any hints.  And, if you know any gotchas I should watch for 
= during this process, please let me know!  I don't want bacula expiring 
= the tapes ever, or re-using them, as the data will never change and we 
= need to keep it forever.

Set the file/volume/job retention times to something really long. For us, 10
years =~ infinite, under the theory that after 10 years we'll have moved to
different tape hardware and the old data will need to be transferred to the
new media somehow.

Make a backup of the Bacula database as soon as the backup is complete. Save
that to both a backup tape and to some other media (external hard drive?
multiple Blueray discs? punch cards?) so that you can recover data if there's
ever a problem with the database--you do NOT want to be in a position of
needing to bscan ~100x LTO5 tapes in order to rebuild the database.

Mark

= 
= Many thanks,
= erich
= 

--
Virtualization  Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


[Bacula-users] Large backup to tape?

2012-02-29 Thread Erich Weiler
Hey Y'all,

So I have a Dell ML6010 tape library that holds 41 LTO-5 tapes, all 
barcoded with labels from the manufacturer.  I want to use bacula to 
back up our dataset, and I've not used bacula before and am sorting 
through the documentation to get things set up.  I have a few questions 
I was hoping someone could answer!

The fileset I'm backing up is about 200TB large total (each file is 
about 300GB big).  So, not only will it use every tape in the tape 
library (41 tapes), but we'll have to refill the tape library about 6 
times to get the whole thing backed up.  After that I want to just do 
incrementals against the initial Full Backup; the files will never 
change, they will just be added to as time moves on.  So this is kind of 
like an archival solution.  I can control the autoloader with the 
mtx-changer script just fine, it totally works as expected.

So, I guess a have a couple basic questions.  When it uses all the tapes 
in the library in a single job (200TB! 41 tapes only hold 60TB), will it 
simply pause, send me an email saying it's waiting for new media, then I 
load 41 new tapes?  Then tell it to resume, and it uses the next 41, ad 
nauseum?

And, if I want to make 2 copies of the tapes, can I simply configure 2 
differently named jobs that each backup the same fileset?

Also, do I need to manually label the tapes (electronically) as I load 
them, or will the fact that the autoloader automatically reads the new 
barcodes be enough?

Thanks for any hints.  And, if you know any gotchas I should watch for 
during this process, please let me know!  I don't want bacula expiring 
the tapes ever, or re-using them, as the data will never change and we 
need to keep it forever.

Many thanks,
erich

--
Virtualization  Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users