Re: [BackupPC-users] Long: How BackupPC handles pooling, and how transfer methods affect bandwidth usage

Timothy J. Massey Sat, 27 Jan 2007 11:20:39 -0800

Les Mikesell <[EMAIL PROTECTED]> wrote on 01/26/2007 09:53:11 PM:

 > Timothy J. Massey wrote:
 > >
 > > As a start, how about a utility that simply clones one host to another
 > > using only the pc/host directory tree, and assumes that none of the
 > > source files are in the pool, just like it would during a brand-new
 > > rsync backup?
 >
 > That would be better than nothing, but if you have multiple full runs
 > that you want to keep you'll have to transfer a lot of duplicates that
 > could probably be avoided.


Correct.  But it's a proof of concept that can be refined.  I understand 
that some sort of inode or hash caching is required.  But the first step 
can be done with the parts we've already got.

 > But what is the advantage over just letting the remote server make its
 > run directly against the same targets?

I thought a lot of Holger's points were good.  But for me, it comes down 
to two points:

Point 1:  Distributing Load
===========================
I have hosts that take, across a LAN, 12 hours to back up.  The deltas 
are not necessarily very big:  there's just *lots* of files.  And these 
are reasonably fast hosts:  >2GHz Xeon processors, 10k and 15k RPM 
drives, hardware SCSI (and now SAS) RAID controllers, etc.

I want to store the data in multiple places, both on the local LAN and 
in at least 2 remote locations.  That would mean 3 backups.  It's 
probably not going to take 36 hours to do that, but it's going to take a 
*lot* more than 12...

Other times, it's not the host's fault, but the Internet connection. 
Maybe it's a host that's behind a low-end DSL that only offers 768k up 
(or worse).  It's hard enough to get *one* backup done over that, let 
alone two.

So how can I speed this up?

I could use a faster host.  Unfortunately, I've already *got* a pretty 
powerful host, and it is doing *its* job just fine, so why do I want to 
spend multiple thousands of dollars on this?  Short answer:  that is not 
possible.

I could use a faster Internet connection.  Usually, if a faster option 
were available affordably, they'd already have it.  Even a T1, at 
$400/month, only offers 1.5Mb up.  Not a lot.  So getting a dramatically 
faster Internet connection is not possible, either.

The other way to manage this is to distribute the load to multiple 
systems.  By being able to replicate between BackupPC servers, I can 
still limit the number of backups the host must perform to 1 (with a 
local BackupPC server).  The BackupPC server can then take on the load 
of performing multiple time-consuming replications with remote BackupPC 
servers.  I'm not kidding when I say that the task can take a week for 
all I care, as long as it can get one week's worth of backups done 
during that time.  And once one remote BackupPC server has the data, the 
rest can get it over the very fast Internet connections that they have 
between them.  So I only have to get the data across that slow link 
once, and I can still get it to multiple remote locations.

On top of this, the BackupPC server has a much easier task to replicate 
a pool than the host does in the first place.  Pooling has already been 
taken care of.  We *know* which files are new, and which ones are not. 
There are only two things the replication need worry about:  1) 
Transferring the new files and see if they already exist in the new 
pool, and 2) integrating these new files into the remote server's own pool.

By distributing the load, we can get more backups replicated out to more 
places more quickly, with exactly zero increase in load on the most 
important devices in the entire process:  the hosts and their Internet 
connections.  Those are the machines that have "real" work to do, 
servicing real people with real tasks.  I cannot load these machines 
24x7.  The *only* person who cares about the BackupPC machines (until 
something is lost) is me.  They can stay 100% utilized 24x7 for all I care.

Point 2:  Long-term Management of Data
======================================
With BackupPC, you have a single, intertwined pool that stores data for 
all hosts.  Viewed as a static entity (the data and hosts I need today, 
or even over a couple of weeks), that's fine.  However, over time, I 
envision this getting unwieldy.  As hosts come and go, and as hosts' 
data needs change (usually upward), and as data storage requirements 
increase, this single, solid, unbreakable, indivisible pool still needs 
to be managed.

We are right now envisioning needing 2TB of space to back up a single 
host:  our mail server, which has less than 100GB of data.  The deltas 
on our mail server are currently in the neighborhood of 50GB/day. 
That's because we have 50GB of mail data, and we all receive at least 
one mail a day.  Now, there are things like transaction logs which can 
reduce this, but greatly increase the complexity of restoring individual 
mail files.

This is just the worst host, but far from unique.  We have other servers 
that have multi-GB daily deltas, invariably because of some sort of 
large database that is constantly changing.  Sometimes transaction 
logging is an option, sometimes not.

Our goal is to achieve 7 years of data retention.  For BackupPC, that 
works out to a FullKeepCnt of [4,0,12,0,0,10] or thereabouts.  That's 4 
"weekly", 12 "monthly", and 10 "every 6 months" (actually 32 weeks 
backups.  With the incorrect labels it doesn't work out to 7 years, but 
in practice, it should work out to 7.09 years.  In addition to the 
fulls, there are 6 incrementals.  That is a total of 32 copies.  With a 
daily delta of 50GB, that's 1.5TB, plus the space used beyond the daily 
delta (another 75GB or so for that one host).   That is for *one* host. 
  We're going to need *multiple* *terabytes* of space to hold all of this.

With a single, monolithic, unbreakable pool, I need to either provide 
that much space all at once, or look at moving a terabyte sized pool to 
another storage solution at some point down the road.  That is not 
appealing.  And if I misjudge my storage requirements today, I *still* 
might have to move my entire pool, even if I give it *3 terabytes* today.

However, with the ability to migrate hosts from one host to another, I 
can have tiers of BackupPC servers.  As hosts are retired, I still need 
to keep their data.  7 years was not chosen for the fun of it:  SOX 
compliance requires it.  However, I can migrate it *out* of my 
first-line backup servers onto secondary servers.  If my backup load 
increases to the point where one server can no longer handle its load, I 
can divide its load across multiple servers *without* losing its history.

I have a feeling that many people who run BackupPC might be somewhat 
caviler with historical data.  I *know* I have been:  between 
permanently archiving content every 3 months, and not keeping more than 
a few weeks' worth of data on a server, none of these items were a 
hardship.  One BackupPC server overloaded?  No problem:  build a couple 
of new servers, divide up the load, start new backups on the new 
servers, and in a few weeks when they have enough historical 
information, make the original server go away.

Trying to keep historical information in place and accessable for 
*years* makes that much harder.  It is *guaranteed* that over the 7 year 
life of that data, it's going to live on at least 2 different servers, 
and likely three.  The idea of needing to stay locked into the 
configuration that was put in place 7 years ago is *not* appealing. 
Without being able to move hosts from one server to another--where the 
pool is one monolithic block forever--is worrysome.


All of this led me to thinking of ways to distribute both the backup 
load and the storage needs, both on a daily basis as well as over time. 
  The ability to migrate a pool from one server to another seems to me 
like it would accomplish this, without increasing load on the hosts.

I would love to hear anyone else's thoughts on how to address these 
issues.  Currently, we *are* doing backups of hosts to multiple BackupPC 
servers, both locally and remotely on a number of hosts.  Unfortunately, 
this limits us to smaller servers, both to keep our BackupPC pools 
manageable and our bandwidth usage after-hours.

Maybe I'm just abusing BackupPC beyond what it was intended to do. 
That's fine, too.  But adding the ability to migrate a host from one 
pool to another does not have to change a *single* thing about BackupPC 
at all.  It's kind of like archiving hosts:  it's a feature you can use, 
or ignore.

Tim Massey

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
BackupPC-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/backuppc-users
http://backuppc.sourceforge.net/

Re: [BackupPC-users] Long: How BackupPC handles pooling, and how transfer methods affect bandwidth usage

Reply via email to