[BackupPC-users] Long: How BackupPC handles pooling, and how transfer methods affect bandwidth usage

Timothy J. Massey Thu, 25 Jan 2007 22:45:36 -0800

Hello!

I'm trying to gain a better understanding of the physics behind backups, 
as well as the advantages and disadvantages of different transfer 
methods, particularly to support large hosts (500GB of data, with 
50-100GB of deltas every day), as well as remote hosts.  To do this, I'm 
going over the BackupPC Operation documentation 
(http://backuppc.sourceforge.net/faq/BackupPC.html#backuppc_operation) 
and the source code.  If you have a moment (or several moments:  this is 
*really* long), I would appreciate it if you could help me.  Please feel 
free to correct my interpretation of the documentation where I make 
mistakes.


Let's assume the following configuration:  We have a BackupPC server 
that has been backing up a single host running (e.g.) Linux RHEL4. 
There are a number of successful full and incremental backups for this 
host, as it has been running for some time.

We now want to add a new host, which happens to be running the exact 
same operating system.  They're not mirror images of each other, but 
they are naturally going to share a large number of common files.  Let's 
assume that the new server contains 1GB worth of files, and that 90% of 
the files are identical to files on the existing host.

According to the docs, the data received from the host will be stored in 
__TOPDIR__/pc/$host/new.  It then goes on to say that *before* new files 
are written to disk, they are checked against the pool.  Files that 
match an item in the pool are not saved, but rather hardlinks are 
created to the matching item in the pool.  Files that do not match an 
item in the pool (and are therefore new files) are saved and will later 
be hardlinked into the pool by BackupPC_link.

This means that the additional disk usage on the backup server caused by 
the backup of our new host will at no point exceed 100MB, or the 10% 
unique data the new host contains, even during the initial full backup. 
  This is true for rsync, rsyncd, tar and smb.  How this happens, 
though, differs between them.

The documentation has very little in the way of details as to *how* the 
data is transferred from the host to BackupPC, or how that data is 
actually checked against the pool.  In the case of tar-based transfers 
(tar and smb), it mentions that this function performed by 
BackupPC_tarExtract.  An examination of the source points to 
BackupPC::PoolWrite, and the comments there explain it very nicely.

Because smbclient is used in tarmode, tar and smb are functionally 
identical.  (There are small differences in file *selection* between the 
two, which is only relevant during incremental backups.  However, when 
examining how data is transferred from the host to BackupPC, and what 
BackupPC does with it when it gets there, they are identical).

According to the docs, the tar datastream generated by tar or smbclient 
is piped through BackupPC_tarExtract.  An examination of the source for 
BackupPC_tarExtract shows that the tar datastream is parsed, and the 
file data it contains is sent to BackupPC::PoolWrite.  According to the 
comments for BackupPC::PoolWrite, the incoming file is eventually read 
in its entirety.  It will either match a file in the pool, in which case 
a hardlink is created, or it won't, in which case the data will be 
written to disk.  Other than a small buffer, the file data received from 
the client is neither kept in memory nor written to disk during this 
time:  the comments for BackupPC::PoolWrite describe how this is done. 
This keeps the amount of memory needed to a minimum, as well as avoiding 
writing data that might duplicate something in the pool, but at the 
expense of possibly needing to read data from the pool twice (and, due 
to collisions, possibly reading multiple pool files, though that should 
happen less than 0.1% of the time).

This means that in the case of smb and tar, 100% of file data stored on 
the host will need to be transferred from the host to BackupPC.  There 
is no ability to avoid this.  BackupPC::PoolWrite allows BackupPC to 
avoid writing any data that is already in the pool to disk, but it 
cannot avoid the need to receive 100% of each and every file stored on 
the host, even if 100% of the host's data is already in the pool!

The advantage of this is that it puts near zero extra load on the host. 
  In the case of a full backup, reading 100% of the data on the server 
is unavoidable no matter what--that's what makes it a full backup. 
Beyond this, using tar or smb puts no additional load upon the server. 
Instead, it shifts the load to the network (by transferring 100% of the 
data to the BackupPC server) and the BackupPC server (by making it do 
all of the hashing and comparison).


So far, I think I'm pretty solid.  I would still like to make sure that 
I'm correct, of course, but this is basically just a rehashing of what 
is contained in the documentation and the source code:  nothing 
earth-shattering.  However, the situation is quite different for rsync 
and rsyncd.


First of all, the difference between rsync and rsyncd is minimal.  In 
the case of rsync, the rsync processes communicate via pipes through a 
remote shell.  In the case of rsyncd, the processes communicate via a 
network socket.  Other than that, the acutal process and communication 
is identical.  For our purposes, these two are functionally identical.

For rsync and rsyncd, BackupPC_dump handles the transfer directly: 
there is no program like BackupPC_tarExtract to handle hashing and 
pooling.  It seems that BackupPC is depending on rysnc to handle these 
details completely on its own.  However, while I can see in the code 
where the transfer is started, I can't find the code that is actually 
*doing* the transfer.

Without being able to find this code, it leaves a number of questions. 
How is the pool used in the case of a new host?  With traditional rsync, 
the generator (in our case, this would be on the BackupPC server) would 
use any existing files in the destination path as a "basis file" and 
generate block checksums based upon this.  In this case, there are no 
existing files in the destination path:  it's a new host.  From what I 
understand, there would *never* be any files in the destination path: 
the destination path is always a newly created directory.  So what does 
BackupPC do?  How does it take advantage of pooling, especially on the 
initial transfer?  I have to imagine it does:  the documentation says 
"As rsync runs, it checks each file in the backup to see if it is 
identical to an existing file from any previous backup of any PC."  How 
exactly does it do that?  Can anyone point me to the right spot where 
this takes place?

 From what I can tell, rsync by itself can't do that, even with a 
previously completed backup, for several reasons.  For one thing, the 
file names are mangled, even in the pc directories.  For another, 
BackupPC hashes are not related to rsync (or any other) hashes. 
Therefore, it seems the ability to use the pool must be managed by 
BackupPC on top of rsync.  There are also other unrelated things that 
rsync wouldn't do on its own, such as mangling new filenames, updating 
the attrib file, etc.

I can imagine a process where the rsync generator would be intelligent 
enough to use previous backups as the source of basis files, in spite of 
the fact that the names were mangled and stored in a different directory 
from the destination (and an intelligent receiver could mange the 
filenames and update the attrib file).  However, that is very far from 
checking against any "existing file from any previous backup of any PC."

Knowing how it does all of this is pretty important in order to make the 
correct choices on transfer methods, especially with large or remote 
hosts.  I would like to find out if BackupPC, using rsync, is able to 
determine if a file is already in the pool for a new host.  If it can, 
the next question would be does it do this *before* the file is 
transferred from the host?  If so, that would be *tremendous*.  It would 
make the initial backup of a remote host *much* easier.  But if the file 
has to be completely transferred, there is no real difference between 
rsync and smb or tar, at least on the initial backup.


Thank you *very* much for reading this information.  You have an amazing 
amount of patience!  I would greatly appreciate any information you 
might be able to give me, even if it's just a pointer as to *where* in 
the source code I can look to try to chase down some of this 
information, particularly on how BackupPC uses rsync to transfer files.

Tim Massey

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/backuppc-users
http://backuppc.sourceforge.net/

[BackupPC-users] Long: How BackupPC handles pooling, and how transfer methods affect bandwidth usage

Reply via email to