Hello! I'm trying to gain a better understanding of the physics behind backups, as well as the advantages and disadvantages of different transfer methods, particularly to support large hosts (500GB of data, with 50-100GB of deltas every day), as well as remote hosts. To do this, I'm going over the BackupPC Operation documentation (http://backuppc.sourceforge.net/faq/BackupPC.html#backuppc_operation) and the source code. If you have a moment (or several moments: this is *really* long), I would appreciate it if you could help me. Please feel free to correct my interpretation of the documentation where I make mistakes.
Let's assume the following configuration: We have a BackupPC server that has been backing up a single host running (e.g.) Linux RHEL4. There are a number of successful full and incremental backups for this host, as it has been running for some time. We now want to add a new host, which happens to be running the exact same operating system. They're not mirror images of each other, but they are naturally going to share a large number of common files. Let's assume that the new server contains 1GB worth of files, and that 90% of the files are identical to files on the existing host. According to the docs, the data received from the host will be stored in __TOPDIR__/pc/$host/new. It then goes on to say that *before* new files are written to disk, they are checked against the pool. Files that match an item in the pool are not saved, but rather hardlinks are created to the matching item in the pool. Files that do not match an item in the pool (and are therefore new files) are saved and will later be hardlinked into the pool by BackupPC_link. This means that the additional disk usage on the backup server caused by the backup of our new host will at no point exceed 100MB, or the 10% unique data the new host contains, even during the initial full backup. This is true for rsync, rsyncd, tar and smb. How this happens, though, differs between them. The documentation has very little in the way of details as to *how* the data is transferred from the host to BackupPC, or how that data is actually checked against the pool. In the case of tar-based transfers (tar and smb), it mentions that this function performed by BackupPC_tarExtract. An examination of the source points to BackupPC::PoolWrite, and the comments there explain it very nicely. Because smbclient is used in tarmode, tar and smb are functionally identical. (There are small differences in file *selection* between the two, which is only relevant during incremental backups. However, when examining how data is transferred from the host to BackupPC, and what BackupPC does with it when it gets there, they are identical). According to the docs, the tar datastream generated by tar or smbclient is piped through BackupPC_tarExtract. An examination of the source for BackupPC_tarExtract shows that the tar datastream is parsed, and the file data it contains is sent to BackupPC::PoolWrite. According to the comments for BackupPC::PoolWrite, the incoming file is eventually read in its entirety. It will either match a file in the pool, in which case a hardlink is created, or it won't, in which case the data will be written to disk. Other than a small buffer, the file data received from the client is neither kept in memory nor written to disk during this time: the comments for BackupPC::PoolWrite describe how this is done. This keeps the amount of memory needed to a minimum, as well as avoiding writing data that might duplicate something in the pool, but at the expense of possibly needing to read data from the pool twice (and, due to collisions, possibly reading multiple pool files, though that should happen less than 0.1% of the time). This means that in the case of smb and tar, 100% of file data stored on the host will need to be transferred from the host to BackupPC. There is no ability to avoid this. BackupPC::PoolWrite allows BackupPC to avoid writing any data that is already in the pool to disk, but it cannot avoid the need to receive 100% of each and every file stored on the host, even if 100% of the host's data is already in the pool! The advantage of this is that it puts near zero extra load on the host. In the case of a full backup, reading 100% of the data on the server is unavoidable no matter what--that's what makes it a full backup. Beyond this, using tar or smb puts no additional load upon the server. Instead, it shifts the load to the network (by transferring 100% of the data to the BackupPC server) and the BackupPC server (by making it do all of the hashing and comparison). So far, I think I'm pretty solid. I would still like to make sure that I'm correct, of course, but this is basically just a rehashing of what is contained in the documentation and the source code: nothing earth-shattering. However, the situation is quite different for rsync and rsyncd. First of all, the difference between rsync and rsyncd is minimal. In the case of rsync, the rsync processes communicate via pipes through a remote shell. In the case of rsyncd, the processes communicate via a network socket. Other than that, the acutal process and communication is identical. For our purposes, these two are functionally identical. For rsync and rsyncd, BackupPC_dump handles the transfer directly: there is no program like BackupPC_tarExtract to handle hashing and pooling. It seems that BackupPC is depending on rysnc to handle these details completely on its own. However, while I can see in the code where the transfer is started, I can't find the code that is actually *doing* the transfer. Without being able to find this code, it leaves a number of questions. How is the pool used in the case of a new host? With traditional rsync, the generator (in our case, this would be on the BackupPC server) would use any existing files in the destination path as a "basis file" and generate block checksums based upon this. In this case, there are no existing files in the destination path: it's a new host. From what I understand, there would *never* be any files in the destination path: the destination path is always a newly created directory. So what does BackupPC do? How does it take advantage of pooling, especially on the initial transfer? I have to imagine it does: the documentation says "As rsync runs, it checks each file in the backup to see if it is identical to an existing file from any previous backup of any PC." How exactly does it do that? Can anyone point me to the right spot where this takes place? From what I can tell, rsync by itself can't do that, even with a previously completed backup, for several reasons. For one thing, the file names are mangled, even in the pc directories. For another, BackupPC hashes are not related to rsync (or any other) hashes. Therefore, it seems the ability to use the pool must be managed by BackupPC on top of rsync. There are also other unrelated things that rsync wouldn't do on its own, such as mangling new filenames, updating the attrib file, etc. I can imagine a process where the rsync generator would be intelligent enough to use previous backups as the source of basis files, in spite of the fact that the names were mangled and stored in a different directory from the destination (and an intelligent receiver could mange the filenames and update the attrib file). However, that is very far from checking against any "existing file from any previous backup of any PC." Knowing how it does all of this is pretty important in order to make the correct choices on transfer methods, especially with large or remote hosts. I would like to find out if BackupPC, using rsync, is able to determine if a file is already in the pool for a new host. If it can, the next question would be does it do this *before* the file is transferred from the host? If so, that would be *tremendous*. It would make the initial backup of a remote host *much* easier. But if the file has to be completely transferred, there is no real difference between rsync and smb or tar, at least on the initial backup. Thank you *very* much for reading this information. You have an amazing amount of patience! I would greatly appreciate any information you might be able to give me, even if it's just a pointer as to *where* in the source code I can look to try to chase down some of this information, particularly on how BackupPC uses rsync to transfer files. Tim Massey ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ BackupPC-users mailing list BackupPC-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/backuppc-users http://backuppc.sourceforge.net/