So, for anyone who cares (doesn't seem to be anyone on this list who noticed), I found this post from 2006 stating and analyzing my exact problem:
http://www.topology.org/linux/backuppc.html On this site, search for "Design flaw: Avoidable re-transmission of massive amounts of data." For future reference and archiving, I quote here in full: "2006-6-7: During the last week while using BackupPC in earnest, I have noticed a very serious design flaw which it totally avoidable by making a small change to the software. First I will describe the flaw with an example. 1. First I back up the ryncd "module" home from computer client1 to computer server1 using the "rsyncd" method. This uses the following line in the server1 config.pl file: $Conf{RsyncShareName} = ['home']; 2. Then I do an incremental back-up of module "home" from client1 to server1. This back-up correctly sends only the changes in the file-system module "home" over the network. So the back-up is very quick. 3. Now I modify the variable $Conf{RsyncShareName} on server1 as follows: $Conf{RsyncShareName} = ['home', 'home1']; 4. Next, I make an incremental back-up. Naturally, the home module is sent very efficiently over the LAN and home1 is sent in full, essentially uncompressed. Well, this isn't quite natural. In fact, it's quite avoidable, but I'll explain why this is so later. 5. Now I make a second incremental back-up of home and home1. Since I have already backed up these two modules, I expect them both to be very quick. But this does not happen. In fact, all of home1 is sent in full over the LAN, which in my case takes about 10 hours. This is a real nuisance. This problem occurs even if I have this in the config.pl file on server1: $Conf{IncrFill} = 1; 6. Next, I make a full back-up. This sends only the changes to home over the LAN, but sends the full contents of home1, uncompressed, over the LAN, even though I have already sent this module in full twice. 7. Now when I make future back-ups, the modules home and home1 are both sent efficiently and quickly. The design flaw here is crystal clear. Consider a single file home1/xyz.txt. The authors has designed the BackupPC system so that the file home1/xyz.txt is sent in full from client1 to server1 unless 1. the file home1/xyz.txt is already on server1 with the identical path in the identical module home1, and 2. the back-up in which home1/xyz.txt exists is a full back-up, not an incremental back-up. If the above conditions do not both hold, the full file is transmitted by rsyncd on client1; then it is discarded by server1 if it is already present on server1 in either the same path in an earlier back-up, or in any path at all in any other module in any kind of earlier back-up. So the software correctly discards duplicate files when they arrive on server1, but they are still transmitted anyway. The cure for this design flaw is very easy indeed, and it would save me several days of saturated LAN bandwidth when I make back-ups. It's very sad that the authors did not design the software correctly. Here is how the software design flaw can be fixed. 1. When an rsync file-system module module1 is to be transmitted from client1 to server1, first transmit the hash (e.g. MD5) of each file from client1 to server1. This can be done (a) on a file by file basis, (b) for all the files in module1 at the same time, or (c) in bundles of say, a few hundred or thousand hashes at a time. 2. The BackupPC server server1 matches the received file hashes with the global hash table of all files on server1, both full back-up files and incremenetal back-up files. 3. Then server1 requests rsyncd on client1 to only transmit the files which are not already present on server1. Notice that the files on server1 do not have to be in the same path in the same module on server1 in a full back-up, which is the case in the current BackupPC software design. 4. Then client1 sends only the files which are requested, which are the files which are not already present on server1. The above design concept would make BackupPC much more efficient even under normal circumstances where the variable $Conf{RsyncShareName} is unchanging. At present, rsyncd will only refrain from sending a file if it is present in the same path in the same module in a previous full back-up. If server1 already has the same identical file in any other location, the file is sent by rsyncd and then discarded after it arrives. If the above serious design flaw is not fixed, it will not do much harm to people whose files are rarely changing and rarely moving. But if, for example, you move a directory tree from once place to another, BackupPC will re-send the whole lot across the LAN, and then it will discard the files when they arrive on the BackupPC server. This will keep on happening until after you have made a full back-up of the files in the new location. " -------- Original-Nachricht -------- > Datum: Thu, 22 Oct 2009 22:31:32 +0200 > Von: "Harald Amtmann" <hardlo...@gmx.de> > An: backuppc-users@lists.sourceforge.net > Betreff: [BackupPC-users] RsyncP problem > My problem is still that rsyncP with rsyncd as client still retransmits > unchanged files. I reduced the testcase: > > 1) Full Backup. All files are transmitted, This is the logoutput from the > client: > > 2009/10/22 21:35:44 [3820] connect from UNKNOWN (192.168.5.9) > 2009/10/22 21:35:55 [3820] rsync on . from bag...@unknown (192.168.5.9) > 2009/10/22 21:35:56 [3820] send unknown [192.168.5.9] docsnsettings > (baggub) .musikproject/musikCube_u.ini 1913 <f??????? > 2009/10/22 21:35:57 [3820] send unknown [192.168.5.9] docsnsettings > (baggub) .musikproject/musik_collected_u.db 157696 <f??????? > 2009/10/22 21:39:32 [3820] send unknown [192.168.5.9] docsnsettings > (baggub) .musikproject/musik_u.db 28868608 <f??????? > 2009/10/22 21:39:32 [3820] sent 28836048 bytes received 61235 bytes > total size 29028217 > > As you can see, roughly 30 MB are transmitted. > > 2) Incremental backup: > > 2009/10/22 21:40:46 [3940] 192.168.5.9 is not a known address for > "localhost": spoofed address? > 2009/10/22 21:40:46 [3940] connect from UNKNOWN (192.168.5.9) > 2009/10/22 21:40:57 [3940] rsync on . from bag...@unknown (192.168.5.9) > 2009/10/22 21:40:57 [3940] sent 212 bytes received 674 bytes total size > 29028217 > > Almost nothing is transmitted, as the client only checks the timestamps. > > 3) Another full backup: This looks exactly like the output to 1). All data > is sent over the wire again. Rsync summary states that about 30MB are > transmitted. > > 4) Experiment: > > For testing, I added "--checksum" to the {$RsyncArgs}. I rerun a Full > Backup again: > > 2009/10/22 21:55:09 [2172] rsync on . from bag...@unknown (192.168.5.9) > 2009/10/22 21:55:10 [2172] send unknown [192.168.5.9] docsnsettings > (baggub) .musikproject/musikCube_u.ini 1913 <f??????? > 2009/10/22 21:55:11 [2172] send unknown [192.168.5.9] docsnsettings > (baggub) .musikproject/musik_collected_u.db 157696 <f??????? > 2009/10/22 21:55:11 [2172] sent 158068 bytes received 762 bytes total > size 29028217 > > Interestingly, this time, only the two small files get retransmitted, the > big one is left out. > > I then restored my configuration to include the complete client pc, > keeping the --checksum parameter. Sadly, now all I get is fileListReceived > errors > on the server, so this didn't help either. > > And for the records, I tried both rsync 2.6.8 and 3.0.4 on the client. > > Craig, is this expected behaviour? Why does the full backup retransmit > everything everytime? -- Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3.5 - sicherer, schneller und einfacher! http://portal.gmx.net/de/go/atbrowser ------------------------------------------------------------------------------ Join us December 9, 2009 for the Red Hat Virtual Experience, a free event focused on virtualization and cloud computing. Attend in-depth sessions from your desk. Your couch. Anywhere. http://p.sf.net/sfu/redhat-sfdev2dev _______________________________________________ BackupPC-users mailing list BackupPC-users@lists.sourceforge.net List: https://lists.sourceforge.net/lists/listinfo/backuppc-users Wiki: http://backuppc.wiki.sourceforge.net Project: http://backuppc.sourceforge.net/