Holger Parplies wrote at about 00:57:28 +0200 on Wednesday, June 3, 2009: > Hi, > > Les Mikesell wrote on 2009-06-02 17:32:24 -0500 [Re: [BackupPC-users] > Backing up a BackupPC server]: > > Jeffrey J. Kosowsky wrote: > > > [...] > > > Once we are talking about redoing things, I would prefer to use a > > > full md5sum hash for the name of the pool file. [...] > > > With this approach then you would automatically have "a common hashed > > > filename that is ['statistically'] unique across all instances for > > > every piece of content." > > > > Somehow the number of possible different file contents and the number > > possible md5sums don't seem quite statistically equivalent to me. And > > then there's: > > > > http://www.mscs.dal.ca/~selinger/md5collision/ > > first of all, if you are *not* using rsync, you *don't* get a *full* md5sum > hash for free or even cheap. You (Jeffrey) know the code well enough to > realize that BackupPC goes to great pains to avoid writing to the pool disk > unless necessary. If you need to transfer the whole file (of arbitrary size) > before you can look up the pool entry, you *have to* write a temporary copy > (probably compressed, too, giving up the benefits you gain from only > compressing once and decompressing when matching). You have to handle > collisions just the same (meaning re-reading your temporary copy and > comparing > to the pool file). Yuck. > > Yes, you can special-case small files that fit into memory, but yuck just the > same. > > If you use a *partial* md5sum, there's no gain from rsync, and you trivially > get collisions just like you do now. > > That is not to say, if we end up using a database, that it would not be a > good > idea to store the full md5sum in the database. In fact, with a database, file > names would be somewhat arbitrary, and I'd propose keeping them *short* for > the sake of rsync et al. and file lists. > > Regards, > Holger
I guess my point was as follows: - If you use rsync, then you get the md5sums for free - Even if you don't use rsync, given the speed of current processors, calculating the md5sum doesn't take any longer than a full file compare (though you can tell a file is different as soon as a difference arises, that is not really relevant since if a file is different you will have to copy it over anyway in which case the md5sum doesn't add significant overhead relative to the copy operation since you have to read in the file anyway) - The md5sums for the pool only need to be calculated once and then appended (or prepended) to the pool file I'm tired and I haven't looked at the code in a few months so maybe I'm forgetting something but I'm having trouble remembering what is the advantage of using the partial md5sum hashes on a fast (i.e. modern) computer where the limitation is disk speed and/or network bandwidth. Because it seems that any time you have to read/write the entire file, calculating the md5sum will only introduce relatively trivial overhead relative to the disk read/write or network transfer. I like the idea of using the full md5sum for the following reasons: 1. It allows you to check file (and hence pool) integrity at any point 2. It can be used to "uniquely" (from a statistical perspective) label pool files without any real chance of a collision. If you are still worried about a collision with 128 bit md5sums, I'm sure simple ways can be found to extend it that make the chance of a collision even more infinitesimal. 3. If the md5sum is appended/prepended to the pool file then the name of the pool file can be found by reading any of its hard links in the pc tree 4. Full-file md5sums are consistent with protocol>30 rsync and come for "free" when using rsync. Since they are there anyway, why use an alternative and less precise (and also confusing) partial md5sum hash when you can use the full md5sum. 5. Using md5sums would get rid of the confusion between partial md5sums used for pool hash names, the md4sums used in protocol 28 rsync and the regular *nix md5sum function. ------------------------------------------------------------------------------ OpenSolaris 2009.06 is a cutting edge operating system for enterprises looking to deploy the next generation of Solaris that includes the latest innovations from Sun and the OpenSource community. Download a copy and enjoy capabilities such as Networking, Storage and Virtualization. Go to: http://p.sf.net/sfu/opensolaris-get _______________________________________________ BackupPC-users mailing list BackupPC-users@lists.sourceforge.net List: https://lists.sourceforge.net/lists/listinfo/backuppc-users Wiki: http://backuppc.wiki.sourceforge.net Project: http://backuppc.sourceforge.net/