Les Mikesell wrote at about 17:32:24 -0500 on Tuesday, June 2, 2009: > Jeffrey J. Kosowsky wrote: > > > > > > Backing up other backuppc servers is really a special case that might > > > deserve a special optimization. But, I'm not sure that adding a > > > database automatically makes it any easier - unless you are thinking of > > > a common database that could arbitrate a common hashed filename that is > > > unique across all instances for every piece of content. That's an > > > interesting idea but seems kind of fragile. > > > > > > > Once we are talking about redoing things, I would prefer to use a > > full md5sum hash for the name of the pool file. You end up > > calculating this anyway for free when you use the rsync method > > (although with protocol <=28, you get a full file md4sum but with > > protocol >=30, I believe you have the true md5sum). This would > > simplify the ambiguity of having multiple indexed chain entries with > > the same partial md5sum. > > > > With this approach then you would automatically have "a common hashed > > filename that is ['statistically'] unique across all instances for > > every piece of content." > > Somehow the number of possible different file contents and the number > possible md5sums don't seem quite statistically equivalent to me. And > then there's: > > http://www.mscs.dal.ca/~selinger/md5collision/ >
That's the whole point. md5sum collisions are exceedingly rare with any imaginable number of files since there are 2^128 different md5sums - so even if you have billions of files, the chance of a collision is infinitesimal. Suppose you have 1 trillion (unique) files that is just less than 2^40, which means that the chance of at least one collision is approximately 1- e^(-2^40 * (2^40-1)/2^129) ~ 2^(-49) which is less than 1 in 500 trillion [this is just a generalization of the birthday problem]. If you have "only" 1 billion *unique* files then the chance of at least one collision is less than 2^(-55) which is less than 1 in 36 quadrillion. Yes there are some known examples of md5sum collisions but they are all artificial. I don't believe anyone has ever "accidentally" come across one in a real world situation. In fact since digital signatures rely on statistics like this if md5sum collisions were even remotely possible in real life, the whole electronic financial system would be unreliable. Hence, I stand by my statement that in any currently conceivable BackupPC situation, the md5sums are "statistically" unique. ------------------------------------------------------------------------------ OpenSolaris 2009.06 is a cutting edge operating system for enterprises looking to deploy the next generation of Solaris that includes the latest innovations from Sun and the OpenSource community. Download a copy and enjoy capabilities such as Networking, Storage and Virtualization. Go to: http://p.sf.net/sfu/opensolaris-get _______________________________________________ BackupPC-users mailing list BackupPC-users@lists.sourceforge.net List: https://lists.sourceforge.net/lists/listinfo/backuppc-users Wiki: http://backuppc.wiki.sourceforge.net Project: http://backuppc.sourceforge.net/