> -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf > Of Curtis Preston > Sent: 01 October 2007 06:35 > To: [EMAIL PROTECTED]; veritas-bu@mailman.eng.auburn.edu > Subject: Re: [Veritas-bu] Tapeless backup environments? ... > > These are odds based on the size of the key space. If you have 2^160 > odds, you have a 1:2^160 chance of a collision.
by saying that, the implication is that the keyspace is uniform. It's not. The probablity of a hash collision is a function of the uniformity of the keyspace as well as the number of items you've hashed and the size of the key. There's lots of research in the crypto field that's relevant to de-dupe. You also should consider the characteristics of the de-dupe software when it encounters a hash collision. Backups are the last line of defence for many, when all else (personal copies, replication, snapshots etc.) has failed. The 'acceptable risk' of a hash collision is of little comfort when you've got one. Does it fail silently, throw it's hands in the air and core dump, or handle the situation gracefully and carry on without missing a beat. Ask them what they do. As Curtis mentioned, not all de-dupe s/ware relies purely on hashes. Balance this with the /fact/ that there's already a chance of undetected corruption in the components you buy today, which is why most technologies that survive impose their own data validation checks instead of relying purely on the underlying technology in the stack to have checked it for them. The multi-layered checks that go on improve your overall confidence. At least one design in the SiS field also accepts that hashing algorithms will improve over time and they've had the foresight to be able to drop in new hashing schemes in future. When picking de-dupe software you should also care about Intellectual Property. Who's got what isn't necessarily clear in this space, and the patent lawyers won't be far away. Picking the big boys help here, but also look at people with a mature view to the marketplace (eg. some companies are prepared to talk about licensing deals rather than court cases when they encounter infringement) There's lots of other things to consider in picking an algorithm, including how well it handles patterns that don't fall naturaly on block boundaries (think of the challenges involved in de-duping 'the quick brown fox' and 'the quicker brown fox') that will affect de-dupe ratios, and how that affects performance. And the solution's not just about the algorithm. De-dupe is a great advance, and a disruptive technology not just for backup but also for primary storage. Look forward to it, but go in with your eyes open. -------------------------------------------------------- NOTICE: If received in error, please destroy and notify sender. Sender does not intend to waive confidentiality or privilege. Use of this email is prohibited when received in error. _______________________________________________ Veritas-bu maillist - Veritas-bu@mailman.eng.auburn.edu http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu