Chris Freemantle said: >It's interesting that the probability of any 2 randomly selected hashs >being the same is quoted, rather than the probability that at least 2 >out of a whole group are the same. That's probably because the minutely
>small chance becomes rather bigger when you consider many hashs. This >will still be small, but I suspect not as reassuringly small. >To illustrate this consider the 'birthday paradox'. I'm really glad you point this out. The way I interpret this is that the odds of their being a hash collision in your environment increase with every new block of data you submit to the de-duplication system. I've talked to somebody who has researched this mathematically, and he says he's going to share with me his calculations. I'll share them if/when he shares them with me. As a proponent of these systems, I certainly don't want to misrepresent the odds they represent. >For our data I would certainly not use de-duping, even if it did work >well on image data. I think you're under the misconception that all de-dupe systems use ONLY hashes to identify redundant data. While there are products that do this (and I still trust them more than you do), there are also products that do a full block comparison of the supposedly matching blocks before throwing one of them away. In addition, there are ways to completely remove the risk you're worried about. If you backup to a de-dupe backup system, regardless of its design, and then use your backup software to copy from it to tape (or anything), you verify the de-duped data, as any good backup software will check all data it copies against its own stored checksums. _______________________________________________ Veritas-bu maillist - Veritas-bu@mailman.eng.auburn.edu http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu