>It's interesting that the probability of any 2 randomly selected hashs 
>being the same is quoted, rather than the probability that at least 2 
>out of a whole group are the same. That's probably because the minutely

>small chance becomes rather bigger when you consider many hashs. This 
>will still be small, but I suspect not as reassuringly small.
>To illustrate this consider the 'birthday paradox'. 

I'm really glad you point this out.  The way I interpret this is that
the odds of their being a hash collision in your environment increase
with every new block of data you submit to the de-duplication system.
I've talked to somebody who has researched this mathematically, and he
says he's going to share with me his calculations.  I'll share them
if/when he shares them with me.  As a proponent of these systems, I
certainly don't want to misrepresent the odds they represent.

>For our data I would certainly not use de-duping, even if it did work 
>well on image data.

I think you're under the misconception that all de-dupe systems use ONLY
hashes to identify redundant data.  While there are products that do
this (and I still trust them more than you do), there are also products
that do a full block comparison of the supposedly matching blocks before
throwing one of them away.

In addition, there are ways to completely remove the risk you're worried
about.  If you backup to a de-dupe backup system, regardless of its
design, and then use your backup software to copy from it to tape (or
anything), you verify the de-duped data, as any good backup software
will check all data it copies against its own stored checksums.

