On Thu, Nov 08, 2001 at 01:11:21PM -0500, Barrie Slaymaker wrote:

> Even a bit more OT: one thing to watch out for, especially if you
> plan on caching a *lot* of data, is that the Cache::* modules did
> not do collision detection on MD5 collisions the last time I looked.
> Forgive me if that's changed recently.

I knew that a collision was technically possible, but the odds seemed
astronomically low.  But if security is a concern, then I agree that
MD5-ing the key isn't the best way to create a unique identifier.  The
thing that I liked most about MD5 (other than its convenience) is that
it tends to distribute very evenly.  For example, file system caches
fill their directories roughly equally when their paths are created
from MD5 hashed keys.  Doing something simple and unique like
"URL-encoding" the key to make a legal identifier (legal in the sense
that it is a valid filename) wouldn't distribute as evenly.

Barrie raised this question privately to me earlier this year, and
this is what I wrote at the time:

   Good question.  They would overwrite each other.  You may want to
   check out this RFC for more information about the odds of that:

    http://www.faqs.org/rfcs/rfc1321.html                                         
   Specifically, they say that "it is conjectured that the difficulty
   of coming up with two messages having the same message digest is on
   the order of 2^64 operations, and that the difficulty of coming up
   with any message having a given message digest is on the order
   of 2^128 operations."

   This means you would need 18,446,744,073,709,551,616 keys to make
   it 100% likely that there would be a collision.

   Assuming you were only willing to risk a 0.01% chance that there is
   a collision, you could still fit 1,844,674,407,370,955 keys.  Since
   each value stored will take at least 1k in memory, that number of
   keys would require 1,844,674 Terrabytes of storage.
   
   In other words, it is highly unlikely that you could get a collision
   in real world usage.


Note that I'm not disagreeing with Barrie at all -- I just took the
easy way out and refuted the likelihood of an infinite number of
monkeys hitting my code.  :)

-DeWitt

Reply via email to