On Thu, Nov 08, 2001 at 01:11:21PM -0500, Barrie Slaymaker wrote: > Even a bit more OT: one thing to watch out for, especially if you > plan on caching a *lot* of data, is that the Cache::* modules did > not do collision detection on MD5 collisions the last time I looked. > Forgive me if that's changed recently.
I knew that a collision was technically possible, but the odds seemed astronomically low. But if security is a concern, then I agree that MD5-ing the key isn't the best way to create a unique identifier. The thing that I liked most about MD5 (other than its convenience) is that it tends to distribute very evenly. For example, file system caches fill their directories roughly equally when their paths are created from MD5 hashed keys. Doing something simple and unique like "URL-encoding" the key to make a legal identifier (legal in the sense that it is a valid filename) wouldn't distribute as evenly. Barrie raised this question privately to me earlier this year, and this is what I wrote at the time: Good question. They would overwrite each other. You may want to check out this RFC for more information about the odds of that: http://www.faqs.org/rfcs/rfc1321.html Specifically, they say that "it is conjectured that the difficulty of coming up with two messages having the same message digest is on the order of 2^64 operations, and that the difficulty of coming up with any message having a given message digest is on the order of 2^128 operations." This means you would need 18,446,744,073,709,551,616 keys to make it 100% likely that there would be a collision. Assuming you were only willing to risk a 0.01% chance that there is a collision, you could still fit 1,844,674,407,370,955 keys. Since each value stored will take at least 1k in memory, that number of keys would require 1,844,674 Terrabytes of storage. In other words, it is highly unlikely that you could get a collision in real world usage. Note that I'm not disagreeing with Barrie at all -- I just took the easy way out and refuted the likelihood of an infinite number of monkeys hitting my code. :) -DeWitt