Gunnar, > while reading the implementation notes for ccache, I came to the > question why it uses MD4 and not MD5. MD4 has a significantly > higher collision probability (2^20 instead of 2^64, cf. Menezes/ > Oorschot/Vanstone, 'Handbook of Applied Cryptography', p. 339, > <http://www.cacr.math.uwaterloo.ca/hac/> chapter 9). This means > that one can expect a 50 % collision probability after just one > million different compilations with MD4; it becomes so much less > probable with MD5 that I don't know how to express it in English.
You are badly misinterpreting that table in HAC. The table gives you an idea of how much computation (in terms of number of hash calculations) would be required for an _attacker_ to find a collision. This is totally different from the number of collisions found with data chosen by a non-attack method. As an example, think of a trivial, but long, hash based on a simple CRC. It would take an enormous number of random pieces of data to find a collision in a long CRC, but it would take an attacker with knowledge of cryptanalysis only a very few cycles. In case you don't believe me, I have in fact tried this. See http://samba.org/ftp/unpacked/junkcode/md4-collision/ for my test code that will tell you if you get a collision in the first approximately ten million random md4 calculations (beyond that there is a chance of a bucket overflow, and fixing that would require more than the 80 lines of code I have used). I have run this code and found no collisions. I believe that it is well established that MD4 is quite collision resistant against random data, and I imagine that the guys at RSA would have hung themselves with embarrassment if they had released a message digest with a random collision rate of just 2^20. As far as I know, the expected number of pairs of random data to produce a collision is 2^64, which is the same as MD5. What has been found is that MD4 is lousy as a _cryptographic_ cipher. But in ccache we aren't using its cryptographic properties. We are just using it as a convenient, fairly collision resistant hash. If you think you need the cryptographic properties of a good message digest function in ccache, then I wonder what C code you are compiling! It would have to be hand-crafted C from a crypt-analyst hell bent on causing you trouble. In that case I recommend not compiling their code for them. Now, despite all of the above, we could switch ccache over to MD5 if it gives some people a warm-fuzzy feeling, but we should do it when there is some need to change the hash function ccache is using anyway, to minimize disruption to existing caches. That tends to happen every few revisions when some fundamental change is made to what goes into the hash. As a final warm-fuzzy for all you skeptics out there, ccache also includes the number of bytes in the hash input in the hash filename, like this: .ccache/0/2/fa9002c30d7c9ef159664ba94bdaec-355123 the 355123 is the number of bytes being hashed. So even if MD4 did collide, it would have to collide on exactly the same length of cpp output (plus associated hash input data). Cheers, Tridge PS: If someone wants to extend my md4-collision finder to not be memory bound then please send me a patch