[ccache] Why not using MD5?

2004-11-15 Thread Gunnar Ritter
tri...@samba.org wrote:

> You are badly misinterpreting that table in HAC.

Yes, you are right. Sorry for the brainfart.

Gunnar


[ccache] Why not using MD5?

2004-11-15 Thread tri...@samba.org
Gunnar,

 > while reading the implementation notes for ccache, I came to the
 > question why it uses MD4 and not MD5. MD4 has a significantly
 > higher collision probability (2^20 instead of 2^64, cf. Menezes/
 > Oorschot/Vanstone, 'Handbook of Applied Cryptography', p. 339,
 >  chapter 9). This means
 > that one can expect a 50 % collision probability after just one
 > million different compilations with MD4; it becomes so much less
 > probable with MD5 that I don't know how to express it in English.

You are badly misinterpreting that table in HAC. The table gives you
an idea of how much computation (in terms of number of hash
calculations) would be required for an _attacker_ to find a
collision. This is totally different from the number of collisions
found with data chosen by a non-attack method.

As an example, think of a trivial, but long, hash based on a simple
CRC. It would take an enormous number of random pieces of data to
find a collision in a long CRC, but it would take an attacker with
knowledge of cryptanalysis only a very few cycles.

In case you don't believe me, I have in fact tried this. See
http://samba.org/ftp/unpacked/junkcode/md4-collision/ for my test code
that will tell you if you get a collision in the first approximately
ten million random md4 calculations (beyond that there is a chance of
a bucket overflow, and fixing that would require more than the 80
lines of code I have used). I have run this code and found no
collisions.

I believe that it is well established that MD4 is quite collision
resistant against random data, and I imagine that the guys at RSA
would have hung themselves with embarrassment if they had released a
message digest with a random collision rate of just 2^20. As far as I
know, the expected number of pairs of random data to produce a
collision is 2^64, which is the same as MD5.

What has been found is that MD4 is lousy as a _cryptographic_
cipher. But in ccache we aren't using its cryptographic properties. We
are just using it as a convenient, fairly collision resistant hash.

If you think you need the cryptographic properties of a good message
digest function in ccache, then I wonder what C code you are
compiling! It would have to be hand-crafted C from a crypt-analyst
hell bent on causing you trouble. In that case I recommend not
compiling their code for them.

Now, despite all of the above, we could switch ccache over to MD5 if
it gives some people a warm-fuzzy feeling, but we should do it when
there is some need to change the hash function ccache is using anyway,
to minimize disruption to existing caches. That tends to happen every
few revisions when some fundamental change is made to what goes into
the hash.

As a final warm-fuzzy for all you skeptics out there, ccache also
includes the number of bytes in the hash input in the hash filename,
like this:

  .ccache/0/2/fa9002c30d7c9ef159664ba94bdaec-355123

the 355123 is the number of bytes being hashed. So even if MD4 did
collide, it would have to collide on exactly the same length of cpp
output (plus associated hash input data). 

Cheers, Tridge

PS: If someone wants to extend my md4-collision finder to not be
memory bound then please send me a patch


[ccache] Why not using MD5?

2004-11-15 Thread Martin Pool
On 15 Nov 2004, Egmont Koblinger  wrote:
> PS. AFAIK no-one has ever found two different files with the same md5. Am I
> right?

People have constructed collisions.  I am not aware of anyone finding
a collision with ccache, but it's possible that they would just
attribute it to a mysterious failure.

--
Martin
-- next part --
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : 
http://lists.samba.org/archive/ccache/attachments/20041115/afed3f55/attachment.bin


[ccache] Why not using MD5?

2004-11-15 Thread Egmont Koblinger
On Mon, Nov 15, 2004 at 06:25:49PM +1100, Martin Pool wrote:

> The other factor is whether the cache will grow to include a million
> files.

We're building a distribution using a build system completely created by us
and it intensively uses ccache. We have approx 1000 packages, including, of
course, all the standard linux applications.

Through about 1.5 years we've had one huge ccache pool for all our packages
and sometimes manually dropped old files after a change that caused old
files not to match anymore (gcc, glibc etc. upgrade). The ccache size was
around 5-10 GB.

Recently we changed to use separate pool for each package, but for other
reasons, not because we've met any collision. Fortunately we didn't meet
any, but I didn't even realize that we've been not far from it...

(If you're curious I can easily count the number of entries in our ccache
pools later this day.)

So there's one more vote here for md5. It doesn't hurt if ccache is prepared
for "extreme use". :-)) And it's perfect time to do it now since the
2.4->2.5 version number change would reflect the md4->md5 change :-


PS. AFAIK no-one has ever found two different files with the same md5. Am I
right?


bye,

Egmont


[ccache] Why not using MD5?

2004-11-15 Thread Martin Pool
On 15 Nov 2004, Gunnar Ritter  wrote:

> But a million lies within reasonable bounds. For example, compiling
> a Linux kernel produces about 1000 objects in a typical setup. Then
> if one uses a cache for two years for twenty-five projects of that
> size in the average, changing the compiler version four times within
> that period, ten versions of each project that change a global header
> will suffice to produce one million different objects. A compilation
> server for a Linux distribution with that usage pattern will then
> reach a 50 % probability of a ccache collision within these two years.

The other factor is whether the cache will grow to include a million
files.  I have a 403MB cache containing 15196 object files, for an
average of 26kB per file.  To get to a million files would be on the
order of 26GB, which is probably larger than most people would use.
There is additionally a limit on the number of files in the cache.

But regardless of this, I agree it would be better to use MD5.

--
Martin
-- next part --
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : 
http://lists.samba.org/archive/ccache/attachments/20041115/899ce0b1/attachment.bin


[ccache] Why not using MD5?

2004-11-14 Thread Gunnar Ritter
Hi,

while reading the implementation notes for ccache, I came to the
question why it uses MD4 and not MD5. MD4 has a significantly
higher collision probability (2^20 instead of 2^64, cf. Menezes/
Oorschot/Vanstone, 'Handbook of Applied Cryptography', p. 339,
 chapter 9). This means
that one can expect a 50 % collision probability after just one
million different compilations with MD4; it becomes so much less
probable with MD5 that I don't know how to express it in English.

But a million lies within reasonable bounds. For example, compiling
a Linux kernel produces about 1000 objects in a typical setup. Then
if one uses a cache for two years for twenty-five projects of that
size in the average, changing the compiler version four times within
that period, ten versions of each project that change a global header
will suffice to produce one million different objects. A compilation
server for a Linux distribution with that usage pattern will then
reach a 50 % probability of a ccache collision within these two years.
(This analysis might not be totally correct because ccache does not
pass randomly generated input to MD4. However, it seems clear that
the risk of a collision cannot be neglected here.)

Using MD5 comes with a speed penalty (cf. p. 345 of the named book)
but it seems absolutely negligible for ccache. A quick profiling run
shows that MD4 consumes about 1/100 of the time the preprocessor needs
with Linux/glibc. If MD5 consumes 1/70 of its time, there is still no
user-visible difference. I have tested ccache with both MD4 and MD5
and could not measure any resulting difference without profiling again.

So I would suggest that future ccache versions use MD5 instead of MD4.

Changing ccache to use MD5 is trivial. I can send a patch if someone
really wishes, though.

Gunnar

-- 
http://omnibus.ruf.uni-freiburg.de/~gritter