Re: SHA1 hash safety

2005-04-16 Thread Tkil

>>>>> "Tkil" == Tkil <[EMAIL PROTECTED]> writes:

Tkil> but the chance of any collision at all wigs me out.

>>>>> "Paul" == Paul Jackson <[EMAIL PROTECTED]> writes:

Paul> Guess you're just going to get wigged out then.

Wig wig.  :)

I didn't mean "wigs me out to the point I won't use it" but more of
"wigs me out so that I'm curious whether there are backup schemes
worth considering".

In particular, the comparisons between hash collisions and hardware
failure seem contrived -- if I have bad RAM, or a bad block on my HD,
I can recover it from known good sources.  But if the actual known
good source is structured in such a way that a particular set of data
cannot be represented, that bothers me.

In this case, the fact that it has to be the same length, same SHA-1,
correct C, and functionally similar C at that, makes for a comforting
cushion.  Further, git wouldn't be the only representation; there
would be periodic tarballs, different trees, etc.

On the other paw, if "effectively random" MS Word docs gave true MD5
collisions (when we have a proper MD5 hash computed over the entire
document) in a "mere" 1e7 space, that is interesting/scary.

(I was also trying to add a few factoids to the MSW comment, as their
structure could lead to collisions if (say) only the first 512 bytes
were considered -- it's possible that nothing but size and date might
change in that, and /those/ I can see colliding in 1e7 documents.)

Finally, I apologize for taking your time.  I'm just watching this
from the sidelines, and the questions above are just intellectual
curiosity.  :-/

(The only other thread I'm really following is people trying to chunk
files in a way that would increase storage efficiency; reading the
Venti paper, I was wondering how efficient it would be if a one-byte
addition at the top of the file would generate all-new blocks, while
the rsync-ish protocol seems to offer substantial relief.  But if the
"interesting history" fits in 10USD worth of HD, that might be enough.
Babble.)

Thanks,
t.

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SHA1 hash safety

2005-04-16 Thread Tkil
> "Brian" == Brian O'Mahoney <[EMAIL PROTECTED]> writes:

Brian> (1) I _have_ seen real-life collisions with MD5, in the context
Brian> of Document management systems containing ~10^6 ms-WORD
Brian> documents.

Was this whole-document based, or was it blocked or otherwise chunked?

I'm wondering, because (SFAIK) the MS word on-disk format is some
serialized version of one or more containers, possibly nested.  If
you're blocks are sized so that the first block is the same across
multiple files, this could cause collisions -- but they're the good
kind, that allow us to save disk space, so they're not a problem.

Are you saying that, within 1e7 documents, that you found two
documents with the same MD5 hash yet different contents?

That's not an accusation, btw; I'm just trying to get clarity on the
terminology.  I'm fascinated by the idea of using this sort of
content-addressable filesystem, but the chance of any collision at all
wigs me out.  I look at the probabilities, but still.

Thanks,
t.
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html