Re: SHA1 hash safety
On Tue, Apr 19, 2005 at 06:48:57PM -0400, C. Scott Ananian wrote: On Tue, 19 Apr 2005, David Meybohm wrote: But doesn't this require assuming the distribution of MD5 is uniform, and don't the papers finding collisions in less show it's not? So, your birthday-argument for calculating the probability wouldn't apply, because it rests on the assumption MD5 is uniform, and it isn't. No, the collision papers don't show this at all. I didn't mean they showed it directly. I meant by finding collisions in MD5 quickly, MD5 would have to have some non-uniformity. But that's nevertheless wrong because uniformness and collision finding ability aren't related. Sorry to have wasted everyone's time. Dave - To unsubscribe from this list: send the line unsubscribe git in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SHA1 hash safety
On Tue, 19 Apr 2005, David Meybohm wrote: But doesn't this require assuming the distribution of MD5 is uniform, and don't the papers finding collisions in less show it's not? So, your birthday-argument for calculating the probability wouldn't apply, because it rests on the assumption MD5 is uniform, and it isn't. No, the collision papers don't show this at all. --scott atomic strategic HBDRILL SARANAC COBRA JUDY Ft. Meade assassination politics Mossad HOPEFUL ZPSEMANTIC DTFROGS HTKEEPER LITEMPO LIONIZER operation ( http://cscott.net/ ) - To unsubscribe from this list: send the line unsubscribe git in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SHA1 hash safety
[trimmed cc list, nobody wants to read this noise] On Sat, Apr 16, 2005 at 11:35:39PM +0200, Brian O'Mahoney wrote: (1) I _have_ seen real-life collisions with MD5, in the context of Document management systems containing ~10^6 ms-WORD documents. Dude! You could have been *famous*! Why the aitch-ee-double-hockey-sticks didn't you publish this when you found it? Seriously, man. The MD5 has was fine, or at least the code (a) produced the correct results on the published test cases, and, (b) was properly applied to all bytes of the file(s). I was surprised when it happened, which is why I bothered to post to this list at this time, so I make two more points OK, I guess it's time for some remedial math. There are 2^128 = 340282366920938463463374607431768211456 different MD5 hashes. You are suggesting that you found a collision using ~1e6 = ~1,000,000 plaintexts. Let's suppose there were actually 100,000,000 = 1e8 plaintexts, just in case you underestimated the number. Applying the birthday paradox, we have a 50% probability that you'd find one collision if there were ~7,213,475,309,916,173 possible hash values. If you extend the birthday argument (what is the probability of a collision if you take N samples from a set of size M?) you get the following results, with N = 1e8: 50% (1 in 2) probability of collision in 7,213,475,309,916,173. 1% (1 in 100) probability of collision in497,496,027,172,833,194. .05% (1 in 1845) probability of collision in 9,223,372,036,854,775,806. That's where my quick-and-dirty solver craps out, but we're still a really long ways from 340,282,366,920,938,463,463,374,607,431,768,211,456. A simple linear extrapolation (certainly wrong, but not by more than a few dozen orders of magnitude) says that the odds would be 1 in 68,056,473,384,187,692,692 for the full MD5 hash (I'm not even going to dignify that with a percentage). I'm not going to do the sums, but I would hazard a guess that it's more likely your PC suffered a cosmic-ray-induced memory fault - EACH OF THE FOUR TIMES YOU TESTED IT - causing it to report the same MD5, than that you actually discovered a collision with a measly million (or even hundred million) plaintexts. (Of course, I don't know how many tests of the hash you actually did. But the point stands.) Hell, if you're *that* lucky, what are you doing in IT? You could be making a killing at the roulette table. Or, even more likely, there was some other factor in the system (most likely that it was only using a few bits, probably 32, of the hash when looking for collisions) that resulted in a false alarm. If you had actual evidence of a collision, I'd love to see it - even if it's just the equivalent of % md5 foo d3b07384d113edec49eaa6238ad5ff00 foo % md5 bar d3b07384d113edec49eaa6238ad5ff00 bar % cmp foo bar foo bar differ: byte 25, line 1 % But in the absence of actual evidence, we have to assume (just based on the probabilities) that there was some error in your testing. -andy - To unsubscribe from this list: send the line unsubscribe git in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SHA1 hash safety
On Mon, 18 Apr 2005, Andy Isaacson wrote: If you had actual evidence of a collision, I'd love to see it - even if it's just the equivalent of % md5 foo d3b07384d113edec49eaa6238ad5ff00 foo % md5 bar d3b07384d113edec49eaa6238ad5ff00 bar % cmp foo bar foo bar differ: byte 25, line 1 % But in the absence of actual evidence, we have to assume (just based on the probabilities) that there was some error in your testing. I've already had a long correspondence with this poster. He claims that this happened 7 years ago, involved a commercial contract covered by Swiss Banking Law (with caps!) and that, of course, he certainly doesn't retain [his] client's documents, and even if he *did*, he wouldn't show them to *me*. And then he was unable to comprehend that I couldn't accept his word alone as prima facie evidence that the laws of probability did not apply to him or his clients. I've been a coder far too long to attribute to The Mysterious Hand Of God what can adequately be described by subtle programmer error. The most reasonable explanation, given the (lack of) evidence, is that the programmer involved quickly took refuge in a (wildly improbable, but his clients'll never know) MD5 collision instead of buckling down and finding the bug in his code. --scott ODOATH Ortega FBI SGUAT AEBARMAN India Peking ODACID operation RYBAT [Hello to all my fans in domestic surveillance] for Dummies KUCLUB ( http://cscott.net/ ) - To unsubscribe from this list: send the line unsubscribe git in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SHA1 hash safety
* David Lang [EMAIL PROTECTED] wrote: this issue was raised a few days ago in the context of someone tampering with the files and it was decided that the extra checks were good enough to prevent this (at least for now), but what about accidental collisions? if I am understanding things right the objects get saved in the filesystem in filenames that are the SHA1 hash. of two legitimate files have the same hash I don't see any way for both of them to exist. yes the risk of any two files having the same has is low, but in the earlier thread someone chimed in and said that they had two files on their system that had the same hash.. you can add -DCOLLISION_CHECK to Makefile:CFLAGS to turn on collision checking (disabled currently). If there indeed exist two files that have different content but the same hash, could someone send those two files? Ingo - To unsubscribe from this list: send the line unsubscribe git in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SHA1 hash safety
Three points: (1) I _have_ seen real-life collisions with MD5, in the context of Document management systems containing ~10^6 ms-WORD documents. (2) The HMAC (ethernet-harware-address) of any interface _should_ help to make a unique Id. (3) While I havn't looked at the details of the plumbing, this is the time to make sure we can, easily, drop in SHA-160, SHA-256 (or whatever comes from NIST) when needed. David Lang wrote: On Sat, 16 Apr 2005, Ingo Molnar wrote: * David Lang [EMAIL PROTECTED] wrote: this issue was raised a few days ago in the context of someone tampering with the files and it was decided that the extra checks were good enough to prevent this (at least for now), but what about accidental collisions? if I am understanding things right the objects get saved in the filesystem in filenames that are the SHA1 hash. of two legitimate files have the same hash I don't see any way for both of them to exist. yes the risk of any two files having the same has is low, but in the earlier thread someone chimed in and said that they had two files on their system that had the same hash.. you can add -DCOLLISION_CHECK to Makefile:CFLAGS to turn on collision checking (disabled currently). If there indeed exist two files that have different content but the same hash, could someone send those two files? remember that the flap over SHA1 being 'broken' a couple weeks ago was not from researchers finding multiple files with the same hash, but finding that it was more likly then expected that files would have the same hash. there was qa discussion on LKML within the last year about useing MD5 hashes for identifying unique filesystem blocks (with the idea of being able to merge identical blocks) and in that discussion it was pointed out that collisions are a known real-life issue. so if collision detection is turned on in git, does that make it error out if it runs into a second file with the same hash, or does it do something else? David Lang -- Brian - To unsubscribe from this list: send the line unsubscribe git in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Re: SHA1 hash safety
Dear diary, on Sat, Apr 16, 2005 at 04:58:15PM CEST, I got a letter where C. Scott Ananian [EMAIL PROTECTED] told me that... On Sat, 16 Apr 2005, Brian O'Mahoney wrote: (1) I _have_ seen real-life collisions with MD5, in the context of Document management systems containing ~10^6 ms-WORD documents. Dude! You could have been *famous*! Why the aitch-ee-double-hockey-sticks didn't you publish this when you found it? Seriously, man. Even given the known weaknesses in MD5, it would take much more than a million documents to find MD5 collisions. I can only conclude that the hash was being used incorrectly; most likely truncated (my wild-ass guess would be to 32 bits; a collision is likely with 50% probability in a million document store for a hash of less than 40 bits). I know the current state of the art here. It's going to take more than just hearsay to convince me that full 128-bit MD5 collisions are likely. I believe there are only two or so known to exist so far, and those were found by a research team in China (which, yes, is fairly famous among the cryptographic community now after publishing a paper consisting of little apart from the two collisions themselves). http://cryptography.hyperlink.cz/MD5_collisions.html -- Petr Pasky Baudis Stuff: http://pasky.or.cz/ C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor - To unsubscribe from this list: send the line unsubscribe git in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SHA1 hash safety
that's the difference between CS researchers and sysadmins. sysadmins realize that there are an infinante number of files that map to the same hash value and plan accordingly (becouse we KNOW we will run across them eventually), and don't see it as a big deal when we finally do. CS researches quote statistics that show how hard it is to intentiallly create two files with the same hash and insist it just doesn't happen until presented by the proof, at which point it is a big deal. a difference in viewpoints. David Lang On Sat, 16 Apr 2005, C. Scott Ananian wrote: Date: Sat, 16 Apr 2005 10:58:15 -0400 (EDT) From: C. Scott Ananian [EMAIL PROTECTED] To: [EMAIL PROTECTED] Cc: David Lang [EMAIL PROTECTED], Ingo Molnar [EMAIL PROTECTED], git@vger.kernel.org Subject: Re: SHA1 hash safety On Sat, 16 Apr 2005, Brian O'Mahoney wrote: (1) I _have_ seen real-life collisions with MD5, in the context of Document management systems containing ~10^6 ms-WORD documents. Dude! You could have been *famous*! Why the aitch-ee-double-hockey-sticks didn't you publish this when you found it? Seriously, man. Even given the known weaknesses in MD5, it would take much more than a million documents to find MD5 collisions. I can only conclude that the hash was being used incorrectly; most likely truncated (my wild-ass guess would be to 32 bits; a collision is likely with 50% probability in a million document store for a hash of less than 40 bits). I know the current state of the art here. It's going to take more than just hearsay to convince me that full 128-bit MD5 collisions are likely. I believe there are only two or so known to exist so far, and those were found by a research team in China (which, yes, is fairly famous among the cryptographic community now after publishing a paper consisting of little apart from the two collisions themselves). Please, let's talk about hash collisions responsibly. I posted earlier about the *actual computed probability* of finding two files with an SHA-1 collision before the sun goes supernova. It's 10^28 to 1 against. The recent cryptographic works has shown that there are certain situations where a decent amount of computer work (2^69 operations) can produce two sequences with the same hash, but these sequences are not freely chosen; they've got very specific structure. This attack does not apply to (effectively) random files sitting in a SCM. http://www.schneier.com/blog/archives/2005/02/sha1_broken.html That said, Linux's widespread use means that it may not be unimaginable for an attacker to devote this amount of resources to an attack, which would probably involve first committing some specially structured file to the SCM (but would Linus accept it?) and then silently corrupting said file via a SHA1 collision to toggle some bits (which would presumably Do Evil). Thus hashes other than SHA1 really ought to be considered... ...but the cryptographic community has not yet come to a conclusion on what the replacement ought to be. These attacks are so new that we don't really understand what it is about the structure of SHA1 which makes them possible, which makes it hard to determine which other hashes are similarly vulnerable. It will take time. I believe Linus has already stated on this list that his plan is to eventually provide a tool for bulk migration of an existing SHA1 git repository to a new hash type. Basically munging through the repository in bulk, replacing all the hashes. This seems a perfectly adequate strategy at the moment. --scott WASHTUB Panama Minister Moscow explosives KUGOWN hack Marxist LPMEDLEY genetic immediate radar SCRANTON COBRA JANE KGB Shoal Bay atomic Bejing ( http://cscott.net/ ) -- There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies. -- C.A.R. Hoare - To unsubscribe from this list: send the line unsubscribe git in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SHA1 hash safety
what I'm talking about is the chance that somewhere, sometime there will be two different documents that end up with the same hash I have vastly greater chance of a file colliding due to hardware or software glitch than a random message digest collision of two legitimate documents. I've lost quite a few files in 25 years of computing to just such glitches, sometimes without knowing it until months or years later. We've already computed the chances of a random pure hash collision with SHA1 - it's something like an average of 1 collision every 10 billion years if we have 10,000 coders generating 1 new file version every minute, non-stop, 24 hours a day, 365 days a year. Get real. There are _many_ sources of random error in our tools. When some sources are billions of billions times more likely to occur, it makes sense to worry about them first. Reminds me of the drunk looking under the lamp post for the house keys he dropped - because that's where the light is. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson [EMAIL PROTECTED] 1.650.933.1373, 1.925.600.0401 - To unsubscribe from this list: send the line unsubscribe git in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SHA1 hash safety
sysadmins realize that there are an infinante number of files that map to Sysadmins know that there are an infinite ways for their systems to crap out, and try to cover for the ones that there is a snow balls chance in Hades of them seeing in their lifetime. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson [EMAIL PROTECTED] 1.650.933.1373, 1.925.600.0401 - To unsubscribe from this list: send the line unsubscribe git in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SHA1 hash safety
Hi! We've already computed the chances of a random pure hash collision with SHA1 - it's something like an average of 1 collision every 10 billion years if we have 10,000 coders generating 1 new file version every minute, non-stop, 24 hours a day, 365 days a year. GIT is safe even for the millions of monkeys writing Shakespeare :-) Have a nice fortnight -- Martin `MJ' Mares [EMAIL PROTECTED] http://atrey.karlin.mff.cuni.cz/~mj/ Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth Homo homini lupus, frater fratri lupior, bohemus bohemo lupissimus. - To unsubscribe from this list: send the line unsubscribe git in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SHA1 hash safety
Brian == Brian O'Mahoney [EMAIL PROTECTED] writes: Brian (1) I _have_ seen real-life collisions with MD5, in the context Brian of Document management systems containing ~10^6 ms-WORD Brian documents. Was this whole-document based, or was it blocked or otherwise chunked? I'm wondering, because (SFAIK) the MS word on-disk format is some serialized version of one or more containers, possibly nested. If you're blocks are sized so that the first block is the same across multiple files, this could cause collisions -- but they're the good kind, that allow us to save disk space, so they're not a problem. Are you saying that, within 1e7 documents, that you found two documents with the same MD5 hash yet different contents? That's not an accusation, btw; I'm just trying to get clarity on the terminology. I'm fascinated by the idea of using this sort of content-addressable filesystem, but the chance of any collision at all wigs me out. I look at the probabilities, but still. Thanks, t. - To unsubscribe from this list: send the line unsubscribe git in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SHA1 hash safety
but the chance of any collision at all wigs me out. Guess you're just going to get wigged out then. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson [EMAIL PROTECTED] 1.650.933.1373, 1.925.600.0401 - To unsubscribe from this list: send the line unsubscribe git in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SHA1 hash safety
Paul Jackson wrote: what I'm talking about is the chance that somewhere, sometime there will be two different documents that end up with the same hash I have vastly greater chance of a file colliding due to hardware or software glitch than a random message digest collision of two legitimate documents. The probability of an accidental overlap for SHA-1 for two different files is absurdly remote; it's just not worth worrying about. However, the possibility of an INTENTIONAL overlap is a completely different matter. I think the hash algorithm should change in the future; I have a proposal below. Someone has ALREADY broken into a server to modify the Linux kernel code already, so the idea of an attack on kernel code is not an idle fantasy. MD5 is dead, and SHA-1's work factor has already been sufficiently broken that people have already been told walk to the exits (i.e., DO NOT USE SHA-1 for new programs like git). The fact that blobs are compressed first, with a length header in front, _may_ make it harder to attack. But maybe not. I haven't checked for this case, but most decompression algorithms I know of have a don't change mode that essentially just copies the data behind it. If the one used in git has such a mode (I bet it does!), an attacker could use that mode to make it MUCH easier to create an attack vector than it would appear at first. Now the attacker just needs to create a collision (hmmm, where was that paper?). Remember, you don't need to run a hash algorithm over an entire file; you can precompute to near the end, and then try your iterations from there. A little hardware (inc. FPGAs) would speed the attack. Of course, that assumes you actually check everything to make sure that an attacker can't slip in something different. After each rsync, are all new files' hash values checked? Do they uncompress to right length? Do they have excess data after the decompression? I'm hoping that sort of input-checking (since the data might be from an attacker, if indirectly!) is already going on, though I haven't reviewed the git source code. While the jury's still out, the current belief by most folks I talk to is that SHA-1 variants with more bits, such as SHA-256, are the way to go now. The SHA-1 attack simply reduces the work factor (it's not a COMPLETE break), so adding more bits is believed to increase the work factor enough to counter it. Adding more information to the hash can make attacking even harder. Here's one idea: whenever that hash algorithm switch occurs, create a new hash value as this: SHA-256 + uncompressed-length Where SHA-256 is computed just like SHA-1 is now, e.g., SHA-256(file) where file = typecode + length + compressed data. Leave the internal format as-is (with the length embedded as well). This means that an attacker has to come up with an attack that creates the same length uncompressed, yet has the same hash of the compressed result. That's harder to do. Length is also really, really cheap to compute :-). That also might help the convince the what happens if there's an accidental collision crowd: now, if the file lengths are different, you're GUARANTEED that the hash values are different, though that's not the best reason to do that. One reason to think about switching sooner rather than later is that it'd be really nice if the object store also included signatures, so that in one fell swoop you could check who signed what (and thus you could later on CONFIRM with much more certainty who REALLY submitted a given change... say if it was clearly malicious). If you switch hash algorithms, the signatures might not work, depending on how you do it. --- David A. Wheeler - To unsubscribe from this list: send the line unsubscribe git in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SHA1 hash safety
I have nothing further to contribute to this subtopic. Good luck with it. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson [EMAIL PROTECTED] 1.650.933.1373, 1.925.600.0401 - To unsubscribe from this list: send the line unsubscribe git in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html