On Fri, Mar 8, 2013 at 4:32 PM, Donald Stufft <don...@stufft.io> wrote: > Here's some more information pulled straight from Wikiepdia:
Trust me, I've read a LOT of Wikipedia (and even more from other sites, including at least the conclusions of a number of cryptography papers) about hashing attacks recently, because I was seeing inconsistencies in what people are saying about hashes and their weaknesses and so forth. 99.9% of the discussion about attacks on hashes have to do with collision attacks, prefix attacks, and length extension attacks, all of which are extremely relevant for *cryptographic* purposes. Specifically, the use of hashes to verify identity, authority, repudiability, etc... which emphatically do *not* apply to the use of an MD5 as a checksum to verify a correct download. All of these attacks depend on *something else* being at stake besides the integrity of the original message. For example length-extension attacks bypass the need to know a "secret" used in a naive hash-based signature scheme (which is why you're supposed to use HMAC for such things), while collision attacks let you trick a signer into signing something that you can later replace with something altered. The current use of #md5 tags isn't subject to either of these kinds of attack, because: 1. There is no "secret" to be revealed, and 2. The author and signer are the same person So the only type of attack I've found out about thus far, in my (admittedly few) hours of study on the subject, that is relevant to the way we use MD5 on PyPI at present is the so-called "second pre-image" attack, which is when you're given an existing message and a hash, and have to create a new message with the same hash... while also incorporating something useful in the new message. The most recent report I saw on second pre-image attacks against full MD5 estimated a 2**127 strength, meaning that even if you could process a great many billion tries per second, it would take you thousands of years to come up with a file that could masquerade as an existing download. (And most people's computers and/or internet connections would choke on the massive file sizes needed for the still-theoretical Kelsey-Schneier generalized preimage attack, which in any case would apply equally to just about any other hash we could currently put out in the field. i.e., it's not specific to a particular hash algorithm, it just relies on certain properties of the algorithm.) So, yeah, MD5 is *cryptographically* broken, sure. But it's not broken for *data integrity*. And in the PyPI use case, the "cryptographic" part is all in the SSL being used to fetch the MD5 link in the first place. > Here's the important highlights: > > - specifically, a group of researchers described how to create a pair of > files that share the same MD5 checksum Right, that's what's called a "collision attack". It means that you can go out *ahead of time*, and make two files with the same checksum, one good, one evil. It does *not* mean you get to take an existing file, and then make a second file with the same checksum. (The latter is a "second preimage" attack, which is *not* broken Hash collision attacks in PyPI would basically require an author to upload a special version of their package that looked innocent, and then they could later switch that version out with one that's harmful. And the *way* that this works is that you specially generate *both* files, in advance. Which means that the author themselves is compromised, so the threat is moot. The author can already upload compromised code (either through being evil or having their PC hijacked), and what #md5 it has is 100% irrelevant. That is, there's nothing stopping an evil author or an author with a compromised PC from simply uploading a new file with a new MD5, because PyPI will pass it along in exactly the same way. Changing hash algorithms will not affect this threat vector in the slightest. Given these facts, it makes no sense to fuss over the hash algorithm in current use, since a concurrent goal here is to switch to file formats that can be directly signed using, you know, *actual* cryptography. ;-) The new .wheel format makes provisions for modern signature techniques. It'd be good if sdists also did. Then the #md5 tag can die a natural death, hopefully within the year replaced by a hashtag that say, fingerprints the author's public key as registered with PyPI, or something of that sort. In the meantime, there's no actual threat here, so bikeshedding what to replace it with *while keeping the current system* is like rearranging office furniture in a building that's about to have demolition charges set underneath it. ;-) _______________________________________________ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig