[See below a proposal that libsvn_wc not use any fixed hash function.] Martin Edgar Furter Rathod wrote on Sat, 21 Jan 2023 05:22 +00:00: > On 20.01.23 22:48, Karl Fogel wrote: >> On 20 Jan 2023, Nathan Hartman wrote: >>> We already can't store files with identical SHA1 hashes, but AFAIK the >>> only meaningful impact we've ever heard is that security researchers >>> cannot track files they generate with deliberate collisions. The same >>> would be true with any hash type, for collisions within that hash >>> type. >> >> Yes. A hash is considered "broken" the moment security researches can >> generate a collision. > > No matter what hash function you choose now, sooner or later it will be > broken. > > But a broken hash function can still be good enough for use in tools > like subversion if it is used correctly. Instead of just storing the > hash value subversion should also store a sequence number. Whenever a > collision happens subversion has to compare the two (or more) files > which have the same hash value.
So, basically, just do what the implementation of hashes (the data structure mapping keys to values) does? I think this would work in most of our uses of checksums, and make it possible to have collisions in both the repository and the wc. However, what about running `svn status` when there's an unhydrated file that has been modified in a way that changes the fulltext but doesn't change the checksum value? In this case the BASE fulltext isn't available locally to compare with. ---- I think there is actually something we can do about this: stop hardcoding any particular hash function in libsvn_wc's internals. The server is aware of what algorithm the wc uses on the wire, which is SHA-1 in ra_serf's download optimization and MD5 in svn_delta_editor_t::apply_textdelta() and svn_delta_editor_t::close_file(). However, the algorithm(s) used by the wc for naming pristines and, in f32, for detecting local mods are implementation details of the wc. So, suppose the wc didn't hardcode _any particular_ hash function for naming pristines and for status walks — not md5, not sha1, not sha256 — but had each «svn checkout» run pick a hash function uniformly at random out of a large enough family of hash functions[1]. (Intuitively, think of a family of hash functions as a hash function with a random salt, similar to [2].) This way, even if someone tried to deliberately create a collision, they wouldn't be able to pick a collision "off the shelf", as with shattered.io; they'd need to compute a collision for the specific hash function ("salt") used by that particular wc. That's more difficult than creating a collision in a well-known hash function, regardless of whether we treat the salt's value as a secret of the wc (as in, stored in a mode-0400 file in under .svn directory and not disclosed to the server) or as a value the attacker is assumed to know. So, that's one way to address the scenario kfogel described. Thanks for speaking up, Martin. Daniel [1] I'm not making this term up; see, for instance, page 143 of https://cseweb.ucsd.edu/~mihir/papers/gb.pdf. "𝒦" is keyspace, "D" is domain, "R" is range. A random element K ∈ 𝒦 is chosen and the hash function H_K [aka H with currying of the first parameter] is used thereafter. [2] def f(foo): return sha1(str(foo) + f.salt) f.salt = str(random_thing()) > If the files are identical the old > hash+number pair is stored. If they differ the new file gets a new > sequence number and that hash+number pair is stored. Since collisions > almost never happen even if md5 is used the performance penalty will be > almost zero. > > The same thing has been discussed earlier and changing the hash function > will just solve the problem for a few years... > > Best regards, > Martin