[See below a proposal that libsvn_wc not use any fixed hash function.]

Martin Edgar Furter Rathod wrote on Sat, 21 Jan 2023 05:22 +00:00:
> On 20.01.23 22:48, Karl Fogel wrote:
>> On 20 Jan 2023, Nathan Hartman wrote:
>>> We already can't store files with identical SHA1 hashes, but AFAIK the
>>> only meaningful impact we've ever heard is that security researchers
>>> cannot track files they generate with deliberate collisions. The same
>>> would be true with any hash type, for collisions within that hash
>>> type.
>> 
>> Yes.  A hash is considered "broken" the moment security researches can 
>> generate a collision.
>
> No matter what hash function you choose now, sooner or later it will be 
> broken.
>
> But a broken hash function can still be good enough for use in tools 
> like subversion if it is used correctly. Instead of just storing the 
> hash value subversion should also store a sequence number. Whenever a 
> collision happens subversion has to compare the two (or more) files 
> which have the same hash value.

So, basically, just do what the implementation of hashes (the data
structure mapping keys to values) does?

I think this would work in most of our uses of checksums, and make it
possible to have collisions in both the repository and the wc.

However, what about running `svn status` when there's an unhydrated file
that has been modified in a way that changes the fulltext but doesn't
change the checksum value?  In this case the BASE fulltext isn't
available locally to compare with.

----

I think there is actually something we can do about this: stop
hardcoding any particular hash function in libsvn_wc's internals.

The server is aware of what algorithm the wc uses on the wire, which is
SHA-1 in ra_serf's download optimization and MD5 in 
svn_delta_editor_t::apply_textdelta()
and svn_delta_editor_t::close_file().  However, the algorithm(s) used by
the wc for naming pristines and, in f32, for detecting local mods are
implementation details of the wc.

So, suppose the wc didn't hardcode _any particular_ hash function for
naming pristines and for status walks — not md5, not sha1, not sha256 —
but had each «svn checkout» run pick a hash function uniformly at random
out of a large enough family of hash functions[1].  (Intuitively, think
of a family of hash functions as a hash function with a random salt,
similar to [2].)

This way, even if someone tried to deliberately create a collision, they
wouldn't be able to pick a collision "off the shelf", as with
shattered.io; they'd need to compute a collision for the specific hash
function ("salt") used by that particular wc.  That's more difficult than
creating a collision in a well-known hash function, regardless of
whether we treat the salt's value as a secret of the wc (as in, stored
in a mode-0400 file in under .svn directory and not disclosed to the
server) or as a value the attacker is assumed to know.

So, that's one way to address the scenario kfogel described.

Thanks for speaking up, Martin.

Daniel

[1] I'm not making this term up; see, for instance, page 143 of
    https://cseweb.ucsd.edu/~mihir/papers/gb.pdf.  "𝒦" is keyspace,
    "D" is domain, "R" is range.  A random element K ∈ 𝒦 is chosen and the
    hash function H_K [aka H with currying of the first parameter] is
    used thereafter.

[2]
    def f(foo):
        return sha1(str(foo) + f.salt)
    f.salt = str(random_thing())

> If the files are identical the old 
> hash+number pair is stored. If they differ the new file gets a new 
> sequence number and that hash+number pair is stored. Since collisions 
> almost never happen even if md5 is used the performance penalty will be 
> almost zero.
>
> The same thing has been discussed earlier and changing the hash function 
> will just solve the problem for a few years...
>
> Best regards,
> Martin

Reply via email to