Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Daniel Shahaf Thu, 26 Jan 2023 01:34:19 -0800

Evgeny Kotkov via dev wrote on Mon, Jan 23, 2023 at 02:28:50 +0300:
> Daniel Shahaf <d...@daniel.shahaf.name> writes:
> 
> > > I can complete the work on this branch and bring it to a production-ready
> > > state, assuming there are no objections.
> >
> > Your assumption is counterfactual:
> >
> > https://mail-archives.apache.org/mod_mbox/subversion-dev/202301.mbox/%3C20230119152001.GA27446%40tarpaulin.shahaf.local2%3E
> >
> > https://mail-archives.apache.org/mod_mbox/subversion-dev/202212.mbox/%3CCAMHy98NqYBLZaTL5-FAbf24RR6bagPN1npC5gsZenewZb0-EuQ%40mail.gmail.com%3E
> 
> I don't see any explicit objections in these two emails (here I assume that
> if something is not clear to a PMC member, it doesn't automatically become
> an objection).  If the "why?" question is indeed an objection, then I would
> say it has already been discussed and responded to in the thread.
>


The "Why?" was sent _after_ the post you're quoting, and in any case was
just an elevator pitch summary of something I had explained more verbosely.

The first post in this thread asserts X is a problem and Y is a solution
to it, and argues that Y is a good thing.  However, that post does not
explain /why/ X is a problem, does not consider alternatives to Y, and
does not consider possible cons of Y.  That's what's missing.

> Now, returning to the problem:
> 
> As described in the advisory [1], we have a supported configuration that
> makes data forgery possible:
> 
> - A repository with disabled rep-sharing allows storing different files with
>   colliding SHA-1 values.
> - Having a repository with disabled rep-sharing is a supported configuration.
>   There may be a certain number of such repositories in the wild
>   (for example, created with SVN < 1.6 and not upgraded afterwise).
> - A working copy uses an assumption that the pristine contents are equal if
>   their SHA-1 hashes are equal.
> - So committing different files with colliding SHA-1 values makes it possible
>   to forge the contents of a file that will be checked-out and used by the
>   client.
> 
> I would say that this state is worrying just by itself.
> 

I assume this situation could happen accidentally, say, if someone adds
shattered-1.pdf and shattered-2.pdf to the same wc in a particular way.
That is, I'm not assuming "forgery" (which implies Mallory is involved).

Still, this is a potential data integrity issue with the new-in-1.15 wc
format, so we should address it before the release.  What are our
options to address that?  Switching to another checksum is an option,
yes, but we [as in, dev@] don't seem to have considered any alternatives
to that.

Just off the top of my head, we could:

- Encourage or require use of rep-sharing
  [the advisory already recommends this]

- Encourage or require use of 
tools/hook-scripts/reject-detected-sha1-collisions.sh
  [the advisory already recommends this]

- Have f32 wc's refuse to talk to servers that don't detect SHA-1
  collisions.  (1.15 users will still be able to interoperate with old
  servers by using f31.)

And there may be more options.  (Lurkers are invited to speak up!)

> However, with the feasibility of chosen-prefix attacks on SHA-1 [2], it's
> probably only a matter of time until the situation becomes worse.
> 

Quoting the third hunk of 
<https://mail-archives.apache.org/mod_mbox/subversion-dev/202212.mbox/%3C20221220201300.GH32332%40tarpaulin.shahaf.local2%3E>:

    What's the acceptance test we use for candidate checksum algorithms?
    
    You say we should switch to a checksum algorithm that doesn't have known
    collisions, but, why should we require that?  Consider the following
    160-bit checksum algorithm:
    .
        1. If the input consists of 40 ASCII lowercase hex digits and
           nothing else, return the input.
        2. Else, return the SHA-1 of the input.
    
    This algorithm has a trivial first preimage attack.  If a wc used this
    identity-then-sha1 algorithm instead of SHA-1, then… what?

> That could happen after a public disclosure of a pair of executable
> files/scripts where the forged version allows for remote code execution.
> Or maybe something similar with a file format that is often stored in
> repositories and that can be executed or used by a build script, etc.
> 

Err, hang on.  Your reference described a chosen-prefix attack, while
this scenario concerns a single public collision.  These are two
different things.

Disclosure of of a pair of executable files/scripts isn't by itself
a problem unless one of the pair ("file A") is in a repository
somewhere.  Now, was the colliding file ("file B") generated _before_ or
_after_ file A was committed?

- If _before_, then it would seem Mallory had somehow managed to:

  1. get a file of his choosing committed to Alice's repository; and

  2. get a wc of Alice's repository into one of the codepaths that
     assume SHA-1 is one-to-one / collission-free (currently that's the
     ra_serf optimization and the 1.15 wc status).

  Now, step #1 seems plausible enough.  As to step #2, it's not clear to
  me how file B would reach the wc in step #2… but insofar as security
  assumptions go, it seems reasonable to assume Mallory can make this
  happen.

  So, I agree it's a scenario we should address.  What options do we
  have to address it?  (I grant that migrating away from SHA-1 is one
  option.)

- If _after_, then you're presuming not simply a collision attack but
  a second preimage attack.  Should we assume Mallory to be able to
  mount a second preimage attack?

Chosen-prefix collision attacks can help Mallory in a variant of the
"before" case: Mallory computes a collision, sends file A to Alice (who
commits it), and invokes his assumed ability to inject file B into
Alice's wc.  This would work for file formats that ignore the unchosen
suffix.

> [1] https://subversion.apache.org/security/sha1-advisory.txt
> [2] https://sha-mbles.github.io/
> 
> 
> Speaking of the proposed switch to SHA-256 or a different checksum, there's
> an argument by contradiction: if we were designing the pristineless working
> copy from scratch today, would we choose SHA-1 as the best available hash
> that can be used to assert content equality?

If we were designing f32 from the ground up, I hope we'd first nail down
our requirements and then check what are the possible ways to address
them.


We might specify that "The probability of birthday collisions in
<USE-CASE> must not exceed <PERCENTAGE>.".

[E.g., my parents named each of their kids for the CRC32 of the
timestamp,longitude,latitude,altitude of that kid's birth, and that was
fine: they had no collisions.  In contrast, sea turtles shouldn't use
CRC32 to name their kids, since they'd have a ≈5% chance of a collision
due to their larger number of offspring.  However, sea turtles would
have no collisions if they used MD5.]


We might specify that "The hash of an <N>-byte file can be computed
within <T> milliseconds on <SUCH AND SUCH> hardware."

[E.g., the existence of https://en.wikipedia.org/wiki/Intel_SHA_extensions
is a consideration.]


We might specify that "An attacker who is capable of <SUCH AND SUCH>
will not be able to cause a false positive or a false negative in the wc
status optimization.".

[E.g., see above about second preimage attacks.]


And then we'd brainstorm possible solutions (plural) and run each of
them through the specifications, which would be our acceptance test
checklist.

(And since we aren't designing from scratch, our actual acceptance test
would also include implementation and maintenance costs for us and
upgrade costs for our users.)

> If yes, how can one prove that?

Well, for starters, rep-sharing was released in in 2009, the first
public collision (shattered) was published in 2017, a chosen-prefix
attack (shambled) in 2020, and we haven't had any complaints since then
<fine print>other than from people literally trying to store
shattered-1.pdf and shattered-2.pdf in their repos</fine print>?

And so long as we're doing thought experiments, here's another: If we
switched to using only MD5 internally, would anyone notice?  (Cf. above
about identity-then-sha1, which is even weaker than MD5.)

> > Objections have been raised, been left unanswered, and now implementation
> > work has commenced following the original design.  That's not acceptable.
> > I'm vetoing the change until a non-rubber-stamp design discussion has
> > been completed on the public dev@ list.
> 
> I would like to note that vetoing a code modification should be accompanied
> with a technical justification, and I have certain doubts that the above
> arguments qualify as such:
> 
> https://www.apache.org/foundation/voting.html
> [[[
> To prevent vetoes from being used capriciously, the voter must provide
> with the veto a technical justification showing why the change is bad
> (opens a security exposure, negatively affects performance, etc. ).
> A veto without a justification is invalid and has no weight.
> ]]]
> 
> (I'm not saying that the above rules have to be used in this particular case
>  and that a veto is invalid, but still thought it’s worth mentioning.)
> 

I vetoed the change because it hadn't been designed on the dev@ list,
had not garnered dev@'s consensus, and was being railroaded through.
(as far as I could tell)

> Anyway, I'll stop working on the branch, because a veto has been casted.

That's your decision.  Implementing one design on a branch while other
options are being considered by dev@ /is/ possible, but there are some
risks with that; cf. my remarks in
<https://mail-archives.apache.org/mod_mbox/subversion-dev/202301.mbox/%3C20230121092231.GA3174%40tarpaulin.shahaf.local2%3E>.

And once again, for clarity: I'm not vetoing migrating away from SHA-1.
(In fact, my intuition was that it'd be a good idea.)

Daniel

> 
> Regards,
> Evgeny Kotkov

Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

Reply via email to