On Tue, Oct 16, 2012 at 02:27:51PM -0400, Jeff King wrote:
> > The one reason why we *might* want to use SHA-3, BTW, is that it is a
> > radically different design from SHA-1 and SHA-2.  And if there is a
> > crypto hash failure which is bad enough that the security of git would
> > be affected, there's a chance that the same attack could significantly
> > affect SHA-2 as well.  The fact that SHA-3 is fundamentally different
> > from a cryptographic design perspective means that an attack that
> > impacts SHA-1/SHA-2 will not likely impact SHA-3, and vice versa.
> 
> Right. The point of having the SHA-3 contest was that we thought SHA-1's
> breakage meant that SHA-2 was going to fall next. But Schneier's
> comments before the winners were announced were basically "it turns out
> that SHA-2 is not broken like we thought, so there's no reason to ditch
> it, and the fact that it is well-studied and well-deployed may mean it's
> a good choice".
> 
> So I could go either way. This is not a decision we should make today,
> though, so we can wait and see which direction the world goes before
> picking an algorithm.

Do you really need to pick an algorithm and go through a full-on flag
day ten years down the road all over again? People don't really care
that a git revision is actually the hex-encoded SHA1 hash of a tree.
They just know it's this long string of "stuff" that uniquely identifies
a revison globally somehow. They know if they copy and paste the first
few characters of the string there is a small chance two revisions will
have the same first few characters, and if they copy and paste the whole
string the chance drops to "you're whole dev team will be eaten by
wolves in tragic unrelated incidences" unlikely.

So why bake in a single algorithm? We'll have to extend the length of a
whole revision string anyway - the alternatives start at 256bits - and
people are going to want to be able to specify the whole revision string
at least sometimes. Once you've gone through that pain, why have to
repeat it again in ten years?


Let's make revisions be a long but variable length string. A revision by
itself is meaningless of course. However if if you know of a repo that
contains that revision, you can convert it into something useful, like a
commit and associated tree. If you don't know, well, you'd be stuck
anyway right now. 

Now when you push and pull from a remote repo what'll happen is the repo
will figure out what type(s) of hash algorithm your client supports. A
Git 3000 user with a repo using SHA3072 can talk to a v0.1 client just
fine: they send the v0.1 client revisions calculated with an algorithm
they support, and when they pull revisions from that repo they calculate
new revisions with their preferred algorithm. If they want to do this a
lot, they maintain the two sets of digest tables next to each other,
with the SHA3072 table marked as preferred, and the rest kept only so
pushes and pulls can be fast. In most cases a project will convert to
one hash algorithm, but by having multi-hash support that conversion
doesn't have to be a flag day, and at the same time it's still easy to
lookup old revisions by their old digests. Meanwhile the crypto-wonks
get to have their fun PGP signing and timestamping long, secure digests.

Note that we don't even have to shut out non-upgraded users from
participating. Machine-to-machine communication is not a problem as
outlined above, but even with stuff like mailing lists we can start
passing around concatenated revisions like the following: 

da39a3ee5e6b4b0d3255bfef95601890afd80709.e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855

Old users just use the first bit. (the period isn't even required
really) If you think that's too long, there's a simple solution that
keeps <your requirements>-bit security, albeit one whose implications
lead you right to Linus's lines of thinking:

da39a3ee5e6b4b0d3255bfef95601890afd80709

That's just a SHA1 again. Of course, if you actually care about this
stuff you already have cryptographic infrastructure, and that
infrastructure can simply store *trusted* metadata in you're repos
saying that the string 'foo' happens to be a valid alias for the actual
digest that *the user* can specify instead of that digest. It may even
be that for your security needs just timestamping those aliases is
enough. Either way while something needs to be calculating secure
hashes, and preferably Git mainline so push and pull works without
having to examine every last line of code, you can get away without
changing the UI very much.


Anyway, in the short term the people who care can write parallel digest
calculators; I personally have a use-case for one right now. Better code
to handle the cases where individual blobs have colliding hashes is
required as well in the medium term. Finally those who require it could
very well write parallel git's to effectively do the pulling and pushing
of their parallel calculated revision hashes if they really wanted too.

But if this problem gets to the point where git-core has to change the
reality is organizations are not going to be happy if it has to be a big
flag day. Git v2/v3 interoperability *will* be implemented. Once we're
at that point, let's make sure Git 5.0's big feature isn't SHA5000. 

-- 
'peter'[:-1]@petertodd.org

Attachment: signature.asc
Description: Digital signature

Reply via email to