On Mon, Aug 26, 2019 at 08:27:15AM -0400, Austin S. Hemmelgarn wrote:
> On 2019-08-23 13:08, Adam Borowski wrote:
> > the improved collision
> > resistance of xxhash64 is not a reason as if you intend to dedupe you want
> > a crypto hash so you don't need to verify.
> 
> The improved collision resistance is a roughly 10 orders of magnitude
> reduction in the chance of a collision.  That may not matter for most, but
> it's a significant improvement for anybody operating at large enough scale
> that media errors are commonplace.

Hash size doesn't matter vs media errors.  You don't have billions of
mismatches: the first one is a cause of alarm, so 1-in-4294967296 chance of
failing to notice it hardly ever matters (even though it _can_ happen in
real life as opposed to collisions below).

I can think of a bigger hash useful in three cases:
* recovering from a split-brain RAID
* recovering from one disk of a RAID having had a large piece scribbled upon
* finding candidates for deduplication (but see below why not 64-bit)

> Also, you would still need to verify even if you're using whatever the
> fanciest new collision resistant cryptographic hash is, because the number
> of possible input values is still more than _nine thousand_ orders of
> magnitude larger than the total number of output values even if we use a
> 512-bit cryptographic hash.

You're underestimating how rare crypto-strength hash collisions are.

There are two scenarios: unintentional, and malicious.

Let's go with unintentional first: the age of the Universe is 2^58.5
seconds.  The fastest disk (non-pmem) is NVMe-connected Optane, at 240000
IOPS.  That's 2^17.8.  With a 256-bit hash, the mass of machines needed for
a single expected collision within the age of Universe exceeds the mass of
observable Universe itself.

So, malicious.  We demand a non-broken hash, which in crypto speak means
there's no known attack better than brute force.  An iterative approach is
right out; the best space-time tradeoff is birthday attack, which requires
storage size akin to the root of # of combinations (ie, half the hash
length).  It's drastically better: at current best storage densities, you'd
need only the mass of the Earth.

Please let me know when you'll build that Earth-sized computer, so I can
migrate from weak SHA256 to eg. BLAKE2b.

On the other hand, computers and memories get hit by cosmic rays, thermal
noise, and so on at a non-negligible rate.  Any theoretical chance of a hash
collision is dwarfed by flaws of technology we have.  Or, eg, by the chance
that you'll get hit by multiple lightings the next time you leave your
house.

Thus: no, you don't need to recheck after SHA256.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁
⢿⡄⠘⠷⠚⠋  The root of a real enemy is an imaginary friend.
⠈⠳⣄⠀⠀⠀⠀

Reply via email to