On Mon, Aug 26, 2019 at 08:27:15AM -0400, Austin S. Hemmelgarn wrote: > On 2019-08-23 13:08, Adam Borowski wrote: > > the improved collision > > resistance of xxhash64 is not a reason as if you intend to dedupe you want > > a crypto hash so you don't need to verify. > > The improved collision resistance is a roughly 10 orders of magnitude > reduction in the chance of a collision. That may not matter for most, but > it's a significant improvement for anybody operating at large enough scale > that media errors are commonplace.
Hash size doesn't matter vs media errors. You don't have billions of mismatches: the first one is a cause of alarm, so 1-in-4294967296 chance of failing to notice it hardly ever matters (even though it _can_ happen in real life as opposed to collisions below). I can think of a bigger hash useful in three cases: * recovering from a split-brain RAID * recovering from one disk of a RAID having had a large piece scribbled upon * finding candidates for deduplication (but see below why not 64-bit) > Also, you would still need to verify even if you're using whatever the > fanciest new collision resistant cryptographic hash is, because the number > of possible input values is still more than _nine thousand_ orders of > magnitude larger than the total number of output values even if we use a > 512-bit cryptographic hash. You're underestimating how rare crypto-strength hash collisions are. There are two scenarios: unintentional, and malicious. Let's go with unintentional first: the age of the Universe is 2^58.5 seconds. The fastest disk (non-pmem) is NVMe-connected Optane, at 240000 IOPS. That's 2^17.8. With a 256-bit hash, the mass of machines needed for a single expected collision within the age of Universe exceeds the mass of observable Universe itself. So, malicious. We demand a non-broken hash, which in crypto speak means there's no known attack better than brute force. An iterative approach is right out; the best space-time tradeoff is birthday attack, which requires storage size akin to the root of # of combinations (ie, half the hash length). It's drastically better: at current best storage densities, you'd need only the mass of the Earth. Please let me know when you'll build that Earth-sized computer, so I can migrate from weak SHA256 to eg. BLAKE2b. On the other hand, computers and memories get hit by cosmic rays, thermal noise, and so on at a non-negligible rate. Any theoretical chance of a hash collision is dwarfed by flaws of technology we have. Or, eg, by the chance that you'll get hit by multiple lightings the next time you leave your house. Thus: no, you don't need to recheck after SHA256. Meow! -- ⢀⣴⠾⠻⢶⣦⠀ ⣾⠁⢠⠒⠀⣿⡁ ⢿⡄⠘⠷⠚⠋ The root of a real enemy is an imaginary friend. ⠈⠳⣄⠀⠀⠀⠀