Moin Robert,

On 2019-11-22 20:01, Robert Haas wrote:
On Fri, Nov 22, 2019 at 1:10 PM David Steele <da...@pgmasters.net> wrote:
Well, the maximum amount of data that can be protected with a 32-bit CRC is 512MB according to all the sources I found (NIST, Wikipedia, etc). I
presume that's what we are talking about since I can't find any 64-bit
CRC code in core or this patch.

Could you give a more precise citation for this? I can't find a
reference to that in the Wikipedia article off-hand and I don't know
where to look in NIST. I apologize if I'm being dense here, but I
don't see why there should be any limit on the amount of data that can
be protected. The important thing is that if the original file F is
altered to F', we hope that CHECKSUM(F) != CHECKSUM(F'). The
probability of that, assuming that the alteration is random rather
than malicious and that the checksum function is equally likely to
produce every possible output, is just 1-2^-${CHECKSUM_BITS},
regardless of the length of the message (except that there might be
some special cases for very short messages, which don't matter here).

This analysis by me seems to match
https://en.wikipedia.org/wiki/Cyclic_redundancy_check, which says:

"Typically an n-bit CRC applied to a data block of arbitrary length
will detect any single error burst not longer than n bits, and the
fraction of all longer error bursts that it will detect is (1 −
2^−n)."

Notice the phrase "a data block of arbitrary length" and the formula "1 - 2^-n".

It is related to the number of states, and the birthday problem factors in it, too:

   https://en.wikipedia.org/wiki/Birthday_problem

If you have a 32 bit checksum or hash, it can represent only 2**32-1 states at most (or less, if the
algorithmn isn't really good).

Each byte is 8 bit, so 2 ** 32 / 8 is 512 Mbyte. If you process your data bit by bit, each new bit would add a new state (consider: missing bit == 0, added bit == 1). If each new state is repesented by a different checksum, all possible 2 ** 32 values are exhausted after processing 512 Mbyte, after that you get one of the former states again - aka a collision.

There is no way around it with so little bits, no matter what algorithmn you choose.

> Phrased more positively, if you want a cryptographic hash
> at all, you should probably use one that isn't widely viewed as too
> weak.

Sure.  There's another advantage to picking an algorithm with lower
collision rates, though.

CRCs are fine for catching transmission errors (as caveated above) but
not as great for comparing two files for equality.  With strong hashes
you can confidently compare local files against the path, size, and hash
stored in the manifest and save yourself a round-trip to the remote
storage to grab the file if it has not changed locally.

I agree in part. I think there are two reasons why a cryptographically
strong hash is desirable for delta restore. First, since the checksums
are longer, the probability of a false match happening randomly is
lower, which is important. Even if the above analysis is correct and
the chance of a false match is just 2^-32 with a 32-bit CRC, if you
back up ten million files every day, you'll likely get a false match
within a few years or less, and once is too often. Second, unlike what
I supposed above, the contents of a PostgreSQL data file are not
chosen at random, unlike transmission errors, which probably are more
or less random. It seems somewhat possible that there is an adversary
who is trying to choose the data that gets stored in some particular
record so as to create a false checksum match. A CRC is a lot easier
to fool than a crytographic hash, so I think that using a CRC of *any*
length for this kind of use case would be extremely dangerous no
matter the probability of an accidental match.

Agreed. See above.

However, if you choose a hash, please do not go below SHA-256. Both MD5
and SHA-1 already had collision attacks, and these only got to be bound
to be worse.

  https://www.mscs.dal.ca/~selinger/md5collision/
  https://shattered.io/

It might even be a wise idea to encode the used Hash-Algorithm into the
manifest file, so it can be changed later. The hash length might be not
enough to decide which algorithm is the one used.

This is the basic premise of what we call delta restore which can speed
up restores by orders of magnitude.

Delta restore is the main advantage that made us decide to require SHA1
checksums.  In most cases, restore speed is more important than backup
speed.

I see your point, but it's not the whole story. We've encountered a
bunch of cases where the time it took to complete a backup exceeded
the user's desired backup interval, which is obviously very bad, or
even more commonly where it exceeded the length of the user's
"low-usage" period when they could tolerate the extra overhead imposed
by the backup. A few percentage points is probably not a big deal, but
a user who has an 8-hour window to get the backup done overnight will
not be happy if it's taking 6 hours now and we tack 40%-50% on to
that. So I think that we either have to disable backup checksums by
default, or figure out a way to get the overhead down to something a
lot smaller than what current tests are showing -- which we could
possibly do without changing the algorithm if we can somehow make it a
lot cheaper, but otherwise I think the choice is between disabling the
functionality altogether by default and adopting a less-expensive
algorithm. Maybe someday when delta restore is in core and widely used
and CPUs are faster, it'll make sense to revise the default, and
that's cool, but I can't see imposing a big overhead by default to
enable a feature core doesn't have yet...

Modern algorithms are amazingly fast on modern hardware, some even
are implemented in hardware nowadays:

 https://software.intel.com/en-us/articles/intel-sha-extensions

Quote from:

https://neosmart.net/blog/2017/will-amds-ryzen-finally-bring-sha-extensions-to-intels-cpus/

 "Despite the extremely limited availability of SHA extension support
  in modern desktop and mobile processors, crypto libraries have already
upstreamed support to great effect. Botan’s SHA extension patches show a significant 3x to 5x performance boost when taking advantage of the hardware extensions, and the Linux kernel itself shipped with hardware SHA support with version 4.4, bringing a very respectable 3.6x performance upgrade over
  the already hardware-assisted SSE3-enabled code."

If you need to load the data from disk and shove it over a network, the
hashing will certainly be very little overhead, it might even be completely invisible, since it can run in paralell to all the other things. Sure, there is the thing called zero-copy-networking, but if you have to compress the data bevore sending it to the network, you have to put it through the CPU,
anyway. And if you have more than one core, the second one can to the
hashing it paralell to the first one doing the compression.

To get a feeling one can use:

   openssl speed md5 sha1 sha256 sha512

On my really-not-fast desktop CPU (i5-4690T CPU @ 2.50GHz) it says:

 The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes md5 122638.55k 277023.96k 487725.57k 630806.19k 683892.74k 688553.98k sha1 127226.45k 313891.52k 632510.55k 865753.43k 960995.33k 977215.19k sha256 77611.02k 173368.15k 325460.99k 412633.43k 447022.92k 448020.48k sha512 51164.77k 205189.87k 361345.79k 543883.26k 638372.52k 645933.74k

Or in other words, it can hash nearly 931 MByte /s with SHA-1 and about
427 MByte / s with SHA-256 (if I haven't miscalculated something). You'd need a pretty fast disk (aka M.2 SSD) and network (aka > 1 Gbit) to top these speeds and then you'd use a real CPU for your server, not some poor Intel powersaving
surfing thingy-majingy :)

Best regards,

Tels


Reply via email to