On Tue, Mar 31, 2009 at 09:07:31AM -0700, paul wrote: > Nice. > > As a minor thought with respect to naming conventions, as most > aren't likely to bother attempting to understand the relative merits > of available checksum implementations, I can't help but wonder > if the naming conventions should to be ordered relative to their > performance/goodness. > > For example: > > fletcher1 (future, 64b add-w/c or 128b sums, 64b data, fast/robust)* > fletcher2 (present, using 64b adds, 64b data, fast/less-robust) > fletcher3 (proposed, using 64b adds, 32b data, slower/robust) > fletcher4 (present, using 64b adds, 32b data, very-slow/robust) > > *where a future fletcher1 may possibly use either 128b sums or 64b > add-with-carry implementation (coded in assembly or leverage compiler > intrinsics if/when available), and thereby be as fast as the current > fletcher2 and as robust as the proposed fletcher3 (fletcher8).
I've been investigating this. My main performance results are that the checksum computation is overwhelmingly dominated by the data motion; from a performance standpoint, fletcher2 and fletcher4 are completely equivalent, despite the fact that on cached data, fletcher4 is ~4x slower than fletcher2. (checksum=none is faster, since it doesn't touch the data at all, and checksum=sha256 is slower, since it does significant amounts of computation per word) fletcher2 has two issues: 1. Since it is not pulling around its carries, it is vulnerable to 2-bit errors in the higher bits. It is (as is any 2nd-order fletcher) vulnerable to three-bit errors in a variety of spacings. 2. Since there is no performance boost associated with using fletcher2, there is no motivation to continue using it, given its various vulnerabilities. Fletcher-4 is still somewhat vulnerable; here's part of a block-quoted comment I've written: * Fletcher-4 is defined by the recurrence relation: * a_i = a_(i-1) + f_(i-1) * b_i = b_(i-1) + a_i * c_i = c_(i-1) + b_i * d_i = d_(i-1) + c_i * * Where: * a_0 = b_0 = c_0 = d_0 = 0 * and * f_0 .. f_(n-1) are the input data. * * Using standard series transformations, these become the following sums: * * __n_ __n_ * \ | \ | * a = > f b = > i * f * n /___| n - i n /___| n - i * i = 1 i = 1 * * __n_ __n_ * \ | i*(i+1) \ | i*(i+1)*(i+2) * c = > ------- f d = > ------------- f * n /___| 2 n - i n /___| 6 n - i * i = 1 i = 1 * * * Now, f_i are 32-bit values, and [abcd]_n are 64-bit accumulators. In * order to not loose data, we need to make sure that we don't overflow. * The worst-case for this is if f_i = 0xffffffff for all i. We can * conservatively estimate how big n can get before we overflow by running: * * % bc * a=2^32-1;for (i=1;a < 2^64; i++) {a += i*(i+1)*(i+2)/6*(2^32-1)};i-1 * 566 * quit 566 * 4bytes/datum is slightly above 2k of data, so a 128k block is ~64 times longer than the earliest truncation. There are two ways of dealing with this: 1. Leave fletcher4 alone. Our largest buffer is 128k; that's 32k*sizeof (uint32_t). That means that n = 32768, and we want to look at the factorization of: D_i = (i*(i+1)*(i+2))/6 i = 1..n. Writing it out as: D_i = (2^(E_i)) * (F_i) where F_i is odd, we can ask about the distribution of E_i. Some scripting later, we get: for the Ds: COUNT 2-FACTOR 8192 2^0 4096 2^1 10240 2^2 5120 2^3 2560 2^4 1280 2^5 640 2^6 320 2^7 160 2^8 80 2^9 40 2^10 20 2^11 10 2^12 5 2^13 3 2^14 2 2^15 And the Cs: 16384 2^0 8192 2^1 4096 2^2 2048 2^3 1024 2^4 512 2^5 256 2^6 128 2^7 64 2^8 32 2^9 16 2^10 8 2^11 4 2^12 2 2^13 1 2^14 This means that even in the worst case, f_i will be added at 2^15 * (32-bit quantity), which fits in a 64-bit accumulator. This does mean that early-on words in the buffer have *less* of an effect on the result. Of course, without more math, I don't have a compelling case for this being effective. 2. Invent a fletcher4c, which does something about the overflow. My suggestion would be to add a cleanup pass, run every 512 iterations and at the end, which does: c_a += (a >> 32) a = (a & 0xffffffff) + (a >> 32); if ((a >> 32) != 0) { a = (a & 0affffffff) + 1; c_a++; } (and similarly for b, c, and d), then at the end, do: while ((c_a >> 32) != 0) c_a = (c_a & 0xffffffff) + (c_a >> 32); a |= (c_a) << 32; This computes a, b, c, and d (mod 2^32 - 1), then puts the carry count in the top bits. The advantage of putting the carry count there is that it counteracts a weakness of computing (mod 2^32 - 1), which is that 0 == 2^32 - 1 (mod 2^32 - 1). But 0 won't add a carry, and 2^32 - 1 will, so the final counts will end up different. The main complication in introducing fletcher4c is there are several places which hardcode the use of fletcher4 (the send/receive code in particular), and so versioning could be an issue. Thoughts? Cheers, - jonathan