On Tue, Mar 31, 2009 at 09:07:31AM -0700, paul wrote:
> Nice.
>
> As a minor thought with respect to naming conventions, as most
> aren't likely to bother attempting to understand the relative merits
> of available checksum implementations, I can't help but wonder
> if the naming conventions should to be ordered relative to their
> performance/goodness.
>
> For example:
>
> fletcher1 (future, 64b add-w/c or 128b sums, 64b data, fast/robust)*
> fletcher2 (present, using 64b adds, 64b data, fast/less-robust)
> fletcher3 (proposed, using 64b adds, 32b data, slower/robust)
> fletcher4 (present, using 64b adds, 32b data, very-slow/robust)
>
> *where a future fletcher1 may possibly use either 128b sums or 64b
> add-with-carry implementation (coded in assembly or leverage compiler
> intrinsics if/when available), and thereby be as fast as the current
> fletcher2 and as robust as the proposed fletcher3 (fletcher8).
I've been investigating this. My main performance results are that the
checksum computation is overwhelmingly dominated by the data motion; from
a performance standpoint, fletcher2 and fletcher4 are completely equivalent,
despite the fact that on cached data, fletcher4 is ~4x slower than fletcher2.
(checksum=none is faster, since it doesn't touch the data at all, and
checksum=sha256 is slower, since it does significant amounts of computation
per word)
fletcher2 has two issues:
1. Since it is not pulling around its carries, it is vulnerable to
2-bit errors in the higher bits. It is (as is any 2nd-order
fletcher) vulnerable to three-bit errors in a variety of spacings.
2. Since there is no performance boost associated with using fletcher2,
there is no motivation to continue using it, given its various
vulnerabilities.
Fletcher-4 is still somewhat vulnerable; here's part of a block-quoted comment
I've written:
* Fletcher-4 is defined by the recurrence relation:
* a_i = a_(i-1) + f_(i-1)
* b_i = b_(i-1) + a_i
* c_i = c_(i-1) + b_i
* d_i = d_(i-1) + c_i
*
* Where:
* a_0 = b_0 = c_0 = d_0 = 0
* and
* f_0 .. f_(n-1) are the input data.
*
* Using standard series transformations, these become the following sums:
*
* __n_ __n_
* \ | \ |
* a = > f b = > i * f
* n /___| n - i n /___| n - i
* i = 1 i = 1
*
* __n_ __n_
* \ | i*(i+1) \ | i*(i+1)*(i+2)
* c = > ------- f d = > ------------- f
* n /___| 2 n - i n /___| 6 n - i
* i = 1 i = 1
*
*
* Now, f_i are 32-bit values, and [abcd]_n are 64-bit accumulators. In
* order to not loose data, we need to make sure that we don't overflow.
* The worst-case for this is if f_i = 0xffffffff for all i. We can
* conservatively estimate how big n can get before we overflow by running:
*
* % bc
* a=2^32-1;for (i=1;a < 2^64; i++) {a += i*(i+1)*(i+2)/6*(2^32-1)};i-1
* 566
* quit
566 * 4bytes/datum is slightly above 2k of data, so a 128k block is ~64 times
longer than the earliest truncation.
There are two ways of dealing with this:
1. Leave fletcher4 alone.
Our largest buffer is 128k; that's 32k*sizeof (uint32_t). That
means that n = 32768, and we want to look at the factorization
of:
D_i = (i*(i+1)*(i+2))/6
i = 1..n. Writing it out as:
D_i = (2^(E_i)) * (F_i)
where F_i is odd, we can ask about the distribution of E_i.
Some scripting later, we get:
for the Ds:
COUNT 2-FACTOR
8192 2^0
4096 2^1
10240 2^2
5120 2^3
2560 2^4
1280 2^5
640 2^6
320 2^7
160 2^8
80 2^9
40 2^10
20 2^11
10 2^12
5 2^13
3 2^14
2 2^15
And the Cs:
16384 2^0
8192 2^1
4096 2^2
2048 2^3
1024 2^4
512 2^5
256 2^6
128 2^7
64 2^8
32 2^9
16 2^10
8 2^11
4 2^12
2 2^13
1 2^14
This means that even in the worst case, f_i will be added at
2^15 * (32-bit quantity), which fits in a 64-bit accumulator.
This does mean that early-on words in the buffer have *less* of
an effect on the result.
Of course, without more math, I don't have a compelling case for
this being effective.
2. Invent a fletcher4c, which does something about the overflow.
My suggestion would be to add a cleanup pass, run every 512 iterations
and at the end, which does:
c_a += (a >> 32)
a = (a & 0xffffffff) + (a >> 32);
if ((a >> 32) != 0) {
a = (a & 0affffffff) + 1;
c_a++;
}
(and similarly for b, c, and d), then at the end, do:
while ((c_a >> 32) != 0)
c_a = (c_a & 0xffffffff) + (c_a >> 32);
a |= (c_a) << 32;
This computes a, b, c, and d (mod 2^32 - 1), then puts the carry
count in the top bits. The advantage of putting the carry count there
is that it counteracts a weakness of computing (mod 2^32 - 1), which
is that 0 == 2^32 - 1 (mod 2^32 - 1). But 0 won't add a carry, and
2^32 - 1 will, so the final counts will end up different.
The main complication in introducing fletcher4c is there are several places
which hardcode the use of fletcher4 (the send/receive code in particular), and
so versioning could be an issue.
Thoughts?
Cheers,
- jonathan