On Tue, Mar 31, 2009 at 09:07:31AM -0700, paul wrote:
> Nice.
> 
> As a minor thought with respect to naming conventions, as most
> aren't likely to bother attempting to understand the relative merits
> of available checksum implementations, I can't help but wonder
> if the naming conventions should to be ordered relative to their
> performance/goodness.
> 
> For example:
> 
> fletcher1 (future, 64b add-w/c or 128b sums, 64b data, fast/robust)*
> fletcher2 (present, using 64b adds, 64b data, fast/less-robust)
> fletcher3 (proposed, using 64b adds, 32b data, slower/robust)
> fletcher4 (present, using 64b adds, 32b data, very-slow/robust)
> 
> *where a future fletcher1 may possibly use either 128b sums or 64b
> add-with-carry implementation (coded in assembly or leverage compiler
> intrinsics if/when available), and thereby be as fast as the current
> fletcher2 and as robust as the proposed fletcher3 (fletcher8).

I've been investigating this.  My main performance results are that the
checksum computation is overwhelmingly dominated by the data motion;  from
a performance standpoint, fletcher2 and fletcher4 are completely equivalent,
despite the fact that on cached data, fletcher4 is ~4x slower than fletcher2.

(checksum=none is faster, since it doesn't touch the data at all, and
checksum=sha256 is slower, since it does significant amounts of computation
per word)

fletcher2 has two issues:

        1. Since it is not pulling around its carries, it is vulnerable to
           2-bit errors in the higher bits.  It is (as is any 2nd-order
           fletcher) vulnerable to three-bit errors in a variety of spacings.

        2. Since there is no performance boost associated with using fletcher2,
           there is no motivation to continue using it, given its various
           vulnerabilities.

Fletcher-4 is still somewhat vulnerable;  here's part of a block-quoted comment
I've written:

 * Fletcher-4 is defined by the recurrence relation:
 *    a_i = a_(i-1) + f_(i-1)
 *    b_i = b_(i-1) + a_i
 *    c_i = c_(i-1) + b_i
 *    d_i = d_(i-1) + c_i
 *
 * Where:
 *   a_0 = b_0 = c_0 = d_0 = 0
 * and
 *   f_0 .. f_(n-1) are the input data.
 *
 * Using standard series transformations, these become the following sums:
 *
 *           __n_                            __n_
 *           \   |                           \   |
 *      a  =  >     f                   b  =  >     i * f
 *       n   /___|   n - i               n   /___|       n - i
 *           i = 1                           i = 1
 *
 *           __n_                            __n_
 *           \   |  i*(i+1)                  \   |  i*(i+1)*(i+2)
 *      c  =  >     ------- f           d  =  >     ------------- f
 *       n   /___|     2     n - i       n   /___|        6        n - i
 *           i = 1                           i = 1
 *
 *
 * Now, f_i are 32-bit values, and [abcd]_n are 64-bit accumulators.  In
 * order to not loose data, we need to make sure that we don't overflow.
 * The worst-case for this is if f_i = 0xffffffff for all i.  We can
 * conservatively estimate how big n can get before we overflow by running:
 *
 * % bc
 * a=2^32-1;for (i=1;a < 2^64; i++) {a += i*(i+1)*(i+2)/6*(2^32-1)};i-1
 * 566
 * quit

566 * 4bytes/datum is slightly above 2k of data, so a 128k block is ~64 times
longer than the earliest truncation.

There are two ways of dealing with this:

  1. Leave fletcher4 alone.

        Our largest buffer is 128k; that's 32k*sizeof (uint32_t). That
        means that n = 32768, and we want to look at the factorization
        of:

                D_i = (i*(i+1)*(i+2))/6

        i = 1..n.  Writing it out as:

                D_i = (2^(E_i)) * (F_i)

        where F_i is odd, we can ask about the distribution of E_i.
        Some scripting later, we get:

        for the Ds:

           COUNT 2-FACTOR
            8192 2^0
            4096 2^1
           10240 2^2
            5120 2^3
            2560 2^4
            1280 2^5
             640 2^6
             320 2^7
             160 2^8
              80 2^9
              40 2^10
              20 2^11
              10 2^12
               5 2^13
               3 2^14
               2 2^15

        And the Cs:
           16384 2^0
            8192 2^1
            4096 2^2
            2048 2^3
            1024 2^4
             512 2^5
             256 2^6
             128 2^7
              64 2^8
              32 2^9
              16 2^10
               8 2^11
               4 2^12
               2 2^13
               1 2^14

        This means that even in the worst case, f_i will be added at
        2^15 * (32-bit quantity), which fits in a 64-bit accumulator.
        This does mean that early-on words in the buffer have *less* of
        an effect on the result.

        Of course, without more math, I don't have a compelling case for
        this being effective.

  2. Invent a fletcher4c, which does something about the overflow.

        My suggestion would be to add a cleanup pass, run every 512 iterations
        and at the end, which does:

                c_a += (a >> 32)
                a = (a & 0xffffffff) + (a >> 32);
                if ((a >> 32) != 0) {
                        a = (a & 0affffffff) + 1;
                        c_a++;
                }

        (and similarly for b, c, and d), then at the end, do:

        while ((c_a >> 32) != 0)
                c_a = (c_a & 0xffffffff) + (c_a >> 32);

        a |= (c_a) << 32;

        This computes a, b, c, and d (mod 2^32 - 1), then puts the carry
        count in the top bits.  The advantage of putting the carry count there
        is that it counteracts a weakness of computing (mod 2^32 - 1), which
        is that 0 == 2^32 - 1 (mod 2^32 - 1).  But 0 won't add a carry, and
        2^32 - 1 will, so the final counts will end up different.

The main complication in introducing fletcher4c is there are several places
which hardcode the use of fletcher4 (the send/receive code in particular), and
so versioning could be an issue.

Thoughts?

Cheers,
- jonathan


Reply via email to