Here is a better example. Suppose you are checksumming an 8 KB block that 
consists of all one bits.  fletcher2 runs two parallel quasi-fletcher checksums 
on alternating 64 bit words, such that the linear term of each checksum over an 
8KB block is the sum of 512 additions.  The highest linear term you can get for 
an 8KB block this way is 512 * (2^32-1), which is somewhat less than 2^73.  
However, fletcher2 uses a 64bit accumulator, so those extra nine bits are 
effectively discarded. The equivalent bits are discarded even quicker from the 
non-linear term.

Because of that weakness, with fletcher2 errors in any of the _eight_ most 
significant bits in any of the 64bit words in the first half of an 8KB block 
consisting of all one bits will be completely ignored.  In addition the errors 
in the seven most significant bits in the next quarter of the block will be 
ignored, errors in the next eighth of the block, and so on until you reach the 
end of the block.  All told, errors in just under 8192 bit positions of an 8KB 
block consisting of all ones will be completely ignored due to carry overflow.  
With larger blocks, the problem is naturally worse.

I therefore propose an alternative to fletcher2 that I call "fletcher8" that 
doesn't have the same weakness with detecting problems near the MSB of each 64 
bit word that fletcher2 does.  The only difference here is that the high 32 
bits of each 64 bit word are added twice, first in their normal position, and 
then shifted downward 32 bits.

On an Intel Core 2 Q6600 (2.3 Ghz) in 32 bit mode with gcc 4.1.2, "fletcher8" 
runs 3.3x faster than fletcher4, and about twice as slow as the flawed 
fletcher2.  Of course, if you don't care very much about detecting errors near 
the MSB of each 64 bit word, the flawed fletcher2 is much preferable.  I ran 
each checksum 500,000 times on an 8K block of quasi-random data and measured 
the following:

fletcher2:  4137 MB/s  
fletcher4: 612 MB/s
fletcher8 (proposed):  2038 MB/s
---

void
fletcher_8_native(const void *buf, uint64_t size, zio_cksum_t *zcp)
{
    const uint64_t *ip = buf;
    const uint64_t *ipend = ip + (size / sizeof (uint64_t));
    uint64_t a0, b0, a1, b1;

    for (a0 = b0 = a1 = b1 = 0; ip < ipend; ip += 2) {
        a0 += ip[0] + (ip[0] >> 32);
        a1 += ip[1] + (ip[1] >> 32);
        b0 += a0;
        b1 += a1;
    }

    ZIO_SET_CHECKSUM(zcp, a0, a1, b0, b1);
}
-- 
This message posted from opensolaris.org

Reply via email to