Here is a better example. Suppose you are checksumming an 8 KB block that
consists of all one bits. fletcher2 runs two parallel quasi-fletcher checksums
on alternating 64 bit words, such that the linear term of each checksum over an
8KB block is the sum of 512 additions. The highest linear term you can get for
an 8KB block this way is 512 * (2^32-1), which is somewhat less than 2^73.
However, fletcher2 uses a 64bit accumulator, so those extra nine bits are
effectively discarded. The equivalent bits are discarded even quicker from the
non-linear term.
Because of that weakness, with fletcher2 errors in any of the _eight_ most
significant bits in any of the 64bit words in the first half of an 8KB block
consisting of all one bits will be completely ignored. In addition the errors
in the seven most significant bits in the next quarter of the block will be
ignored, errors in the next eighth of the block, and so on until you reach the
end of the block. All told, errors in just under 8192 bit positions of an 8KB
block consisting of all ones will be completely ignored due to carry overflow.
With larger blocks, the problem is naturally worse.
I therefore propose an alternative to fletcher2 that I call "fletcher8" that
doesn't have the same weakness with detecting problems near the MSB of each 64
bit word that fletcher2 does. The only difference here is that the high 32
bits of each 64 bit word are added twice, first in their normal position, and
then shifted downward 32 bits.
On an Intel Core 2 Q6600 (2.3 Ghz) in 32 bit mode with gcc 4.1.2, "fletcher8"
runs 3.3x faster than fletcher4, and about twice as slow as the flawed
fletcher2. Of course, if you don't care very much about detecting errors near
the MSB of each 64 bit word, the flawed fletcher2 is much preferable. I ran
each checksum 500,000 times on an 8K block of quasi-random data and measured
the following:
fletcher2: 4137 MB/s
fletcher4: 612 MB/s
fletcher8 (proposed): 2038 MB/s
---
void
fletcher_8_native(const void *buf, uint64_t size, zio_cksum_t *zcp)
{
const uint64_t *ip = buf;
const uint64_t *ipend = ip + (size / sizeof (uint64_t));
uint64_t a0, b0, a1, b1;
for (a0 = b0 = a1 = b1 = 0; ip < ipend; ip += 2) {
a0 += ip[0] + (ip[0] >> 32);
a1 += ip[1] + (ip[1] >> 32);
b0 += a0;
b1 += a1;
}
ZIO_SET_CHECKSUM(zcp, a0, a1, b0, b1);
}
--
This message posted from opensolaris.org