Here is a better example. Suppose you are checksumming an 8 KB block that consists of all one bits. fletcher2 runs two parallel quasi-fletcher checksums on alternating 64 bit words, such that the linear term of each checksum over an 8KB block is the sum of 512 additions. The highest linear term you can get for an 8KB block this way is 512 * (2^32-1), which is somewhat less than 2^73. However, fletcher2 uses a 64bit accumulator, so those extra nine bits are effectively discarded. The equivalent bits are discarded even quicker from the non-linear term.
Because of that weakness, with fletcher2 errors in any of the _eight_ most significant bits in any of the 64bit words in the first half of an 8KB block consisting of all one bits will be completely ignored. In addition the errors in the seven most significant bits in the next quarter of the block will be ignored, errors in the next eighth of the block, and so on until you reach the end of the block. All told, errors in just under 8192 bit positions of an 8KB block consisting of all ones will be completely ignored due to carry overflow. With larger blocks, the problem is naturally worse. I therefore propose an alternative to fletcher2 that I call "fletcher8" that doesn't have the same weakness with detecting problems near the MSB of each 64 bit word that fletcher2 does. The only difference here is that the high 32 bits of each 64 bit word are added twice, first in their normal position, and then shifted downward 32 bits. On an Intel Core 2 Q6600 (2.3 Ghz) in 32 bit mode with gcc 4.1.2, "fletcher8" runs 3.3x faster than fletcher4, and about twice as slow as the flawed fletcher2. Of course, if you don't care very much about detecting errors near the MSB of each 64 bit word, the flawed fletcher2 is much preferable. I ran each checksum 500,000 times on an 8K block of quasi-random data and measured the following: fletcher2: 4137 MB/s fletcher4: 612 MB/s fletcher8 (proposed): 2038 MB/s --- void fletcher_8_native(const void *buf, uint64_t size, zio_cksum_t *zcp) { const uint64_t *ip = buf; const uint64_t *ipend = ip + (size / sizeof (uint64_t)); uint64_t a0, b0, a1, b1; for (a0 = b0 = a1 = b1 = 0; ip < ipend; ip += 2) { a0 += ip[0] + (ip[0] >> 32); a1 += ip[1] + (ip[1] >> 32); b0 += a0; b1 += a1; } ZIO_SET_CHECKSUM(zcp, a0, a1, b0, b1); } -- This message posted from opensolaris.org