Nice analysis. I understand you've found on the processor/memory-subsystem architecture you've experimented with, that if the data being summed isn't already cached, it's access will likely dominate it's use in calculations:
- Although this makes sense (presuming no mechanism may used to pre-cache data to be only then be subsequently check-summed when cache resident in the future), as processor/memory system architectures are still evolving, it may be prudent to account for that possibility; for example possible future use of streaming vector/memory units, where multiple data streams may be potentially summed in parallel and/or possibly integrated into the I/O channel such that while data is being retrieved its checksum may be computed without processor intervention may be desirable. - With respect to Fletcher4, as the first two running sums (representing a traditional Fletcher checksum implementation) are sufficient to yield a hamming distance of at least 3 (meaning all 1 and 2 bit errors are detectable, and thereby 1 bit errors also potentially correctable if ever desired) for block sizes somewhat larger than 256KB, being larger than required by ZFS; I can't help but wonder if this may be sufficient rather than worrying about trying to maintain two more running sums each dependent on the previous without potential overflow (or even potential algorithm implementation issues which may impede performance in the future by having to maintain 4 sums with the later 3 being dependant on it's predecessor's sum, rather than just having to schedule only 1 such dependency if only maintaining 2 sums, and presuming support for potentially more efficient data access on future platform implementations)? (As a caveat, it's not clear to me how much more resilient the checksum will be by maintaining 2 more sums than traditionally used, and thereby when I previously observed that fletcher4 was fine, it was based on it's first two terms not overflowing for data sizes required, without regard to the possibility that the upper two terms may overflow, as I didn't consider them significant, or rather didn't rely on their significance). - However in conclusion, although without pre-fetching, streaming, and/or channel integration, fletcher2 (corrected to use 32b data with 64b sums) may be no faster than the current implementation of fletcher4 (being possibly not significantly more resilient than a corrected flecther2, but which may be refined to warrant the upper two term also do not overflow, and thereby improving it's resilience to some degree); I personally suspect it's best to BOTH refine fletcher4 to warrant the upper two sums do not overflow by wrapping the upper bits of the upper two sums every N (pick your N) iterations; AND fix fletcher2 because when fixed it has the potential to be significantly faster than the fixed fletcher4 on future or other platforms leveraging mechanizes to pre-fetch data, and/or synchronize computations with access. (Both being relatively easily fixes, so why not?) Again, merely IMHO. -- This message posted from opensolaris.org