On Wed, Apr 17, 2013 at 5:21 PM, Greg Smith <g...@2ndquadrant.com> wrote: > Let me see if I can summarize where the messages flying by are at since > you'd like to close this topic for now: > > -Original checksum feature used Fletcher checksums. Its main problems, to > quote wikipedia, include that it "cannot distinguish between blocks of all 0 > bits and blocks of all 1 bits". > > -Committed checksum feature uses truncated CRC-32. This has known good > error detection properties, but is expensive to compute. There's reason to > believe that particular computation will become cheaper on future platforms > though. But taking full advantage of that will require adding CPU-specific > code to the database. > > -The latest idea is using the Fowler–Noll–Vo hash function: > https://en.wikipedia.org/wiki/Fowler_Noll_Vo_hash There's 20 years of > research around when that is good or bad. The exactly properties depend on > magic "FNV primes": http://isthe.com/chongo/tech/comp/fnv/#fnv-prime that > can vary based on both your target block size and how many bytes you'll > process at a time. For PostgreSQL checksums, one of the common > problems--getting an even distribution of the hashed values--isn't important > the way it is for other types of hashes. Ants and Florian have now dug into > how exactly that and specific CPU optimization concerns impact the best > approach for 8K database pages. This is very clearly a 9.4 project that is > just getting started.
I was curious about the activity in this thread and wanted to understand the tradeoffs, and came to the same understanding as you when poking around. It seems the tough aspect of the equation is that the most well studied thing is slow (CRC-32C) unless you have special ISA support Trying to find as much information and conclusive research on FNV was a lot more challenging. Fletcher is similar in that regard. Given my hasty attempt to understand each of the alternatives, my qualitative judgement is that, strangely enough, the most conservative choice of the three (in terms of being understood and treated in the literature more than ten times over) is CRC-32C, but it's also the one being cast as only suitable inside micro-optimization. To add another, theoretically-oriented dimension to the discussion, I'd like suggest it's also the most thoroughly studied of all the alternatives. I really had a hard time finding follow-up papers about the two alternatives, but to be fair, I didn't try very hard...then again, I didn't try very hard for any of the three, it's just that CRC32C was by far the easiest find materials on. The original paper is often shorthanded "Castagnoli 93", but it exists in the IEEE's sphere of influence and is hard to find a copy of. Luckily, a pretty interesting survey paper discussing some of the issues was written by Koopman in 2002 and is available: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.11.8323 As a pedagolgical note, it's pretty interesting and accessible piece of writing (for me, as someone who knows little of error detection/correction) and explains some of the engineering reasons that provoke such exercises. Basically...if it comes down to understand what the heck is going on and what the trade-offs are, it was a lot easier to brush up on CRC32-C in my meandering around the Internet. One might think this level of scrutiny would constitute a viable explanation of why CRC32C found its way into several standards and then finally in silicon. All in all, if the real world costs of CRC32C on not-SSE4.2 are allowable, I think it's the most researched and and conservative option, although perhaps some of the other polynomials seen in Koopman could also be desirable. It seems there's a tradeoff in CRC polynomials between long-message and short-message error detection, and the paper above may allow for a more informed selection. CRC32C is considered a good trade-off for both, but I haven't assessed the paper in enough detail to suggest whether there are specialized long-run polynomials that may be better still (although, then, there is also the microoptimization question, which postdates the literature I was looking at by a lot). -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers