Hi hackers, Andres Freund asked me off list what I thought about this part of execGrouping.c, which builds a hash from the hashes of several datums in a loop:
/* rotate hashkey left 1 bit at each step */ hashkey = (hashkey << 1) | ((hashkey & 0x80000000) ? 1 : 0); ... hashkey ^= hkey; I think we should consider improving it and putting it somewhere central. 1. The same code also appears in nodeHash.c, and we also need to do the same thing in a couple of new projects that my EDB colleagues and I are working on for proposal as 10.0 features, based on DSM-backed hash tables. So I think it would make sense to define a hash_combine function or macro to be reused by all of such places rather than repeating it everywhere. 2. I suspect that this algorithm for combining hashes is weak, and could amplify weaknesses in the hash functions feeding it. Compare Boost's hash_combine, which does some more work before XORing: seed ^= hash_value(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2); That constant approximates the golden ratio (as a fraction of the 32 bit hash space), and it also appears in our hash_any and hash_uint32 functions. I don't claim to understand the mathematics but I think this may have to do with TACP section 6, theorem S and exercises 8 and 9, though I'm not sure if it's being used for the same purpose in algorithms that add it (is it just some random bits? [1][2]) and algorithms that multiply by it [3][4]. Perhaps we could write a reusable hash_combine function/macro using existing code from hashfunc.c, or use the formula from Boost, or something else, to improve the quality of our hash combinations. It would be very interesting to hear from someone with expertise in this area. Hash indexes don't currently support multiple column keys. If they ever do in future, they would also benefit from high quality combination. Assuming we'd use the same algorithm there too, changing the combination algorithm after we start persisting the results to disk in hash indexes would obviously be difficult. There doesn't seem to be any reason why we can't change the algorithm before then. [1] http://stackoverflow.com/questions/4948780/magic-number-in-boosthash-combine [2] https://xkcd.com/221/ [3] http://cstheory.stackexchange.com/questions/9639/how-did-knuth-derive-a [4] http://community.haskell.org/~simonmar/base/src/Data-HashTable.html -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers