Re: [HACKERS] Hash support for arrays
On 3 November 2010 09:24, Nicolas Barbier nicolas.barb...@gmail.com wrote: 2010/11/2 Kenneth Marshall k...@rice.edu: Given that our hash implimentation mixes the input data well (It does. I tested it.) then a simple rotate-and-xor method is all that should be needed to maintain all of the needed information. The original hash function has done the heavy lifting in this case. Even with the perfect hash function for the elements, certain combinations of elements could still lead to massive collisions. E.g., if repeated values are typical in the input data we are talking about, then the rotate-and-xor method would still lead to collisions between any array of the same values of certain lengths, regardless of the value. In Tom's implementation, as he mentioned before, those problematical lengths would be multiples of 32 (e.g., an array of 32 1s would collide with an array of 32 2s would collide with an array of 32 3s, etc). Yeah, rotate-and-xor is a pretty weak hashing algorithm, since any array of 32 identical elements will hash to either 0 or -1. Similarly various permutations or multiples of that array length will cause it to perform badly. The multiply-by-m algorithm doesn't have that weakness, provided m is chosen carefully. There are a couple of qualities a good algorithm should possess: 1). The bits from the individual element hash values should be distributed evenly so that no 2 different hash values would result in the same contribution to the final value. This is easy to achieve - just make sure that m is odd. 2). The way that each element's hash value bits are distributed should be different from the way that every other element's hash value bits are distributed. m=31 achieves this pretty well, although there are plenty of other equally valid choices. Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash support for arrays
On Thu, Nov 04, 2010 at 10:00:40AM +, Dean Rasheed wrote: On 3 November 2010 09:24, Nicolas Barbier nicolas.barb...@gmail.com wrote: 2010/11/2 Kenneth Marshall k...@rice.edu: Given that our hash implimentation mixes the input data well (It does. I tested it.) then a simple rotate-and-xor method is all that should be needed to maintain all of the needed information. The original hash function has done the heavy lifting in this case. Even with the perfect hash function for the elements, certain combinations of elements could still lead to massive collisions. E.g., if repeated values are typical in the input data we are talking about, then the rotate-and-xor method would still lead to collisions between any array of the same values of certain lengths, regardless of the value. In Tom's implementation, as he mentioned before, those problematical lengths would be multiples of 32 (e.g., an array of 32 1s would collide with an array of 32 2s would collide with an array of 32 3s, etc). Yeah, rotate-and-xor is a pretty weak hashing algorithm, since any array of 32 identical elements will hash to either 0 or -1. Similarly various permutations or multiples of that array length will cause it to perform badly. The multiply-by-m algorithm doesn't have that weakness, provided m is chosen carefully. There are a couple of qualities a good algorithm should possess: 1). The bits from the individual element hash values should be distributed evenly so that no 2 different hash values would result in the same contribution to the final value. This is easy to achieve - just make sure that m is odd. 2). The way that each element's hash value bits are distributed should be different from the way that every other element's hash value bits are distributed. m=31 achieves this pretty well, although there are plenty of other equally valid choices. Regards, Dean Hi Dean, In my comment yesterday, I included a simple function that would allow us to leverage our current hash functions mixing process to scramble the bits effectively and retaining the maximum amount of information in the hash. Regards, Ken -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash support for arrays
2010/11/2 Kenneth Marshall k...@rice.edu: Given that our hash implimentation mixes the input data well (It does. I tested it.) then a simple rotate-and-xor method is all that should be needed to maintain all of the needed information. The original hash function has done the heavy lifting in this case. Even with the perfect hash function for the elements, certain combinations of elements could still lead to massive collisions. E.g., if repeated values are typical in the input data we are talking about, then the rotate-and-xor method would still lead to collisions between any array of the same values of certain lengths, regardless of the value. In Tom's implementation, as he mentioned before, those problematical lengths would be multiples of 32 (e.g., an array of 32 1s would collide with an array of 32 2s would collide with an array of 32 3s, etc). Nicolas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash support for arrays
On Wed, Nov 03, 2010 at 10:24:16AM +0100, Nicolas Barbier wrote: 2010/11/2 Kenneth Marshall k...@rice.edu: Given that our hash implimentation mixes the input data well (It does. I tested it.) then a simple rotate-and-xor method is all that should be needed to maintain all of the needed information. The original hash function has done the heavy lifting in this case. Even with the perfect hash function for the elements, certain combinations of elements could still lead to massive collisions. E.g., if repeated values are typical in the input data we are talking about, then the rotate-and-xor method would still lead to collisions between any array of the same values of certain lengths, regardless of the value. In Tom's implementation, as he mentioned before, those problematical lengths would be multiples of 32 (e.g., an array of 32 1s would collide with an array of 32 2s would collide with an array of 32 3s, etc). Nicolas True. I just took another look at our defined hash functions and it looks like we can make a simple variant of hash_uint32() that we can use as a stream checksum. The only thing missing is that ability to pass in the current 32-bit hash value as a starting seed to add the next 32-bit value. Something like this should work: Datum hash_uint32(uint32 k, uint32 initval) { register uint32 a, b, c; a = b = c = 0x9e3779b9 + (uint32) sizeof(uint32) + 3923095 + initval; a += k; final(a, b, c); /* report the result */ return UInt32GetDatum(c); } Then if you pass in the current value as the initval, it should mix well each additional 32-bit hash value. Regards, Ken -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash support for arrays
On Sat, Oct 30, 2010 at 10:01 AM, Tom Lane t...@sss.pgh.pa.us wrote: marcin mank marcin.m...@gmail.com writes: On Sat, Oct 30, 2010 at 6:21 PM, Tom Lane t...@sss.pgh.pa.us wrote: 3. To hash, apply the element type's hash function to each array element. Combine these values by rotating the accumulator left one bit at each step and xor'ing in the next element's hash value. Thoughts? In particular, is anyone aware of a better method for combining the element hash values? This would make the hash the same for arrays with elements 32 apart swapped. Well, there are *always* going to be cases where you get the same hash value for two different inputs; it's unavoidable given that you have to combine N 32-bit hash values into only one 32-bit output. Sure. The goal is to make those hard to predict, though. I think multiply by 31 and add the next value is a fairly standard way of getting that behavior. It mixes up the bits a lot more than just left-shifting by a variable offset. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash support for arrays
Robert Haas robertmh...@gmail.com writes: On Sat, Oct 30, 2010 at 10:01 AM, Tom Lane t...@sss.pgh.pa.us wrote: marcin mank marcin.m...@gmail.com writes: This would make the hash the same for arrays with elements 32 apart swapped. Well, there are *always* going to be cases where you get the same hash value for two different inputs; it's unavoidable given that you have to combine N 32-bit hash values into only one 32-bit output. Sure. The goal is to make those hard to predict, though. Really? I think I don't understand when this fails isn't obviously better than being able to predict when it fails ... I think multiply by 31 and add the next value is a fairly standard way of getting that behavior. It mixes up the bits a lot more than just left-shifting by a variable offset. What concerns me about that is that it tends to push the bits to the left --- I think the effects of the earlier inputs are going to overflow out as you incorporate a lot of newer inputs. What you want is a scheme where every input item affects all the bits of the final result about equally, and my gut feeling is this doesn't provide that. I'd be happier about this approach if there were some actual theory behind it ... maybe there's some analysis out there, but the one link that was given was just about entirely unconvincing. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash support for arrays
Excerpts from Tom Lane's message of mar nov 02 15:21:31 -0300 2010: What concerns me about that is that it tends to push the bits to the left --- I think the effects of the earlier inputs are going to overflow out as you incorporate a lot of newer inputs. What you want is a scheme where every input item affects all the bits of the final result about equally, and my gut feeling is this doesn't provide that. Ahh, now I understand what you meant with rotating the bits in the original email in this thread. You're not simply shifting. Clever. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash support for arrays
On Tue, Nov 2, 2010 at 11:21 AM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Sat, Oct 30, 2010 at 10:01 AM, Tom Lane t...@sss.pgh.pa.us wrote: marcin mank marcin.m...@gmail.com writes: This would make the hash the same for arrays with elements 32 apart swapped. Well, there are *always* going to be cases where you get the same hash value for two different inputs; it's unavoidable given that you have to combine N 32-bit hash values into only one 32-bit output. Sure. The goal is to make those hard to predict, though. Really? I think I don't understand when this fails isn't obviously better than being able to predict when it fails ... Isn't that the whole point of hash functions? Collisions are inevitable, but you want them to be unpredictable. If you want a hash function with predictable collision behavior, just has the first element. It'll be highly predictable AND wicked fast. I think multiply by 31 and add the next value is a fairly standard way of getting that behavior. It mixes up the bits a lot more than just left-shifting by a variable offset. What concerns me about that is that it tends to push the bits to the left --- I think the effects of the earlier inputs are going to overflow out as you incorporate a lot of newer inputs. What you want is a scheme where every input item affects all the bits of the final result about equally, and my gut feeling is this doesn't provide that. I don't think so. That would be a problem if you multiplied by an even number. You can see that you don't get dups if you just add in the same value over and over: #include stdio.h int main(int argc, char **argv) { unsigned hv = 1; unsigned i; for (i = 0; i 100; ++i) { hv = hv * 31 + 1; printf(%08x\n, hv); } return 0; } No dups. Also, if you change that first hv = 1 line to hv = 2 and rerun, that sequence also has no dups, and no collisions with the hv = 1 sequence. I'd be happier about this approach if there were some actual theory behind it ... maybe there's some analysis out there, but the one link that was given was just about entirely unconvincing. I think it's from Knuth, though unfortunately I don't have a copy to check. There are probably better algorithms out there, but this one's pretty simple. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash support for arrays
Robert Haas robertmh...@gmail.com wrote: Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: The goal is to make those hard to predict, though. Really? I think I don't understand when this fails isn't obviously better than being able to predict when it fails ... Isn't that the whole point of hash functions? Collisions are inevitable, but you want them to be unpredictable. Are you sure you're not confusing attributes of a good random number generator with hash generation? (They have much in common, but I've only seen concern with hard to predict when working on random number algorithms, as for financial audits or jury selection.) If you want a hash function with predictable collision behavior, just has the first element. It'll be highly predictable AND wicked fast. You're confusing unpredictable with widely distributed, I think. There's no reason that the hash value of an integer of the same size as the hash shouldn't be the integer itself, for example. It's hard to get more predictable than that, yet it works well for hash lookups. I know that for a hash of a character string, the total = (total * 31) + nextcharacter has been shown to be effective. That's hardly random or hard to predict, but it does tend to spread out the hash values well. Whether it is as good for arrays, I don't know. It seems like a reasonable place to start, though, since a character string is rather like an array of characters. What concerns me about that is that it tends to push the bits to the left --- I think the effects of the earlier inputs are going to overflow out as you incorporate a lot of newer inputs. What you want is a scheme where every input item affects all the bits of the final result about equally, and my gut feeling is this doesn't provide that. I don't think so. That would be a problem if you multiplied by an even number. You can see that you don't get dups if you just add in the same value over and over: Yeah, that 1 in the low order position ensures that the impact of any value which is ever added into the total is never entirely lost. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash support for arrays
On Tue, Nov 2, 2010 at 12:34 PM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Robert Haas robertmh...@gmail.com wrote: Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: The goal is to make those hard to predict, though. Really? I think I don't understand when this fails isn't obviously better than being able to predict when it fails ... Isn't that the whole point of hash functions? Collisions are inevitable, but you want them to be unpredictable. Are you sure you're not confusing attributes of a good random number generator with hash generation? (They have much in common, but I've only seen concern with hard to predict when working on random number algorithms, as for financial audits or jury selection.) If you want a hash function with predictable collision behavior, just has the first element. It'll be highly predictable AND wicked fast. You're confusing unpredictable with widely distributed, I think. There's no reason that the hash value of an integer of the same size as the hash shouldn't be the integer itself, for example. It's hard to get more predictable than that, yet it works well for hash lookups. Well, no, not really. For example, it may be that you have a hash table with N buckets and values that of the form N+k, 2N+k, 3N+k, and everything will collide. If you do some mixing of the bits, that case is far less likely to occur in practice. See for example http://www.azillionmonkeys.com/qed/hash.html I know that for a hash of a character string, the total = (total * 31) + nextcharacter has been shown to be effective. That's hardly random or hard to predict, but it does tend to spread out the hash values well. Whether it is as good for arrays, I don't know. It seems like a reasonable place to start, though, since a character string is rather like an array of characters. That was my thought. What concerns me about that is that it tends to push the bits to the left --- I think the effects of the earlier inputs are going to overflow out as you incorporate a lot of newer inputs. What you want is a scheme where every input item affects all the bits of the final result about equally, and my gut feeling is this doesn't provide that. I don't think so. That would be a problem if you multiplied by an even number. You can see that you don't get dups if you just add in the same value over and over: Yeah, that 1 in the low order position ensures that the impact of any value which is ever added into the total is never entirely lost. Right... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash support for arrays
Robert Haas robertmh...@gmail.com wrote: There's no reason that the hash value of an integer of the same size as the hash shouldn't be the integer itself, for example. It's hard to get more predictable than that, yet it works well for hash lookups. Well, no, not really. For example, it may be that you have a hash table with N buckets and values that of the form N+k, 2N+k, 3N+k, and everything will collide. That hardly seems convincing if the integer is a sequentially assigned number, as with many ID columns; but if you want an algorithm which will work well with numbers which might be assigned in a pattern with a high correlation with some unpredictable number of buckets -- sure, it won't work well without some scrambling. For sequentially assigned numbers, that scrambling would have a cost and would be of no value. I guess it depend a bit on your use case and goals. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash support for arrays
Robert Haas robertmh...@gmail.com writes: On Tue, Nov 2, 2010 at 11:21 AM, Tom Lane t...@sss.pgh.pa.us wrote: Really? I think I don't understand when this fails isn't obviously better than being able to predict when it fails ... Isn't that the whole point of hash functions? Collisions are inevitable, but you want them to be unpredictable. If you want a hash function with predictable collision behavior, just has the first element. It'll be highly predictable AND wicked fast. That seems like a rather poor straw man, since it suffers from exactly the defect I'm complaining about, namely failing to consider all the input values equally. I'd be happier about this approach if there were some actual theory behind it ... maybe there's some analysis out there, but the one link that was given was just about entirely unconvincing. I think it's from Knuth, though unfortunately I don't have a copy to check. There are probably better algorithms out there, but this one's pretty simple. I don't see anything in Knuth suggesting a multiplier of 31. His recommendation for a multiplier, if you're going to use multiplicative hashing, is wordsize/phi (phi being the golden ratio) ... and he also wants you to keep the high order not the low order bits of the product. However, this is largely beside the point, because that theory, as well as the Java code you're arguing from, has to do with the initial hashing of a raw sequence of input items. Not with combining some existing hash values. The rotate-and-xor method I suggested for that is borrowed exactly from section 6.4 of Knuth (page 512, in the first edition of volume 3). regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash support for arrays
On Tue, Nov 02, 2010 at 04:42:19PM -0400, Tom Lane wrote: Robert Haas robertmh...@gmail.com writes: On Tue, Nov 2, 2010 at 11:21 AM, Tom Lane t...@sss.pgh.pa.us wrote: Really? ?I think I don't understand when this fails isn't obviously better than being able to predict when it fails ... Isn't that the whole point of hash functions? Collisions are inevitable, but you want them to be unpredictable. If you want a hash function with predictable collision behavior, just has the first element. It'll be highly predictable AND wicked fast. That seems like a rather poor straw man, since it suffers from exactly the defect I'm complaining about, namely failing to consider all the input values equally. I'd be happier about this approach if there were some actual theory behind it ... maybe there's some analysis out there, but the one link that was given was just about entirely unconvincing. I think it's from Knuth, though unfortunately I don't have a copy to check. There are probably better algorithms out there, but this one's pretty simple. I don't see anything in Knuth suggesting a multiplier of 31. His recommendation for a multiplier, if you're going to use multiplicative hashing, is wordsize/phi (phi being the golden ratio) ... and he also wants you to keep the high order not the low order bits of the product. However, this is largely beside the point, because that theory, as well as the Java code you're arguing from, has to do with the initial hashing of a raw sequence of input items. Not with combining some existing hash values. The rotate-and-xor method I suggested for that is borrowed exactly from section 6.4 of Knuth (page 512, in the first edition of volume 3). regards, tom lane Given that our hash implimentation mixes the input data well (It does. I tested it.) then a simple rotate-and-xor method is all that should be needed to maintain all of the needed information. The original hash function has done the heavy lifting in this case. Regards, Ken -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash support for arrays
On Nov 2, 2010, at 1:42 PM, Tom Lane t...@sss.pgh.pa.us wrote: However, this is largely beside the point, because that theory, as well as the Java code you're arguing from, has to do with the initial hashing of a raw sequence of input items. Not with combining some existing hash values. The rotate-and-xor method I suggested for that is borrowed exactly from section 6.4 of Knuth (page 512, in the first edition of volume 3). It seems undesirable to me to have a situation where transposing two array elements doesn't change the hash value. But I just work here. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash support for arrays
Robert Haas robertmh...@gmail.com writes: On Nov 2, 2010, at 1:42 PM, Tom Lane t...@sss.pgh.pa.us wrote: However, this is largely beside the point, because that theory, as well as the Java code you're arguing from, has to do with the initial hashing of a raw sequence of input items. Not with combining some existing hash values. The rotate-and-xor method I suggested for that is borrowed exactly from section 6.4 of Knuth (page 512, in the first edition of volume 3). It seems undesirable to me to have a situation where transposing two array elements doesn't change the hash value. But I just work here. [ shrug... ] There are always going to be collisions, and you're overstating the importance of this one (only some transpositions will result in a duplicate hash, not any transposition). What's actually *important*, for our purposes, is that all bits of the final hash value depend in a reasonably uniform way on all bits of the input hash values. If we don't have that property, then bucket numbers (which we form by taking the low-order N bits of the final hash, where N isn't known in advance) won't be as well distributed as we'd like. It's possible that the multiply-by-31 method is as good as the rotate-and-xor method by that measure, or even better; but it's far from obvious that it's better. And I'm not convinced that the multiply method has a pedigree that should encourage me to take it on faith. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash support for arrays
On Sat, Oct 30, 2010 at 01:01:44PM -0400, Tom Lane wrote: marcin mank marcin.m...@gmail.com writes: This is what boost does: http://www.systomath.com/include/Boost-1_34/doc/html/boost/hash_combine.html Hmm. I am reminded of Knuth's famous dictum: never generate random numbers with a method chosen at random. Is there any actual theory behind that algorithm, and if so what is it? The combination of shifting with addition (not xor) seems more likely to lead to weird cancellations than any improvement in the hash behavior. As far as I can tell the boost combiner comes from: http://goanna.cs.rmit.edu.au/~jz/fulltext/jasist-tch.pdf Which seems to be solving a completely different problem and I can't see how relevant it is to the design of a hash combiner. The fact that they get trivial details wrong like 32bit integers going up to 2^32 rather than 2^32-1 doesn't inspire confidence. One subtle point I noticed on the boost mailing list was that they didn't want [[1],2] hashing to the same value as [1,2]. I'm pretty sure that this sort of construction isn't possible to express in PG, but thought I should mention it just in case. -- Sam http://samason.me.uk/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash support for arrays
Tom Lane wrote: It's possible that the multiply-by-31 method is as good as the rotate-and-xor method by that measure, or even better; but it's far from obvious that it's better. And I'm not convinced that the multiply method has a pedigree that should encourage me to take it on faith. Short summary: * some history of it's trivia follows * (nothing here suggests it's better - just old and common and cheap) Longer - some trivia regarding its pedigree: It (or at least a *31 variant) seems to have a history of advocacy going back to Chris Torek in 1990: http://groups.google.com/group/comp.lang.c/browse_thread/thread/28c2095282f0c1b5/193be99e9836791b?q=#193be99e9836791b X#defineHASH(str, h, p) \ X for (p = str, h = 0; *p;) h = (h 5) - h + *p++ and gets referred to in Usenet papers in the early 90's as well: http://www.usenix.com/publications/library/proceedings/sa92/salz.pdf Regarding why the magic number 31 [or 33 which also often comes up] apparently the only thing magic about it is that it's an odd number != 1.The rest of the odd numbers work about as well according to this guy who tried to explain it: http://svn.eu.apache.org/repos/asf/apr/apr/trunk/tables/apr_hash.c * The magic of number 33, i.e. why it works better than many other * constants, prime or not, has never been adequately explained by * anyone. So I try an explanation: if one experimentally tests all * multipliers between 1 and 256 (as I did while writing a low-level * data structure library some time ago) one detects that even * numbers are not useable at all. The remaining 128 odd numbers * (except for the number 1) work more or less all equally well. * They all distribute in an acceptable way and this way fill a hash * table with an average percent of approx. 86%. * * If one compares the chi^2 values of the variants (see * Bob Jenkins ``Hashing Frequently Asked Questions'' at * http://burtleburtle.net/bob/hash/hashfaq.html for a description * of chi^2), the number 33 not even has the best value. But the * number 33 and a few other equally good numbers like 17, 31, 63, * 127 and 129 have nevertheless a great advantage to the remaining * numbers in the large set of possible multipliers: their multiply * operation can be replaced by a faster operation based on just one * shift plus either a single addition or subtraction operation. And * because a hash function has to both distribute good _and_ has to * be very fast to compute, those few numbers should be preferred. ... -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash support for arrays
Hmm. I am reminded of Knuth's famous dictum: never generate random numbers with a method chosen at random. Is there any actual theory behind that algorithm, and if so what is it? The combination of shifting with addition (not xor) seems more likely to lead to weird cancellations than any improvement in the hash behavior. For what's worth: the standard way in Java is multiplying by 31. Which 31 amounts to a 5 bit shift and a substraction. In general, a prime number is recommended though one would like that Knuth made some analysis here... it all seems mostly empirical. http://computinglife.wordpress.com/2008/11/20/why-do-hash-functions-use-prime-numbers/ Perhaps the 5 bit shift is related to the spread of ascii charpoints, as that hashcode() implementation was mainly tested and implemented for a String. But now it's also suggested for general objects, and it's even specified for standard collections: for example http://download.oracle.com/javase/6/docs/api/java/util/List.html#hashCode() Hernán J. González
[HACKERS] Hash support for arrays
I was reminded today that I'd promised to do $SUBJECT awhile back. It's worth having so that hash joins and hash aggregation can work on array values. I believe this is a fairly straightforward exercise: 1. Add a hash opclass that accepts ANYARRAY, similar to the existing btree opclass for ANYARRAY. It will work for any array type whose element type has a hash opclass. 2. Coding is much like the array_cmp support code, including caching the lookup of the element type's hash function. 3. To hash, apply the element type's hash function to each array element. Combine these values by rotating the accumulator left one bit at each step and xor'ing in the next element's hash value. Thoughts? In particular, is anyone aware of a better method for combining the element hash values? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash support for arrays
On Sat, Oct 30, 2010 at 6:21 PM, Tom Lane t...@sss.pgh.pa.us wrote: 3. To hash, apply the element type's hash function to each array element. Combine these values by rotating the accumulator left one bit at each step and xor'ing in the next element's hash value. Thoughts? In particular, is anyone aware of a better method for combining the element hash values? This would make the hash the same for arrays with elements 32 apart swapped. This is what boost does: http://www.systomath.com/include/Boost-1_34/doc/html/boost/hash_combine.html Greetings -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash support for arrays
marcin mank marcin.m...@gmail.com writes: On Sat, Oct 30, 2010 at 6:21 PM, Tom Lane t...@sss.pgh.pa.us wrote: 3. To hash, apply the element type's hash function to each array element. Combine these values by rotating the accumulator left one bit at each step and xor'ing in the next element's hash value. Thoughts? In particular, is anyone aware of a better method for combining the element hash values? This would make the hash the same for arrays with elements 32 apart swapped. Well, there are *always* going to be cases where you get the same hash value for two different inputs; it's unavoidable given that you have to combine N 32-bit hash values into only one 32-bit output. This is what boost does: http://www.systomath.com/include/Boost-1_34/doc/html/boost/hash_combine.html Hmm. I am reminded of Knuth's famous dictum: never generate random numbers with a method chosen at random. Is there any actual theory behind that algorithm, and if so what is it? The combination of shifting with addition (not xor) seems more likely to lead to weird cancellations than any improvement in the hash behavior. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash support for arrays
On Sat, Oct 30, 2010 at 9:21 AM, Tom Lane t...@sss.pgh.pa.us wrote: Thoughts? In particular, is anyone aware of a better method for combining the element hash values? The obvious thing to do seems like it would be to feed the individual values back into the regular hash function. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash support for arrays
Greg Stark gsst...@mit.edu writes: On Sat, Oct 30, 2010 at 9:21 AM, Tom Lane t...@sss.pgh.pa.us wrote: Thoughts? In particular, is anyone aware of a better method for combining the element hash values? The obvious thing to do seems like it would be to feed the individual values back into the regular hash function. Hmm, you mean use hash_any on the sequence of hash values? Interesting idea; but hash_any doesn't look very amenable to being invoked incrementally, and I don't think I want to construct a temp array with enough space for the hashes of all the items in the input array ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash support for arrays
On Sat, Oct 30, 2010 at 1:48 PM, Tom Lane t...@sss.pgh.pa.us wrote: Hmm, you mean use hash_any on the sequence of hash values? Interesting idea; but hash_any doesn't look very amenable to being invoked incrementally, and I don't think I want to construct a temp array with enough space for the hashes of all the items in the input array ... I had rather hoped without evidence that it would be easy enough to use incrementally. Looking now I see your point; it doesn't lend itself to that at all. I suppose you could use crc where our interface does allow incremental use. I'm not really familiar enough with the math to know whether CRC is any better than your XOR mixing. I suspect they're pretty similar. I suppose if you have arrays of various lengths containing repetitions of a single value than the non-CRC would produce a hash of 0 for all arrays with a length that's a multiple of 32. I'm not sure what CRC would do. Oh, one question that occurred to me you didn't mention including the bounds of the array or the null bitmap in the hash function. I suppose it's correct with or without them (or in the case of the null bitmap another way to put it is the choice of the hash value to represent null is arbitrary). -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash support for arrays
Greg Stark gsst...@mit.edu writes: I suppose you could use crc where our interface does allow incremental use. Hm, that's a thought. I'm not sure though whether CRC really has the behavior you want from a hash function, in particular that all the bits are independent (a/k/a taking N low-order bits is a reasonable way to cut the hash down to a suitable bucket index). Anybody know? I'm not really familiar enough with the math to know whether CRC is any better than your XOR mixing. I suspect they're pretty similar. I suppose if you have arrays of various lengths containing repetitions of a single value than the non-CRC would produce a hash of 0 for all arrays with a length that's a multiple of 32. I'm not sure what CRC would do. Some quick experimenting suggests that you get -1 from an array of 32 of the same value, then 0 from 64 of the same, etc. This doesn't bother me particularly though. There are always going to be some situations that produce degenerate hash values. Oh, one question that occurred to me you didn't mention including the bounds of the array or the null bitmap in the hash function. I suppose it's correct with or without them (or in the case of the null bitmap another way to put it is the choice of the hash value to represent null is arbitrary). Yeah. I have it coded to treat a null element as if it had a hash value of zero. I don't see a great need to consider the array bounds per se. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers