Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1
On Wed, 2009-10-28 at 12:51 -0700, Jeff Davis wrote: Trying to develop and document a set of standardized, stable hash functions covering a wide range of possible use cases sounds like it may be better served by an extension. I suspect that some of the participants in this thread have PL/Proxy in mind. PL/Proxy should probably ship its own set of hash functions. If we ever get built-in partitioning by hash (see thread nearby and most previous ones like it), we should also think about providing a hash function that doesn't change output over versions and is independent of hash index implementation concerns. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1
On Thu, 2009-10-29 at 09:47 +0200, Peter Eisentraut wrote: On Wed, 2009-10-28 at 12:51 -0700, Jeff Davis wrote: Trying to develop and document a set of standardized, stable hash functions covering a wide range of possible use cases sounds like it may be better served by an extension. I suspect that some of the participants in this thread have PL/Proxy in mind. Yes, I think that pl/proxy is the prime example of such use. PL/Proxy should probably ship its own set of hash functions. Or maybe we could just extract the hashes form some version of stock postgresql (say 8.3) and then make those available in contrib under the name stable_hashes ? If we ever get built-in partitioning by hash (see thread nearby and most previous ones like it), we should also think about providing a hash function that doesn't change output over versions and is independent of hash index implementation concerns. And maybe even documented ;) -- Hannu Krosing http://www.2ndQuadrant.com PostgreSQL Scalability and Availability Services, Consulting and Training -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1
Hannu Krosing ha...@2ndquadrant.com writes: Or maybe we could just extract the hashes form some version of stock postgresql (say 8.3) and then make those available in contrib under the name stable_hashes ? A better name would be wishful_thinking ... contrib does not have control over some of the main risk factors, such as changes to the internal representation of datatypes. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1
On Thu, 2009-10-29 at 09:32 -0400, Tom Lane wrote: Hannu Krosing ha...@2ndquadrant.com writes: Or maybe we could just extract the hashes form some version of stock postgresql (say 8.3) and then make those available in contrib under the name stable_hashes ? A better name would be wishful_thinking ... contrib does not have control over some of the main risk factors, such as changes to the internal representation of datatypes. Good point. But the internal layout of data types popular for use as hash key changes rarely. The hash function change, however, is a showstopper for everyone concerned. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1
On Thu, 2009-10-29 at 09:32 -0400, Tom Lane wrote: Hannu Krosing ha...@2ndquadrant.com writes: Or maybe we could just extract the hashes form some version of stock postgresql (say 8.3) and then make those available in contrib under the name stable_hashes ? A better name would be wishful_thinking ... contrib does not have control over some of the main risk factors, such as changes to the internal representation of datatypes. My understanding was, that contrib is actually the only possibility to have something maintained in sync with core postgreSQL. regards, tom lane -- Hannu Krosing http://www.2ndQuadrant.com PostgreSQL Scalability and Availability Services, Consulting and Training -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1
On Wed, 2009-02-11 at 11:22 -0500, Tom Lane wrote: Asko Oja asc...@gmail.com writes: Did this change hashtext() visible to users? We have been using it quite widely for partitioning our databases. If so then it should be marked quite visibly in release notes as there might be others who will be hit by this. The hash functions are undocumented, have changed in the past, and are likely to change again in the future. If you are using them in a way that depends on them to give the same answers across versions, you'd better stop. Is at least the fact that they are undocumented, have changed in the past, and are likely to change again in the future documented ? I'm sure this is something that has hit unwary users in the past and will hit again in the future, so some words about it in the doc's would be appropriate. search for hashtext on http://www.postgresql.org/docs/8.4/interactive/index.html returned no results, so I guess even theyr undocumented, will surprise you status is not documented. Hashing is a quite fundamental thing in computing, so I was quite surprised to find out it had silently changed. I had never checked the docs for hash functions, but I had assumed, that internal functions are prefixed by pg_ and anything else is public, free to use functionality. Changing hash functions also makes in-place upgrades a lot harder, as they can't be done incrementally anymore for tables which use hash indexes. regards, tom lane -- Hannu Krosing http://www.2ndQuadrant.com PostgreSQL Scalability and Availability Services, Consulting and Training -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1
Hannu Krosing ha...@2ndquadrant.com writes: I had never checked the docs for hash functions, but I had assumed, that internal functions are prefixed by pg_ and anything else is public, free to use functionality. Sure, it's free to use. It's not free to assume that we promise never to change it. Changing hash functions also makes in-place upgrades a lot harder, as they can't be done incrementally anymore for tables which use hash indexes. Hash indexes are so far from being production-grade that this argument is not significant. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1
On Wed, Oct 28, 2009 at 03:31:17PM -0400, Tom Lane wrote: Hannu Krosing ha...@2ndquadrant.com writes: I had never checked the docs for hash functions, but I had assumed, that internal functions are prefixed by pg_ and anything else is public, free to use functionality. Sure, it's free to use. It's not free to assume that we promise never to change it. Changing hash functions also makes in-place upgrades a lot harder, as they can't be done incrementally anymore for tables which use hash indexes. Hash indexes are so far from being production-grade that this argument is not significant. regards, tom lane In addition that change from 8.3 - 8.4 to store only the hash and not the value in the index means that a reindex would be required in any event. Cheers, Ken -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1
Kenneth Marshall k...@rice.edu writes: On Wed, Oct 28, 2009 at 03:31:17PM -0400, Tom Lane wrote: Hash indexes are so far from being production-grade that this argument is not significant. In addition that change from 8.3 - 8.4 to store only the hash and not the value in the index means that a reindex would be required in any event. Indeed, and I fully expect there will be some more on-disk format changes required before we get to the point where hash indexes are actually interesting for production. If we start insisting that they be in-place-upgradable now, we will pretty much guarantee that they never become useful enough to justify the restriction :-( (As examples, the hash bucket size probably needs revisiting, and we ought to think very hard about whether we shouldn't switch to 64-bit hash values. And that's not even considering some of the more advanced suggestions that have been made.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1
On Wed, 2009-10-28 at 21:09 +0200, Hannu Krosing wrote: Is at least the fact that they are undocumented, have changed in the past, and are likely to change again in the future documented ? That's a little confusing to me: how do we document that something is undocumented? And where do we stop? Hashing is a quite fundamental thing in computing, so I was quite surprised to find out it had silently changed. There are many reasons to use a hash, and we don't want people to use these functions for the wrong purpose. I have seen people use a performance hash for security purposes before, and I had to demonstrate some hash collisions to show why that was a bad idea. So, if we do provide documented functions, it should be done carefully. Trying to develop and document a set of standardized, stable hash functions covering a wide range of possible use cases sounds like it may be better served by an extension. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1
On Wed, 2009-10-28 at 12:51 -0700, Jeff Davis wrote: On Wed, 2009-10-28 at 21:09 +0200, Hannu Krosing wrote: Is at least the fact that they are undocumented, have changed in the past, and are likely to change again in the future documented ? That's a little confusing to me: how do we document that something is undocumented? And where do we stop? My previous e-mail message documents that the undocumentedness is undocumented, so no need to go any further here ;) Though undocumented, the hash functions are easily discoverable by doing \df *hash* in psql Hashing is a quite fundamental thing in computing, so I was quite surprised to find out it had silently changed. There are many reasons to use a hash, and we don't want people to use these functions for the wrong purpose. I don't think that not documenting a hash function helps here at all. I have seen people use a performance hash for security purposes before, and I had to demonstrate some hash collisions to show why that was a bad idea. I've seen people use CRC32 as hash and then hit a collisions in 15 tries with quite large keys. So, if we do provide documented functions, it should be done carefully. Any user-visible behavior change should be done carefully, even if the original behavior is not documented. Careful upgrade of hasxxx functions would have kept the old functions, and introduced the new ones with _v2 suffix, and then used these in appropriate places. then kept the old ones for a few versions, with maybe a deprecation warning and then moved them to contrib for a few more versions. Doing it this way could leave them undocumented and still not break peoples applications in mysterious ways. Trying to develop and document a set of standardized, stable hash functions covering a wide range of possible use cases sounds like it may be better served by an extension. I guess there are enough security/crypt/strong hashes in pgcrypto package so that should not be a problem. -- Hannu Krosing http://www.2ndQuadrant.com PostgreSQL Scalability and Availability Services, Consulting and Training -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1
On Wed, 2009-10-28 at 15:31 -0400, Tom Lane wrote: Hannu Krosing ha...@2ndquadrant.com writes: I had never checked the docs for hash functions, but I had assumed, that internal functions are prefixed by pg_ and anything else is public, free to use functionality. Sure, it's free to use. It's not free to assume that we promise never to change it. Changing hash functions also makes in-place upgrades a lot harder, as they can't be done incrementally anymore for tables which use hash indexes. Hash indexes are so far from being production-grade that this argument is not significant. AFAIK in-place upgrade is also not quite production-grade, so this was meant as a forward-looking note for next time the hashxxx functions will change. regards, tom lane -- Hannu Krosing http://www.2ndQuadrant.com PostgreSQL Scalability and Availability Services, Consulting and Training -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1
Did this change hashtext() visible to users? We have been using it quite widely for partitioning our databases. If so then it should be marked quite visibly in release notes as there might be others who will be hit by this. regards Asko On Mon, Feb 9, 2009 at 11:22 PM, Tom Lane t...@sss.pgh.pa.us wrote: Kenneth Marshall k...@rice.edu writes: I have updated the patch posted by Jeff Davis on January 9th to include the micro-patch above as well as updated the polymorphism regressions tests. This applies cleanly to the latest CVS pull. Applied --- thanks for being persistent about resolving the doubts on this. One thing that apparently neither of you realized was that the polymorphism results were varying between bigendian and littleendian machines; I suppose you are using different hardware and that's why you didn't agree on what the results should be. Since we already agreed we were going to tolerate endianness dependence in the hash functions, I fixed that by adding some ORDER BYs. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1
Asko Oja asc...@gmail.com writes: Did this change hashtext() visible to users? We have been using it quite widely for partitioning our databases. If so then it should be marked quite visibly in release notes as there might be others who will be hit by this. The hash functions are undocumented, have changed in the past, and are likely to change again in the future. If you are using them in a way that depends on them to give the same answers across versions, you'd better stop. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1
Kenneth Marshall k...@rice.edu writes: I have updated the patch posted by Jeff Davis on January 9th to include the micro-patch above as well as updated the polymorphism regressions tests. This applies cleanly to the latest CVS pull. Applied --- thanks for being persistent about resolving the doubts on this. One thing that apparently neither of you realized was that the polymorphism results were varying between bigendian and littleendian machines; I suppose you are using different hardware and that's why you didn't agree on what the results should be. Since we already agreed we were going to tolerate endianness dependence in the hash functions, I fixed that by adding some ORDER BYs. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1
On Sun, Jan 25, 2009 at 10:27:03PM -0600, Kenneth Marshall wrote: On Sat, Jan 10, 2009 at 01:36:25PM -0500, Tom Lane wrote: Jeff Davis pg...@j-davis.com writes: I ran 5 times on both old and new code, eliminating the top and bottom and taking the average of the remaining 3, and I got a 6.9% performance improvement with the new code. The question that has been carefully evaded throughout the discussion of this patch is whether the randomness of the hash result is decreased, and if so what is that likely to cost us in performance of the various hash-dependent algorithms. I would still like to have an answer to that before we make a change to gain marginal performance improvement in the hash function itself (which is already shown to be barely measurable in the total context of a hash-dependent operation...) regards, tom lane Dear Hackers and Reviewers, In response to Tom's questioning the randomness of the hash_any when the mixing functions are split into two, the first used when sequentially processing the input and the second for the final mix, I have generated a more detailed analysis of the two hash_any functions. First, one change to the 11/2008 patch, the keylen is added to a, b and c initially so we do not need to add it later on. The is the micro-diff: - --- hashfunc.c_TWOMIX 2009-01-22 14:07:34.0 -0600 +++ hashfunc.c_TWOMIX2 2009-01-22 14:17:32.0 -0600 @@ -336,7 +336,6 @@ /* handle the last 11 bytes */ k = (const unsigned char *) ka; - c += keylen; #ifdef WORDS_BIGENDIAN switch (len) { @@ -439,7 +438,6 @@ } /* handle the last 11 bytes */ - c += keylen; #ifdef WORDS_BIGENDIAN switch (len)/* all the case statements fall through */ - The remainder of this document will use the runs from my initial results broken out using various power-of-two bucket sizes to simulate our actual use in PostgreSQL as the number of items hashed increases and we use more and more bits of our hash to identify the appropriate bucket. I have run each test twice, once with our current hash_any() with the single mix() function and then a second time using my patch from the November commitfest plus the patch above to produce a new hash_any() with two separate mixing functions mix() and final(). For each run I have generated a sequence of unique inputs, up to the number of buckets, hashed them with the hash functions (both old and new), then I calculate the expected number of collision p(n) using the poisson formula for each number of buckets, where the number of buckets are 2**16, 2**18, 2**20, 2**22, 2**24, and 2**26. For my initial run, I used a string consisting of the letter 'a' followed by the integer representation of the numbers from 0 to the (number of buckets - 1): 1) auint32 ((i.e. a1,a0002...) Number of buckets: 65536 Total number of items: 65536 Expected number with n items: 24109 24109 12054 4018 1004 200 33 4 Actual number mix(): 24044 24172 12078 4036 980 186 30 10 Actual number mix()/final(): 24027 24232 12060 3972 1001 207 31 5 1 Number of buckets: 262144 Total number of items: 262144 Expected number with n items: 96437 96437 48218 16072 4018 803 133 19 2 Actual number mix(): 96224 96730 48240 15951 4094 744 143 17 1 Actual number mix()/final(): 96335 96646 48071 16128 4018 801 122 21 2 Number of buckets: 1048576 Total number of items: 1048576 Expected number with n items: 385749 385749 192874 64291 16072 3214 535 76 9 Actual number mix(): 385716 385596 193243 64115 16053 3285 478 77 12 1 Actual number mix()/final(): 385955 385016 193789 63768 16259 3190 511 79 8 1 Number of buckets: 4194304 Total number of items: 4194304 Expected number with n items: 1542998 1542998 771499 257166 64291 12858 2143 306 38 Actual number mix(): 1542536 1543489 771351 25 63830 12847 2123 326 19 5 1 Actual number mix()/final(): 1541828 1544429 772072 256178 64579 12774 2129 288 22 5 Number of buckets: 16777216 Total number of items: 16777216 Expected number with n items: 6171992 6171992 3085996 1028665 257166 51433 8572 1224 153 Actual number mix(): 6170866 6174079 3085912 1027140 257808 51385 8638 1219 146 23 Actual number mix()/final(): 6172058 6171991 3086279 1027916 257535 51465 8554 1243 149 23 3 Number of buckets: 67108864 Total number of items: 67108864 Expected number with n items: 24687971 24687971 12343985 4114661 1028665 205733 34288 4898 612 Actual number mix(): 24686110 24690897 12344232 4113515 1028232 205682 34546 4942 629 72 7 Actual number mix()/final(): 24708515 24669248 12333034 4114796 1034256 208424 34888 5023 598 77 5 Here is a
Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1
On Sat, Jan 10, 2009 at 01:36:25PM -0500, Tom Lane wrote: Jeff Davis pg...@j-davis.com writes: I ran 5 times on both old and new code, eliminating the top and bottom and taking the average of the remaining 3, and I got a 6.9% performance improvement with the new code. The question that has been carefully evaded throughout the discussion of this patch is whether the randomness of the hash result is decreased, and if so what is that likely to cost us in performance of the various hash-dependent algorithms. I would still like to have an answer to that before we make a change to gain marginal performance improvement in the hash function itself (which is already shown to be barely measurable in the total context of a hash-dependent operation...) regards, tom lane Dear Hackers and Reviewers, In response to Tom's questioning the randomness of the hash_any when the mixing functions are split into two, the first used when sequentially processing the input and the second for the final mix, I have generated a more detailed analysis of the two hash_any functions. First, one change to the 11/2008 patch, the keylen is added to a, b and c initially so we do not need to add it later on. The is the micro-diff: - --- hashfunc.c_TWOMIX 2009-01-22 14:07:34.0 -0600 +++ hashfunc.c_TWOMIX2 2009-01-22 14:17:32.0 -0600 @@ -336,7 +336,6 @@ /* handle the last 11 bytes */ k = (const unsigned char *) ka; - c += keylen; #ifdef WORDS_BIGENDIAN switch (len) { @@ -439,7 +438,6 @@ } /* handle the last 11 bytes */ - c += keylen; #ifdef WORDS_BIGENDIAN switch (len)/* all the case statements fall through */ - The remainder of this document will use the runs from my initial results broken out using various power-of-two bucket sizes to simulate our actual use in PostgreSQL as the number of items hashed increases and we use more and more bits of our hash to identify the appropriate bucket. I have run each test twice, once with our current hash_any() with the single mix() function and then a second time using my patch from the November commitfest plus the patch above to produce a new hash_any() with two separate mixing functions mix() and final(). For each run I have generated a sequence of unique inputs, up to the number of buckets, hashed them with the hash functions (both old and new), then I calculate the expected number of collision p(n) using the poisson formula for each number of buckets, where the number of buckets are 2**16, 2**18, 2**20, 2**22, 2**24, and 2**26. For my initial run, I used a string consisting of the letter 'a' followed by the integer representation of the numbers from 0 to the (number of buckets - 1): 1) auint32 ((i.e. a1,a0002...) Number of buckets: 65536 Total number of items: 65536 Expected number with n items: 24109 24109 12054 4018 1004 200 33 4 Actual number mix(): 24044 24172 12078 4036 980 186 30 10 Actual number mix()/final(): 24027 24232 12060 3972 1001 207 31 5 1 Number of buckets: 262144 Total number of items: 262144 Expected number with n items: 96437 96437 48218 16072 4018 803 133 19 2 Actual number mix(): 96224 96730 48240 15951 4094 744 143 17 1 Actual number mix()/final(): 96335 96646 48071 16128 4018 801 122 21 2 Number of buckets: 1048576 Total number of items: 1048576 Expected number with n items: 385749 385749 192874 64291 16072 3214 535 76 9 Actual number mix(): 385716 385596 193243 64115 16053 3285 478 77 12 1 Actual number mix()/final(): 385955 385016 193789 63768 16259 3190 511 79 8 1 Number of buckets: 4194304 Total number of items: 4194304 Expected number with n items: 1542998 1542998 771499 257166 64291 12858 2143 306 38 Actual number mix(): 1542536 1543489 771351 25 63830 12847 2123 326 19 5 1 Actual number mix()/final(): 1541828 1544429 772072 256178 64579 12774 2129 288 22 5 Number of buckets: 16777216 Total number of items: 16777216 Expected number with n items: 6171992 6171992 3085996 1028665 257166 51433 8572 1224 153 Actual number mix(): 6170866 6174079 3085912 1027140 257808 51385 8638 1219 146 23 Actual number mix()/final(): 6172058 6171991 3086279 1027916 257535 51465 8554 1243 149 23 3 Number of buckets: 67108864 Total number of items: 67108864 Expected number with n items: 24687971 24687971 12343985 4114661 1028665 205733 34288 4898 612 Actual number mix(): 24686110 24690897 12344232 4113515 1028232 205682 34546 4942 629 72 7 Actual number mix()/final(): 24708515 24669248 12333034 4114796 1034256 208424 34888 5023 598 77 5 Here is a second run with number of items = (number of buckets)/2: Number of buckets: 65536 Total number of items: 32768 Expected number with n items: 39749 19874 4968 828 103 10 Actual
Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1
On Fri, Jan 09, 2009 at 02:00:39PM -0800, Jeff Davis wrote: On Fri, 2009-01-09 at 14:29 -0600, Kenneth Marshall wrote: Jeff, Thanks for the review. I would not really expect any differences in hash index build times other than normal noise variances. The most definitive benchmark that I have seen was done with my original patch submission in March posted by Luke of Greenplum: We just applied this and saw a 5 percent speedup on a hash aggregation query with four columns in a 'group by' clause run against a single TPC-H table. I wonder if they would be willing to re-run their test? Thanks again. Separating mix() and final() should have some performance benefit, right? Regards, Jeff Davis Yes, it does but the results can be swamped by other latencies in the code path. Tests such as Tom's benchmark of the underlying functions is needed to isolate the timings effectively or a benchmark like Greenplum's that will benefit from a more efficient function. Ken -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1
On Sat, 2009-01-10 at 11:06 -0600, Kenneth Marshall wrote: Separating mix() and final() should have some performance benefit, right? Yes, it does but the results can be swamped by other latencies in the code path. Tests such as Tom's benchmark of the underlying functions is needed to isolate the timings effectively or a benchmark like Greenplum's that will benefit from a more efficient function. Ok. I isolated the function itself by just doing: -- 10 million rows of random()::text EXPLAIN ANALYZE SELECT hashtext(t) FROM randomtext; I ran 5 times on both old and new code, eliminating the top and bottom and taking the average of the remaining 3, and I got a 6.9% performance improvement with the new code. I tried quickly with a few other data types and got similar results. It's obviously a small microbenchmark, but that's good enough for me. Thanks! Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1
Jeff Davis pg...@j-davis.com writes: I ran 5 times on both old and new code, eliminating the top and bottom and taking the average of the remaining 3, and I got a 6.9% performance improvement with the new code. The question that has been carefully evaded throughout the discussion of this patch is whether the randomness of the hash result is decreased, and if so what is that likely to cost us in performance of the various hash-dependent algorithms. I would still like to have an answer to that before we make a change to gain marginal performance improvement in the hash function itself (which is already shown to be barely measurable in the total context of a hash-dependent operation...) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1
On Sat, 2009-01-10 at 13:36 -0500, Tom Lane wrote: Jeff Davis pg...@j-davis.com writes: I ran 5 times on both old and new code, eliminating the top and bottom and taking the average of the remaining 3, and I got a 6.9% performance improvement with the new code. The question that has been carefully evaded throughout the discussion of this patch is whether the randomness of the hash result is decreased, and if so what is that likely to cost us in performance of the various hash-dependent algorithms. I would still like to have an answer to that before we make a change to gain marginal performance improvement in the hash function itself (which is already shown to be barely measurable in the total context of a hash-dependent operation...) In: http://archives.postgresql.org/message-id/20081104202655.gp18...@it.is.rice.edu Ken showed that the number of hash collisions is the same in the old and the new code for his test data. Not a direct measurement of randomness, but it's some kind of indication. I'm not an authority on either hash functions or statistics, so I'll have to defer to Ken or someone else to actually prove that the randomness is maintained. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1
Tom Lane t...@sss.pgh.pa.us writes: Jeff Davis pg...@j-davis.com writes: I ran 5 times on both old and new code, eliminating the top and bottom and taking the average of the remaining 3, and I got a 6.9% performance improvement with the new code. The question that has been carefully evaded throughout the discussion of this patch is whether the randomness of the hash result is decreased, In fairness that doesn't seem to be the case. The original patch was posted with such an analysis using cracklib and raw binary data: http://article.gmane.org/gmane.comp.db.postgresql.devel.general/105675 marginal performance improvement in the hash function itself (which is already shown to be barely measurable in the total context of a hash-dependent operation...) If it's a 6% gain in the speed of Hash Join or HashAggregate it would be very interesting. But I gather it's a 6% speedup in the time spent actually in the hash function. Is that really where much of our time is going? If it's 10% of the total time to execute one of these nodes then we're talking about a 0.6% optimization... -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support! -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1
On Sat, 2009-01-10 at 13:57 -0500, Gregory Stark wrote: But I gather it's a 6% speedup in the time spent actually in the hash function. That's correct. I was not able to detect any difference at all between the old and new code using HashAggregate, which is where most of my testing effort went: http://archives.postgresql.org/message-id/1231531455.25019.75.ca...@jdavis (see test results attached to that message, if you're interested) I tested with two data distributions and 6 data types. The 6-10% speedup I saw was a super-micro benchmark testing the hash function, and I didn't spend much time with that. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1
On Sat, Jan 10, 2009 at 10:56:15AM -0800, Jeff Davis wrote: On Sat, 2009-01-10 at 13:36 -0500, Tom Lane wrote: Jeff Davis pg...@j-davis.com writes: I ran 5 times on both old and new code, eliminating the top and bottom and taking the average of the remaining 3, and I got a 6.9% performance improvement with the new code. The question that has been carefully evaded throughout the discussion of this patch is whether the randomness of the hash result is decreased, and if so what is that likely to cost us in performance of the various hash-dependent algorithms. I would still like to have an answer to that before we make a change to gain marginal performance improvement in the hash function itself (which is already shown to be barely measurable in the total context of a hash-dependent operation...) In: http://archives.postgresql.org/message-id/20081104202655.gp18...@it.is.rice.edu Ken showed that the number of hash collisions is the same in the old and the new code for his test data. Not a direct measurement of randomness, but it's some kind of indication. I'm not an authority on either hash functions or statistics, so I'll have to defer to Ken or someone else to actually prove that the randomness is maintained. Regards, Jeff Davis First, while I am not an expert by any means, basic statistics will give you the probability of a collision when packing N items into M buckets chosen at random. In all of my tests, I used 1.6M items into 2^32 buckets. If the bits mixing is complete and random, the number of collisions should be close to the same no matter what arrangement of bits are used for inputs. I think my test cases indicate quite clearly that the new hash function is as independent of bit- order as the older functions, in that the number of bucket collisions is basically the same for all layouts. I have the test harness and can test any other input that you would like me to. Second, the author of the basic hash function (old and new) has performed more extensive testing and has shown that the new functions are better in randomizing bits than his original function. I will try and run a micro-benchmark of my original patch in March and the result of the incremental approach that is the result of my Novermber patch tomorrow. Cheers, Ken -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1
On Sat, Jan 10, 2009 at 01:57:27PM -0500, Gregory Stark wrote: Tom Lane t...@sss.pgh.pa.us writes: Jeff Davis pg...@j-davis.com writes: I ran 5 times on both old and new code, eliminating the top and bottom and taking the average of the remaining 3, and I got a 6.9% performance improvement with the new code. The question that has been carefully evaded throughout the discussion of this patch is whether the randomness of the hash result is decreased, In fairness that doesn't seem to be the case. The original patch was posted with such an analysis using cracklib and raw binary data: http://article.gmane.org/gmane.comp.db.postgresql.devel.general/105675 marginal performance improvement in the hash function itself (which is already shown to be barely measurable in the total context of a hash-dependent operation...) If it's a 6% gain in the speed of Hash Join or HashAggregate it would be very interesting. But I gather it's a 6% speedup in the time spent actually in the hash function. Is that really where much of our time is going? If it's 10% of the total time to execute one of these nodes then we're talking about a 0.6% optimization... The Greenplum test did show a 5% increase in performance with the replacement functions in March. Regards, Ken -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1
On Mon, 2008-12-22 at 13:47 -0600, Kenneth Marshall wrote: Dear PostgreSQL developers, I am re-sending this to keep this last change to the internal hash function on the radar. Hi Ken, A few comments: 1. New patch with very minor changes attached. 2. I reverted the change you made to indices.sgml. We still don't use WAL for hash indexes, and in my opinion we should continue to discourage their use until we do use WAL. We can add back in the comment that hash indexes are suitable for large keys if we have some results to show that. 3. There was a regression test failure in union.sql because the ordering of the results was different. I updated the regression test. 4. Hash functions affect a lot more than hash indexes, so I ran through a variety of tests that use a HashAggregate plan. Test setup and results are attached. These results show no difference between the old and the new code (about 0.1% better). 5. The hash index build time shows some improvement. The new code won in every instance in which a there were a lot of duplicates in the table (100 distinct values, 50K of each) by around 5%. The new code appeared to be the same or slightly worse in the case of hash index builds with few duplicates (100 distinct values, 5 of each). The difference was about 1% worse, which is probably just noise. Note: I'm no expert on hash functions. Take all of my tests with a grain of salt. I would feel a little better if I saw at least one test that showed better performance of the new code on a reasonable-looking distribution of data. The hash index build that you showed only took a second or two -- it would be nice to see a test that lasted at least a minute. Regards, Jeff Davis test_results.tar.gz Description: application/compressed-tar diff --git a/src/backend/access/hash/hashfunc.c b/src/backend/access/hash/hashfunc.c index 96d5643..8a236b5 100644 --- a/src/backend/access/hash/hashfunc.c +++ b/src/backend/access/hash/hashfunc.c @@ -200,39 +200,94 @@ hashvarlena(PG_FUNCTION_ARGS) * hash function, see http://burtleburtle.net/bob/hash/doobs.html, * or Bob's article in Dr. Dobb's Journal, Sept. 1997. * - * In the current code, we have adopted an idea from Bob's 2006 update - * of his hash function, which is to fetch the data a word at a time when - * it is suitably aligned. This makes for a useful speedup, at the cost - * of having to maintain four code paths (aligned vs unaligned, and - * little-endian vs big-endian). Note that we have NOT adopted his newer - * mix() function, which is faster but may sacrifice some randomness. + * In the current code, we have adopted Bob's 2006 update of his hash + * which fetches the data a word at a time when it is suitably aligned. + * This makes for a useful speedup, at the cost of having to maintain + * four code paths (aligned vs unaligned, and little-endian vs big-endian). + * It also two separate mixing functions mix() and final(), instead + * of a slower multi-purpose function. */ /* Get a bit mask of the bits set in non-uint32 aligned addresses */ #define UINT32_ALIGN_MASK (sizeof(uint32) - 1) +#define rot(x,k) (((x)(k)) | ((x)(32-(k /*-- * mix -- mix 3 32-bit values reversibly. - * For every delta with one or two bits set, and the deltas of all three - * high bits or all three low bits, whether the original value of a,b,c - * is almost all zero or is uniformly distributed, - * - If mix() is run forward or backward, at least 32 bits in a,b,c - * have at least 1/4 probability of changing. - * - If mix() is run forward, every bit of c will change between 1/3 and - * 2/3 of the time. (Well, 22/100 and 78/100 for some 2-bit deltas.) + * + * This is reversible, so any information in (a,b,c) before mix() is + * still in (a,b,c) after mix(). + * + * If four pairs of (a,b,c) inputs are run through mix(), or through + * mix() in reverse, there are at least 32 bits of the output that + * are sometimes the same for one pair and different for another pair. + * This was tested for: + * * pairs that differed by one bit, by two bits, in any combination + * of top bits of (a,b,c), or in any combination of bottom bits of + * (a,b,c). + * * differ is defined as +, -, ^, or ~^. For + and -, I transformed + * the output delta to a Gray code (a^(a1)) so a string of 1's (as + * is commonly produced by subtraction) look like a single 1-bit + * difference. + * * the base values were pseudorandom, all zero but one bit set, or + * all zero plus a counter that starts at zero. + * + * This does not achieve avalanche. There are input bits of (a,b,c) + * that fail to affect some output bits of (a,b,c), especially of a. The + * most thoroughly mixed value is c, but it doesn't really even achieve + * avalanche in c. + * + * This allows some parallelism. Read-after-writes are good at doubling + * the number of bits affected, so the goal of mixing pulls in the opposite + * direction as the goal of parallelism. I did what
Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1
On Fri, Jan 09, 2009 at 12:04:15PM -0800, Jeff Davis wrote: On Mon, 2008-12-22 at 13:47 -0600, Kenneth Marshall wrote: Dear PostgreSQL developers, I am re-sending this to keep this last change to the internal hash function on the radar. Hi Ken, A few comments: 1. New patch with very minor changes attached. 2. I reverted the change you made to indices.sgml. We still don't use WAL for hash indexes, and in my opinion we should continue to discourage their use until we do use WAL. We can add back in the comment that hash indexes are suitable for large keys if we have some results to show that. 3. There was a regression test failure in union.sql because the ordering of the results was different. I updated the regression test. 4. Hash functions affect a lot more than hash indexes, so I ran through a variety of tests that use a HashAggregate plan. Test setup and results are attached. These results show no difference between the old and the new code (about 0.1% better). 5. The hash index build time shows some improvement. The new code won in every instance in which a there were a lot of duplicates in the table (100 distinct values, 50K of each) by around 5%. The new code appeared to be the same or slightly worse in the case of hash index builds with few duplicates (100 distinct values, 5 of each). The difference was about 1% worse, which is probably just noise. Note: I'm no expert on hash functions. Take all of my tests with a grain of salt. I would feel a little better if I saw at least one test that showed better performance of the new code on a reasonable-looking distribution of data. The hash index build that you showed only took a second or two -- it would be nice to see a test that lasted at least a minute. Regards, Jeff Davis Jeff, Thanks for the review. I would not really expect any differences in hash index build times other than normal noise variances. The most definitive benchmark that I have seen was done with my original patch submission in March posted by Luke of Greenplum: We just applied this and saw a 5 percent speedup on a hash aggregation query with four columns in a 'group by' clause run against a single TPC-H table. I wonder if they would be willing to re-run their test? Thanks again. Regards, Ken -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1
On Fri, 2009-01-09 at 14:29 -0600, Kenneth Marshall wrote: Jeff, Thanks for the review. I would not really expect any differences in hash index build times other than normal noise variances. The most definitive benchmark that I have seen was done with my original patch submission in March posted by Luke of Greenplum: We just applied this and saw a 5 percent speedup on a hash aggregation query with four columns in a 'group by' clause run against a single TPC-H table. I wonder if they would be willing to re-run their test? Thanks again. Separating mix() and final() should have some performance benefit, right? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1
Dear PostgreSQL developers, I am re-sending this to keep this last change to the internal hash function on the radar. Ken Sorry about the delay for this update to the new hash index implementation. I was trying to get the WAL logging in place and forgot to post the actual patch. The WAL for hash indexes will need to wait for 8.5, but I did want to add back in the piece of the Bob Jenkins 2006 hash function that was stripped out of the initial patch on application due to concerns about the randomness of the resulting hash values. Here is a re-post of my initial findings comparing the old/new Jenkins hash from lookup2 and lookup3. I have added a third column containing the results for the hash_any() resulting from the attached patch as well as simple timing test for a DB reindex both before and after patching. Also attached is a simple documentation patch updating the note attached to the hash index description. Regards, Ken Hi, I have finally had a chance to do some investigation on the performance of the old hash mix() function versus the updated mix()/final() in the new hash function. Here is a table of my current results for both the old and the new hash function. In this case cracklib refers to the cracklib-dict containing 1648379 unique words massaged in various ways to generate input strings for the hash functions. The result is the number of collisions in the hash values generated. hash inputoldnew newv2 -------- - cracklib 338316 338 cracklib x 2 (i.e. clibclib) 305319 300 cracklib x 3 (clibclibclib) 323329 315 cracklib x 10 302310 329 cracklib x 100350335 298 cracklib x 1000 314309 315 cracklib x 100 truncated to char(100) 311327 320 uint32 from 1-1648379 309319 347 (uint32 1-1948379)*256309314 304 (uint32 1-1948379)*16 310314 324 auint32 (i.e. a1,a0002...) 320321 312 uint32uint32 (i.e. uint64)321287 309 The different result columns are old = Jenkins 1996 hash function(lookup2.c), new = Jenkins 2006 hash function (lookup3.c), and newv2 = adaptation of current hash_any() to incorporate the separate mix()/final() functions. As you can see from the results, spliting the mix() and final() apart does not result in any perceptible loss of randomness in the hash assignment. I also ran a crude timing for a reindex of the following database: CREATE TABLE dict (word text); CREATE INDEX wordhash ON dict USING hash (word); INSERT INTO dict (word) VALUES('s;lhkjdpyoijxfg;lktjgh;sdlhkjo'); INSERT INTO dict (SELECT MAX(word)||MAX(word) FROM dict); ... (21 times) REINDEX TABLE ... The average time to reindex the table using our current hash_any() without the separate mix()/final() was 1696ms and 1482ms with the separate mix()/final() stages giving almost 13% better performance for this stupid metric. --- indices.sgml2008-10-13 14:40:06.0 -0500 +++ indices.sgml.NEW2008-11-04 12:42:35.0 -0600 @@ -190,13 +190,11 @@ note para -Testing has shown productnamePostgreSQL/productname's hash -indexes to perform no better than B-tree indexes, and the -index size and build time for hash indexes is much worse. -Furthermore, hash index operations are not presently WAL-logged, +productnamePostgreSQL/productname's hash indexes provide +the fast O(1) lookups, even for very large objects. +Hash index operations are not presently WAL-logged, so hash indexes might need to be rebuilt with commandREINDEX/ -after a database crash. -For these reasons, hash index use is presently discouraged. +after a database crash. /para /note --- hashfunc.c.ORIG 2008-09-03 13:07:14.0 -0500 +++ hashfunc.c.NEW 2008-11-04 08:36:16.0 -0600 @@ -200,39 +200,94 @@ * hash function, see http://burtleburtle.net/bob/hash/doobs.html, * or Bob's article in Dr. Dobb's Journal, Sept. 1997. * - * In the current code, we have adopted an idea from Bob's 2006 update - * of his hash function, which is to fetch the data a word at a time when - * it is suitably aligned. This makes for a useful speedup, at the cost - * of having to maintain four code paths (aligned vs unaligned, and - * little-endian vs big-endian). Note that we have NOT adopted his newer - * mix() function, which is faster but may sacrifice some randomness. + * In the current code, we have adopted Bob's 2006 update of his hash + * which fetches the data a word at a time when it is suitably aligned. + * This makes for a useful speedup, at the cost of having to maintain + * four code paths (aligned vs unaligned, and little-endian vs big-endian). + * It also two separate mixing functions
Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1
Kenneth Marshall wrote: Dear PostgreSQL developers, I am re-sending this to keep this last change to the internal hash function on the radar. Please add it to the commitfest wiki page, http://wiki.postgresql.org/wiki/CommitFest_2008-11 Thanks -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1
Just interested if you repeat your tests not with cracklib-dict, but using 8-bit words. From our experience we found many hash functions are optimized for 7-bit words and produce too many collisions for 8-bit words. That's why we use crc32. Oleg On Tue, 4 Nov 2008, Kenneth Marshall wrote: Sorry about the delay for this update to the new hash index implementation. I was trying to get the WAL logging in place and forgot to post the actual patch. The WAL for hash indexes will need to wait for 8.5, but I did want to add back in the piece of the Bob Jenkins 2006 hash function that was stripped out of the initial patch on application due to concerns about the randomness of the resulting hash values. Here is a re-post of my initial findings comparing the old/new Jenkins hash from lookup2 and lookup3. I have added a third column containing the results for the hash_any() resulting from the attached patch as well as simple timing test for a DB reindex both before and after patching. Also attached is a simple documentation patch updating the note attached to the hash index description. Regards, Ken Hi, I have finally had a chance to do some investigation on the performance of the old hash mix() function versus the updated mix()/final() in the new hash function. Here is a table of my current results for both the old and the new hash function. In this case cracklib refers to the cracklib-dict containing 1648379 unique words massaged in various ways to generate input strings for the hash functions. The result is the number of collisions in the hash values generated. hash inputoldnew newv2 -------- - cracklib 338316 338 cracklib x 2 (i.e. clibclib) 305319 300 cracklib x 3 (clibclibclib) 323329 315 cracklib x 10 302310 329 cracklib x 100350335 298 cracklib x 1000 314309 315 cracklib x 100 truncated to char(100) 311327 320 uint32 from 1-1648379 309319 347 (uint32 1-1948379)*256309314 304 (uint32 1-1948379)*16 310314 324 auint32 (i.e. a1,a0002...) 320321 312 uint32uint32 (i.e. uint64)321287 309 The different result columns are old = Jenkins 1996 hash function(lookup2.c), new = Jenkins 2006 hash function (lookup3.c), and newv2 = adaptation of current hash_any() to incorporate the separate mix()/final() functions. As you can see from the results, spliting the mix() and final() apart does not result in any perceptible loss of randomness in the hash assignment. I also ran a crude timing for a reindex of the following database: CREATE TABLE dict (word text); CREATE INDEX wordhash ON dict USING hash (word); INSERT INTO dict (word) VALUES('s;lhkjdpyoijxfg;lktjgh;sdlhkjo'); INSERT INTO dict (SELECT MAX(word)||MAX(word) FROM dict); ... (21 times) REINDEX TABLE ... The average time to reindex the table using our current hash_any() without the separate mix()/final() was 1696ms and 1482ms with the separate mix()/final() stages giving almost 13% better performance for this stupid metric. Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1
On Tue, Nov 04, 2008 at 11:32:47PM +0300, Oleg Bartunov wrote: Just interested if you repeat your tests not with cracklib-dict, but using 8-bit words. From our experience we found many hash functions are optimized for 7-bit words and produce too many collisions for 8-bit words. That's why we use crc32. Oleg I think that the lines: uint32 from 1-1648379 309319 347 (uint32 1-1948379)*256309314 304 (uint32 1-1948379)*16 310314 324 auint32 (i.e. a1,a0002...) 320321 312 uint32uint32 (i.e. uint64)321287 309 can count as 8-bit words if taken a byte at a time. In fact that is how hash_any() treats them, as a character string and a length. One of the design goals of the original 1997 hash function in lookup2 and the 2006 update in lookup3 is to support keys of arbitrary arrangements of bits. I can run any additional checks that you want since the test harness is perl with Inline::C. If you are using crc32 his article in Dr. Dobbs shows that CRC has a 2 into 1 funnel-15 and an 11 into 10 funnel-100 unless you are using a generalized CRC. Also, unless you can inline your CRC the Jenkins lookup3 is 5n+20 where CRC is 9n+3. Regards, Ken On Tue, 4 Nov 2008, Kenneth Marshall wrote: Sorry about the delay for this update to the new hash index implementation. I was trying to get the WAL logging in place and forgot to post the actual patch. The WAL for hash indexes will need to wait for 8.5, but I did want to add back in the piece of the Bob Jenkins 2006 hash function that was stripped out of the initial patch on application due to concerns about the randomness of the resulting hash values. Here is a re-post of my initial findings comparing the old/new Jenkins hash from lookup2 and lookup3. I have added a third column containing the results for the hash_any() resulting from the attached patch as well as simple timing test for a DB reindex both before and after patching. Also attached is a simple documentation patch updating the note attached to the hash index description. Regards, Ken Hi, I have finally had a chance to do some investigation on the performance of the old hash mix() function versus the updated mix()/final() in the new hash function. Here is a table of my current results for both the old and the new hash function. In this case cracklib refers to the cracklib-dict containing 1648379 unique words massaged in various ways to generate input strings for the hash functions. The result is the number of collisions in the hash values generated. hash inputoldnew newv2 -------- - cracklib 338316 338 cracklib x 2 (i.e. clibclib) 305319 300 cracklib x 3 (clibclibclib) 323329 315 cracklib x 10 302310 329 cracklib x 100350335 298 cracklib x 1000 314309 315 cracklib x 100 truncated to char(100) 311327 320 uint32 from 1-1648379 309319 347 (uint32 1-1948379)*256309314 304 (uint32 1-1948379)*16 310314 324 auint32 (i.e. a1,a0002...) 320321 312 uint32uint32 (i.e. uint64)321287 309 The different result columns are old = Jenkins 1996 hash function(lookup2.c), new = Jenkins 2006 hash function (lookup3.c), and newv2 = adaptation of current hash_any() to incorporate the separate mix()/final() functions. As you can see from the results, spliting the mix() and final() apart does not result in any perceptible loss of randomness in the hash assignment. I also ran a crude timing for a reindex of the following database: CREATE TABLE dict (word text); CREATE INDEX wordhash ON dict USING hash (word); INSERT INTO dict (word) VALUES('s;lhkjdpyoijxfg;lktjgh;sdlhkjo'); INSERT INTO dict (SELECT MAX(word)||MAX(word) FROM dict); ... (21 times) REINDEX TABLE ... The average time to reindex the table using our current hash_any() without the separate mix()/final() was 1696ms and 1482ms with the separate mix()/final() stages giving almost 13% better performance for this stupid metric. Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription:
Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1
Oleg, Here is a little more information on the use of CRC32 as a hash function, with some warning caveats: http://home.comcast.net/~bretm/hash/8.html Regards, Ken On Tue, Nov 04, 2008 at 03:15:44PM -0600, Kenneth Marshall wrote: On Tue, Nov 04, 2008 at 11:32:47PM +0300, Oleg Bartunov wrote: Just interested if you repeat your tests not with cracklib-dict, but using 8-bit words. From our experience we found many hash functions are optimized for 7-bit words and produce too many collisions for 8-bit words. That's why we use crc32. Oleg I think that the lines: uint32 from 1-1648379 309319 347 (uint32 1-1948379)*256309314 304 (uint32 1-1948379)*16 310314 324 auint32 (i.e. a1,a0002...) 320321 312 uint32uint32 (i.e. uint64)321287 309 can count as 8-bit words if taken a byte at a time. In fact that is how hash_any() treats them, as a character string and a length. One of the design goals of the original 1997 hash function in lookup2 and the 2006 update in lookup3 is to support keys of arbitrary arrangements of bits. I can run any additional checks that you want since the test harness is perl with Inline::C. If you are using crc32 his article in Dr. Dobbs shows that CRC has a 2 into 1 funnel-15 and an 11 into 10 funnel-100 unless you are using a generalized CRC. Also, unless you can inline your CRC the Jenkins lookup3 is 5n+20 where CRC is 9n+3. Regards, Ken -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1
Sorry about the delay for this update to the new hash index implementation. I was trying to get the WAL logging in place and forgot to post the actual patch. The WAL for hash indexes will need to wait for 8.5, but I did want to add back in the piece of the Bob Jenkins 2006 hash function that was stripped out of the initial patch on application due to concerns about the randomness of the resulting hash values. Here is a re-post of my initial findings comparing the old/new Jenkins hash from lookup2 and lookup3. I have added a third column containing the results for the hash_any() resulting from the attached patch as well as simple timing test for a DB reindex both before and after patching. Also attached is a simple documentation patch updating the note attached to the hash index description. Regards, Ken Hi, I have finally had a chance to do some investigation on the performance of the old hash mix() function versus the updated mix()/final() in the new hash function. Here is a table of my current results for both the old and the new hash function. In this case cracklib refers to the cracklib-dict containing 1648379 unique words massaged in various ways to generate input strings for the hash functions. The result is the number of collisions in the hash values generated. hash inputoldnew newv2 -------- - cracklib 338316 338 cracklib x 2 (i.e. clibclib) 305319 300 cracklib x 3 (clibclibclib) 323329 315 cracklib x 10 302310 329 cracklib x 100350335 298 cracklib x 1000 314309 315 cracklib x 100 truncated to char(100) 311327 320 uint32 from 1-1648379 309319 347 (uint32 1-1948379)*256309314 304 (uint32 1-1948379)*16 310314 324 auint32 (i.e. a1,a0002...) 320321 312 uint32uint32 (i.e. uint64)321287 309 The different result columns are old = Jenkins 1996 hash function(lookup2.c), new = Jenkins 2006 hash function (lookup3.c), and newv2 = adaptation of current hash_any() to incorporate the separate mix()/final() functions. As you can see from the results, spliting the mix() and final() apart does not result in any perceptible loss of randomness in the hash assignment. I also ran a crude timing for a reindex of the following database: CREATE TABLE dict (word text); CREATE INDEX wordhash ON dict USING hash (word); INSERT INTO dict (word) VALUES('s;lhkjdpyoijxfg;lktjgh;sdlhkjo'); INSERT INTO dict (SELECT MAX(word)||MAX(word) FROM dict); ... (21 times) REINDEX TABLE ... The average time to reindex the table using our current hash_any() without the separate mix()/final() was 1696ms and 1482ms with the separate mix()/final() stages giving almost 13% better performance for this stupid metric. --- indices.sgml2008-10-13 14:40:06.0 -0500 +++ indices.sgml.NEW2008-11-04 12:42:35.0 -0600 @@ -190,13 +190,11 @@ note para -Testing has shown productnamePostgreSQL/productname's hash -indexes to perform no better than B-tree indexes, and the -index size and build time for hash indexes is much worse. -Furthermore, hash index operations are not presently WAL-logged, +productnamePostgreSQL/productname's hash indexes provide +the fast O(1) lookups, even for very large objects. +Hash index operations are not presently WAL-logged, so hash indexes might need to be rebuilt with commandREINDEX/ -after a database crash. -For these reasons, hash index use is presently discouraged. +after a database crash. /para /note --- hashfunc.c.ORIG 2008-09-03 13:07:14.0 -0500 +++ hashfunc.c.NEW 2008-11-04 08:36:16.0 -0600 @@ -200,39 +200,94 @@ * hash function, see http://burtleburtle.net/bob/hash/doobs.html, * or Bob's article in Dr. Dobb's Journal, Sept. 1997. * - * In the current code, we have adopted an idea from Bob's 2006 update - * of his hash function, which is to fetch the data a word at a time when - * it is suitably aligned. This makes for a useful speedup, at the cost - * of having to maintain four code paths (aligned vs unaligned, and - * little-endian vs big-endian). Note that we have NOT adopted his newer - * mix() function, which is faster but may sacrifice some randomness. + * In the current code, we have adopted Bob's 2006 update of his hash + * which fetches the data a word at a time when it is suitably aligned. + * This makes for a useful speedup, at the cost of having to maintain + * four code paths (aligned vs unaligned, and little-endian vs big-endian). + * It also two separate mixing functions mix() and final() instead + * of a single multi-purpose function, that is slower as a result. */ /* Get a bit mask of the bits set in