Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1

2009-10-29 Thread Peter Eisentraut
On Wed, 2009-10-28 at 12:51 -0700, Jeff Davis wrote:
 Trying to develop and document a set of standardized, stable hash
 functions covering a wide range of possible use cases sounds like it may
 be better served by an extension.

I suspect that some of the participants in this thread have PL/Proxy in
mind.  PL/Proxy should probably ship its own set of hash functions.

If we ever get built-in partitioning by hash (see thread nearby and most
previous ones like it), we should also think about providing a hash
function that doesn't change output over versions and is independent of
hash index implementation concerns.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1

2009-10-29 Thread Hannu Krosing
On Thu, 2009-10-29 at 09:47 +0200, Peter Eisentraut wrote:
 On Wed, 2009-10-28 at 12:51 -0700, Jeff Davis wrote:
  Trying to develop and document a set of standardized, stable hash
  functions covering a wide range of possible use cases sounds like it may
  be better served by an extension.
 
 I suspect that some of the participants in this thread have PL/Proxy in
 mind.  

Yes, I think that pl/proxy is the prime example of such use.

 PL/Proxy should probably ship its own set of hash functions.

Or maybe we could just extract the hashes form some version of stock
postgresql (say 8.3) and then make those available in contrib under the
name stable_hashes ?

 If we ever get built-in partitioning by hash (see thread nearby and most
 previous ones like it), we should also think about providing a hash
 function that doesn't change output over versions and is independent of
 hash index implementation concerns.

And maybe even documented ;)

-- 
Hannu Krosing   http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability 
   Services, Consulting and Training



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1

2009-10-29 Thread Tom Lane
Hannu Krosing ha...@2ndquadrant.com writes:
 Or maybe we could just extract the hashes form some version of stock
 postgresql (say 8.3) and then make those available in contrib under the
 name stable_hashes ?

A better name would be wishful_thinking ... contrib does not have
control over some of the main risk factors, such as changes to the
internal representation of datatypes.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1

2009-10-29 Thread Peter Eisentraut
On Thu, 2009-10-29 at 09:32 -0400, Tom Lane wrote:
 Hannu Krosing ha...@2ndquadrant.com writes:
  Or maybe we could just extract the hashes form some version of stock
  postgresql (say 8.3) and then make those available in contrib under the
  name stable_hashes ?
 
 A better name would be wishful_thinking ... contrib does not have
 control over some of the main risk factors, such as changes to the
 internal representation of datatypes.

Good point.  But the internal layout of data types popular for use as
hash key changes rarely.  The hash function change, however, is a
showstopper for everyone concerned.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1

2009-10-29 Thread Hannu Krosing
On Thu, 2009-10-29 at 09:32 -0400, Tom Lane wrote:
 Hannu Krosing ha...@2ndquadrant.com writes:
  Or maybe we could just extract the hashes form some version of stock
  postgresql (say 8.3) and then make those available in contrib under the
  name stable_hashes ?
 
 A better name would be wishful_thinking ... contrib does not have
 control over some of the main risk factors, such as changes to the
 internal representation of datatypes.

My understanding was, that contrib is actually the only possibility to
have something maintained in sync with core postgreSQL.

   regards, tom lane
-- 
Hannu Krosing   http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability 
   Services, Consulting and Training



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1

2009-10-28 Thread Hannu Krosing
On Wed, 2009-02-11 at 11:22 -0500, Tom Lane wrote:
 Asko Oja asc...@gmail.com writes:
  Did this change hashtext() visible to users? We have been using it quite
  widely for partitioning our databases. If so then it should be marked quite
  visibly in release notes as there might be others who will be hit by this.
 
 The hash functions are undocumented, have changed in the past, and are
 likely to change again in the future.  If you are using them in a way
 that depends on them to give the same answers across versions, you'd
 better stop.

Is at least the fact that they are undocumented, have changed in the
past, and are likely to change again in the future documented ?

I'm sure this is something that has hit unwary users in the past and
will hit again in the future, so some words about it in the doc's would
be appropriate.

search for hashtext on
http://www.postgresql.org/docs/8.4/interactive/index.html returned no
results, so I guess even theyr undocumented, will surprise you status
is not documented.

Hashing is a quite fundamental thing in computing, so I was quite
surprised to find out it had silently changed. 

I had never checked the docs for hash functions, but I had assumed, that
internal functions are prefixed by pg_ and anything else is public, free
to use functionality.

Changing hash functions also makes in-place upgrades a lot harder, as
they can't be done incrementally anymore for tables which use hash
indexes.


   regards, tom lane
 
-- 
Hannu Krosing   http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability 
   Services, Consulting and Training



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1

2009-10-28 Thread Tom Lane
Hannu Krosing ha...@2ndquadrant.com writes:
 I had never checked the docs for hash functions, but I had assumed, that
 internal functions are prefixed by pg_ and anything else is public, free
 to use functionality.

Sure, it's free to use.  It's not free to assume that we promise never
to change it.

 Changing hash functions also makes in-place upgrades a lot harder, as
 they can't be done incrementally anymore for tables which use hash
 indexes.

Hash indexes are so far from being production-grade that this argument
is not significant.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1

2009-10-28 Thread Kenneth Marshall
On Wed, Oct 28, 2009 at 03:31:17PM -0400, Tom Lane wrote:
 Hannu Krosing ha...@2ndquadrant.com writes:
  I had never checked the docs for hash functions, but I had assumed, that
  internal functions are prefixed by pg_ and anything else is public, free
  to use functionality.
 
 Sure, it's free to use.  It's not free to assume that we promise never
 to change it.
 
  Changing hash functions also makes in-place upgrades a lot harder, as
  they can't be done incrementally anymore for tables which use hash
  indexes.
 
 Hash indexes are so far from being production-grade that this argument
 is not significant.
 
   regards, tom lane

In addition that change from 8.3 - 8.4 to store only the hash and not
the value in the index means that a reindex would be required in any event.

Cheers,
Ken

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1

2009-10-28 Thread Tom Lane
Kenneth Marshall k...@rice.edu writes:
 On Wed, Oct 28, 2009 at 03:31:17PM -0400, Tom Lane wrote:
 Hash indexes are so far from being production-grade that this argument
 is not significant.

 In addition that change from 8.3 - 8.4 to store only the hash and not
 the value in the index means that a reindex would be required in any event.

Indeed, and I fully expect there will be some more on-disk format
changes required before we get to the point where hash indexes are
actually interesting for production.  If we start insisting that they
be in-place-upgradable now, we will pretty much guarantee that they
never become useful enough to justify the restriction :-(

(As examples, the hash bucket size probably needs revisiting,
and we ought to think very hard about whether we shouldn't switch
to 64-bit hash values.  And that's not even considering some of the
more advanced suggestions that have been made.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1

2009-10-28 Thread Jeff Davis
On Wed, 2009-10-28 at 21:09 +0200, Hannu Krosing wrote:
 Is at least the fact that they are undocumented, have changed in the
 past, and are likely to change again in the future documented ?

That's a little confusing to me: how do we document that something is
undocumented? And where do we stop?

 Hashing is a quite fundamental thing in computing, so I was quite
 surprised to find out it had silently changed. 

There are many reasons to use a hash, and we don't want people to use
these functions for the wrong purpose. I have seen people use a
performance hash for security purposes before, and I had to demonstrate
some hash collisions to show why that was a bad idea. So, if we do
provide documented functions, it should be done carefully.

Trying to develop and document a set of standardized, stable hash
functions covering a wide range of possible use cases sounds like it may
be better served by an extension.

Regards,
Jeff Davis 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1

2009-10-28 Thread Hannu Krosing
On Wed, 2009-10-28 at 12:51 -0700, Jeff Davis wrote:
 On Wed, 2009-10-28 at 21:09 +0200, Hannu Krosing wrote:
  Is at least the fact that they are undocumented, have changed in the
  past, and are likely to change again in the future documented ?
 
 That's a little confusing to me: how do we document that something is
 undocumented? And where do we stop?

My previous e-mail message documents that the undocumentedness is
undocumented, so no need to go any further here ;)

Though undocumented, the hash functions are easily discoverable by doing

\df *hash*

in psql

  Hashing is a quite fundamental thing in computing, so I was quite
  surprised to find out it had silently changed. 
 
 There are many reasons to use a hash, and we don't want people to use
 these functions for the wrong purpose. 

I don't think that not documenting a hash function helps here at all.

 I have seen people use a
 performance hash for security purposes before, and I had to demonstrate
 some hash collisions to show why that was a bad idea. 

I've seen people use CRC32 as hash and then hit a collisions in 15 tries
with quite large keys.

 So, if we do provide documented functions, it should be done carefully.

Any user-visible behavior change should be done carefully, even if the
original behavior is not documented.

Careful upgrade of hasxxx functions would have kept the old functions,
and introduced the new ones with _v2 suffix, and then used these in
appropriate places. then kept the old ones for a few versions, with
maybe a deprecation warning and then moved them to contrib for a few
more versions.

Doing it this way could leave them undocumented and still not break
peoples applications in mysterious ways.

 Trying to develop and document a set of standardized, stable hash
 functions covering a wide range of possible use cases sounds like it may
 be better served by an extension.

I guess there are enough security/crypt/strong hashes in pgcrypto
package so that should not be a problem.

-- 
Hannu Krosing   http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability 
   Services, Consulting and Training



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1

2009-10-28 Thread Hannu Krosing
On Wed, 2009-10-28 at 15:31 -0400, Tom Lane wrote:
 Hannu Krosing ha...@2ndquadrant.com writes:
  I had never checked the docs for hash functions, but I had assumed, that
  internal functions are prefixed by pg_ and anything else is public, free
  to use functionality.
 
 Sure, it's free to use.  It's not free to assume that we promise never
 to change it.

  Changing hash functions also makes in-place upgrades a lot harder, as
  they can't be done incrementally anymore for tables which use hash
  indexes.
 
 Hash indexes are so far from being production-grade that this argument
 is not significant.

AFAIK in-place upgrade is also not quite production-grade, so this was
meant as a forward-looking note for next time the hashxxx functions will
change.

   regards, tom lane
-- 
Hannu Krosing   http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability 
   Services, Consulting and Training



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1

2009-02-11 Thread Asko Oja
Did this change hashtext() visible to users? We have been using it quite
widely for partitioning our databases. If so then it should be marked quite
visibly in release notes as there might be others who will be hit by this.

regards
Asko

On Mon, Feb 9, 2009 at 11:22 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Kenneth Marshall k...@rice.edu writes:
  I have updated the patch posted by Jeff Davis on January 9th
  to include the micro-patch above as well as updated the polymorphism
  regressions tests. This applies cleanly to the latest CVS pull.

 Applied --- thanks for being persistent about resolving the doubts on this.

 One thing that apparently neither of you realized was that the
 polymorphism results were varying between bigendian and littleendian
 machines; I suppose you are using different hardware and that's why you
 didn't agree on what the results should be.

 Since we already agreed we were going to tolerate endianness dependence
 in the hash functions, I fixed that by adding some ORDER BYs.

regards, tom lane

 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1

2009-02-11 Thread Tom Lane
Asko Oja asc...@gmail.com writes:
 Did this change hashtext() visible to users? We have been using it quite
 widely for partitioning our databases. If so then it should be marked quite
 visibly in release notes as there might be others who will be hit by this.

The hash functions are undocumented, have changed in the past, and are
likely to change again in the future.  If you are using them in a way
that depends on them to give the same answers across versions, you'd
better stop.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1

2009-02-09 Thread Tom Lane
Kenneth Marshall k...@rice.edu writes:
 I have updated the patch posted by Jeff Davis on January 9th
 to include the micro-patch above as well as updated the polymorphism
 regressions tests. This applies cleanly to the latest CVS pull.

Applied --- thanks for being persistent about resolving the doubts on this.

One thing that apparently neither of you realized was that the
polymorphism results were varying between bigendian and littleendian
machines; I suppose you are using different hardware and that's why you
didn't agree on what the results should be.

Since we already agreed we were going to tolerate endianness dependence
in the hash functions, I fixed that by adding some ORDER BYs.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1

2009-01-27 Thread Kenneth Marshall
On Sun, Jan 25, 2009 at 10:27:03PM -0600, Kenneth Marshall wrote:
 On Sat, Jan 10, 2009 at 01:36:25PM -0500, Tom Lane wrote:
  Jeff Davis pg...@j-davis.com writes:
   I ran 5 times on both old and new code, eliminating the top and bottom
   and taking the average of the remaining 3, and I got a 6.9% performance
   improvement with the new code.
  
  The question that has been carefully evaded throughout the discussion
  of this patch is whether the randomness of the hash result is decreased,
  and if so what is that likely to cost us in performance of the various
  hash-dependent algorithms.  I would still like to have an answer to that
  before we make a change to gain marginal performance improvement in
  the hash function itself (which is already shown to be barely measurable
  in the total context of a hash-dependent operation...)
  
  regards, tom lane
  
 
 Dear Hackers and Reviewers,
 
 In response to Tom's questioning the randomness of the hash_any
 when the mixing functions are split into two, the first used when
 sequentially processing the input and the second for the final
 mix, I have generated a more detailed analysis of the two hash_any
 functions.
 
 First, one change to the 11/2008 patch, the keylen is added to a, b and
 c initially so we do not need to add it later on. The is the
 micro-diff:
 -
 
 --- hashfunc.c_TWOMIX   2009-01-22 14:07:34.0 -0600
 +++ hashfunc.c_TWOMIX2  2009-01-22 14:17:32.0 -0600
 @@ -336,7 +336,6 @@
  
 /* handle the last 11 bytes */
 k = (const unsigned char *) ka;
 -   c += keylen;
  #ifdef WORDS_BIGENDIAN
 switch (len)
 {
 @@ -439,7 +438,6 @@
 }
  
 /* handle the last 11 bytes */
 -   c += keylen;
  #ifdef WORDS_BIGENDIAN
 switch (len)/* all the case statements 
 fall through */
  
 -
 
 The remainder of this document will use the runs from my initial results
 broken out using various power-of-two bucket sizes to simulate our actual
 use in PostgreSQL as the number of items hashed increases and we use more
 and more bits of our hash to identify the appropriate bucket. I have run
 each test twice, once with our current hash_any() with the single mix()
 function and then a second time using my patch from the November commitfest
 plus the patch above to produce a new hash_any() with two separate mixing
 functions mix() and final(). For each run I have generated a sequence of
 unique inputs, up to the number of buckets, hashed them with the hash
 functions (both old and new), then I calculate the expected number of
 collision p(n) using the poisson formula for each number of buckets,
 where the number of buckets are 2**16, 2**18, 2**20, 2**22, 2**24, and
 2**26. For my initial run, I used a string consisting of the letter 'a'
 followed by the integer representation of the numbers from 0 to the
 (number of buckets - 1):
 
 1) auint32 ((i.e. a1,a0002...)
 Number of buckets: 65536
 Total number of items: 65536
 Expected number with n items: 24109 24109 12054 4018 1004 200 33 4
 Actual number mix():  24044 24172 12078 4036 980 186 30 10
 Actual number mix()/final():  24027 24232 12060 3972 1001 207 31 5 1
 
 Number of buckets: 262144
 Total number of items: 262144
 Expected number with n items: 96437 96437 48218 16072 4018 803 133 19 2
 Actual number mix():  96224 96730 48240 15951 4094 744 143 17 1
 Actual number mix()/final():  96335 96646 48071 16128 4018 801 122 21 2
 
 Number of buckets: 1048576
 Total number of items: 1048576
 Expected number with n items: 385749 385749 192874 64291 16072 3214 535 76 9
 Actual number mix():  385716 385596 193243 64115 16053 3285 478 77 12 
 1
 Actual number mix()/final():  385955 385016 193789 63768 16259 3190 511 79 8 1
 
 Number of buckets: 4194304
 Total number of items: 4194304
 Expected number with n items: 1542998 1542998 771499 257166 64291 12858 2143 
 306 38
 Actual number mix():  1542536 1543489 771351 25 63830 12847 2123 
 326 19 5 1
 Actual number mix()/final():  1541828 1544429 772072 256178 64579 12774 2129 
 288 22 5
 
 Number of buckets: 16777216
 Total number of items: 16777216
 Expected number with n items: 6171992 6171992 3085996 1028665 257166 51433 
 8572 1224 153
 Actual number mix():  6170866 6174079 3085912 1027140 257808 51385 
 8638 1219 146 23
 Actual number mix()/final():  6172058 6171991 3086279 1027916 257535 51465 
 8554 1243 149 23 3
 
 Number of buckets: 67108864
 Total number of items: 67108864
 Expected number with n items: 24687971 24687971 12343985 4114661 1028665 
 205733 34288 4898 612
 Actual number mix():  24686110 24690897 12344232 4113515 1028232 
 205682 34546 4942 629 72 7
 Actual number mix()/final():  24708515 24669248 12333034 4114796 1034256 
 208424 34888 5023 598 77 5
 
 Here is a 

Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1

2009-01-25 Thread Kenneth Marshall
On Sat, Jan 10, 2009 at 01:36:25PM -0500, Tom Lane wrote:
 Jeff Davis pg...@j-davis.com writes:
  I ran 5 times on both old and new code, eliminating the top and bottom
  and taking the average of the remaining 3, and I got a 6.9% performance
  improvement with the new code.
 
 The question that has been carefully evaded throughout the discussion
 of this patch is whether the randomness of the hash result is decreased,
 and if so what is that likely to cost us in performance of the various
 hash-dependent algorithms.  I would still like to have an answer to that
 before we make a change to gain marginal performance improvement in
 the hash function itself (which is already shown to be barely measurable
 in the total context of a hash-dependent operation...)
 
   regards, tom lane
 

Dear Hackers and Reviewers,

In response to Tom's questioning the randomness of the hash_any
when the mixing functions are split into two, the first used when
sequentially processing the input and the second for the final
mix, I have generated a more detailed analysis of the two hash_any
functions.

First, one change to the 11/2008 patch, the keylen is added to a, b and
c initially so we do not need to add it later on. The is the
micro-diff:
-

--- hashfunc.c_TWOMIX   2009-01-22 14:07:34.0 -0600
+++ hashfunc.c_TWOMIX2  2009-01-22 14:17:32.0 -0600
@@ -336,7 +336,6 @@
 
/* handle the last 11 bytes */
k = (const unsigned char *) ka;
-   c += keylen;
 #ifdef WORDS_BIGENDIAN
switch (len)
{
@@ -439,7 +438,6 @@
}
 
/* handle the last 11 bytes */
-   c += keylen;
 #ifdef WORDS_BIGENDIAN
switch (len)/* all the case statements fall 
through */
 
-

The remainder of this document will use the runs from my initial results
broken out using various power-of-two bucket sizes to simulate our actual
use in PostgreSQL as the number of items hashed increases and we use more
and more bits of our hash to identify the appropriate bucket. I have run
each test twice, once with our current hash_any() with the single mix()
function and then a second time using my patch from the November commitfest
plus the patch above to produce a new hash_any() with two separate mixing
functions mix() and final(). For each run I have generated a sequence of
unique inputs, up to the number of buckets, hashed them with the hash
functions (both old and new), then I calculate the expected number of
collision p(n) using the poisson formula for each number of buckets,
where the number of buckets are 2**16, 2**18, 2**20, 2**22, 2**24, and
2**26. For my initial run, I used a string consisting of the letter 'a'
followed by the integer representation of the numbers from 0 to the
(number of buckets - 1):

1) auint32 ((i.e. a1,a0002...)
Number of buckets: 65536
Total number of items: 65536
Expected number with n items: 24109 24109 12054 4018 1004 200 33 4
Actual number mix():  24044 24172 12078 4036 980 186 30 10
Actual number mix()/final():  24027 24232 12060 3972 1001 207 31 5 1

Number of buckets: 262144
Total number of items: 262144
Expected number with n items: 96437 96437 48218 16072 4018 803 133 19 2
Actual number mix():  96224 96730 48240 15951 4094 744 143 17 1
Actual number mix()/final():  96335 96646 48071 16128 4018 801 122 21 2

Number of buckets: 1048576
Total number of items: 1048576
Expected number with n items: 385749 385749 192874 64291 16072 3214 535 76 9
Actual number mix():  385716 385596 193243 64115 16053 3285 478 77 12 1
Actual number mix()/final():  385955 385016 193789 63768 16259 3190 511 79 8 1

Number of buckets: 4194304
Total number of items: 4194304
Expected number with n items: 1542998 1542998 771499 257166 64291 12858 2143 
306 38
Actual number mix():  1542536 1543489 771351 25 63830 12847 2123 
326 19 5 1
Actual number mix()/final():  1541828 1544429 772072 256178 64579 12774 2129 
288 22 5

Number of buckets: 16777216
Total number of items: 16777216
Expected number with n items: 6171992 6171992 3085996 1028665 257166 51433 8572 
1224 153
Actual number mix():  6170866 6174079 3085912 1027140 257808 51385 8638 
1219 146 23
Actual number mix()/final():  6172058 6171991 3086279 1027916 257535 51465 8554 
1243 149 23 3

Number of buckets: 67108864
Total number of items: 67108864
Expected number with n items: 24687971 24687971 12343985 4114661 1028665 205733 
34288 4898 612
Actual number mix():  24686110 24690897 12344232 4113515 1028232 205682 
34546 4942 629 72 7
Actual number mix()/final():  24708515 24669248 12333034 4114796 1034256 208424 
34888 5023 598 77 5

Here is a second run with number of items = (number of buckets)/2:
Number of buckets: 65536
Total number of items: 32768
Expected number with n items: 39749 19874 4968 828 103 10
Actual 

Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1

2009-01-10 Thread Kenneth Marshall
On Fri, Jan 09, 2009 at 02:00:39PM -0800, Jeff Davis wrote:
 On Fri, 2009-01-09 at 14:29 -0600, Kenneth Marshall wrote:
  Jeff,
  
  Thanks for the review. I would not really expect any differences in hash
  index build times other than normal noise variances. The most definitive
  benchmark that I have seen was done with my original patch submission
  in March posted by Luke of Greenplum:
  
We just applied this and saw a 5 percent speedup on a hash aggregation
 query with four columns in a 'group by' clause run against a single
 TPC-H table.
  
  I wonder if they would be willing to re-run their test? Thanks again.
 
 Separating mix() and final() should have some performance benefit,
 right?
 
 Regards,
   Jeff Davis
 
 
Yes, it does but the results can be swamped by other latencies in the
code path. Tests such as Tom's benchmark of the underlying functions is
needed to isolate the timings effectively or a benchmark like Greenplum's
that will benefit from a more efficient function.

Ken

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1

2009-01-10 Thread Jeff Davis
On Sat, 2009-01-10 at 11:06 -0600, Kenneth Marshall wrote:
  Separating mix() and final() should have some performance benefit,
  right?
  
 Yes, it does but the results can be swamped by other latencies in the
 code path. Tests such as Tom's benchmark of the underlying functions is
 needed to isolate the timings effectively or a benchmark like Greenplum's
 that will benefit from a more efficient function.
 

Ok. I isolated the function itself by just doing:

-- 10 million rows of random()::text
EXPLAIN ANALYZE SELECT hashtext(t) FROM randomtext;

I ran 5 times on both old and new code, eliminating the top and bottom
and taking the average of the remaining 3, and I got a 6.9% performance
improvement with the new code.

I tried quickly with a few other data types and got similar results.
It's obviously a small microbenchmark, but that's good enough for me. 

Thanks!

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1

2009-01-10 Thread Tom Lane
Jeff Davis pg...@j-davis.com writes:
 I ran 5 times on both old and new code, eliminating the top and bottom
 and taking the average of the remaining 3, and I got a 6.9% performance
 improvement with the new code.

The question that has been carefully evaded throughout the discussion
of this patch is whether the randomness of the hash result is decreased,
and if so what is that likely to cost us in performance of the various
hash-dependent algorithms.  I would still like to have an answer to that
before we make a change to gain marginal performance improvement in
the hash function itself (which is already shown to be barely measurable
in the total context of a hash-dependent operation...)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1

2009-01-10 Thread Jeff Davis
On Sat, 2009-01-10 at 13:36 -0500, Tom Lane wrote:
 Jeff Davis pg...@j-davis.com writes:
  I ran 5 times on both old and new code, eliminating the top and bottom
  and taking the average of the remaining 3, and I got a 6.9% performance
  improvement with the new code.
 
 The question that has been carefully evaded throughout the discussion
 of this patch is whether the randomness of the hash result is decreased,
 and if so what is that likely to cost us in performance of the various
 hash-dependent algorithms.  I would still like to have an answer to that
 before we make a change to gain marginal performance improvement in
 the hash function itself (which is already shown to be barely measurable
 in the total context of a hash-dependent operation...)
 

In:
http://archives.postgresql.org/message-id/20081104202655.gp18...@it.is.rice.edu

Ken showed that the number of hash collisions is the same in the old and
the new code for his test data. Not a direct measurement of randomness,
but it's some kind of indication.

I'm not an authority on either hash functions or statistics, so I'll
have to defer to Ken or someone else to actually prove that the
randomness is maintained.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1

2009-01-10 Thread Gregory Stark
Tom Lane t...@sss.pgh.pa.us writes:

 Jeff Davis pg...@j-davis.com writes:
 I ran 5 times on both old and new code, eliminating the top and bottom
 and taking the average of the remaining 3, and I got a 6.9% performance
 improvement with the new code.

 The question that has been carefully evaded throughout the discussion
 of this patch is whether the randomness of the hash result is decreased,

In fairness that doesn't seem to be the case. The original patch was posted
with such an analysis using cracklib and raw binary data:

http://article.gmane.org/gmane.comp.db.postgresql.devel.general/105675

  marginal performance improvement in the hash function itself (which is
 already shown to be barely measurable in the total context of a
 hash-dependent operation...)

If it's a 6% gain in the speed of Hash Join or HashAggregate it would be very
interesting. But I gather it's a 6% speedup in the time spent actually in the
hash function. Is that really where much of our time is going? If it's 10% of
the total time to execute one of these nodes then we're talking about a 0.6%
optimization...

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com
  Ask me about EnterpriseDB's PostGIS support!

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1

2009-01-10 Thread Jeff Davis
On Sat, 2009-01-10 at 13:57 -0500, Gregory Stark wrote:
 But I gather it's a 6% speedup in the time spent actually in the
 hash function.

That's correct. I was not able to detect any difference at all between
the old and new code using HashAggregate, which is where most of my
testing effort went:

http://archives.postgresql.org/message-id/1231531455.25019.75.ca...@jdavis

(see test results attached to that message, if you're interested)

I tested with two data distributions and 6 data types.

The 6-10% speedup I saw was a super-micro benchmark testing the hash
function, and I didn't spend much time with that.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1

2009-01-10 Thread Kenneth Marshall
On Sat, Jan 10, 2009 at 10:56:15AM -0800, Jeff Davis wrote:
 On Sat, 2009-01-10 at 13:36 -0500, Tom Lane wrote:
  Jeff Davis pg...@j-davis.com writes:
   I ran 5 times on both old and new code, eliminating the top and bottom
   and taking the average of the remaining 3, and I got a 6.9% performance
   improvement with the new code.
  
  The question that has been carefully evaded throughout the discussion
  of this patch is whether the randomness of the hash result is decreased,
  and if so what is that likely to cost us in performance of the various
  hash-dependent algorithms.  I would still like to have an answer to that
  before we make a change to gain marginal performance improvement in
  the hash function itself (which is already shown to be barely measurable
  in the total context of a hash-dependent operation...)
  
 
 In:
 http://archives.postgresql.org/message-id/20081104202655.gp18...@it.is.rice.edu
 
 Ken showed that the number of hash collisions is the same in the old and
 the new code for his test data. Not a direct measurement of randomness,
 but it's some kind of indication.
 
 I'm not an authority on either hash functions or statistics, so I'll
 have to defer to Ken or someone else to actually prove that the
 randomness is maintained.
 
 Regards,
   Jeff Davis
 
First, while I am not an expert by any means, basic statistics will give you the
probability of a collision when packing N items into M buckets chosen at random.
In all of my tests, I used 1.6M items into 2^32 buckets. If the bits mixing is
complete and random, the number of collisions should be close to the same no
matter what arrangement of bits are used for inputs. I think my test cases
indicate quite clearly that the new hash function is as independent of bit-
order as the older functions, in that the number of bucket collisions is
basically the same for all layouts. I have the test harness and can test
any other input that you would like me to. Second, the author of the basic
hash function (old and new) has performed more extensive testing and has
shown that the new functions are better in randomizing bits than his original
function. I will try and run a micro-benchmark of my original patch in March
and the result of the incremental approach that is the result of my Novermber
patch tomorrow.

Cheers,
Ken




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1

2009-01-10 Thread Kenneth Marshall
On Sat, Jan 10, 2009 at 01:57:27PM -0500, Gregory Stark wrote:
 Tom Lane t...@sss.pgh.pa.us writes:
 
  Jeff Davis pg...@j-davis.com writes:
  I ran 5 times on both old and new code, eliminating the top and bottom
  and taking the average of the remaining 3, and I got a 6.9% performance
  improvement with the new code.
 
  The question that has been carefully evaded throughout the discussion
  of this patch is whether the randomness of the hash result is decreased,
 
 In fairness that doesn't seem to be the case. The original patch was posted
 with such an analysis using cracklib and raw binary data:
 
 http://article.gmane.org/gmane.comp.db.postgresql.devel.general/105675
 
   marginal performance improvement in the hash function itself (which is
  already shown to be barely measurable in the total context of a
  hash-dependent operation...)
 
 If it's a 6% gain in the speed of Hash Join or HashAggregate it would be very
 interesting. But I gather it's a 6% speedup in the time spent actually in the
 hash function. Is that really where much of our time is going? If it's 10% of
 the total time to execute one of these nodes then we're talking about a 0.6%
 optimization...
 

The Greenplum test did show a 5% increase in performance with the replacement
functions in March.

Regards,
Ken

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1

2009-01-09 Thread Jeff Davis
On Mon, 2008-12-22 at 13:47 -0600, Kenneth Marshall wrote: 
 Dear PostgreSQL developers,
 
 I am re-sending this to keep this last change to the
 internal hash function on the radar.
 

Hi Ken,

A few comments:

1. New patch with very minor changes attached.

2. I reverted the change you made to indices.sgml. We still don't use
WAL for hash indexes, and in my opinion we should continue to discourage
their use until we do use WAL. We can add back in the comment that hash
indexes are suitable for large keys if we have some results to show
that.

3. There was a regression test failure in union.sql because the ordering
of the results was different. I updated the regression test.

4. Hash functions affect a lot more than hash indexes, so I ran through
a variety of tests that use a HashAggregate plan. Test setup and results
are attached. These results show no difference between the old and the
new code (about 0.1% better).

5. The hash index build time shows some improvement. The new code won in
every instance in which a there were a lot of duplicates in the table
(100 distinct values, 50K of each) by around 5%.

The new code appeared to be the same or slightly worse in the case of
hash index builds with few duplicates (100 distinct values, 5 of
each). The difference was about 1% worse, which is probably just noise.

Note: I'm no expert on hash functions. Take all of my tests with a grain
of salt.

I would feel a little better if I saw at least one test that showed
better performance of the new code on a reasonable-looking distribution
of data. The hash index build that you showed only took a second or two
-- it would be nice to see a test that lasted at least a minute.

Regards,
Jeff Davis




test_results.tar.gz
Description: application/compressed-tar
diff --git a/src/backend/access/hash/hashfunc.c b/src/backend/access/hash/hashfunc.c
index 96d5643..8a236b5 100644
--- a/src/backend/access/hash/hashfunc.c
+++ b/src/backend/access/hash/hashfunc.c
@@ -200,39 +200,94 @@ hashvarlena(PG_FUNCTION_ARGS)
  * hash function, see http://burtleburtle.net/bob/hash/doobs.html,
  * or Bob's article in Dr. Dobb's Journal, Sept. 1997.
  *
- * In the current code, we have adopted an idea from Bob's 2006 update
- * of his hash function, which is to fetch the data a word at a time when
- * it is suitably aligned.  This makes for a useful speedup, at the cost
- * of having to maintain four code paths (aligned vs unaligned, and
- * little-endian vs big-endian).  Note that we have NOT adopted his newer
- * mix() function, which is faster but may sacrifice some randomness.
+ * In the current code, we have adopted Bob's 2006 update of his hash
+ * which fetches the data a word at a time when it is suitably aligned.
+ * This makes for a useful speedup, at the cost of having to maintain
+ * four code paths (aligned vs unaligned, and little-endian vs big-endian).
+ * It also two separate mixing functions mix() and final(), instead
+ * of a slower multi-purpose function.
  */
 
 /* Get a bit mask of the bits set in non-uint32 aligned addresses */
 #define UINT32_ALIGN_MASK (sizeof(uint32) - 1)
+#define rot(x,k) (((x)(k)) | ((x)(32-(k
 
 /*--
  * mix -- mix 3 32-bit values reversibly.
- * For every delta with one or two bits set, and the deltas of all three
- * high bits or all three low bits, whether the original value of a,b,c
- * is almost all zero or is uniformly distributed,
- * - If mix() is run forward or backward, at least 32 bits in a,b,c
- *	 have at least 1/4 probability of changing.
- * - If mix() is run forward, every bit of c will change between 1/3 and
- *	 2/3 of the time.  (Well, 22/100 and 78/100 for some 2-bit deltas.)
+ *
+ * This is reversible, so any information in (a,b,c) before mix() is
+ * still in (a,b,c) after mix().
+ *
+ * If four pairs of (a,b,c) inputs are run through mix(), or through
+ * mix() in reverse, there are at least 32 bits of the output that
+ * are sometimes the same for one pair and different for another pair.
+ * This was tested for:
+ * * pairs that differed by one bit, by two bits, in any combination
+ *   of top bits of (a,b,c), or in any combination of bottom bits of
+ *   (a,b,c).
+ * * differ is defined as +, -, ^, or ~^.  For + and -, I transformed
+ *   the output delta to a Gray code (a^(a1)) so a string of 1's (as
+ *   is commonly produced by subtraction) look like a single 1-bit
+ *   difference.
+ * * the base values were pseudorandom, all zero but one bit set, or
+ *   all zero plus a counter that starts at zero.
+ * 
+ * This does not achieve avalanche.  There are input bits of (a,b,c)
+ * that fail to affect some output bits of (a,b,c), especially of a.  The
+ * most thoroughly mixed value is c, but it doesn't really even achieve
+ * avalanche in c. 
+ * 
+ * This allows some parallelism.  Read-after-writes are good at doubling
+ * the number of bits affected, so the goal of mixing pulls in the opposite
+ * direction as the goal of parallelism.  I did what 

Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1

2009-01-09 Thread Kenneth Marshall
On Fri, Jan 09, 2009 at 12:04:15PM -0800, Jeff Davis wrote:
 On Mon, 2008-12-22 at 13:47 -0600, Kenneth Marshall wrote: 
  Dear PostgreSQL developers,
  
  I am re-sending this to keep this last change to the
  internal hash function on the radar.
  
 
 Hi Ken,
 
 A few comments:
 
 1. New patch with very minor changes attached.
 
 2. I reverted the change you made to indices.sgml. We still don't use
 WAL for hash indexes, and in my opinion we should continue to discourage
 their use until we do use WAL. We can add back in the comment that hash
 indexes are suitable for large keys if we have some results to show
 that.
 
 3. There was a regression test failure in union.sql because the ordering
 of the results was different. I updated the regression test.
 
 4. Hash functions affect a lot more than hash indexes, so I ran through
 a variety of tests that use a HashAggregate plan. Test setup and results
 are attached. These results show no difference between the old and the
 new code (about 0.1% better).
 
 5. The hash index build time shows some improvement. The new code won in
 every instance in which a there were a lot of duplicates in the table
 (100 distinct values, 50K of each) by around 5%.
 
 The new code appeared to be the same or slightly worse in the case of
 hash index builds with few duplicates (100 distinct values, 5 of
 each). The difference was about 1% worse, which is probably just noise.
 
 Note: I'm no expert on hash functions. Take all of my tests with a grain
 of salt.
 
 I would feel a little better if I saw at least one test that showed
 better performance of the new code on a reasonable-looking distribution
 of data. The hash index build that you showed only took a second or two
 -- it would be nice to see a test that lasted at least a minute.
 
 Regards,
   Jeff Davis
 
 

Jeff,

Thanks for the review. I would not really expect any differences in hash
index build times other than normal noise variances. The most definitive
benchmark that I have seen was done with my original patch submission
in March posted by Luke of Greenplum:

  We just applied this and saw a 5 percent speedup on a hash aggregation
   query with four columns in a 'group by' clause run against a single
   TPC-H table.

I wonder if they would be willing to re-run their test? Thanks again.

Regards,
Ken


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1

2009-01-09 Thread Jeff Davis
On Fri, 2009-01-09 at 14:29 -0600, Kenneth Marshall wrote:
 Jeff,
 
 Thanks for the review. I would not really expect any differences in hash
 index build times other than normal noise variances. The most definitive
 benchmark that I have seen was done with my original patch submission
 in March posted by Luke of Greenplum:
 
   We just applied this and saw a 5 percent speedup on a hash aggregation
query with four columns in a 'group by' clause run against a single
TPC-H table.
 
 I wonder if they would be willing to re-run their test? Thanks again.

Separating mix() and final() should have some performance benefit,
right?

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1

2008-12-22 Thread Kenneth Marshall
Dear PostgreSQL developers,

I am re-sending this to keep this last change to the
internal hash function on the radar.

Ken

Sorry about the delay for this update to the new hash
index implementation. I was trying to get the WAL logging
in place and forgot to post the actual patch. The WAL
for hash indexes will need to wait for 8.5, but I did
want to add back in the piece of the Bob Jenkins 2006
hash function that was stripped out of the initial
patch on application due to concerns about the randomness
of the resulting hash values. Here is a re-post of my
initial findings comparing the old/new Jenkins hash
from lookup2 and lookup3. I have added a third column
containing the results for the hash_any() resulting
from the attached patch as well as simple timing test
for a DB reindex both before and after patching.

Also attached is a simple documentation patch updating
the note attached to the hash index description.

Regards,
Ken

Hi,

I have finally had a chance to do some investigation on
the performance of the old hash mix() function versus
the updated mix()/final() in the new hash function. Here
is a table of my current results for both the old and the
new hash function. In this case cracklib refers to the
cracklib-dict containing 1648379 unique words massaged
in various ways to generate input strings for the hash
functions. The result is the number of collisions in the
hash values generated.

hash inputoldnew  newv2
--------  -
cracklib  338316  338
cracklib x 2 (i.e. clibclib)  305319  300
cracklib x 3 (clibclibclib)   323329  315
cracklib x 10 302310  329
cracklib x 100350335  298
cracklib x 1000   314309  315
cracklib x 100 truncated to char(100) 311327  320

uint32 from 1-1648379 309319  347
(uint32 1-1948379)*256309314  304
(uint32 1-1948379)*16 310314  324
auint32 (i.e. a1,a0002...)  320321  312

uint32uint32 (i.e. uint64)321287  309

The different result columns are old = Jenkins 1996 hash
function(lookup2.c), new = Jenkins 2006 hash function
(lookup3.c), and newv2 = adaptation of current hash_any()
to incorporate the separate mix()/final() functions. As
you can see from the results, spliting the mix() and final()
apart does not result in any perceptible loss of randomness
in the hash assignment. I also ran a crude timing for a
reindex of the following database:

CREATE TABLE dict (word text);
CREATE INDEX wordhash ON dict USING hash (word);
INSERT INTO dict (word) VALUES('s;lhkjdpyoijxfg;lktjgh;sdlhkjo');
INSERT INTO dict (SELECT MAX(word)||MAX(word) FROM dict);
... (21 times)

REINDEX TABLE
...

The average time to reindex the table using our current
hash_any() without the separate mix()/final() was 1696ms
and 1482ms with the separate mix()/final() stages giving
almost 13% better performance for this stupid metric.
--- indices.sgml2008-10-13 14:40:06.0 -0500
+++ indices.sgml.NEW2008-11-04 12:42:35.0 -0600
@@ -190,13 +190,11 @@
 
   note
para
-Testing has shown productnamePostgreSQL/productname's hash
-indexes to perform no better than B-tree indexes, and the
-index size and build time for hash indexes is much worse.
-Furthermore, hash index operations are not presently WAL-logged,
+productnamePostgreSQL/productname's hash indexes provide
+the fast O(1) lookups, even for very large objects.
+Hash index operations are not presently WAL-logged,
 so hash indexes might need to be rebuilt with commandREINDEX/
-after a database crash.
-For these reasons, hash index use is presently discouraged.
+after a database crash. 
/para
   /note
 
--- hashfunc.c.ORIG 2008-09-03 13:07:14.0 -0500
+++ hashfunc.c.NEW  2008-11-04 08:36:16.0 -0600
@@ -200,39 +200,94 @@
  * hash function, see http://burtleburtle.net/bob/hash/doobs.html,
  * or Bob's article in Dr. Dobb's Journal, Sept. 1997.
  *
- * In the current code, we have adopted an idea from Bob's 2006 update
- * of his hash function, which is to fetch the data a word at a time when
- * it is suitably aligned.  This makes for a useful speedup, at the cost
- * of having to maintain four code paths (aligned vs unaligned, and
- * little-endian vs big-endian).  Note that we have NOT adopted his newer
- * mix() function, which is faster but may sacrifice some randomness.
+ * In the current code, we have adopted Bob's 2006 update of his hash
+ * which fetches the data a word at a time when it is suitably aligned.
+ * This makes for a useful speedup, at the cost of having to maintain
+ * four code paths (aligned vs unaligned, and little-endian vs big-endian).
+ * It also two separate mixing functions 

Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1

2008-12-22 Thread Alvaro Herrera
Kenneth Marshall wrote:
 Dear PostgreSQL developers,
 
 I am re-sending this to keep this last change to the
 internal hash function on the radar.

Please add it to the commitfest wiki page,
http://wiki.postgresql.org/wiki/CommitFest_2008-11

Thanks

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1

2008-11-04 Thread Oleg Bartunov

Just interested if you repeat your tests not with cracklib-dict,
but using 8-bit words. From our experience we found many hash functions
are optimized for 7-bit words and produce too many collisions
for 8-bit words. That's why we use crc32.

Oleg
On Tue, 4 Nov 2008, Kenneth Marshall wrote:


Sorry about the delay for this update to the new hash
index implementation. I was trying to get the WAL logging
in place and forgot to post the actual patch. The WAL
for hash indexes will need to wait for 8.5, but I did
want to add back in the piece of the Bob Jenkins 2006
hash function that was stripped out of the initial
patch on application due to concerns about the randomness
of the resulting hash values. Here is a re-post of my
initial findings comparing the old/new Jenkins hash
from lookup2 and lookup3. I have added a third column
containing the results for the hash_any() resulting
from the attached patch as well as simple timing test
for a DB reindex both before and after patching.

Also attached is a simple documentation patch updating
the note attached to the hash index description.

Regards,
Ken

Hi,

I have finally had a chance to do some investigation on
the performance of the old hash mix() function versus
the updated mix()/final() in the new hash function. Here
is a table of my current results for both the old and the
new hash function. In this case cracklib refers to the
cracklib-dict containing 1648379 unique words massaged
in various ways to generate input strings for the hash
functions. The result is the number of collisions in the
hash values generated.

hash inputoldnew  newv2
--------  -
cracklib  338316  338
cracklib x 2 (i.e. clibclib)  305319  300
cracklib x 3 (clibclibclib)   323329  315
cracklib x 10 302310  329
cracklib x 100350335  298
cracklib x 1000   314309  315
cracklib x 100 truncated to char(100) 311327  320

uint32 from 1-1648379 309319  347
(uint32 1-1948379)*256309314  304
(uint32 1-1948379)*16 310314  324
auint32 (i.e. a1,a0002...)  320321  312

uint32uint32 (i.e. uint64)321287  309

The different result columns are old = Jenkins 1996 hash
function(lookup2.c), new = Jenkins 2006 hash function
(lookup3.c), and newv2 = adaptation of current hash_any()
to incorporate the separate mix()/final() functions. As
you can see from the results, spliting the mix() and final()
apart does not result in any perceptible loss of randomness
in the hash assignment. I also ran a crude timing for a
reindex of the following database:

CREATE TABLE dict (word text);
CREATE INDEX wordhash ON dict USING hash (word);
INSERT INTO dict (word) VALUES('s;lhkjdpyoijxfg;lktjgh;sdlhkjo');
INSERT INTO dict (SELECT MAX(word)||MAX(word) FROM dict);
... (21 times)

REINDEX TABLE
...

The average time to reindex the table using our current
hash_any() without the separate mix()/final() was 1696ms
and 1482ms with the separate mix()/final() stages giving
almost 13% better performance for this stupid metric.



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1

2008-11-04 Thread Kenneth Marshall
On Tue, Nov 04, 2008 at 11:32:47PM +0300, Oleg Bartunov wrote:
 Just interested if you repeat your tests not with cracklib-dict,
 but using 8-bit words. From our experience we found many hash functions
 are optimized for 7-bit words and produce too many collisions
 for 8-bit words. That's why we use crc32.

 Oleg

I think that the lines:

uint32 from 1-1648379 309319  347
(uint32 1-1948379)*256309314  304
(uint32 1-1948379)*16 310314  324
auint32 (i.e. a1,a0002...)  320321  312

uint32uint32 (i.e. uint64)321287  309

can count as 8-bit words if taken a byte at a time. In fact
that is how hash_any() treats them, as a character string
and a length. One of the design goals of the original 1997
hash function in lookup2 and the 2006 update in lookup3 is
to support keys of arbitrary arrangements of bits. I can run
any additional checks that you want since the test harness
is perl with Inline::C. If you are using crc32 his article in
Dr. Dobbs shows that CRC has a 2 into 1 funnel-15 and an
11 into 10 funnel-100 unless you are using a generalized
CRC. Also, unless you can inline your CRC the Jenkins lookup3
is 5n+20 where CRC is 9n+3.

Regards,
Ken

 On Tue, 4 Nov 2008, Kenneth Marshall wrote:

 Sorry about the delay for this update to the new hash
 index implementation. I was trying to get the WAL logging
 in place and forgot to post the actual patch. The WAL
 for hash indexes will need to wait for 8.5, but I did
 want to add back in the piece of the Bob Jenkins 2006
 hash function that was stripped out of the initial
 patch on application due to concerns about the randomness
 of the resulting hash values. Here is a re-post of my
 initial findings comparing the old/new Jenkins hash
 from lookup2 and lookup3. I have added a third column
 containing the results for the hash_any() resulting
 from the attached patch as well as simple timing test
 for a DB reindex both before and after patching.

 Also attached is a simple documentation patch updating
 the note attached to the hash index description.

 Regards,
 Ken
 
 Hi,

 I have finally had a chance to do some investigation on
 the performance of the old hash mix() function versus
 the updated mix()/final() in the new hash function. Here
 is a table of my current results for both the old and the
 new hash function. In this case cracklib refers to the
 cracklib-dict containing 1648379 unique words massaged
 in various ways to generate input strings for the hash
 functions. The result is the number of collisions in the
 hash values generated.

 hash inputoldnew  newv2
 --------  -
 cracklib  338316  338
 cracklib x 2 (i.e. clibclib)  305319  300
 cracklib x 3 (clibclibclib)   323329  315
 cracklib x 10 302310  329
 cracklib x 100350335  298
 cracklib x 1000   314309  315
 cracklib x 100 truncated to char(100) 311327  320

 uint32 from 1-1648379 309319  347
 (uint32 1-1948379)*256309314  304
 (uint32 1-1948379)*16 310314  324
 auint32 (i.e. a1,a0002...)  320321  312

 uint32uint32 (i.e. uint64)321287  309

 The different result columns are old = Jenkins 1996 hash
 function(lookup2.c), new = Jenkins 2006 hash function
 (lookup3.c), and newv2 = adaptation of current hash_any()
 to incorporate the separate mix()/final() functions. As
 you can see from the results, spliting the mix() and final()
 apart does not result in any perceptible loss of randomness
 in the hash assignment. I also ran a crude timing for a
 reindex of the following database:

 CREATE TABLE dict (word text);
 CREATE INDEX wordhash ON dict USING hash (word);
 INSERT INTO dict (word) VALUES('s;lhkjdpyoijxfg;lktjgh;sdlhkjo');
 INSERT INTO dict (SELECT MAX(word)||MAX(word) FROM dict);
 ... (21 times)

 REINDEX TABLE
 ...

 The average time to reindex the table using our current
 hash_any() without the separate mix()/final() was 1696ms
 and 1482ms with the separate mix()/final() stages giving
 almost 13% better performance for this stupid metric.


   Regards,
   Oleg
 _
 Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
 Sternberg Astronomical Institute, Moscow University, Russia
 Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
 phone: +007(495)939-16-83, +007(495)939-23-83

 -- 
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1

2008-11-04 Thread Kenneth Marshall
Oleg,

Here is a little more information on the use of CRC32 as
a hash function, with some warning caveats:

http://home.comcast.net/~bretm/hash/8.html

Regards,
Ken

On Tue, Nov 04, 2008 at 03:15:44PM -0600, Kenneth Marshall wrote:
 On Tue, Nov 04, 2008 at 11:32:47PM +0300, Oleg Bartunov wrote:
  Just interested if you repeat your tests not with cracklib-dict,
  but using 8-bit words. From our experience we found many hash functions
  are optimized for 7-bit words and produce too many collisions
  for 8-bit words. That's why we use crc32.
 
  Oleg
 
 I think that the lines:
 
 uint32 from 1-1648379 309319  347
 (uint32 1-1948379)*256309314  304
 (uint32 1-1948379)*16 310314  324
 auint32 (i.e. a1,a0002...)  320321  312
 
 uint32uint32 (i.e. uint64)321287  309
 
 can count as 8-bit words if taken a byte at a time. In fact
 that is how hash_any() treats them, as a character string
 and a length. One of the design goals of the original 1997
 hash function in lookup2 and the 2006 update in lookup3 is
 to support keys of arbitrary arrangements of bits. I can run
 any additional checks that you want since the test harness
 is perl with Inline::C. If you are using crc32 his article in
 Dr. Dobbs shows that CRC has a 2 into 1 funnel-15 and an
 11 into 10 funnel-100 unless you are using a generalized
 CRC. Also, unless you can inline your CRC the Jenkins lookup3
 is 5n+20 where CRC is 9n+3.
 
 Regards,
 Ken

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCHES] updated hash functions for postgresql v1

2008-11-04 Thread Kenneth Marshall
Sorry about the delay for this update to the new hash
index implementation. I was trying to get the WAL logging
in place and forgot to post the actual patch. The WAL
for hash indexes will need to wait for 8.5, but I did
want to add back in the piece of the Bob Jenkins 2006
hash function that was stripped out of the initial
patch on application due to concerns about the randomness
of the resulting hash values. Here is a re-post of my
initial findings comparing the old/new Jenkins hash
from lookup2 and lookup3. I have added a third column
containing the results for the hash_any() resulting
from the attached patch as well as simple timing test
for a DB reindex both before and after patching.

Also attached is a simple documentation patch updating
the note attached to the hash index description.

Regards,
Ken

Hi,

I have finally had a chance to do some investigation on
the performance of the old hash mix() function versus
the updated mix()/final() in the new hash function. Here
is a table of my current results for both the old and the
new hash function. In this case cracklib refers to the
cracklib-dict containing 1648379 unique words massaged
in various ways to generate input strings for the hash
functions. The result is the number of collisions in the
hash values generated.

hash inputoldnew  newv2
--------  -
cracklib  338316  338
cracklib x 2 (i.e. clibclib)  305319  300
cracklib x 3 (clibclibclib)   323329  315
cracklib x 10 302310  329
cracklib x 100350335  298
cracklib x 1000   314309  315
cracklib x 100 truncated to char(100) 311327  320

uint32 from 1-1648379 309319  347
(uint32 1-1948379)*256309314  304
(uint32 1-1948379)*16 310314  324
auint32 (i.e. a1,a0002...)  320321  312

uint32uint32 (i.e. uint64)321287  309

The different result columns are old = Jenkins 1996 hash
function(lookup2.c), new = Jenkins 2006 hash function
(lookup3.c), and newv2 = adaptation of current hash_any()
to incorporate the separate mix()/final() functions. As
you can see from the results, spliting the mix() and final()
apart does not result in any perceptible loss of randomness
in the hash assignment. I also ran a crude timing for a
reindex of the following database:

CREATE TABLE dict (word text);
CREATE INDEX wordhash ON dict USING hash (word);
INSERT INTO dict (word) VALUES('s;lhkjdpyoijxfg;lktjgh;sdlhkjo');
INSERT INTO dict (SELECT MAX(word)||MAX(word) FROM dict);
... (21 times)

REINDEX TABLE
...

The average time to reindex the table using our current
hash_any() without the separate mix()/final() was 1696ms
and 1482ms with the separate mix()/final() stages giving
almost 13% better performance for this stupid metric.
--- indices.sgml2008-10-13 14:40:06.0 -0500
+++ indices.sgml.NEW2008-11-04 12:42:35.0 -0600
@@ -190,13 +190,11 @@
 
   note
para
-Testing has shown productnamePostgreSQL/productname's hash
-indexes to perform no better than B-tree indexes, and the
-index size and build time for hash indexes is much worse.
-Furthermore, hash index operations are not presently WAL-logged,
+productnamePostgreSQL/productname's hash indexes provide
+the fast O(1) lookups, even for very large objects.
+Hash index operations are not presently WAL-logged,
 so hash indexes might need to be rebuilt with commandREINDEX/
-after a database crash.
-For these reasons, hash index use is presently discouraged.
+after a database crash. 
/para
   /note
 
--- hashfunc.c.ORIG 2008-09-03 13:07:14.0 -0500
+++ hashfunc.c.NEW  2008-11-04 08:36:16.0 -0600
@@ -200,39 +200,94 @@
  * hash function, see http://burtleburtle.net/bob/hash/doobs.html,
  * or Bob's article in Dr. Dobb's Journal, Sept. 1997.
  *
- * In the current code, we have adopted an idea from Bob's 2006 update
- * of his hash function, which is to fetch the data a word at a time when
- * it is suitably aligned.  This makes for a useful speedup, at the cost
- * of having to maintain four code paths (aligned vs unaligned, and
- * little-endian vs big-endian).  Note that we have NOT adopted his newer
- * mix() function, which is faster but may sacrifice some randomness.
+ * In the current code, we have adopted Bob's 2006 update of his hash
+ * which fetches the data a word at a time when it is suitably aligned.
+ * This makes for a useful speedup, at the cost of having to maintain
+ * four code paths (aligned vs unaligned, and little-endian vs big-endian).
+ * It also two separate mixing functions mix() and final() instead
+ * of a single multi-purpose function, that is slower as a result.
  */
 
 /* Get a bit mask of the bits set in