I wrote: > + # XXX This one neither, but if I add it to @skip, PerfectHash will fail. (???) > + #FIXME: AttributeRelationId > > I took a quick look at this, and attached is the least invasive way to get it working for now, which is to bump the table size slightly. The comment says doing this isn't reliable, but it happens to work in this case. Playing around with the functions is hit-or-miss, and when that fails, somehow the larger table saves the day.
To while away a rainy day, I poked at this a bit more and found the input is pathological with our current methods. Even with a large-ish exhaustive search, the two success are strange in that they only succeeded by accidentally bumping the table size up to where I got it to work before (77): With multipliers (5, 19), it recognizes that the initial table size (75) is a multiple of 5, so increases the table size to 76, which is a multiple of 19, so it increases it again to 77 and succeeds. Same with (3, 76): 75 is a multiple of 3, so up to 76, which of course divides 76, so bumps it to 77 likewise. Turning the loop into a = (a ^ c) * 257; b = (b ^ c) * X; ...seems to work very well. In fact, now trying some powers-of-two for X before the primes works most of the time with our inputs, even for some unicode normalization functions, on the first seed iteration. That likely won't make any difference in practice, but it's an interesting demo. I've attached these two draft ideas as text. -- John Naylor EDB: http://www.enterprisedb.com
From c72fa746108ed853f0219b10e7368b6e833e5fc5 Mon Sep 17 00:00:00 2001 From: John Naylor <john.nay...@postgresql.org> Date: Sat, 2 Sep 2023 16:25:23 +0700 Subject: [PATCH v301 1/2] Use XOR for combining and do it before mixing Previously, we mixed first and then combined the input character via addition. This failed on a small set of OIDs, per report from Peter Eisentraut. --- src/tools/PerfectHash.pm | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/src/tools/PerfectHash.pm b/src/tools/PerfectHash.pm index e54905a3ef..0d6826141f 100644 --- a/src/tools/PerfectHash.pm +++ b/src/tools/PerfectHash.pm @@ -144,8 +144,8 @@ sub generate_hash_function $f .= sprintf "\t\tunsigned char c = *k++"; $f .= sprintf " | 0x20" if $case_fold; # see comment below $f .= sprintf ";\n\n"; - $f .= sprintf "\t\ta = a * %d + c;\n", $hash_mult1; - $f .= sprintf "\t\tb = b * %d + c;\n", $hash_mult2; + $f .= sprintf "\t\ta = (a ^ c) * %d;\n", $hash_mult1; + $f .= sprintf "\t\tb = (b ^ c) * %d;\n", $hash_mult2; $f .= sprintf "\t}\n"; $f .= sprintf "\treturn h[a %% %d] + h[b %% %d];\n", $nhash, $nhash; $f .= sprintf "}\n"; @@ -171,7 +171,7 @@ sub _calc_hash { my $cn = ord($c); $cn |= 0x20 if $case_fold; - $result = ($result * $mult + $cn) % 4294967296; + $result = (($result ^ $cn) * $mult) % 4294967296; } return $result; } @@ -203,8 +203,8 @@ sub _construct_hash_table my $nverts = 2 * $nedges + 1; # number of vertices # However, it would be very bad if $nverts were exactly equal to either - # $hash_mult1 or $hash_mult2: effectively, that hash function would be - # sensitive to only the last byte of each key. Cases where $nverts is a + # $hash_mult1 or $hash_mult2: the corresponding hash function would + # always have a modulus of zero. Cases where $nverts is a # multiple of either multiplier likewise lose information. (But $nverts # can't actually divide them, if they've been intelligently chosen as # primes.) We can avoid such problems by adjusting the table size. -- 2.41.0
From a407ee7138c597403be5e0521416e4bf5a9c2e46 Mon Sep 17 00:00:00 2001 From: John Naylor <john.nay...@postgresql.org> Date: Sat, 2 Sep 2023 16:07:53 +0700 Subject: [PATCH v301 2/2] Use powers of two when choosing multipliers for perfect hash functions --- src/tools/PerfectHash.pm | 19 +++++++++++-------- 1 file changed, 11 insertions(+), 8 deletions(-) diff --git a/src/tools/PerfectHash.pm b/src/tools/PerfectHash.pm index 0d6826141f..dc5a9b4427 100644 --- a/src/tools/PerfectHash.pm +++ b/src/tools/PerfectHash.pm @@ -77,8 +77,10 @@ sub generate_hash_function $case_fold = $options{case_fold} || 0; # Try different hash function parameters until we find a set that works - # for these keys. The multipliers are chosen to be primes that are cheap - # to calculate via shift-and-add, so don't change them without care. + # for these keys. The multipliers are chosen to be numbers that are cheap + # to calculate, so don't change them without care. + # Currently they are either powers-of-two (which reduce to a shift), + # or adjacent primes (which reduce to shift-and-add). # (Commonly, random seeds are tried, but we want reproducible results # from this program so we don't do that.) my $hash_mult1 = 257; @@ -89,12 +91,12 @@ sub generate_hash_function FIND_PARAMS: for ($hash_seed1 = 0; $hash_seed1 < 10; $hash_seed1++) { - for ($hash_seed2 = 0; $hash_seed2 < 10; $hash_seed2++) { - foreach (17, 31, 127, 8191) + foreach (16, 32, 64, 128, 512, 1024, 2048, 4096, 8192, 17, 31, 127, 8191) { $hash_mult2 = $_; # "foreach $hash_mult2" doesn't work + @subresult = _construct_hash_table( $keys_ref, $hash_mult1, $hash_mult2, $hash_seed1, $hash_seed2); @@ -205,11 +207,12 @@ sub _construct_hash_table # However, it would be very bad if $nverts were exactly equal to either # $hash_mult1 or $hash_mult2: the corresponding hash function would # always have a modulus of zero. Cases where $nverts is a - # multiple of either multiplier likewise lose information. (But $nverts - # can't actually divide them, if they've been intelligently chosen as - # primes.) We can avoid such problems by adjusting the table size. + # multiple or divisor of either multiplier likewise lose information. + # We can avoid such problems by adjusting the table size. while ($nverts % $hash_mult1 == 0 - || $nverts % $hash_mult2 == 0) + || $nverts % $hash_mult2 == 0 + || $hash_mult1 % $nverts == 0 + || $hash_mult2 % $nverts == 0) { $nverts++; } -- 2.41.0