Re: generate syscache info automatically

John Naylor Sat, 02 Sep 2023 03:38:25 -0700

I wrote:

> + # XXX This one neither, but if I add it to @skip, PerfectHash will
fail. (???)
> + #FIXME: AttributeRelationId
>
> I took a quick look at this, and attached is the least invasive way to
get it working for now, which is to bump the table size slightly. The
comment says doing this isn't reliable, but it happens to work in this
case. Playing around with the functions is hit-or-miss, and when that
fails, somehow the larger table saves the day.


To while away a rainy day, I poked at this a bit more and found the input
is pathological with our current methods. Even with a large-ish exhaustive
search, the two success are strange in that they only succeeded by
accidentally bumping the table size up to where I got it to work before
(77):

With multipliers (5, 19), it recognizes that the initial table size (75) is
a multiple of 5, so increases the table size to 76, which is a multiple of
19, so it increases it again to 77 and succeeds.

Same with (3, 76): 75 is a multiple of 3, so up to 76, which of course
divides 76, so bumps it to 77 likewise.

Turning the loop into

a = (a ^ c) * 257;
b = (b ^ c) * X;

...seems to work very well.

In fact, now trying some powers-of-two for X before the primes works most
of the time with our inputs, even for some unicode normalization functions,
on the first seed iteration. That likely won't make any difference in
practice, but it's an interesting demo. I've attached these two draft ideas
as text.

--
John Naylor
EDB: http://www.enterprisedb.com

From c72fa746108ed853f0219b10e7368b6e833e5fc5 Mon Sep 17 00:00:00 2001
From: John Naylor <john.nay...@postgresql.org>
Date: Sat, 2 Sep 2023 16:25:23 +0700
Subject: [PATCH v301 1/2] Use XOR for combining and do it before mixing

Previously, we mixed first and then combined the input
character via addition. This failed on a small set of
OIDs, per report from Peter Eisentraut.
---
 src/tools/PerfectHash.pm | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/src/tools/PerfectHash.pm b/src/tools/PerfectHash.pm
index e54905a3ef..0d6826141f 100644
--- a/src/tools/PerfectHash.pm
+++ b/src/tools/PerfectHash.pm
@@ -144,8 +144,8 @@ sub generate_hash_function
        $f .= sprintf "\t\tunsigned char c = *k++";
        $f .= sprintf " | 0x20" if $case_fold;                 # see comment 
below
        $f .= sprintf ";\n\n";
-       $f .= sprintf "\t\ta = a * %d + c;\n", $hash_mult1;
-       $f .= sprintf "\t\tb = b * %d + c;\n", $hash_mult2;
+       $f .= sprintf "\t\ta = (a ^ c) * %d;\n", $hash_mult1;
+       $f .= sprintf "\t\tb = (b ^ c) * %d;\n", $hash_mult2;
        $f .= sprintf "\t}\n";
        $f .= sprintf "\treturn h[a %% %d] + h[b %% %d];\n", $nhash, $nhash;
        $f .= sprintf "}\n";
@@ -171,7 +171,7 @@ sub _calc_hash
        {
                my $cn = ord($c);
                $cn |= 0x20 if $case_fold;
-               $result = ($result * $mult + $cn) % 4294967296;
+               $result = (($result ^ $cn) * $mult) % 4294967296;
        }
        return $result;
 }
@@ -203,8 +203,8 @@ sub _construct_hash_table
        my $nverts = 2 * $nedges + 1;    # number of vertices
 
        # However, it would be very bad if $nverts were exactly equal to either
-       # $hash_mult1 or $hash_mult2: effectively, that hash function would be
-       # sensitive to only the last byte of each key.  Cases where $nverts is a
+       # $hash_mult1 or $hash_mult2: the corresponding hash function would
+       # always have a modulus of zero.  Cases where $nverts is a
        # multiple of either multiplier likewise lose information.  (But $nverts
        # can't actually divide them, if they've been intelligently chosen as
        # primes.)  We can avoid such problems by adjusting the table size.
-- 
2.41.0

From a407ee7138c597403be5e0521416e4bf5a9c2e46 Mon Sep 17 00:00:00 2001
From: John Naylor <john.nay...@postgresql.org>
Date: Sat, 2 Sep 2023 16:07:53 +0700
Subject: [PATCH v301 2/2] Use powers of two when choosing multipliers for
 perfect hash functions

---
 src/tools/PerfectHash.pm | 19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/src/tools/PerfectHash.pm b/src/tools/PerfectHash.pm
index 0d6826141f..dc5a9b4427 100644
--- a/src/tools/PerfectHash.pm
+++ b/src/tools/PerfectHash.pm
@@ -77,8 +77,10 @@ sub generate_hash_function
        $case_fold = $options{case_fold} || 0;
 
        # Try different hash function parameters until we find a set that works
-       # for these keys.  The multipliers are chosen to be primes that are 
cheap
-       # to calculate via shift-and-add, so don't change them without care.
+       # for these keys.  The multipliers are chosen to be numbers that are 
cheap
+       # to calculate, so don't change them without care.
+       # Currently they are either powers-of-two (which reduce to a shift),
+       # or adjacent primes (which reduce to shift-and-add).
        # (Commonly, random seeds are tried, but we want reproducible results
        # from this program so we don't do that.)
        my $hash_mult1 = 257;
@@ -89,12 +91,12 @@ sub generate_hash_function
   FIND_PARAMS:
        for ($hash_seed1 = 0; $hash_seed1 < 10; $hash_seed1++)
        {
-
                for ($hash_seed2 = 0; $hash_seed2 < 10; $hash_seed2++)
                {
-                       foreach (17, 31, 127, 8191)
+                       foreach (16, 32, 64, 128, 512, 1024, 2048, 4096, 8192, 
17, 31, 127, 8191)
                        {
                                $hash_mult2 = $_;    # "foreach $hash_mult2" 
doesn't work
+
                                @subresult = _construct_hash_table(
                                        $keys_ref, $hash_mult1, $hash_mult2,
                                        $hash_seed1, $hash_seed2);
@@ -205,11 +207,12 @@ sub _construct_hash_table
        # However, it would be very bad if $nverts were exactly equal to either
        # $hash_mult1 or $hash_mult2: the corresponding hash function would
        # always have a modulus of zero.  Cases where $nverts is a
-       # multiple of either multiplier likewise lose information.  (But $nverts
-       # can't actually divide them, if they've been intelligently chosen as
-       # primes.)  We can avoid such problems by adjusting the table size.
+       # multiple or divisor of either multiplier likewise lose information.
+       # We can avoid such problems by adjusting the table size.
        while ($nverts % $hash_mult1 == 0
-               || $nverts % $hash_mult2 == 0)
+               || $nverts % $hash_mult2 == 0
+               || $hash_mult1 % $nverts == 0
+               || $hash_mult2 % $nverts == 0)
        {
                $nverts++;
        }
-- 
2.41.0

Re: generate syscache info automatically

Reply via email to