Re: Use simplehash.h instead of dynahash in SMgr

Yura Sokolov Sat, 24 Apr 2021 15:27:47 -0700

David Rowley писал 2021-04-24 18:58:

Hackers,


Last year, when working on making compactify_tuples() go faster for
19c60ad69, I did quite a bit of benchmarking of the recovery process.
The next thing that was slow after compactify_tuples() was the hash
lookups done in smgropen().

Currently, we use dynahash hash tables to store the SMgrRelation so we
can perform fast lookups by RelFileNodeBackend. However, I had in mind
that a simplehash table might perform better. So I tried it...

The attached converts the hash table lookups done in smgr.c to use
simplehash instead of dynahash.

This does require a few changes in simplehash.h to make it work.  The
reason being is that RelationData.rd_smgr points directly into the
hash table entries.  This works ok for dynahash as that hash table
implementation does not do any reallocations of existing items or move
any items around in the table, however, simplehash moves entries
around all the time, so we can't point any pointers directly at the
hash entries and expect them to be valid after adding or removing
anything else from the table.

To work around that, I've just made an additional type that serves as
the hash entry type that has a pointer to the SMgrRelationData along
with the hash status and hash value. It's just 16 bytes (or 12 on
32-bit machines).  I opted to keep the hash key in the
SMgrRelationData rather than duplicating it as it keeps the SMgrEntry
struct nice and small.  We only need to dereference the SMgrRelation
pointer when we find an entry with the same hash value.  The chances
are quite good that an entry with the same hash value is the one that
we want, so any additional dereferences to compare the key are not
going to happen very often.

I did experiment with putting the hash key in SMgrEntry and found it
to be quite a bit slower.  I also did try to use hash_bytes() but
found building a hash function that uses murmurhash32 to be quite a
bit faster.

Benchmarking
===========

I did some of that.  It made my test case about 10% faster.

The test case was basically inserting 100 million rows one at a time
into a hash partitioned table with 1000 partitions and 2 int columns
and a primary key on one of those columns. It was about 12GB of WAL. I
used a hash partitioned table in the hope to create a fairly
random-looking SMgr hash table access pattern.  Hopefully something
similar to what might happen in the real world.

Over 10 runs of recovery, master took an average of 124.89 seconds.
The patched version took 113.59 seconds. About 10% faster.

I bumped shared_buffers up to 10GB, max_wal_size to 20GB and
checkpoint_timeout to 60 mins.

To make the benchmark more easily to repeat I patched with the
attached recovery_panic.patch.txt.  This just PANICs at the end of
recovery so that the database shuts down before performing the end of
recovery checkpoint.  Just start the database up again to do another
run.

I did 10 runs.  The end of recovery log message reported:

master (aa271209f)
CPU: user: 117.89 s, system: 5.70 s, elapsed: 123.65 s
CPU: user: 117.81 s, system: 5.74 s, elapsed: 123.62 s
CPU: user: 119.39 s, system: 5.75 s, elapsed: 125.20 s
CPU: user: 117.98 s, system: 4.39 s, elapsed: 122.41 s
CPU: user: 117.92 s, system: 4.79 s, elapsed: 122.76 s
CPU: user: 119.84 s, system: 4.75 s, elapsed: 124.64 s
CPU: user: 120.60 s, system: 5.82 s, elapsed: 126.49 s
CPU: user: 118.74 s, system: 5.71 s, elapsed: 124.51 s
CPU: user: 124.29 s, system: 6.79 s, elapsed: 131.14 s
CPU: user: 118.73 s, system: 5.67 s, elapsed: 124.47 s

master + v1 patch
CPU: user: 106.90 s, system: 4.45 s, elapsed: 111.39 s
CPU: user: 107.31 s, system: 5.98 s, elapsed: 113.35 s
CPU: user: 107.14 s, system: 5.58 s, elapsed: 112.77 s
CPU: user: 105.79 s, system: 5.64 s, elapsed: 111.48 s
CPU: user: 105.78 s, system: 5.80 s, elapsed: 111.63 s
CPU: user: 113.18 s, system: 6.21 s, elapsed: 119.45 s
CPU: user: 107.74 s, system: 4.57 s, elapsed: 112.36 s
CPU: user: 107.42 s, system: 4.62 s, elapsed: 112.09 s
CPU: user: 106.54 s, system: 4.65 s, elapsed: 111.24 s
CPU: user: 113.24 s, system: 6.86 s, elapsed: 120.16 s

I wrote this patch a few days ago. I'm only posting it now as I know a
couple of other people have expressed an interest in working on this.
I didn't really want any duplicate efforts, so thought I'd better post
it now before someone else goes and writes a similar patch.

I'll park this here and have another look at it when the PG15 branch opens.


David


Hi, David

It is quite interesting result. Simplehash being open-addressing with
linear probing is friendly for cpu cache. I'd recommend to define
SH_FILLFACTOR with value lower than default (0.9). I believe 0.75 is
suitable most for such kind of hash table.

+       /* rotate hashkey left 1 bit at each step */
+       hashkey = (hashkey << 1) | ((hashkey & 0x80000000) ? 1 : 0);
+       hashkey ^= murmurhash32((uint32) rnode->node.dbNode);

Why do you use so strange rotation expression? I know compillers are able

to translage `h = (h << 1) | (h >> 31)` to single rotate instruction.
Do they recognize construction in your code as well?

Your construction looks more like "multiplate-modulo" operation in 32bit
Galois field . It is widely used operation in cryptographic, but it is
used modulo some primitive polynomial, and 0x100000001 is not such
polynomial. 0x1000000c5 is, therefore it should be:

    hashkey = (hashkey << 1) | ((hashkey & 0x80000000) ? 0xc5 : 0);
or
    hashkey = (hashkey << 1) | ((uint32)((int32)hashkey >> 31) & 0xc5);

But why don't just use hash_combine(uint32 a, uint32 b) instead (defined
in hashfn.h)? Yep, it could be a bit slower, but is it critical?

- *     smgrclose() -- Close and delete an SMgrRelation object.
+ *     smgrclose() -- Close and delete an SMgrRelation object but don't
+ *     remove from the SMgrRelationHash table.


I believe `smgrclose_internal()` should be in this comment.

Still I don't believe it worth to separate smgrclose_internal from
smgrclose. Is there measurable performance improvement from this
change? Even if there is, it will be lesser with SH_FILLFACTOR 0.75 .

As well I don't support modification simplehash.h for SH_ENTRY_INITIALIZER,

SH_ENTRY_CLEANUP and SH_TRUNCATE. The initialization could comfortably
live in smgropen and the cleanup in smgrclose. And then SH_TRUNCATE
doesn't mean much.

Summary:

regards,
Yura Sokolov

diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index adfc6f67e2..aa7accbe1b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7616,6 +7616,7 @@ StartupXLOG(void)
                                                (errmsg("last completed 
transaction was at log time %s",
                                                                
timestamptz_to_str(xtime))));
 
+                       elog(PANIC, "recovery PANIC");
                        InRedo = false;
                }
                else

v1-0001-Use-simplehash.h-hashtables-in-SMgr.patch
Description: Binary data

Re: Use simplehash.h instead of dynahash in SMgr

Reply via email to