David Rowley писал 2021-04-24 18:58:
Hackers,Last year, when working on making compactify_tuples() go faster for 19c60ad69, I did quite a bit of benchmarking of the recovery process. The next thing that was slow after compactify_tuples() was the hash lookups done in smgropen(). Currently, we use dynahash hash tables to store the SMgrRelation so we can perform fast lookups by RelFileNodeBackend. However, I had in mind that a simplehash table might perform better. So I tried it... The attached converts the hash table lookups done in smgr.c to use simplehash instead of dynahash. This does require a few changes in simplehash.h to make it work. The reason being is that RelationData.rd_smgr points directly into the hash table entries. This works ok for dynahash as that hash table implementation does not do any reallocations of existing items or move any items around in the table, however, simplehash moves entries around all the time, so we can't point any pointers directly at the hash entries and expect them to be valid after adding or removing anything else from the table. To work around that, I've just made an additional type that serves as the hash entry type that has a pointer to the SMgrRelationData along with the hash status and hash value. It's just 16 bytes (or 12 on 32-bit machines). I opted to keep the hash key in the SMgrRelationData rather than duplicating it as it keeps the SMgrEntry struct nice and small. We only need to dereference the SMgrRelation pointer when we find an entry with the same hash value. The chances are quite good that an entry with the same hash value is the one that we want, so any additional dereferences to compare the key are not going to happen very often. I did experiment with putting the hash key in SMgrEntry and found it to be quite a bit slower. I also did try to use hash_bytes() but found building a hash function that uses murmurhash32 to be quite a bit faster. Benchmarking =========== I did some of that. It made my test case about 10% faster. The test case was basically inserting 100 million rows one at a time into a hash partitioned table with 1000 partitions and 2 int columns and a primary key on one of those columns. It was about 12GB of WAL. I used a hash partitioned table in the hope to create a fairly random-looking SMgr hash table access pattern. Hopefully something similar to what might happen in the real world. Over 10 runs of recovery, master took an average of 124.89 seconds. The patched version took 113.59 seconds. About 10% faster. I bumped shared_buffers up to 10GB, max_wal_size to 20GB and checkpoint_timeout to 60 mins. To make the benchmark more easily to repeat I patched with the attached recovery_panic.patch.txt. This just PANICs at the end of recovery so that the database shuts down before performing the end of recovery checkpoint. Just start the database up again to do another run. I did 10 runs. The end of recovery log message reported: master (aa271209f) CPU: user: 117.89 s, system: 5.70 s, elapsed: 123.65 s CPU: user: 117.81 s, system: 5.74 s, elapsed: 123.62 s CPU: user: 119.39 s, system: 5.75 s, elapsed: 125.20 s CPU: user: 117.98 s, system: 4.39 s, elapsed: 122.41 s CPU: user: 117.92 s, system: 4.79 s, elapsed: 122.76 s CPU: user: 119.84 s, system: 4.75 s, elapsed: 124.64 s CPU: user: 120.60 s, system: 5.82 s, elapsed: 126.49 s CPU: user: 118.74 s, system: 5.71 s, elapsed: 124.51 s CPU: user: 124.29 s, system: 6.79 s, elapsed: 131.14 s CPU: user: 118.73 s, system: 5.67 s, elapsed: 124.47 s master + v1 patch CPU: user: 106.90 s, system: 4.45 s, elapsed: 111.39 s CPU: user: 107.31 s, system: 5.98 s, elapsed: 113.35 s CPU: user: 107.14 s, system: 5.58 s, elapsed: 112.77 s CPU: user: 105.79 s, system: 5.64 s, elapsed: 111.48 s CPU: user: 105.78 s, system: 5.80 s, elapsed: 111.63 s CPU: user: 113.18 s, system: 6.21 s, elapsed: 119.45 s CPU: user: 107.74 s, system: 4.57 s, elapsed: 112.36 s CPU: user: 107.42 s, system: 4.62 s, elapsed: 112.09 s CPU: user: 106.54 s, system: 4.65 s, elapsed: 111.24 s CPU: user: 113.24 s, system: 6.86 s, elapsed: 120.16 s I wrote this patch a few days ago. I'm only posting it now as I know a couple of other people have expressed an interest in working on this. I didn't really want any duplicate efforts, so thought I'd better post it now before someone else goes and writes a similar patch.I'll park this here and have another look at it when the PG15 branch opens.David
Hi, David It is quite interesting result. Simplehash being open-addressing with linear probing is friendly for cpu cache. I'd recommend to define SH_FILLFACTOR with value lower than default (0.9). I believe 0.75 is suitable most for such kind of hash table.
+ /* rotate hashkey left 1 bit at each step */ + hashkey = (hashkey << 1) | ((hashkey & 0x80000000) ? 1 : 0); + hashkey ^= murmurhash32((uint32) rnode->node.dbNode);
Why do you use so strange rotation expression? I know compillers are able
to translage `h = (h << 1) | (h >> 31)` to single rotate instruction. Do they recognize construction in your code as well? Your construction looks more like "multiplate-modulo" operation in 32bit Galois field . It is widely used operation in cryptographic, but it is used modulo some primitive polynomial, and 0x100000001 is not such polynomial. 0x1000000c5 is, therefore it should be: hashkey = (hashkey << 1) | ((hashkey & 0x80000000) ? 0xc5 : 0); or hashkey = (hashkey << 1) | ((uint32)((int32)hashkey >> 31) & 0xc5); But why don't just use hash_combine(uint32 a, uint32 b) instead (defined in hashfn.h)? Yep, it could be a bit slower, but is it critical?
- * smgrclose() -- Close and delete an SMgrRelation object. + * smgrclose() -- Close and delete an SMgrRelation object but don't + * remove from the SMgrRelationHash table.
I believe `smgrclose_internal()` should be in this comment. Still I don't believe it worth to separate smgrclose_internal from smgrclose. Is there measurable performance improvement from this change? Even if there is, it will be lesser with SH_FILLFACTOR 0.75 .As well I don't support modification simplehash.h for SH_ENTRY_INITIALIZER,
SH_ENTRY_CLEANUP and SH_TRUNCATE. The initialization could comfortably live in smgropen and the cleanup in smgrclose. And then SH_TRUNCATE doesn't mean much. Summary: regards, Yura Sokolov
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index adfc6f67e2..aa7accbe1b 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -7616,6 +7616,7 @@ StartupXLOG(void) (errmsg("last completed transaction was at log time %s", timestamptz_to_str(xtime)))); + elog(PANIC, "recovery PANIC"); InRedo = false; } else
v1-0001-Use-simplehash.h-hashtables-in-SMgr.patch
Description: Binary data