David Rowley писал 2021-04-25 05:23:
Thanks for having a look at this.

"On Sun, 25 Apr 2021 at 10:27, Yura Sokolov <y.soko...@postgrespro.ru> wrote:

It is quite interesting result. Simplehash being open-addressing with
linear probing is friendly for cpu cache. I'd recommend to define
SH_FILLFACTOR with value lower than default (0.9). I believe 0.75 is
suitable most for such kind of hash table.

You might be right there, although, with the particular benchmark I'm
using the size of the table does not change as a result of that. I'd
need to experiment with varying numbers of relations to see if
dropping the fillfactor helps or hinders performance.

FWIW, the hash stats at the end of recovery are:

LOG:  redo done at 3/C6E34F0 system usage: CPU: user: 107.00 s,
system: 5.61 s, elapsed: 112.67 s
LOG:  size: 4096, members: 2032, filled: 0.496094, total chain: 997,
max chain: 5, avg chain: 0.490650, total_collisions: 422,
max_collisions: 3, avg_collisions: 0.207677

Perhaps if try using a number of relations somewhere between 2048 *
0.75 and 2048 * 0.9 then I might see some gains.  Because I have 2032,
the hash table grew up to 4096 buckets.

I did a quick test dropping the fillfactor down to 0.4.  The aim there
was just to see if having 8192 buckets in this test would make it
faster or slower

LOG:  redo done at 3/C6E34F0 system usage: CPU: user: 109.61 s,
system: 4.28 s, elapsed: 113.93 s
LOG:  size: 8192, members: 2032, filled: 0.248047, total chain: 303,
max chain: 2, avg chain: 0.149114, total_collisions: 209,
max_collisions: 2, avg_collisions: 0.102854

it was slightly slower.

Certainly. That is because in unmodified case you've got fillfactor 0.49
because table just grew. Below somewhat near 0.6 there is no gain in lower
fillfactor. But if you test it when it closer to upper bound, you will
notice difference. Try to test it with 3600 nodes, for example, if
going down to 1800 nodes is not possible.

> +     /* rotate hashkey left 1 bit at each step */
> +     hashkey = (hashkey << 1) | ((hashkey & 0x80000000) ? 1 : 0);
> +     hashkey ^= murmurhash32((uint32) rnode->node.dbNode);

Why do you use so strange rotation expression? I know compillers are
able
to translage `h = (h << 1) | (h >> 31)` to single rotate instruction.
Do they recognize construction in your code as well?

Not sure about all compilers, I only checked the earliest version of
clang and gcc at godbolt.org and they both use a single "rol"
instruction. https://godbolt.org/z/1GqdE6T3q

Yep, looks like all compilers recognize such construction with single
exception of old icc compiler (both 13.0.1 and 16.0.3): https://godbolt.org/z/qsrjY5Eof
and all compilers recognize `(h << 1) | (h >> 31)` well

Your construction looks more like "multiplate-modulo" operation in 32bit
Galois field . It is widely used operation in cryptographic, but it is
used modulo some primitive polynomial, and 0x100000001 is not such
polynomial. 0x1000000c5 is, therefore it should be:

     hashkey = (hashkey << 1) | ((hashkey & 0x80000000) ? 0xc5 : 0);
or
hashkey = (hashkey << 1) | ((uint32)((int32)hashkey >> 31) & 0xc5);

That does not really make sense to me.  If you're shifting a 32-bit
variable left 31 places then why would you AND with 0xc5? The only
possible result is 1 or 0 depending on if the most significant bit is
on or off.

That is why there is cast to signed int before shifting: `(int32)hashkey >> 31`. Shift is then also signed ie arithmetic, and results are 0 or 0xffffffff.

But why don't just use hash_combine(uint32 a, uint32 b) instead (defined
in hashfn.h)? Yep, it could be a bit slower, but is it critical?

I had that function in the corner of my eye when writing this, but
TBH, the hash function performance was just too big a factor to slow
it down any further by using the more expensive hash_combine()
function. I saw pretty good performance gains from writing my own hash
function rather than using hash_bytes(). I didn't want to detract from
that by using hash_combine().  Rotating the bits left 1 slot seems
good enough for hash join and hash aggregate, so I don't have any
reason to believe it's a bad way to combine the hash values.  Do you?

Well, if think a bit more, this hash values could be combined with using
just addition: `hash(a) + hash(b) + hash(c)`.

I thought more about consistency in a codebase. But looks like both ways
(`hash_combine(a,b)` and `rotl(a,1)^b`) are used in a code.
- hash_combine is in one time/three lines in hashTupleDesc at tupledesc.c
- rotl+xor six times:
-- three times/three lines in execGrouping.c with construction like yours
-- three times in jsonb_util.c, multirangetypes.c and rangetypes.c with
   `(h << 1) | (h >> 31)`.
Therefore I step down on recommendation in this place.

Looks like it is possibility for micropatch to unify hash combining :-)


If you grep the source for "hashkey = (hashkey << 1) | ((hashkey &
0x80000000) ? 1 : 0);", then you'll see where else we do the same
rotate left trick.

> - *   smgrclose() -- Close and delete an SMgrRelation object.
> + *   smgrclose() -- Close and delete an SMgrRelation object but don't
> + *   remove from the SMgrRelationHash table.

I believe `smgrclose_internal()` should be in this comment.

Oops. Yeah, that's a mistake.

Still I don't believe it worth to separate smgrclose_internal from
smgrclose. Is there measurable performance improvement from this
change? Even if there is, it will be lesser with SH_FILLFACTOR 0.75 .

The reason I did that is due to the fact that smgrcloseall() loops
over the entire hash table and removes each entry one by one.  The
problem is that if I do a smgrtable_delete or smgrtable_delete_item in
that loop then I'd need to restart the loop each time.  Be aware that
a simplehash delete can move entries earlier in the table, so it might
cause us to miss entries during the loop.  Restarting the loop each
iteration is not going to be very efficient, so instead, I opted to
make a version of smgrclose() that does not remove from the table so
that I can just wipe out all table entries at the end of the loop.  I
called that smgrclose_internal().

If you read comments in SH_START_ITERATE, you'll see:

* Search for the first empty element. As deletions during iterations are * supported, we want to start/end at an element that cannot be affected
  * by elements being shifted.

* Iterate backwards, that allows the current element to be deleted, even
  * if there are backward shifts

Therefore, it is safe to delete during iteration, and it doesn't lead nor
require loop restart.


An additional small benefit is that smgrclosenode() can get away with
a single hashtable lookup rather than having to lookup the entry again
with smgrtable_delete().  Using smgrtable_delete_item() deletes by
bucket rather than key value which should be a good bit faster in many
cases.  I think the SH_ENTRY_CLEANUP macro is quite useful here as I
don't need to worry about NULLing out the smgr_owner in yet another
location where I do a hash delete.

Doubtfully it makes sense since smgrclosenode is called only in
LocalExecuteInvalidationMessage, ie when other backend drops some
relation. There is no useful performance gain from it.


As well I don't support modification simplehash.h for
SH_ENTRY_INITIALIZER,
SH_ENTRY_CLEANUP and SH_TRUNCATE. The initialization could comfortably
live in smgropen and the cleanup in smgrclose. And then SH_TRUNCATE
doesn't mean much.

Can you share what you've got in mind here?

The problem I'm solving with SH_ENTRY_INITIALIZER is the fact that in
SH_INSERT_HASH_INTERNAL(), when we add a new item, we do entry->SH_KEY
= key; to set the new entries key.  Since I have SH_KEY defined as:

#define SH_KEY data->smgr_rnode

then I need some way to allocate the memory for ->data before the key
is set. Doing that in smrgopen() is too late. We've already crashed by
then for referencing uninitialised memory.

Oh, now I see.
I could suggest work-around:
- use entry->hash as a whole key value and manually resolve hash
  collision with chaining.
But it looks ugly: use hash table and still manually resolve collisions.

Therefore perhaps SH_ENTRY_INITIALIZER has sense.

But SH_ENTRY_CLEANUP is abused in the patch: it is not symmetric to
SH_ENTRY_INITIALIZER. It smells bad. `smgr_owner` is better cleaned
in a way it is cleaned now in smgrclose because it is less obscure.
And SH_ENTRY_CLEANUP should be just `pfree(a->data)`.

And still no reason to have SH_TRUNCATE.

I did try putting the key in SMgrEntry but found the performance to be
quite a bit worse than keeping the SMgrEntry down to 16 bytes.  That
makes sense to me as we only need to compare the key when we find an
entry with the same hash value as the one we're looking for. There's a
pretty high chance of that being the entry we want. If I got my hash
function right then the odds are about 1 in 4 billion of it not being
the one we want.   The only additional price we pay when we get two
entries with the same hash value is an additional pointer dereference
and a key comparison.

It has sense: whole benefit of simplehash is cache locality, and
it is gained with smaller entry.

regards,
Yura Sokolov


Reply via email to