https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65641

--- Comment #6 from Jakub Lopuszanski <jakub.lopuszanski at oracle dot com> ---
I was bitten recently by this and tracked down the issue to the fragment of
logic which verifies that the element in the chain belongs to the bucket which
is being iterated. This is because ALL elements form a single list, and thus
the stop condition for searching through the bucket requires rechecking for
each item if its (cached) hash falls to the same bucket as the item sought.
```
    if (!__p->_M_nxt || _M_bucket_index(*__p->_M_next()) != __bkt)
        break;
```
This is what causes most of the div operations, even more if chains in buckets
are long (say, due to poor hash function).

I see three ways to solve this:

A) Use the lowest bit to indicate the pointer points to the first item in its
bucket. 

Then the check wouldn't have to use modulo, just &1.
Addresses are aligned at least to multiple of word length, so the lowest bit is
always zero and can be repurposed for this. 
And the implementation of _Hashtable already knows very well which of the items
are heads of buckets, as there's significant amount of logic to maintain the
invariant they are the ones stored directly in the array.

B) Check if the address is within the buckets array bounds

As the design dictates that pointers which point to the first items are stored
directly in the array, and others lie outside, it should be possible to
identify that a given pointer is the last one/first, by doing comparison of the
address to the range of addresses occupied by the array.

C) Let the users provide custom _FastMod_range_hashing which caches the
reciprocal of the number of buckets to perform equivalent of the modulo
operation using multiplications and shifts

This approach has the benefit of handling not only traversal of the bucket's
chain but also identifying the bucket to look into.
Here the main challange is that the current implementation of _Hash_code_base
is creating a fresh temporary instance of _Range_hash for each computation:
```
   std::size_t
   _M_bucket_index(__hash_code __c, std::size_t __bkt_count) const
   { return _RangeHash{}(__c, __bkt_count); }
```
So it is impossible to amortize the cost of computing the reciprocal by caching
it.

I was able to modify this to use EBO for _Range_hash object, and adding hooks
for updating it whenever the number of buckets changes.
The code is PoC grade, and probably misses a lot of nuance.
My main goal was to check if that would even compile, pass my tests and run
faster.
And it does run faster.
On my laptop, where

 * `uint64_t % uint64_t` takes around 15ns
 * `fast_modulo_old_t.compute(uint64_t)` (inspired by MySQL InnoDB's
implementation) takes around 2ns

(The fast_modulo_t and fast_modulo_carry_t are buggy, and were my attempts at
making it even faster by removing conditional move and reducing number of
multiplications, but this doesn't seem to matter much for performance: either
way they execute in 2-3ns range on my machine. I include them to keep the
source at it was during testing)

I've got following results, for the test which `insert`s 1'000'000 random
numbers from [0,DOMAIN],
then tries to `find` 1'000'000 random numbers from the same range.
The size of the DOMAIN seems to matter in following ways: 

 * it changes the number of collisions 
 * it changes the number of unique items which in turn has impact on fitting
into caching hierarchy
 * it (indirectly, hash function also matters) changes the number of the
buckets array 

I've tested two different hash functions, the default std::hash<Key> and a
Shitty_hash<Key> which is just v&0xFFF.
The test runner runs each thing 10 times and reports AVG and MIN of ten runs of
the avg duration of a single operation.

Shitty_hash && fast_modulo_old_t:
DOMAIN     1'000
    AVG: Find: 6 vs 20 Insert: 7 vs 19 [ns]
    MIN: Find: 7 vs 19 Insert: 7 vs 18 [ns]
DOMAIN    10'000
    AVG: Find: 36 vs 59 Insert: 36 vs 59 [ns]
    MIN: Find: 38 vs 59 Insert: 35 vs 57 [ns]
DOMAIN   100'000
    AVG: Find: 450 vs 533 Insert: 502 vs 547 [ns]
    MIN: Find: 451 vs 468 Insert: 477 vs 496 [ns]
DOMAIN 1'000'000
    AVG: Find: 5955 vs 6389 Insert: 10184 vs 10219 [ns]
    MIN: Find: 6251 vs 6301 Insert: 9967 vs 10074 [ns]

std::hash && fast_modulo_old_t:
DOMAIN     1'000
    AVG: Find: 7 vs 20 Insert: 6 vs 18 [ns]
    MIN: Find: 7 vs 19 Insert: 6 vs 18 [ns]
DOMAIN    10'000
    AVG: Find: 11 vs 22 Insert: 10 vs 20 [ns]
    MIN: Find: 11 vs 22 Insert: 10 vs 19 [ns]
DOMAIN   100'000
    AVG: Find: 46 vs 72 Insert: 37 vs 55 [ns]
    MIN: Find: 45 vs 66 Insert: 32 vs 52 [ns]
DOMAIN 1'000'000
    AVG: Find: 216 vs 264 Insert: 71 vs 112 [ns]
    MIN: Find: 215 vs 244 Insert: 67 vs 106 [ns]

Reply via email to