The following is not quite a full review, but has plenty to think about.
There is too much to cover at once, and I have to start somewhere...

My main concerns are that internal APIs:

1. are difficult to follow
2. lead to poor branch prediction and too many function calls

Some of the measurements are picking on the SIMD search code, but I go into
details in order to demonstrate how a regression there can go completely
unnoticed. Hopefully the broader themes are informative.

On Fri, Oct 7, 2022 at 3:09 PM Masahiko Sawada <sawada.m...@gmail.com>
wrote:
> [fixed benchmarks]

Thanks for that! Now I can show clear results on some aspects in a simple
way. The attached patches (apply on top of v6) are not intended to be
incorporated as-is quite yet, but do point the way to some reorganization
that I think is necessary. I've done some testing on loading, but will
leave it out for now in the interest of length.


0001-0003 are your performance test fix and and some small conveniences for
testing. Binary search is turned off, for example, because we know it
already. And the sleep call is so I can run perf in a different shell
session, on only the search portion.

Note the v6 test loads all block numbers in the range. Since the test item
ids are all below 64 (reasonable), there are always 32 leaf chunks, so all
the leaves are node32 and completely full. This had the effect of never
taking the byte-wise loop in the proposed pg_lsearch function. These two
aspects make this an easy case for the branch predictor:

john=# select * from bench_seq_search(0, 1*1000*1000);
NOTICE:  num_keys = 1000000, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128
= 1, n256 = 122
NOTICE:  sleeping for 2 seconds...
  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
 1000000 |         10199040 |           180000000 |        167 |
  0 |          822 |               0

     1,470,141,841      branches:u

            63,693      branch-misses:u           #    0.00% of all
branches

john=# select * from bench_shuffle_search(0, 1*1000*1000);
NOTICE:  num_keys = 1000000, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128
= 1, n256 = 122
NOTICE:  sleeping for 2 seconds...
  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
 1000000 |         10199040 |           180000000 |        168 |
  0 |         2174 |               0

     1,470,142,569      branches:u

        15,023,983      branch-misses:u           #    1.02% of all branches


0004 randomizes block selection in the load part of the search test so that
each block has a 50% chance of being loaded.  Note that now we have many
node16s where we had none before. Although node 16 and node32 appear to
share the same path in the switch statement of rt_node_search(), the chunk
comparison and node_get_values() calls each must go through different
branches. The shuffle case is most affected, but even the sequential case
slows down. (The leaves are less full -> there are more of them, so memory
use is larger, but it shouldn't matter much, in the sequential case at
least)

john=# select * from bench_seq_search(0, 2*1000*1000);
NOTICE:  num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889,
n128 = 1, n256 = 245
NOTICE:  sleeping for 2 seconds...
 nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
 999654 |         14893056 |           179937720 |        173 |
0 |          907 |               0

     1,684,114,926      branches:u

         1,989,901      branch-misses:u           #    0.12% of all branches

john=# select * from bench_shuffle_search(0, 2*1000*1000);
NOTICE:  num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889,
n128 = 1, n256 = 245
NOTICE:  sleeping for 2 seconds...
 nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
 999654 |         14893056 |           179937720 |        173 |
0 |         2890 |               0

     1,684,115,844      branches:u

        34,215,740      branch-misses:u           #    2.03% of all branches


0005 replaces pg_lsearch with a branch-free SIMD search. Note that it
retains full portability and gains predictable performance. For
demonstration, it's used on all three linear-search types. Although I'm
sure it'd be way too slow for node4, this benchmark hardly has any so it's
ok.

john=# select * from bench_seq_search(0, 2*1000*1000);
NOTICE:  num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889,
n128 = 1, n256 = 245
NOTICE:  sleeping for 2 seconds...
 nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
 999654 |         14893056 |           179937720 |        176 |
0 |          867 |               0

     1,469,540,357      branches:u

            96,678      branch-misses:u           #    0.01% of all
branches

john=# select * from bench_shuffle_search(0, 2*1000*1000);
NOTICE:  num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889,
n128 = 1, n256 = 245
NOTICE:  sleeping for 2 seconds...
 nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
 999654 |         14893056 |           179937720 |        171 |
0 |         2530 |               0

     1,469,540,533      branches:u

        15,019,975      branch-misses:u           #    1.02% of all branches


0006 removes node16, and 0007 avoids a function call to introspect node
type. 0006 is really to make 0007 simpler to code. The crucial point here
is that calling out to rt_node_get_values/children() to figure out what
type we are is costly. With these patches, searching an unevenly populated
load is the same or faster than the original sequential load, despite
taking twice as much memory. (And, as I've noted before, decoupling size
class from node kind would win the memory back.)

john=# select * from bench_seq_search(0, 2*1000*1000);
NOTICE:  num_keys = 999654, height = 2, n4 = 1, n32 = 62499, n128 = 1, n256
= 245
NOTICE:  sleeping for 2 seconds...
 nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
 999654 |         20381696 |           179937720 |        171 |
0 |          717 |               0

     1,349,614,294      branches:u

             1,313      branch-misses:u           #    0.00% of all
branches

john=# select * from bench_shuffle_search(0, 2*1000*1000);
NOTICE:  num_keys = 999654, height = 2, n4 = 1, n32 = 62499, n128 = 1, n256
= 245
NOTICE:  sleeping for 2 seconds...
 nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
 999654 |         20381696 |           179937720 |        172 |
0 |         2202 |               0

     1,349,614,741      branches:u

            30,592      branch-misses:u           #    0.00% of all
branches

Expanding this point, once a path branches based on node kind, there should
be no reason to ever forget the kind. Ther abstractions in v6 have
disadvantages. I understand the reasoning -- to reduce duplication of code.
However, done this way, less code in the text editor leads to *more* code
(i.e. costly function calls and branches) on the machine level.

I haven't looked at insert/load performance carefully, but it's clear it
suffers from the same amnesia. prepare_node_for_insert() branches based on
the kind. If it must call rt_node_grow(), that function has no idea where
it came from and must branch again. When prepare_node_for_insert() returns
we again have no idea what the kind is, so must branch again. And if we are
one of the three linear-search nodes, we later do another function call,
where we encounter a 5-way jump table because the caller could be anything
at all.

Some of this could be worked around with always-inline functions to which
we pass a const node kind, and let the compiler get rid of the branches
etc. But many cases are probably not even worth doing that. For example, I
don't think prepare_node_for_insert() is a useful abstraction to begin
with. It returns an index, but only for linear nodes. Lookup nodes get a
return value of zero. There is not enough commonality here.

Along the same lines, there are a number of places that have branches as a
consequence of treating inner nodes and leaves with the same api:

rt_node_iterate_next
chunk_array_node_get_slot
node_128/256_get_slot
rt_node_search

I'm leaning towards splitting these out into specialized functions for each
inner and leaf. This is a bit painful for the last one, but perhaps if we
are resigned to templating the shared-mem case, maybe we can template some
of the inner/leaf stuff. Something to think about for later, but for now I
believe we have to accept some code duplication as a prerequisite for
decent performance as well as readability.

For the next steps, we need to proceed cautiously because there is a lot in
the air at the moment. Here are some aspects I would find desirable. If
there are impracticalities I haven't thought of, we can discuss further. I
don't pretend to know the practical consequences of every change I mention.

- If you have started coding the shared memory case, I'd advise to continue
so we can see what that looks like. If that has not gotten beyond the
design stage, I'd like to first see an attempt at tearing down some of the
clumsier abstractions in the current patch.
- As a "smoke test", there should ideally be nothing as general as
rt_node_get_children/values(). We should ideally always know what kind we
are if we found out earlier.
- For distinguishing between linear nodes, perhaps some always-inline
functions can help hide details. But at the same time, trying to treat them
the same is not always worthwhile.
- Start to separate treatment of inner/leaves and see how it goes.
- I firmly believe we only need 4 node *kinds*, and later we can decouple
the size classes as a separate concept. I'm willing to put serious time
into that once the broad details are right. I will also investigate pointer
tagging if we can confirm that can work similarly for dsa pointers.

Regarding size class decoupling, I'll respond to a point made earlier:

On Fri, Sep 30, 2022 at 10:47 PM Masahiko Sawada <sawada.m...@gmail.com>
wrote:
> With this idea, we can just repalloc() to grow to the larger size in a
> pair but I'm slightly concerned that the more size class we use, the
> more frequent the node needs to grow.

Well, yes, but that's orthogonal. For example, v6 has 5 node kinds. Imagine
that we have 4 node kinds, but the SIMD node kind used 2 size classes. Then
the nodes would grow at *exactly* the same frequency as they do today. I
listed many ways a size class could fit into a power-of-two (and there are
more), but we have a choice in how many to actually use. It's a trade off
between memory usage and complexity.

> If we want to support node
> shrink, the deletion is also affected.

Not necessarily. We don't have to shrink at the same granularity as
growing. My evidence is simple: we don't shrink at all now. :-)

--
John Naylor
EDB: http://www.enterprisedb.com

Reply via email to