Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-28 Thread John Naylor
> The fix is easy enough -- set the child pointer to null upon deletion,
but I'm somewhat astonished that the regression tests didn't hit this. I do
still intend to replace this code with something faster, but before I do so
the tests should probably exercise the deletion paths more. Since VACUUM

Oops. I meant to finish with "Since VACUUM doesn't perform deletion we
didn't have an opportunity to detect this during that operation."

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-28 Thread John Naylor
While creating a benchmark for inserting into node128-inner, I found a bug.
If a caller deletes from a node128, the slot index is set to invalid, but
the child pointer is still valid. Do that a few times, and every child
pointer is valid, even if no slot index points to it. When the next
inserter comes along, something surprising happens. This function:

/* Return an unused slot in node-128 */
static int
node_inner_128_find_unused_slot(rt_node_inner_128 *node, uint8 chunk)
{
  int slotpos = 0;

  Assert(!NODE_IS_LEAF(node));
  while (node_inner_128_is_slot_used(node, slotpos))
  slotpos++;

  return slotpos;
}

...passes an integer to this function, whose parameter is a uint8:

/* Is the slot in the node used? */
static inline bool
node_inner_128_is_slot_used(rt_node_inner_128 *node, uint8 slot)
{
  Assert(!NODE_IS_LEAF(node));
  return (node->children[slot] != NULL);
}

...so instead of growing the node unnecessarily or segfaulting, it enters
an infinite loop doing this:

add eax, 1
movzx   ecx, al
cmp QWORD PTR [rbx+264+rcx*8], 0
jne .L147

The fix is easy enough -- set the child pointer to null upon deletion, but
I'm somewhat astonished that the regression tests didn't hit this. I do
still intend to replace this code with something faster, but before I do so
the tests should probably exercise the deletion paths more. Since VACUUM

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-25 Thread John Naylor
On Thu, Nov 24, 2022 at 9:54 PM Masahiko Sawada 
wrote:
>
> [v11]

There is one more thing that just now occurred to me: In expanding the use
of size classes, that makes rebasing and reworking the shared memory piece
more work than it should be. That's important because there are still some
open questions about the design around shared memory. To keep unnecessary
churn to a minimum, perhaps we should limit size class expansion to just
one (or 5 total size classes) for the near future?

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-25 Thread John Naylor
On Thu, Nov 24, 2022 at 9:54 PM Masahiko Sawada 
wrote:
>
> So it seems that there are two candidates of rt_node structure: (1)
> all nodes except for node256 are variable-size nodes and use pointer
> tagging, and (2) node32 and node128 are variable-sized nodes and do
> not use pointer tagging (fanout is in part of only these two nodes).
> rt_node can be 5 bytes in both cases. But before going to this step, I
> started to verify the idea of variable-size nodes by using 6-bytes
> rt_node. We can adjust the node kinds and node classes later.

First, I'm glad you picked up the size class concept and expanded it. (I
have some comments about some internal APIs below.)

Let's leave the pointer tagging piece out until the main functionality is
committed. We have all the prerequisites in place, except for a benchmark
random enough to demonstrate benefit. I'm still not quite satisfied with
how the shared memory coding looked, and that is the only sticky problem we
still have, IMO. The rest is "just work".

That said, (1) and (2) above are still relevant -- variable sizing any
given node is optional, and we can refine as needed.

> Overall, the idea of variable-sized nodes is good, smaller size
> without losing search performance.

Good.

> I'm going to check the load
> performance as well.

Part of that is this, which gets called a lot more now, when node1 expands:

+ if (inner)
+ newnode = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
+ rt_node_kind_info[kind].inner_size);
+ else
+ newnode = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
+ rt_node_kind_info[kind].leaf_size);

Since memset for expanding size class is now handled separately, these can
use the non-zeroing versions. When compiling MemoryContextAllocZero, the
compiler has no idea how big the size is, so it assumes the worst and
optimizes for large sizes. On x86-64, that means using "rep stos",
which calls microcode found in the CPU's ROM. This is slow for small sizes.
The "init" function should be always inline with const parameters where
possible. That way, memset can compile to a single instruction for the
smallest node kind. (More on alloc/init below)

Note, there is a wrinkle: As currently written inner_node128 searches the
child pointers for NULL when inserting, so when expanding from partial to
full size class, the new node must be zeroed (Worth fixing in the short
term. I thought of this while writing the proof-of-concept for size
classes, but didn't mention it.) Medium term, rather than special-casing
this, I actually want to rewrite the inner-node128 to be more similar to
the leaf, with an "isset" array, but accessed and tested differently. I
guarantee it's *really* slow now to load (maybe somewhat true even for
leaves), but I'll leave the details for later. Regarding node128 leaf, note
that it's slightly larger than a DSA size class, and we can trim it to fit:

node61:  6 + 256+(2) +16 +  61*8 =  768
node125: 6 + 256+(2) +16 + 125*8 = 1280

> I've attached the patches I used for the verification. I don't include
> patches for pointer tagging, DSA support, and vacuum integration since
> I'm investigating the issue on cfbot that Andres reported. Also, I've
> modified tests to improve the test coverage.

Sounds good. For v12, I think size classes have proven themselves, so v11's
0002/4/5 can be squashed. Plus, some additional comments:

+/* Return a new and initialized node */
+static rt_node *
+rt_alloc_init_node(radix_tree *tree, uint8 kind, uint8 shift, uint8 chunk,
bool inner)
+{
+ rt_node *newnode;
+
+ newnode = rt_alloc_node(tree, kind, inner);
+ rt_init_node(newnode, kind, shift, chunk, inner);
+
+ return newnode;
+}

I don't see the point of a function that just calls two functions.

+/*
+ * Create a new node with 'new_kind' and the same shift, chunk, and
+ * count of 'node'.
+ */
+static rt_node *
+rt_grow_node(radix_tree *tree, rt_node *node, int new_kind)
+{
+ rt_node*newnode;
+
+ newnode = rt_alloc_init_node(tree, new_kind, node->shift, node->chunk,
+ node->shift > 0);
+ newnode->count = node->count;
+
+ return newnode;
+}

This, in turn, just calls a function that does _almost_ everything, and
additionally must set one member. This function should really be alloc-node
+ init-node + copy-common, where copy-common is like in the prototype:
+ newnode->node_shift = oldnode->node_shift;
+ newnode->node_chunk = oldnode->node_chunk;
+ newnode->count = oldnode->count;

And init-node should really be just memset + set kind + set initial fanout.
It has no business touching "shift" and "chunk". The callers rt_new_root,
rt_set_extend, and rt_extend set some values of their own anyway, so let
them set those, too -- it might even improve readability.

-   if (n32->base.n.fanout ==
rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
+   if (NODE_NEEDS_TO_GROW_CLASS(n32, RT_CLASS_32_PARTIAL))

This macro doesn't really improve readability -- it obscures what is being
tested, and the name implies the "else" branch m

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-22 Thread Andres Freund
On 2022-11-21 17:06:56 +0900, Masahiko Sawada wrote:
> Sure. I've attached the v10 patches. 0004 is the pure refactoring
> patch and 0005 patch introduces the pointer tagging.

This failed on cfbot, with som many crashes that the VM ran out of disk for
core dumps. During testing with 32bit, so there's probably something broken
around that.

https://cirrus-ci.com/task/4635135954386944

A failure is e.g. at: 
https://api.cirrus-ci.com/v1/artifact/task/4635135954386944/testrun/build-32/testrun/adminpack/regress/log/initdb.log

performing post-bootstrap initialization ... 
../src/backend/lib/radixtree.c:1696:21: runtime error: member access within 
misaligned address 0x590faf74 for type 'struct radix_tree_control', which 
requires 8 byte alignment
0x590faf74: note: pointer points here
  90 11 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 
00 00 00 00 00 00 00
  ^
==55813==Using libbacktrace symbolizer.
#0 0x56dcc274 in rt_create ../src/backend/lib/radixtree.c:1696
#1 0x56953d1b in tidstore_create ../src/backend/access/common/tidstore.c:57
#2 0x56a1ca4f in dead_items_alloc 
../src/backend/access/heap/vacuumlazy.c:3109
#3 0x56a2219f in heap_vacuum_rel ../src/backend/access/heap/vacuumlazy.c:539
#4 0x56cb77ed in table_relation_vacuum ../src/include/access/tableam.h:1681
#5 0x56cb77ed in vacuum_rel ../src/backend/commands/vacuum.c:2062
#6 0x56cb9a16 in vacuum ../src/backend/commands/vacuum.c:472
#7 0x56cba904 in ExecVacuum ../src/backend/commands/vacuum.c:272
#8 0x5711b6d0 in standard_ProcessUtility ../src/backend/tcop/utility.c:866
#9 0x5711bdeb in ProcessUtility ../src/backend/tcop/utility.c:530
#10 0x5711759f in PortalRunUtility ../src/backend/tcop/pquery.c:1158
#11 0x57117cb8 in PortalRunMulti ../src/backend/tcop/pquery.c:1315
#12 0x571183d2 in PortalRun ../src/backend/tcop/pquery.c:791
#13 0x57111049 in exec_simple_query ../src/backend/tcop/postgres.c:1238
#14 0x57113f9c in PostgresMain ../src/backend/tcop/postgres.c:4551
#15 0x5711463d in PostgresSingleUserMain ../src/backend/tcop/postgres.c:4028
#16 0x56df4672 in main ../src/backend/main/main.c:197
#17 0xf6ad8e45 in __libc_start_main (/lib/i386-linux-gnu/libc.so.6+0x1ae45)
#18 0x5691d0f0 in _start 
(/tmp/cirrus-ci-build/build-32/tmp_install/usr/local/pgsql/bin/postgres+0x3040f0)

Aborted (core dumped)
child process exited with exit code 134
initdb: data directory 
"/tmp/cirrus-ci-build/build-32/testrun/adminpack/regress/tmp_check/data" not 
removed at user's request




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-21 Thread John Naylor
On Mon, Nov 21, 2022 at 3:43 PM Masahiko Sawada 
wrote:
>
> On Mon, Nov 21, 2022 at 4:20 PM John Naylor
>  wrote:

> > Assuming the smallest node is fixed size (i.e. fanout/capacity member
not part of the common set, so only part of variable-sized nodes), 3 has a
nice property: no wasted padding space:
> >
> > node4: 5 + 4+(7) + 4*8 = 48 bytes
> > node3: 5 + 3 + 3*8 = 32
>
> IIUC if we store the fanout member only in variable-sized nodes,
> rt_node has only count, shift, and chunk, so 4 bytes in total. If so,
> the size of node3 (ie. fixed-sized node) is (4 + 3 + (1) + 3*8)? The
> size doesn't change but there is 1 byte padding space.

I forgot to mention I'm assuming no pointer-tagging for this exercise.
You've demonstrated it can be done in a small amount of code, and I hope we
can demonstrate a speedup in search. Just in case there is some issue with
portability, valgrind, or some other obstacle, I'm being pessimistic in my
calculations.

> Also, even if we have the node3 a variable-sized node, size class 1
> for node3 could be a good choice since it also doesn't need padding
> space and could be a good alternative to path compression.
>
> node3 :  5 + 3 + 3*8 = 32 bytes
> size class 1 : 5 + 3 + 1*8 = 16 bytes

Precisely! I have that scenario in my notes as well -- it's quite
compelling.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-21 Thread Masahiko Sawada
On Mon, Nov 21, 2022 at 4:20 PM John Naylor
 wrote:
>
>
> On Fri, Nov 18, 2022 at 2:48 PM I wrote:
> > One issue with this patch: The "fanout" member is a uint8, so it can't hold 
> > 256 for the largest node kind. That's not an issue in practice, since we 
> > never need to grow it, and we only compare that value with the count in an 
> > Assert(), so I just set it to zero. That does break an invariant, so it's 
> > not great. We could use 2 bytes to be strictly correct in all cases, but 
> > that limits what we can do with the smallest node kind.
>
> Thinking about this part, there's an easy resolution -- use a different macro 
> for fixed- and variable-sized node kinds to determine if there is a free slot.
>
> Also, I wanted to share some results of adjusting the boundary between the 
> two smallest node kinds. In the hackish attached patch, I modified the fixed 
> height search benchmark to search a small (within L1 cache) tree thousands of 
> times. For the first set I modified node4's maximum fanout and filled it up. 
> For the second, I set node4's fanout to 1, which causes 2+ to spill to node32 
> (actually the partially-filled node15 size class as demoed earlier).
>
> node4:
>
> NOTICE:  num_keys = 16, height = 3, n4 = 15, n15 = 0, n32 = 0, n128 = 0, n256 
> = 0
>  fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
> +---+--++--
>   2 |16 |16520 |  0 |3
>
> NOTICE:  num_keys = 81, height = 3, n4 = 40, n15 = 0, n32 = 0, n128 = 0, n256 
> = 0
>  fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
> +---+--++--
>   3 |81 |16456 |  0 |   17
>
> NOTICE:  num_keys = 256, height = 3, n4 = 85, n15 = 0, n32 = 0, n128 = 0, 
> n256 = 0
>  fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
> +---+--++--
>   4 |   256 |16456 |  0 |   89
>
> NOTICE:  num_keys = 625, height = 3, n4 = 156, n15 = 0, n32 = 0, n128 = 0, 
> n256 = 0
>  fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
> +---+--++--
>   5 |   625 |16488 |  0 |  327
>
>
> node32:
>
> NOTICE:  num_keys = 16, height = 3, n4 = 0, n15 = 15, n32 = 0, n128 = 0, n256 
> = 0
>  fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
> +---+--++--
>   2 |16 |16488 |  0 |5
> (1 row)
>
> NOTICE:  num_keys = 81, height = 3, n4 = 0, n15 = 40, n32 = 0, n128 = 0, n256 
> = 0
>  fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
> +---+--++--
>   3 |81 |16520 |  0 |   28
>
> NOTICE:  num_keys = 256, height = 3, n4 = 0, n15 = 85, n32 = 0, n128 = 0, 
> n256 = 0
>  fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
> +---+--++--
>   4 |   256 |16408 |  0 |   79
>
> NOTICE:  num_keys = 625, height = 3, n4 = 0, n15 = 156, n32 = 0, n128 = 0, 
> n256 = 0
>  fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
> +---+--++--
>   5 |   625 |24616 |  0 |  199
>
> In this test, node32 seems slightly faster than node4 with 4 elements, at the 
> cost of more memory.
>
> Assuming the smallest node is fixed size (i.e. fanout/capacity member not 
> part of the common set, so only part of variable-sized nodes), 3 has a nice 
> property: no wasted padding space:
>
> node4: 5 + 4+(7) + 4*8 = 48 bytes
> node3: 5 + 3 + 3*8 = 32

IIUC if we store the fanout member only in variable-sized nodes,
rt_node has only count, shift, and chunk, so 4 bytes in total. If so,
the size of node3 (ie. fixed-sized node) is (4 + 3 + (1) + 3*8)? The
size doesn't change but there is 1 byte padding space.

Also, even if we have the node3 a variable-sized node, size class 1
for node3 could be a good choice since it also doesn't need padding
space and could be a good alternative to path compression.

node3 :  5 + 3 + 3*8 = 32 bytes
size class 1 : 5 + 3 + 1*8 = 16 bytes

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-20 Thread John Naylor
On Fri, Nov 18, 2022 at 2:48 PM I wrote:
> One issue with this patch: The "fanout" member is a uint8, so it can't
hold 256 for the largest node kind. That's not an issue in practice, since
we never need to grow it, and we only compare that value with the count in
an Assert(), so I just set it to zero. That does break an invariant, so
it's not great. We could use 2 bytes to be strictly correct in all cases,
but that limits what we can do with the smallest node kind.

Thinking about this part, there's an easy resolution -- use a different
macro for fixed- and variable-sized node kinds to determine if there is a
free slot.

Also, I wanted to share some results of adjusting the boundary between the
two smallest node kinds. In the hackish attached patch, I modified the
fixed height search benchmark to search a small (within L1 cache) tree
thousands of times. For the first set I modified node4's maximum fanout and
filled it up. For the second, I set node4's fanout to 1, which causes 2+ to
spill to node32 (actually the partially-filled node15 size class
as demoed earlier).

node4:

NOTICE:  num_keys = 16, height = 3, n4 = 15, n15 = 0, n32 = 0, n128 = 0,
n256 = 0
 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
+---+--++--
  2 |16 |16520 |  0 |3

NOTICE:  num_keys = 81, height = 3, n4 = 40, n15 = 0, n32 = 0, n128 = 0,
n256 = 0
 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
+---+--++--
  3 |81 |16456 |  0 |   17

NOTICE:  num_keys = 256, height = 3, n4 = 85, n15 = 0, n32 = 0, n128 = 0,
n256 = 0
 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
+---+--++--
  4 |   256 |16456 |  0 |   89

NOTICE:  num_keys = 625, height = 3, n4 = 156, n15 = 0, n32 = 0, n128 = 0,
n256 = 0
 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
+---+--++--
  5 |   625 |16488 |  0 |  327


node32:

NOTICE:  num_keys = 16, height = 3, n4 = 0, n15 = 15, n32 = 0, n128 = 0,
n256 = 0
 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
+---+--++--
  2 |16 |16488 |  0 |5
(1 row)

NOTICE:  num_keys = 81, height = 3, n4 = 0, n15 = 40, n32 = 0, n128 = 0,
n256 = 0
 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
+---+--++--
  3 |81 |16520 |  0 |   28

NOTICE:  num_keys = 256, height = 3, n4 = 0, n15 = 85, n32 = 0, n128 = 0,
n256 = 0
 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
+---+--++--
  4 |   256 |16408 |  0 |   79

NOTICE:  num_keys = 625, height = 3, n4 = 0, n15 = 156, n32 = 0, n128 = 0,
n256 = 0
 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
+---+--++--
  5 |   625 |24616 |  0 |  199

In this test, node32 seems slightly faster than node4 with 4 elements, at
the cost of more memory.

Assuming the smallest node is fixed size (i.e. fanout/capacity member not
part of the common set, so only part of variable-sized nodes), 3 has a nice
property: no wasted padding space:

node4: 5 + 4+(7) + 4*8 = 48 bytes
node3: 5 + 3 + 3*8 = 32

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-20 Thread John Naylor
On Fri, Nov 18, 2022 at 8:20 PM Masahiko Sawada 
wrote:
>
> On Thu, Nov 17, 2022 at 12:24 AM Masahiko Sawada 
wrote:
> >
> > On Wed, Nov 16, 2022 at 4:39 PM John Naylor
> >  wrote:

> > > That means my idea for the pointer struct might have some problems,
at least as currently implemented. Maybe in the course of separating out
and polishing that piece, an inefficiency will fall out. Or, it might be
another reason to template local and shared separately. Not sure yet. I
also haven't tried to adjust this test for the shared memory case.

Digging a bit deeper, I see a flaw in my benchmark: Even though the total
distribution of node kinds is decently even, the pattern that the benchmark
sees is not terribly random:

 3,343,352  branch-misses:u  #0.85% of all
branches
   393,204,959  branches:u

Recall a previous benchmark [1] where the leaf node was about half node16
and half node32. Randomizing the leaf node between the two caused branch
misses to go from 1% to 2%, causing a noticeable slowdown. Maybe in this
new benchmark, each level has a skewed distribution of nodes, giving a
smart branch predictor something to work with. We will need a way to
efficiently generate keys that lead to a relatively unpredictable
distribution of node kinds, as seen by a searcher. Especially in the leaves
(or just above the leaves), since those are less likely to be cached.

> > I'll also run the test on my environment and do the investigation
tomorrow.
> >
>
> FYI I've not tested the patch you shared today but here are the
> benchmark results I did with the v9 patch in my environment (I used
> the second filter). I splitted 0004 patch into two patches: a patch
> for pure refactoring patch to introduce rt_node_ptr and a patch to do
> pointer tagging.

Would you be able to share the refactoring patch? And a fix for the failing
tests? I'm thinking I want to try the templating approach fairly soon.

[1]
https://www.postgresql.org/message-id/CAFBsxsFEVckVzsBsfgGzGR4Yz%3DJp%3DUxOtjYvTjOz6fOoLXtOig%40mail.gmail.com

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-18 Thread John Naylor
On Fri, Nov 18, 2022 at 8:20 PM Masahiko Sawada 
wrote:
>
> FYI I've not tested the patch you shared today but here are the
> benchmark results I did with the v9 patch in my environment (I used
> the second filter). I splitted 0004 patch into two patches: a patch
> for pure refactoring patch to introduce rt_node_ptr and a patch to do
> pointer tagging.
>
> v9 0003 patch: 1113 1114 1114
> introduce rt_node_ptr: 1127 1128 1128
> pointer tagging  : 1085 1087 1086 (equivalent to 0004 patch)
>
> In my environment, rt_node_ptr seemed to lead some overhead but
> pointer tagging had performance benefits. I'm not sure the reason why
> the results are different from yours. The radix tree stats shows the
> same as your tests.

There is less than 2% difference from the medial set of results, so it's
hard to distinguish from noise. I did a fresh rebuild and retested with the
same results: about 15% slowdown in v9 0004. That's strange.

On Wed, Nov 16, 2022 at 10:24 PM Masahiko Sawada 
wrote:

> > filter = (((uint64) 1<<32) | (0xFF<<24));
> > LOG:  num_keys = 944, height = 7, n4 = 47515559, n32 = 6209, n128 =
62632, n256 = 3161
> >
> > 1) Any idea why the tree height would be reported as 7 here? I didn't
expect that.
>
> In my environment, (0xFF<<24) is 0xFF00, not 0xFF00.
> It seems the filter should be (((uint64) 1<<32) | ((uint64)
> 0xFF<<24)).

Ugh, sign extension, brain fade on my part. Thanks, I'm glad there was a
straightforward explanation.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-18 Thread Masahiko Sawada
On Thu, Nov 17, 2022 at 12:24 AM Masahiko Sawada  wrote:
>
> On Wed, Nov 16, 2022 at 4:39 PM John Naylor
>  wrote:
> >
> >
> > On Wed, Nov 16, 2022 at 12:33 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Wed, Nov 16, 2022 at 1:46 PM John Naylor
> > >  wrote:
> > > >
> > > >
> > > > On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada 
> > > >  wrote:
> > > > > Thanks! Please let me know if there is something I can help with.
> > > >
> > > > I didn't get very far because the tests fail on 0004 in rt_verify_node:
> > > >
> > > > TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File: 
> > > > "../src/backend/lib/radixtree.c", Line: 2186, PID: 18242
> > >
> > > Which tests do you use to get this assertion failure? I've confirmed
> > > there is a bug in 0005 patch but without it, "make check-world"
> > > passed.
> >
> > Hmm, I started over and rebuilt and it didn't reproduce. Not sure what 
> > happened, sorry for the noise.
>
> Good to know. No problem.
>
> > I'm attaching a test I wrote to stress test branch prediction in search, 
> > and while trying it out I found two possible issues.
>
> Thank you for testing!
>
> >
> > It's based on the random int load test, but tests search speed. Run like 
> > this:
> >
> > select * from bench_search_random_nodes(10 * 1000 * 1000)
> >
> > It also takes some care to include all the different node kinds, 
> > restricting the possible keys by AND-ing with a filter. Here's a simple 
> > demo:
> >
> > filter = ((uint64)1<<40)-1;
> > LOG:  num_keys = 967, height = 4, n4 = 17513814, n32 = 6320, n128 = 
> > 62663, n256 = 3130
> >
> > Just using random integers leads to >99% using the smallest node. I wanted 
> > to get close to having the same number of each, but that's difficult while 
> > still using random inputs. I ended up using
> >
> > filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF)
> >
> > which gives
> >
> > LOG:  num_keys = 9291812, height = 4, n4 = 262144, n32 = 79603, n128 = 
> > 182670, n256 = 1024
> >
> > Which seems okay for the task. One puzzling thing I found while trying 
> > various filters is that sometimes the reported tree height would change. 
> > For example:
> >
> > filter = (((uint64) 1<<32) | (0xFF<<24));
> > LOG:  num_keys = 944, height = 7, n4 = 47515559, n32 = 6209, n128 = 
> > 62632, n256 = 3161
> >
> > 1) Any idea why the tree height would be reported as 7 here? I didn't 
> > expect that.
>
> In my environment, (0xFF<<24) is 0xFF00, not 0xFF00.
> It seems the filter should be (((uint64) 1<<32) | ((uint64)
> 0xFF<<24)).
>
> >
> > 2) It seems that 0004 actually causes a significant slowdown in this test 
> > (as in the attached, using the second filter above and with turboboost 
> > disabled):
> >
> > v9 0003: 2062 2051 2050
> > v9 0004: 2346 2316 2321
> >
> > That means my idea for the pointer struct might have some problems, at 
> > least as currently implemented. Maybe in the course of separating out and 
> > polishing that piece, an inefficiency will fall out. Or, it might be 
> > another reason to template local and shared separately. Not sure yet. I 
> > also haven't tried to adjust this test for the shared memory case.
>
> I'll also run the test on my environment and do the investigation tomorrow.
>

FYI I've not tested the patch you shared today but here are the
benchmark results I did with the v9 patch in my environment (I used
the second filter). I splitted 0004 patch into two patches: a patch
for pure refactoring patch to introduce rt_node_ptr and a patch to do
pointer tagging.

v9 0003 patch: 1113 1114 1114
introduce rt_node_ptr: 1127 1128 1128
pointer tagging  : 1085 1087 1086 (equivalent to 0004 patch)

In my environment, rt_node_ptr seemed to lead some overhead but
pointer tagging had performance benefits. I'm not sure the reason why
the results are different from yours. The radix tree stats shows the
same as your tests.

=# select * from bench_search_random_nodes(10 * 1000 * 1000);
2022-11-18 22:18:21.608 JST [3913544] LOG:  num_keys = 9291812, height
= 4, n4 = 262144, n32 =79603, n128 = 182670, n256 = 1024

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-17 Thread John Naylor
On Wed, Sep 28, 2022 at 1:18 PM I wrote:

> Along those lines, one thing I've been thinking about is the number of
size classes. There is a tradeoff between memory efficiency and number of
branches when searching/inserting. My current thinking is there is too much
coupling between size class and data type. Each size class currently uses a
different data type and a different algorithm to search and set it, which
in turn requires another branch. We've found that a larger number of size
classes leads to poor branch prediction [1] and (I imagine) code density.
>
> I'm thinking we can use "flexible array members" for the values/pointers,
and keep the rest of the control data in the struct the same. That way, we
never have more than 4 actual "kinds" to code and branch on. As a bonus,
when migrating a node to a larger size class of the same kind, we can
simply repalloc() to the next size.

While the most important challenge right now is how to best represent and
organize the shared memory case, I wanted to get the above idea working and
out of the way, to be saved for a future time. I've attached a rough
implementation (applies on top of v9 0003) that splits node32 into 2 size
classes. They both share the exact same base data type and hence the same
search/set code, so the number of "kind"s is still four, but here there are
five "size classes", so a new case in the "unlikely" node-growing path. The
smaller instance of node32 is a "node15", because that's currently 160
bytes, corresponding to one of the DSA size classes. This idea can be
applied to any other node except the max size, as we see fit. (Adding a
singleton size class would bring it back in line with the prototype, at
least as far as memory consumption.)

One issue with this patch: The "fanout" member is a uint8, so it can't hold
256 for the largest node kind. That's not an issue in practice, since we
never need to grow it, and we only compare that value with the count in an
Assert(), so I just set it to zero. That does break an invariant, so it's
not great. We could use 2 bytes to be strictly correct in all cases, but
that limits what we can do with the smallest node kind.

In the course of working on this, I encountered a pain point. Since it's
impossible to repalloc in slab, we have to do alloc/copy/free ourselves.
That's fine, but the current coding makes too many assumptions about the
use cases: rt_alloc_node and rt_copy_node are too entangled with each other
and do too much work unrelated to what the names imply. I seem to remember
an earlier version had something like rt_node_copy_common that did
only...copying. That was much easier to reason about. In 0002 I resorted to
doing my own allocation to show what I really want to do, because the new
use case doesn't need zeroing and setting values. It only needs
to...allocate (and increase the stats counter if built that way).

Future optimization work while I'm thinking of it: rt_alloc_node should be
always-inlined and the memset done separately (i.e. not *AllocZero). That
way the compiler should be able generate more efficient zeroing code for
smaller nodes. I'll test the numbers on this sometime in the future.

--
John Naylor
EDB: http://www.enterprisedb.com
From 6fcc970ae7e31f44fa6b6aface983cadb023cc50 Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Thu, 17 Nov 2022 16:10:44 +0700
Subject: [PATCH v901 2/2] Make node32 variable sized

Add a size class for 15 elements, which corresponds to 160 bytes,
an allocation size used by DSA. When a 16th element is to be
inserted, allocte a larger area and memcpy the entire old node
to it.

NB: Zeroing the new area is only necessary if it's for an
inner node128, since insert logic must check for null child
pointers.

This technique allows us to limit the node kinds to 4, which
1. limits the number of cases in switch statements
2. allows a possible future optimization to encode the node kind
in a pointer tag
---
 src/backend/lib/radixtree.c | 141 +++-
 1 file changed, 108 insertions(+), 33 deletions(-)

diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index bef1a438ab..f368e750d5 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -130,6 +130,7 @@ typedef enum
 typedef enum rt_size_class
 {
RT_CLASS_4_FULL = 0,
+   RT_CLASS_32_PARTIAL,
RT_CLASS_32_FULL,
RT_CLASS_128_FULL,
RT_CLASS_256
@@ -147,6 +148,8 @@ typedef struct rt_node
uint16  count;
 
/* Max number of children. We can use uint8 because we never need to 
store 256 */
+   /* WIP: if we don't have a variable sized node4, this should instead be 
in the base
+   types as needed, since saving every byte is crucial for the smallest 
node kind */
uint8   fanout;
 
/*
@@ -166,6 +169,8 @@ typedef struct rt_node
((node)->base.n.count < (node)->base.n.fanout)
 
 /* Base type of each node kinds for leaf and inner nodes */
+/* The base ty

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-16 Thread Masahiko Sawada
On Wed, Nov 16, 2022 at 4:39 PM John Naylor
 wrote:
>
>
> On Wed, Nov 16, 2022 at 12:33 PM Masahiko Sawada  
> wrote:
> >
> > On Wed, Nov 16, 2022 at 1:46 PM John Naylor
> >  wrote:
> > >
> > >
> > > On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada  
> > > wrote:
> > > > Thanks! Please let me know if there is something I can help with.
> > >
> > > I didn't get very far because the tests fail on 0004 in rt_verify_node:
> > >
> > > TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File: 
> > > "../src/backend/lib/radixtree.c", Line: 2186, PID: 18242
> >
> > Which tests do you use to get this assertion failure? I've confirmed
> > there is a bug in 0005 patch but without it, "make check-world"
> > passed.
>
> Hmm, I started over and rebuilt and it didn't reproduce. Not sure what 
> happened, sorry for the noise.

Good to know. No problem.

> I'm attaching a test I wrote to stress test branch prediction in search, and 
> while trying it out I found two possible issues.

Thank you for testing!

>
> It's based on the random int load test, but tests search speed. Run like this:
>
> select * from bench_search_random_nodes(10 * 1000 * 1000)
>
> It also takes some care to include all the different node kinds, restricting 
> the possible keys by AND-ing with a filter. Here's a simple demo:
>
> filter = ((uint64)1<<40)-1;
> LOG:  num_keys = 967, height = 4, n4 = 17513814, n32 = 6320, n128 = 
> 62663, n256 = 3130
>
> Just using random integers leads to >99% using the smallest node. I wanted to 
> get close to having the same number of each, but that's difficult while still 
> using random inputs. I ended up using
>
> filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF)
>
> which gives
>
> LOG:  num_keys = 9291812, height = 4, n4 = 262144, n32 = 79603, n128 = 
> 182670, n256 = 1024
>
> Which seems okay for the task. One puzzling thing I found while trying 
> various filters is that sometimes the reported tree height would change. For 
> example:
>
> filter = (((uint64) 1<<32) | (0xFF<<24));
> LOG:  num_keys = 944, height = 7, n4 = 47515559, n32 = 6209, n128 = 
> 62632, n256 = 3161
>
> 1) Any idea why the tree height would be reported as 7 here? I didn't expect 
> that.

In my environment, (0xFF<<24) is 0xFF00, not 0xFF00.
It seems the filter should be (((uint64) 1<<32) | ((uint64)
0xFF<<24)).

>
> 2) It seems that 0004 actually causes a significant slowdown in this test (as 
> in the attached, using the second filter above and with turboboost disabled):
>
> v9 0003: 2062 2051 2050
> v9 0004: 2346 2316 2321
>
> That means my idea for the pointer struct might have some problems, at least 
> as currently implemented. Maybe in the course of separating out and polishing 
> that piece, an inefficiency will fall out. Or, it might be another reason to 
> template local and shared separately. Not sure yet. I also haven't tried to 
> adjust this test for the shared memory case.

I'll also run the test on my environment and do the investigation tomorrow.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-15 Thread John Naylor
On Wed, Nov 16, 2022 at 12:33 PM Masahiko Sawada 
wrote:
>
> On Wed, Nov 16, 2022 at 1:46 PM John Naylor
>  wrote:
> >
> >
> > On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada 
wrote:
> > > Thanks! Please let me know if there is something I can help with.
> >
> > I didn't get very far because the tests fail on 0004 in rt_verify_node:
> >
> > TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File:
"../src/backend/lib/radixtree.c", Line: 2186, PID: 18242
>
> Which tests do you use to get this assertion failure? I've confirmed
> there is a bug in 0005 patch but without it, "make check-world"
> passed.

Hmm, I started over and rebuilt and it didn't reproduce. Not sure what
happened, sorry for the noise.

I'm attaching a test I wrote to stress test branch prediction in search,
and while trying it out I found two possible issues.

It's based on the random int load test, but tests search speed. Run like
this:

select * from bench_search_random_nodes(10 * 1000 * 1000)

It also takes some care to include all the different node kinds,
restricting the possible keys by AND-ing with a filter. Here's a simple
demo:

filter = ((uint64)1<<40)-1;
LOG:  num_keys = 967, height = 4, n4 = 17513814, n32 = 6320, n128 =
62663, n256 = 3130

Just using random integers leads to >99% using the smallest node. I wanted
to get close to having the same number of each, but that's difficult while
still using random inputs. I ended up using

filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF)

which gives

LOG:  num_keys = 9291812, height = 4, n4 = 262144, n32 = 79603, n128 =
182670, n256 = 1024

Which seems okay for the task. One puzzling thing I found while trying
various filters is that sometimes the reported tree height would change.
For example:

filter = (((uint64) 1<<32) | (0xFF<<24));
LOG:  num_keys = 944, height = 7, n4 = 47515559, n32 = 6209, n128 =
62632, n256 = 3161

1) Any idea why the tree height would be reported as 7 here? I didn't
expect that.

2) It seems that 0004 actually causes a significant slowdown in this test
(as in the attached, using the second filter above and with turboboost
disabled):

v9 0003: 2062 2051 2050
v9 0004: 2346 2316 2321

That means my idea for the pointer struct might have some problems, at
least as currently implemented. Maybe in the course of separating out and
polishing that piece, an inefficiency will fall out. Or, it might be
another reason to template local and shared separately. Not sure yet. I
also haven't tried to adjust this test for the shared memory case.

--
John Naylor
EDB: http://www.enterprisedb.com
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql 
b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
index 0874201d7e..e0205b364e 100644
--- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -43,6 +43,14 @@ returns record
 as 'MODULE_PATHNAME'
 LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
 
+create function bench_search_random_nodes(
+cnt int8,
+OUT mem_allocated int8,
+OUT search_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
 create function bench_fixed_height_search(
 fanout int4,
 OUT fanout int4,
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c 
b/contrib/bench_radix_tree/bench_radix_tree.c
index 7abb237e96..a43fc61c2d 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -29,6 +29,7 @@ PG_FUNCTION_INFO_V1(bench_seq_search);
 PG_FUNCTION_INFO_V1(bench_shuffle_search);
 PG_FUNCTION_INFO_V1(bench_load_random_int);
 PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+PG_FUNCTION_INFO_V1(bench_search_random_nodes);
 
 static uint64
 tid_to_key_off(ItemPointer tid, uint32 *off)
@@ -347,6 +348,77 @@ bench_load_random_int(PG_FUNCTION_ARGS)
PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, 
nulls)));
 }
 
+/* copy of splitmix64() */
+static uint64
+hash64(uint64 x)
+{
+   x ^= x >> 30;
+   x *= UINT64CONST(0xbf58476d1ce4e5b9);
+   x ^= x >> 27;
+   x *= UINT64CONST(0x94d049bb133111eb);
+   x ^= x >> 31;
+   return x;
+}
+
+/* attempts to have a relatively even population of node kinds */
+Datum
+bench_search_random_nodes(PG_FUNCTION_ARGS)
+{
+   uint64  cnt = (uint64) PG_GETARG_INT64(0);
+   radix_tree *rt;
+   TupleDesc   tupdesc;
+   TimestampTz start_time,
+   end_time;
+   longsecs;
+   int usecs;
+   int64   search_time_ms;
+   Datum   values[2] = {0};
+   boolnulls[2] = {0};
+   /* from trial and error */
+   const uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 
0xFF);
+
+   /* Build a tuple descriptor for our result type */
+   if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+   elog(ERROR, "return type must be a row type"

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-15 Thread Masahiko Sawada
On Wed, Nov 16, 2022 at 2:17 PM John Naylor
 wrote:
>
>
>
> On Wed, Nov 16, 2022 at 11:46 AM John Naylor  
> wrote:
> >
> >
> > On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada  
> > wrote:
> > > Thanks! Please let me know if there is something I can help with.
> >
> > I didn't get very far because the tests fail on 0004 in rt_verify_node:
> >
> > TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File: 
> > "../src/backend/lib/radixtree.c", Line: 2186, PID: 18242
>
> Actually I do want to offer some general advice. Upthread I recommended a 
> purely refactoring patch that added the node-pointer struct but did nothing 
> else, so that the DSA changes would be smaller. 0004 attempted pointer 
> tagging in the same commit, which makes it no longer a purely refactoring 
> patch, so that 1) makes it harder to tell what part caused the bug and 2) 
> obscures what is necessary for DSA pointers and what was additionally 
> necessary for pointer tagging. Shared memory support is a prerequisite for a 
> shippable feature, but pointer tagging is (hopefully) a performance 
> optimization. Let's keep them separate.

Totally agreed. I'll separate them in the next version patch. Thank
you for your advice.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-15 Thread Masahiko Sawada
On Wed, Nov 16, 2022 at 1:46 PM John Naylor
 wrote:
>
>
> On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada  
> wrote:
> > Thanks! Please let me know if there is something I can help with.
>
> I didn't get very far because the tests fail on 0004 in rt_verify_node:
>
> TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File: 
> "../src/backend/lib/radixtree.c", Line: 2186, PID: 18242

Which tests do you use to get this assertion failure? I've confirmed
there is a bug in 0005 patch but without it, "make check-world"
passed.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-15 Thread John Naylor
On Wed, Nov 16, 2022 at 11:46 AM John Naylor 
wrote:
>
>
> On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada 
wrote:
> > Thanks! Please let me know if there is something I can help with.
>
> I didn't get very far because the tests fail on 0004 in rt_verify_node:
>
> TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File:
"../src/backend/lib/radixtree.c", Line: 2186, PID: 18242

Actually I do want to offer some general advice. Upthread I recommended a
purely refactoring patch that added the node-pointer struct but did nothing
else, so that the DSA changes would be smaller. 0004 attempted pointer
tagging in the same commit, which makes it no longer a purely refactoring
patch, so that 1) makes it harder to tell what part caused the bug and 2)
obscures what is necessary for DSA pointers and what was additionally
necessary for pointer tagging. Shared memory support is a prerequisite for
a shippable feature, but pointer tagging is (hopefully) a performance
optimization. Let's keep them separate.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-15 Thread John Naylor
On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada 
wrote:
> Thanks! Please let me know if there is something I can help with.

I didn't get very far because the tests fail on 0004 in rt_verify_node:

TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File:
"../src/backend/lib/radixtree.c", Line: 2186, PID: 18242

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-14 Thread Masahiko Sawada
On Mon, Nov 14, 2022 at 10:00 PM John Naylor
 wrote:
>
> On Mon, Nov 14, 2022 at 3:44 PM Masahiko Sawada  wrote:
> >
> > 0004 patch is a new patch supporting a pointer tagging of the node
> > kind. Also, it introduces rt_node_ptr we discussed so that internal
> > functions use it rather than having two arguments for encoded and
> > decoded pointers. With this intermediate patch, the DSA support patch
> > became more readable and understandable. Probably we can make it
> > smaller further if we move the change of separating the control object
> > from radix_tree to the main patch (0002). The patch still needs to be
> > polished but I'd like to check if this idea is worthwhile. If we agree
> > on this direction, this patch will be merged into the main radix tree
> > implementation patch.
>
> Thanks for the new patch set. I've taken a very brief look at 0004 and I 
> think the broad outlines are okay. As you say it needs polish, but before 
> going further, I'd like to do some experiments of my own as I mentioned 
> earlier:
>
> - See how much performance we actually gain from tagging the node kind.
> - Try additional size classes while keeping the node kinds to only four.
> - Optimize node128 insert.
> - Try templating out the differences between local and shared memory. With 
> local memory, the node-pointer struct would be a union, for example. 
> Templating would also reduce branches and re-simplify some internal APIs, but 
> it's likely that would also make the TID store and/or vacuum more complex, 
> because at least some external functions would be duplicated.

Thanks! Please let me know if there is something I can help with.

In the meanwhile, I'd like to make some progress on the vacuum
integration and improving the test coverages.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-14 Thread John Naylor
On Mon, Nov 14, 2022 at 3:44 PM Masahiko Sawada 
wrote:
>
> 0004 patch is a new patch supporting a pointer tagging of the node
> kind. Also, it introduces rt_node_ptr we discussed so that internal
> functions use it rather than having two arguments for encoded and
> decoded pointers. With this intermediate patch, the DSA support patch
> became more readable and understandable. Probably we can make it
> smaller further if we move the change of separating the control object
> from radix_tree to the main patch (0002). The patch still needs to be
> polished but I'd like to check if this idea is worthwhile. If we agree
> on this direction, this patch will be merged into the main radix tree
> implementation patch.

Thanks for the new patch set. I've taken a very brief look at 0004 and I
think the broad outlines are okay. As you say it needs polish, but before
going further, I'd like to do some experiments of my own as I mentioned
earlier:

- See how much performance we actually gain from tagging the node kind.
- Try additional size classes while keeping the node kinds to only four.
- Optimize node128 insert.
- Try templating out the differences between local and shared memory. With
local memory, the node-pointer struct would be a union, for example.
Templating would also reduce branches and re-simplify some internal APIs,
but it's likely that would also make the TID store and/or vacuum more
complex, because at least some external functions would be duplicated.

I'll set the patch to "waiting on author", but in this case the author is
me.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-08 Thread Peter Geoghegan
On Fri, Nov 4, 2022 at 8:25 AM Masahiko Sawada  wrote:
> For parallel heap pruning, multiple workers will insert key-value
> pairs to the radix tree concurrently. The simplest solution would be a
> single lock to protect writes but the performance will not be good.
> Another solution would be that we can divide the tables into multiple
> ranges so that keys derived from TIDs are not conflicted with each
> other and have parallel workers process one or more ranges. That way,
> parallel vacuum workers can build *sub-trees* and the leader process
> can merge them. In use cases of lazy vacuum, since the write phase and
> read phase are separated the readers don't need to worry about
> concurrent updates.

I think that the VM snapshot concept can eventually be used to
implement parallel heap pruning. Since every page that will become a
scanned_pages is known right from the start with VM snapshots, it will
be relatively straightforward to partition these pages into distinct
ranges with an equal number of pages, one per worker planned. The VM
snapshot structure can also be used for I/O prefetching, which will be
more important with parallel heap pruning (and with aio).

Working off of an immutable structure that describes which pages to
process right from the start is naturally easy to work with, in
general. We can "reorder work" flexibly (i.e. process individual
scanned_pages in any order that is convenient). Another example is
"changing our mind" about advancing relfrozenxid when it turns out
that we maybe should have decided to do that at the start of VACUUM
[1]. Maybe the specific "changing our mind" idea will turn out to not
be a very useful idea, but it is at least an interesting and thought
provoking concept.

[1] 
https://postgr.es/m/CAH2-WzkQ86yf==mgAF=cq0qelrwkx3htlw9qo+qx3zbwjjk...@mail.gmail.com
-- 
Peter Geoghegan




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-08 Thread Masahiko Sawada
On Sat, Nov 5, 2022 at 6:23 PM John Naylor  wrote:
>
> On Fri, Nov 4, 2022 at 10:25 PM Masahiko Sawada  wrote:
> >
> > For parallel heap pruning, multiple workers will insert key-value
> > pairs to the radix tree concurrently. The simplest solution would be a
> > single lock to protect writes but the performance will not be good.
> > Another solution would be that we can divide the tables into multiple
> > ranges so that keys derived from TIDs are not conflicted with each
> > other and have parallel workers process one or more ranges. That way,
> > parallel vacuum workers can build *sub-trees* and the leader process
> > can merge them. In use cases of lazy vacuum, since the write phase and
> > read phase are separated the readers don't need to worry about
> > concurrent updates.
>
> It's a good idea to use ranges for a different reason -- readahead. See 
> commit 56788d2156fc3, which aimed to improve readahead for sequential scans. 
> It might work to use that as a model: Each worker prunes a range of 64 pages, 
> keeping the dead tids in a local array. At the end of the range: lock the tid 
> store, enter the tids into the store, unlock, free the local array, and get 
> the next range from the leader. It's possible contention won't be too bad, 
> and I suspect using small local arrays as-we-go would be faster and use less 
> memory than merging multiple sub-trees at the end.

Seems a promising idea. I think it might work well even in the current
parallel vacuum (ie., single writer). I mean, I think we can have a
single lwlock for shared cases in the first version. If the overhead
of acquiring the lwlock per insertion of key-value is not negligible,
we might want to try this idea.

Apart from that, I'm going to incorporate the comments on 0004 patch
and try a pointer tagging.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-05 Thread John Naylor
On Fri, Nov 4, 2022 at 10:25 PM Masahiko Sawada 
wrote:
>
> For parallel heap pruning, multiple workers will insert key-value
> pairs to the radix tree concurrently. The simplest solution would be a
> single lock to protect writes but the performance will not be good.
> Another solution would be that we can divide the tables into multiple
> ranges so that keys derived from TIDs are not conflicted with each
> other and have parallel workers process one or more ranges. That way,
> parallel vacuum workers can build *sub-trees* and the leader process
> can merge them. In use cases of lazy vacuum, since the write phase and
> read phase are separated the readers don't need to worry about
> concurrent updates.

It's a good idea to use ranges for a different reason -- readahead. See
commit 56788d2156fc3, which aimed to improve readahead for sequential
scans. It might work to use that as a model: Each worker prunes a range of
64 pages, keeping the dead tids in a local array. At the end of the range:
lock the tid store, enter the tids into the store, unlock, free the local
array, and get the next range from the leader. It's possible contention
won't be too bad, and I suspect using small local arrays as-we-go would be
faster and use less memory than merging multiple sub-trees at the end.

> I've attached a draft patch for lazy vacuum integration that can be
> applied on top of v8 patches. The patch adds a new module called
> TIDStore, an efficient storage for TID backed by radix tree. Lazy
> vacuum and parallel vacuum use it instead of a TID array. The patch
> also introduces rt_detach() that was missed in 0002 patch. It's a very
> rough patch but I hope it helps in considering lazy vacuum
> integration, radix tree APIs, and shared radix tree functionality.

It does help, good to see this.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-04 Thread Masahiko Sawada
On Thu, Nov 3, 2022 at 1:59 PM John Naylor  wrote:
>
> On Mon, Oct 31, 2022 at 12:47 PM Masahiko Sawada  
> wrote:
> >
> > I've attached v8 patches. 0001, 0002, and 0003 patches incorporated
> > the comments I got so far. 0004 patch is a DSA support patch for PoC.
>
> Thanks for the new patchset. This is not a full review, but I have some 
> comments:
>
> 0001 and 0002 look okay on a quick scan -- I will use this as a base for 
> further work that we discussed. However, before I do so I'd like to request 
> another revision regarding the following:
>
> > In 0004 patch, the basic idea is to use rt_node_ptr in all inner nodes
> > to point its children, and we use rt_node_ptr as either rt_node* or
> > dsa_pointer depending on whether the radix tree is shared or not (ie,
> > by checking radix_tree->dsa == NULL).
>

Thank you for the comments!

> 0004: Looks like a good start, but this patch has a large number of changes 
> like these, making it hard to read:
>
> - if (found && child_p)
> - *child_p = child;
> + if (found && childp_p)
> + *childp_p = childp;
> ...
>   rt_node_inner_32 *new32;
> + rt_node_ptr new32p;
>
>   /* grow node from 4 to 32 */
> - new32 = (rt_node_inner_32 *) rt_copy_node(tree, (rt_node *) n4,
> -  RT_NODE_KIND_32);
> + new32p = rt_copy_node(tree, (rt_node *) n4, RT_NODE_KIND_32);
> + new32 = (rt_node_inner_32 *) node_ptr_get_local(tree, new32p);
>
> It's difficult to keep in my head what all the variables refer to. I thought 
> a bit about how to split this patch up to make this easier to read. Here's 
> what I came up with:
>
> typedef struct rt_node_ptr
> {
>   uintptr_t encoded;
>   rt_node * decoded;
> }
>
> Note that there is nothing about "dsa or local". That's deliberate. That way, 
> we can use the "encoded" field for a tagged pointer as well, as I hope we can 
> do (at least for local pointers) in the future. So an intermediate patch 
> would have "static inline void" functions  node_ptr_encode() and  
> node_ptr_decode(), which would only copy from one member to another. I 
> suspect that: 1. The actual DSA changes will be *much* smaller and easier to 
> reason about. 2. Experimenting with tagged pointers will be easier.

Good idea. Will try in the next version patch.

>
> Also, quick question: 0004 has a new function rt_node_update_inner() -- is 
> that necessary because of DSA?, or does this ideally belong in 0002? What's 
> the reason for it?

Oh, this was needed once when initially I'm writing DSA support but
thinking about it again now I think we can remove it and use
rt_node_insert_inner() with parent = NULL instead.

>
> Regarding the performance, I've
> > added another boolean argument to bench_seq/shuffle_search(),
> > specifying whether to use the shared radix tree or not. Here are
> > benchmark results in my environment,
>
> > [...]
>
> > In non-shared radix tree cases (the forth argument is false), I don't
> > see a visible performance degradation. On the other hand, in shared
> > radix tree cases (the forth argument is true), I see visible overheads
> > because of dsa_get_address().
>
> Thanks, this is useful.
>
> > Please note that the current shared radix tree implementation doesn't
> > support any locking, so it cannot be read while written by someone.
>
> I think at the very least we need a global lock to enforce this.
>
> > Also, only one process can iterate over the shared radix tree. When it
> > comes to parallel vacuum, these don't become restriction as the leader
> > process writes the radix tree while scanning heap and the radix tree
> > is read by multiple processes while vacuuming indexes. And only the
> > leader process can do heap vacuum by iterating the key-value pairs in
> > the radix tree. If we want to use it for other cases too, we would
> > need to support locking, RCU or something.
>
> A useful exercise here is to think about what we'd need to do parallel heap 
> pruning. We don't need to go that far for v16 of course, but what's the 
> simplest thing we can do to make that possible? Other use cases can change to 
> more sophisticated schemes if need be.

For parallel heap pruning, multiple workers will insert key-value
pairs to the radix tree concurrently. The simplest solution would be a
single lock to protect writes but the performance will not be good.
Another solution would be that we can divide the tables into multiple
ranges so that keys derived from TIDs are not conflicted with each
other and have parallel workers process one or more ranges. That way,
parallel vacuum workers can build *sub-trees* and the leader process
can merge them. In use cases of lazy vacuum, since the write phase and
read phase are separated the readers don't need to worry about
concurrent updates.

I've attached a draft patch for lazy vacuum integration that can be
applied on top of v8 patches. The patch adds a new module called
TIDStore, an efficient storage for TID backed by radix tree. Lazy
vacuum and parallel vacuum use it instead of a TID array. The patch
al

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-02 Thread John Naylor
On Mon, Oct 31, 2022 at 12:47 PM Masahiko Sawada 
wrote:
>
> I've attached v8 patches. 0001, 0002, and 0003 patches incorporated
> the comments I got so far. 0004 patch is a DSA support patch for PoC.

Thanks for the new patchset. This is not a full review, but I have some
comments:

0001 and 0002 look okay on a quick scan -- I will use this as a base for
further work that we discussed. However, before I do so I'd like to request
another revision regarding the following:

> In 0004 patch, the basic idea is to use rt_node_ptr in all inner nodes
> to point its children, and we use rt_node_ptr as either rt_node* or
> dsa_pointer depending on whether the radix tree is shared or not (ie,
> by checking radix_tree->dsa == NULL).

0004: Looks like a good start, but this patch has a large number of changes
like these, making it hard to read:

- if (found && child_p)
- *child_p = child;
+ if (found && childp_p)
+ *childp_p = childp;
...
  rt_node_inner_32 *new32;
+ rt_node_ptr new32p;

  /* grow node from 4 to 32 */
- new32 = (rt_node_inner_32 *) rt_copy_node(tree, (rt_node *) n4,
-  RT_NODE_KIND_32);
+ new32p = rt_copy_node(tree, (rt_node *) n4, RT_NODE_KIND_32);
+ new32 = (rt_node_inner_32 *) node_ptr_get_local(tree, new32p);

It's difficult to keep in my head what all the variables refer to. I
thought a bit about how to split this patch up to make this easier to read.
Here's what I came up with:

typedef struct rt_node_ptr
{
  uintptr_t encoded;
  rt_node * decoded;
}

Note that there is nothing about "dsa or local". That's deliberate. That
way, we can use the "encoded" field for a tagged pointer as well, as I hope
we can do (at least for local pointers) in the future. So an intermediate
patch would have "static inline void" functions  node_ptr_encode() and
 node_ptr_decode(), which would only copy from one member to another. I
suspect that: 1. The actual DSA changes will be *much* smaller and easier
to reason about. 2. Experimenting with tagged pointers will be easier.

Also, quick question: 0004 has a new function rt_node_update_inner() -- is
that necessary because of DSA?, or does this ideally belong in 0002? What's
the reason for it?

Regarding the performance, I've
> added another boolean argument to bench_seq/shuffle_search(),
> specifying whether to use the shared radix tree or not. Here are
> benchmark results in my environment,

> [...]

> In non-shared radix tree cases (the forth argument is false), I don't
> see a visible performance degradation. On the other hand, in shared
> radix tree cases (the forth argument is true), I see visible overheads
> because of dsa_get_address().

Thanks, this is useful.

> Please note that the current shared radix tree implementation doesn't
> support any locking, so it cannot be read while written by someone.

I think at the very least we need a global lock to enforce this.

> Also, only one process can iterate over the shared radix tree. When it
> comes to parallel vacuum, these don't become restriction as the leader
> process writes the radix tree while scanning heap and the radix tree
> is read by multiple processes while vacuuming indexes. And only the
> leader process can do heap vacuum by iterating the key-value pairs in
> the radix tree. If we want to use it for other cases too, we would
> need to support locking, RCU or something.

A useful exercise here is to think about what we'd need to do parallel heap
pruning. We don't need to go that far for v16 of course, but what's the
simplest thing we can do to make that possible? Other use cases can change
to more sophisticated schemes if need be.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-10-26 Thread John Naylor
On Thu, Oct 27, 2022 at 9:11 AM Masahiko Sawada 
wrote:
>
> True. I'm going to start with 6 bytes and will consider reducing it to
> 5 bytes.

Okay, let's plan on 6 for now, so we have the worst-case sizes up front. As
discussed, I will attempt the size class decoupling after v8 and see how it
goes.

> Encoding the kind in a pointer tag could be tricky given DSA

If it turns out to be unworkable, that's life. If it's just tricky, that
can certainly be put off for future work. I hope to at least test it out
with local memory.

> support so currently I'm thinking to pack the node kind and node
> capacity classes to uint8.

That won't work, if we need 128 for capacity, leaving no bits left. I want
the capacity to be a number we can directly compare with the count (we
won't ever need to store 256 because that node will never grow). Also,
further to my last message, we need to access the kind quickly, without
more cycles.

> I've made some progress on investigating DSA support. I've written
> draft patch for that and regression tests passed. I'll share it as a
> separate patch for discussion with v8 radix tree patch.

Great!

> While implementing DSA support, I realized that we may not need to use
> pointer tagging to distinguish between backend-local address or
> dsa_pointer. In order to get a backend-local address from dsa_pointer,
> we need to pass dsa_area like:

I was not clear -- when I see how much code changes to accommodate DSA
pointers, I imagine I will pretty much know the places that would be
affected by tagging the pointer with the node kind.

Speaking of tests, there is currently no Meson support, but tests pass
because this library is not used anywhere in the backend yet, and
apparently the CI Meson builds don't know to run the regression test? That
will need to be done too. However, it's okay to keep the benchmarking
module in autoconf, since it won't be committed.

> > +static inline void
> > +chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
> > + uint8 *dst_chunks, rt_node **dst_children, int count)
> > +{
> > + memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
> > + memcpy(dst_children, src_children, sizeof(rt_node *) * count);
> > +}
> >
> > gcc generates better code with something like this (but not hard-coded)
at the top:
> >
> > if (count > 4)
> > pg_unreachable();

Actually it just now occurred to me there's a bigger issue here: *We* know
this code can only get here iff count==4, so why doesn't the compiler know
that? I believe it boils down to

static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {

In the assembly, I see it checks if there is room in the node by doing a
runtime lookup in this array, which is not constant. This might not be
important just yet, because I want to base the check on the proposed node
capacity instead, but I mention it as a reminder to us to make sure we take
all opportunities for the compiler to propagate constants.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-10-26 Thread Masahiko Sawada
On Wed, Oct 26, 2022 at 8:06 PM John Naylor
 wrote:
>
> On Mon, Oct 24, 2022 at 12:54 PM Masahiko Sawada  
> wrote:
>
> > I've attached updated PoC patches for discussion and cfbot. From the
> > previous version, I mainly changed the following things:
> >

Thank you for the comments!

> > * Separate treatment of inner and leaf nodes
>
> Overall, this looks much better!
>
> > * Pack both the node kind and node count to an uint16 value.
>
> For this, I did mention a bitfield earlier as something we "could" do, but it 
> wasn't clear we should. After looking again at the node types, I must not 
> have thought through this at all. Storing one byte instead of four for the 
> full enum is a good step, but saving one more byte usually doesn't buy 
> anything because of padding, with a few exceptions like this example:
>
> node4:   4 +  4   +  4*8 =   40
> node4:   5 +  4+(7)   +  4*8 =   48 bytes
>
> Even there, I'd rather not spend the extra cycles to access the members. And 
> with my idea of decoupling size classes from kind, the variable-sized kinds 
> will require another byte to store "capacity". Then, even if the kind gets 
> encoded in a pointer tag, we'll still have 5 bytes in the base type. So I 
> think we should assume 5 bytes from the start. (Might be 6 temporarily if I 
> work on size decoupling first).

True. I'm going to start with 6 bytes and will consider reducing it to
5 bytes. Encoding the kind in a pointer tag could be tricky given DSA
support so currently I'm thinking to pack the node kind and node
capacity classes to uint8.

>
> (Side note, if you have occasion to use bitfields again in the future, C99 
> has syntactic support for them, so no need to write your own shifting/masking 
> code).

Thanks!

>
> > I've not done SIMD part seriously yet. But overall the performance
> > seems good so far. If we agree with the current approach, I think we
> > can proceed with the verification of decoupling node sizes from node
> > kind. And I'll investigate DSA support.
>
> Sounds good. I have some additional comments about v7, and after these are 
> addressed, we can proceed independently with the above two items. Seeing the 
> DSA work will also inform me how invasive pointer tagging will be. There will 
> still be some performance tuning and cosmetic work, but it's getting closer.
>

I've made some progress on investigating DSA support. I've written
draft patch for that and regression tests passed. I'll share it as a
separate patch for discussion with v8 radix tree patch.

While implementing DSA support, I realized that we may not need to use
pointer tagging to distinguish between backend-local address or
dsa_pointer. In order to get a backend-local address from dsa_pointer,
we need to pass dsa_area like:

node = dsa_get_address(tree->dsa, node_dp);

As shown above, the dsa area used by the shared radix tree is stored
in radix_tree struct, so we can know whether the radix tree is shared
or not by checking (tree->dsa == NULL). That is, if it's shared we use
a pointer to radix tree node as dsa_pointer, and if not we use a
pointer as a backend-local pointer. We don't need to encode something
in a pointer.

> -
> 0001:
>
> +#ifndef USE_NO_SIMD
> +#include "port/pg_bitutils.h"
> +#endif
>
> Leftover from an earlier version?
>
> +static inline int vector8_find(const Vector8 v, const uint8 c);
> +static inline int vector8_find_ge(const Vector8 v, const uint8 c);
>
> Leftovers, causing compiler warnings. (Also see new variable shadow warning)

Will fix.

>
> +#else /* USE_NO_SIMD */
> + Vector8 r = 0;
> + uint8 *rp = (uint8 *) &r;
> +
> + for (Size i = 0; i < sizeof(Vector8); i++)
> + rp[i] = Min(((const uint8 *) &v1)[i], ((const uint8 *) &v2)[i]);
> +
> + return r;
> +#endif
>
> As I mentioned a couple versions ago, this style is really awkward, and 
> potential non-SIMD callers will be better off writing their own byte-wise 
> loop rather than using this API. Especially since the "min" function exists 
> only as a workaround for lack of unsigned comparison in (at least) SSE2. 
> There is one existing function in this file with that idiom for non-assert 
> code (for completeness), but even there, inputs of current interest to us use 
> the uint64 algorithm.

Agreed. Will remove non-SIMD code.

>
> 0002:
>
> + /* XXX: should not to use vector8_highbit_mask */
> + bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << 
> sizeof(Vector8));
>
> Hmm?

It's my outdated memo, will remove.

>
> +/*
> + * Return index of the first element in chunks in the given node that is 
> greater
> + * than or equal to 'key'.  Return -1 if there is no such element.
> + */
> +static inline int
> +node_32_search_ge(rt_node_base_32 *node, uint8 chunk)
>
> The caller must now have logic for inserting at the end:
>
> + int insertpos = node_32_search_ge((rt_node_base_32 *) n32, chunk);
> + int16 count = NODE_GET_COUNT(n32);
> +
> + if (insertpos < 0)
> + insertpos = count; /* inse

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-10-26 Thread John Naylor
On Mon, Oct 24, 2022 at 12:54 PM Masahiko Sawada 
wrote:

> I've attached updated PoC patches for discussion and cfbot. From the
> previous version, I mainly changed the following things:
>
> * Separate treatment of inner and leaf nodes

Overall, this looks much better!

> * Pack both the node kind and node count to an uint16 value.

For this, I did mention a bitfield earlier as something we "could" do, but
it wasn't clear we should. After looking again at the node types, I must
not have thought through this at all. Storing one byte instead of four for
the full enum is a good step, but saving one more byte usually doesn't buy
anything because of padding, with a few exceptions like this example:

node4:   4 +  4   +  4*8 =   40
node4:   5 +  4+(7)   +  4*8 =   48 bytes

Even there, I'd rather not spend the extra cycles to access the members.
And with my idea of decoupling size classes from kind, the variable-sized
kinds will require another byte to store "capacity". Then, even if the kind
gets encoded in a pointer tag, we'll still have 5 bytes in the base type.
So I think we should assume 5 bytes from the start. (Might be 6 temporarily
if I work on size decoupling first).

(Side note, if you have occasion to use bitfields again in the future, C99
has syntactic support for them, so no need to write your own
shifting/masking code).

> I've not done SIMD part seriously yet. But overall the performance
> seems good so far. If we agree with the current approach, I think we
> can proceed with the verification of decoupling node sizes from node
> kind. And I'll investigate DSA support.

Sounds good. I have some additional comments about v7, and after these are
addressed, we can proceed independently with the above two items. Seeing
the DSA work will also inform me how invasive pointer tagging will be.
There will still be some performance tuning and cosmetic work, but it's
getting closer.

-
0001:

+#ifndef USE_NO_SIMD
+#include "port/pg_bitutils.h"
+#endif

Leftover from an earlier version?

+static inline int vector8_find(const Vector8 v, const uint8 c);
+static inline int vector8_find_ge(const Vector8 v, const uint8 c);

Leftovers, causing compiler warnings. (Also see new variable shadow warning)

+#else /* USE_NO_SIMD */
+ Vector8 r = 0;
+ uint8 *rp = (uint8 *) &r;
+
+ for (Size i = 0; i < sizeof(Vector8); i++)
+ rp[i] = Min(((const uint8 *) &v1)[i], ((const uint8 *) &v2)[i]);
+
+ return r;
+#endif

As I mentioned a couple versions ago, this style is really awkward, and
potential non-SIMD callers will be better off writing their own byte-wise
loop rather than using this API. Especially since the "min" function exists
only as a workaround for lack of unsigned comparison in (at least) SSE2.
There is one existing function in this file with that idiom for non-assert
code (for completeness), but even there, inputs of current interest to us
use the uint64 algorithm.

0002:

+ /* XXX: should not to use vector8_highbit_mask */
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) <<
sizeof(Vector8));

Hmm?

+/*
+ * Return index of the first element in chunks in the given node that is
greater
+ * than or equal to 'key'.  Return -1 if there is no such element.
+ */
+static inline int
+node_32_search_ge(rt_node_base_32 *node, uint8 chunk)

The caller must now have logic for inserting at the end:

+ int insertpos = node_32_search_ge((rt_node_base_32 *) n32, chunk);
+ int16 count = NODE_GET_COUNT(n32);
+
+ if (insertpos < 0)
+ insertpos = count; /* insert to the tail */

It would be a bit more clear if node_*_search_ge() always returns the
position we need (see the prototype for example). In fact, these functions
are probably better named node*_get_insertpos().

+ if (likely(NODE_HAS_FREE_SLOT(n128)))
+ {
+ node_inner_128_insert(n128, chunk, child);
+ break;
+ }
+
+ /* grow node from 128 to 256 */

We want all the node-growing code to be pushed down to the bottom so that
all branches of the hot path are close together. This provides better
locality for the CPU frontend. Looking at the assembly, the above doesn't
have the desired effect, so we need to write like this (also see prototype):

if (unlikely( ! has-free-slot))
  grow-node;
else
{
  ...;
  break;
}
/* FALLTHROUGH */

+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+   rt_node*child;
+
+   if (NODE_IS_LEAF(node))
+ break;
+
+   if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+ child = rt_node_add_new_child(tree, parent, node, key);
+
+   Assert(child);
+
+   parent = node;
+   node = child;
+   shift -= RT_NODE_SPAN;
+ }

Note that if we have to call rt_node_add_new_child(), each successive loop
iteration must search it and find nothing there (the prototype had a
separate function to handle this). Maybe it's not that critical yet, but
something to keep in mind as we proceed. Maybe a comment about it to remind
us.

+ /* there is no key to delete */
+ if (!rt_node_search_

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-10-14 Thread Masahiko Sawada
Hi,

On Mon, Oct 10, 2022 at 2:16 PM John Naylor
 wrote:
>
> The following is not quite a full review, but has plenty to think about. 
> There is too much to cover at once, and I have to start somewhere...
>
> My main concerns are that internal APIs:
>
> 1. are difficult to follow
> 2. lead to poor branch prediction and too many function calls
>
> Some of the measurements are picking on the SIMD search code, but I go into 
> details in order to demonstrate how a regression there can go completely 
> unnoticed. Hopefully the broader themes are informative.
>
> On Fri, Oct 7, 2022 at 3:09 PM Masahiko Sawada  wrote:
> > [fixed benchmarks]
>
> Thanks for that! Now I can show clear results on some aspects in a simple 
> way. The attached patches (apply on top of v6) are not intended to be 
> incorporated as-is quite yet, but do point the way to some reorganization 
> that I think is necessary. I've done some testing on loading, but will leave 
> it out for now in the interest of length.
>
>
> 0001-0003 are your performance test fix and and some small conveniences for 
> testing. Binary search is turned off, for example, because we know it 
> already. And the sleep call is so I can run perf in a different shell 
> session, on only the search portion.
>
> Note the v6 test loads all block numbers in the range. Since the test item 
> ids are all below 64 (reasonable), there are always 32 leaf chunks, so all 
> the leaves are node32 and completely full. This had the effect of never 
> taking the byte-wise loop in the proposed pg_lsearch function. These two 
> aspects make this an easy case for the branch predictor:
>
> john=# select * from bench_seq_search(0, 1*1000*1000);
> NOTICE:  num_keys = 100, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 
> 1, n256 = 122
> NOTICE:  sleeping for 2 seconds...
>   nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | 
> array_load_ms | rt_search_ms | array_serach_ms
> -+--+-++---+--+-
>  100 | 10199040 |   18000 |167 | 
> 0 |  822 |   0
>
>  1,470,141,841  branches:u
> 63,693  branch-misses:u   #0.00% of all branches
>
> john=# select * from bench_shuffle_search(0, 1*1000*1000);
> NOTICE:  num_keys = 100, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 
> 1, n256 = 122
> NOTICE:  sleeping for 2 seconds...
>   nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | 
> array_load_ms | rt_search_ms | array_serach_ms
> -+--+-++---+--+-
>  100 | 10199040 |   18000 |168 | 
> 0 | 2174 |   0
>
>  1,470,142,569  branches:u
> 15,023,983  branch-misses:u   #1.02% of all branches
>
>
> 0004 randomizes block selection in the load part of the search test so that 
> each block has a 50% chance of being loaded.  Note that now we have many 
> node16s where we had none before. Although node 16 and node32 appear to share 
> the same path in the switch statement of rt_node_search(), the chunk 
> comparison and node_get_values() calls each must go through different 
> branches. The shuffle case is most affected, but even the sequential case 
> slows down. (The leaves are less full -> there are more of them, so memory 
> use is larger, but it shouldn't matter much, in the sequential case at least)
>
> john=# select * from bench_seq_search(0, 2*1000*1000);
> NOTICE:  num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, 
> n128 = 1, n256 = 245
> NOTICE:  sleeping for 2 seconds...
>  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms 
> | rt_search_ms | array_serach_ms
> +--+-++---+--+-
>  999654 | 14893056 |   179937720 |173 | 0 
> |  907 |   0
>
>  1,684,114,926  branches:u
>  1,989,901  branch-misses:u   #0.12% of all branches
>
> john=# select * from bench_shuffle_search(0, 2*1000*1000);
> NOTICE:  num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, 
> n128 = 1, n256 = 245
> NOTICE:  sleeping for 2 seconds...
>  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms 
> | rt_search_ms | array_serach_ms
> +--+-++---+--+-
>  999654 | 14893056 |   179937720 |173 | 0 
> | 2890 |   0
>
>  1,684,115,844  branches:u
> 34,215,740  branch-misses:u   #2.03% of all branches
>
>
> 0005 replaces pg_lsearch with a branch-free SIMD search. Note that it retains 
> full portab

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-10-09 Thread John Naylor
On Mon, Oct 10, 2022 at 12:16 PM John Naylor 
wrote:
> Thanks for that! Now I can show clear results on some aspects in a simple
way. The attached patches (apply on top of v6)

Forgot the patchset...

--
John Naylor
EDB: http://www.enterprisedb.com


radix-v6-addendum-jcn1.tar.gz
Description: application/gzip


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-10-09 Thread John Naylor
The following is not quite a full review, but has plenty to think about.
There is too much to cover at once, and I have to start somewhere...

My main concerns are that internal APIs:

1. are difficult to follow
2. lead to poor branch prediction and too many function calls

Some of the measurements are picking on the SIMD search code, but I go into
details in order to demonstrate how a regression there can go completely
unnoticed. Hopefully the broader themes are informative.

On Fri, Oct 7, 2022 at 3:09 PM Masahiko Sawada 
wrote:
> [fixed benchmarks]

Thanks for that! Now I can show clear results on some aspects in a simple
way. The attached patches (apply on top of v6) are not intended to be
incorporated as-is quite yet, but do point the way to some reorganization
that I think is necessary. I've done some testing on loading, but will
leave it out for now in the interest of length.


0001-0003 are your performance test fix and and some small conveniences for
testing. Binary search is turned off, for example, because we know it
already. And the sleep call is so I can run perf in a different shell
session, on only the search portion.

Note the v6 test loads all block numbers in the range. Since the test item
ids are all below 64 (reasonable), there are always 32 leaf chunks, so all
the leaves are node32 and completely full. This had the effect of never
taking the byte-wise loop in the proposed pg_lsearch function. These two
aspects make this an easy case for the branch predictor:

john=# select * from bench_seq_search(0, 1*1000*1000);
NOTICE:  num_keys = 100, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128
= 1, n256 = 122
NOTICE:  sleeping for 2 seconds...
  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
-+--+-++---+--+-
 100 | 10199040 |   18000 |167 |
  0 |  822 |   0

 1,470,141,841  branches:u

63,693  branch-misses:u   #0.00% of all
branches

john=# select * from bench_shuffle_search(0, 1*1000*1000);
NOTICE:  num_keys = 100, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128
= 1, n256 = 122
NOTICE:  sleeping for 2 seconds...
  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
-+--+-++---+--+-
 100 | 10199040 |   18000 |168 |
  0 | 2174 |   0

 1,470,142,569  branches:u

15,023,983  branch-misses:u   #1.02% of all branches


0004 randomizes block selection in the load part of the search test so that
each block has a 50% chance of being loaded.  Note that now we have many
node16s where we had none before. Although node 16 and node32 appear to
share the same path in the switch statement of rt_node_search(), the chunk
comparison and node_get_values() calls each must go through different
branches. The shuffle case is most affected, but even the sequential case
slows down. (The leaves are less full -> there are more of them, so memory
use is larger, but it shouldn't matter much, in the sequential case at
least)

john=# select * from bench_seq_search(0, 2*1000*1000);
NOTICE:  num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889,
n128 = 1, n256 = 245
NOTICE:  sleeping for 2 seconds...
 nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
+--+-++---+--+-
 999654 | 14893056 |   179937720 |173 |
0 |  907 |   0

 1,684,114,926  branches:u

 1,989,901  branch-misses:u   #0.12% of all branches

john=# select * from bench_shuffle_search(0, 2*1000*1000);
NOTICE:  num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889,
n128 = 1, n256 = 245
NOTICE:  sleeping for 2 seconds...
 nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
+--+-++---+--+-
 999654 | 14893056 |   179937720 |173 |
0 | 2890 |   0

 1,684,115,844  branches:u

34,215,740  branch-misses:u   #2.03% of all branches


0005 replaces pg_lsearch with a branch-free SIMD search. Note that it
retains full portability and gains predictable performance. For
demonstration, it's used on all three linear-search types. Although I'm
sure it'd be way too slow for node4, this benchmark hardly has any so it's
ok.

john=# select * from bench_seq_search(0, 2*1000*1000);
NOTICE:  num_keys = 999654, height = 2, n4 = 1, n16 = 356

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-10-07 Thread Masahiko Sawada
On Fri, Oct 7, 2022 at 2:29 PM John Naylor  wrote:
>
> On Fri, Sep 16, 2022 at 1:01 PM Masahiko Sawada  wrote:
> > In addition to two patches, I've attached the third patch. It's not
> > part of radix tree implementation but introduces a contrib module
> > bench_radix_tree, a tool for radix tree performance benchmarking. It
> > measures loading and lookup performance of both the radix tree and a
> > flat array.
>
> Hi Masahiko, I've been using these benchmarks, along with my own variations, 
> to try various things that I've mentioned. I'm long overdue for an update, 
> but the picture is not yet complete.

Thanks!

> For now, I have two questions that I can't figure out on my own:
>
> 1. There seems to be some non-obvious limit on the number of keys that are 
> loaded (or at least what the numbers report). This is independent of the 
> number of tids per block. Example below:
>
> john=# select * from bench_shuffle_search(0, 8*1000*1000);
> NOTICE:  num_keys = 800, height = 3, n4 = 0, n16 = 1, n32 = 0, n128 = 
> 25, n256 = 981
>   nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | 
> array_load_ms | rt_search_ms | array_serach_ms
> -+--+-++---+--+-
>  800 |268435456 |4800 |661 |
> 29 |  276 | 389
>
> john=# select * from bench_shuffle_search(0, 9*1000*1000);
> NOTICE:  num_keys = 8388608, height = 3, n4 = 0, n16 = 1, n32 = 0, n128 = 
> 262144, n256 = 1028
>   nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | 
> array_load_ms | rt_search_ms | array_serach_ms
> -+--+-++---+--+-
>  8388608 |276824064 |5400 |718 |
> 33 |  311 | 446
>
> The array is the right size, but nkeys hasn't kept pace. Can you reproduce 
> this? Attached is the patch I'm using to show the stats when running the 
> test. (Side note: The numbers look unfavorable for radix tree because I'm 
> using 1 tid per block here.)

Yes, I can reproduce this. In tid_to_key_off() we need to cast to
uint64 when packing offset number and block number:

   tid_i = ItemPointerGetOffsetNumber(tid);
   tid_i |= ItemPointerGetBlockNumber(tid) << shift;

>
> 2. I found that bench_shuffle_search() is much *faster* for traditional 
> binary search on an array than bench_seq_search(). I've found this to be true 
> in every case. This seems counterintuitive to me -- any idea why this is? 
> Example:
>
> john=# select * from bench_seq_search(0, 100);
> NOTICE:  num_keys = 100, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 
> 1, n256 = 122
>   nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | 
> array_load_ms | rt_search_ms | array_serach_ms
> -+--+-++---+--+-
>  100 | 10199040 |   18000 |168 |   
> 106 |  827 |3348
>
> john=# select * from bench_shuffle_search(0, 100);
> NOTICE:  num_keys = 100, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 
> 1, n256 = 122
>   nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms | 
> array_load_ms | rt_search_ms | array_serach_ms
> -+--+-++---+--+-
>  100 | 10199040 |   18000 |171 |   
> 107 |  827 |1400
>

Ugh, in shuffle_itemptrs(), we shuffled itemptrs instead of itemptr:

for (int i = 0; i < nitems - 1; i++)
{
int j = shuffle_randrange(&state, i, nitems - 1);
   ItemPointerData t = itemptrs[j];

   itemptrs[j] = itemptrs[i];
   itemptrs[i] = t;

With the fix, the results on my environment were:

postgres(1:4093192)=# select * from bench_seq_search(0, 1000);
2022-10-07 16:57:03.124 JST [4093192] LOG:  num_keys = 1000,
height = 3, n4 = 0, n16 = 1, n32 = 312500, n128 = 0, n256 = 1226
  nkeys   | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--+--+-++---+--+-
 1000 |101826560 |  18 |846 |
 486 | 6096 |   21128
(1 row)

Time: 28975.566 ms (00:28.976)
postgres(1:4093192)=# select * from bench_shuffle_search(0, 1000);
2022-10-07 16:57:37.476 JST [4093192] LOG:  num_keys = 1000,
height = 3, n4 = 0, n16 = 1, n32 = 312500, n128 = 0, n256 = 1226
  nkeys   | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--+--+-++---+--+-
 1000 |101826

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-10-06 Thread John Naylor
On Fri, Sep 16, 2022 at 1:01 PM Masahiko Sawada 
wrote:
> In addition to two patches, I've attached the third patch. It's not
> part of radix tree implementation but introduces a contrib module
> bench_radix_tree, a tool for radix tree performance benchmarking. It
> measures loading and lookup performance of both the radix tree and a
> flat array.

Hi Masahiko, I've been using these benchmarks, along with my own
variations, to try various things that I've mentioned. I'm long overdue for
an update, but the picture is not yet complete.

For now, I have two questions that I can't figure out on my own:

1. There seems to be some non-obvious limit on the number of keys that are
loaded (or at least what the numbers report). This is independent of the
number of tids per block. Example below:

john=# select * from bench_shuffle_search(0, 8*1000*1000);
NOTICE:  num_keys = 800, height = 3, n4 = 0, n16 = 1, n32 = 0, n128 =
25, n256 = 981
  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
-+--+-++---+--+-
 800 |268435456 |4800 |661 |
 29 |  276 | 389

john=# select * from bench_shuffle_search(0, 9*1000*1000);
NOTICE:  num_keys = 8388608, height = 3, n4 = 0, n16 = 1, n32 = 0, n128 =
262144, n256 = 1028
  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
-+--+-++---+--+-
 8388608 |276824064 |5400 |718 |
 33 |  311 | 446

The array is the right size, but nkeys hasn't kept pace. Can you reproduce
this? Attached is the patch I'm using to show the stats when running the
test. (Side note: The numbers look unfavorable for radix tree because I'm
using 1 tid per block here.)

2. I found that bench_shuffle_search() is much *faster* for traditional
binary search on an array than bench_seq_search(). I've found this to be
true in every case. This seems counterintuitive to me -- any idea why this
is? Example:

john=# select * from bench_seq_search(0, 100);
NOTICE:  num_keys = 100, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128
= 1, n256 = 122
  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
-+--+-++---+--+-
 100 | 10199040 |   18000 |168 |
106 |  827 |3348

john=# select * from bench_shuffle_search(0, 100);
NOTICE:  num_keys = 100, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128
= 1, n256 = 122
  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
-+--+-++---+--+-
 100 | 10199040 |   18000 |171 |
107 |  827 |1400

--
John Naylor
EDB: http://www.enterprisedb.com
From 43a50a385930ee340d0a3b003910c704a0ff342c Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Thu, 6 Oct 2022 09:07:41 +0700
Subject: [PATCH v65 1/5] Turn on per-node counts in benchmark

Also add gitigore, fix whitespace, and change to NOTICE
---
 contrib/bench_radix_tree/.gitignore | 3 +++
 contrib/bench_radix_tree/bench_radix_tree.c | 5 +
 src/backend/lib/radixtree.c | 2 +-
 src/include/lib/radixtree.h | 2 +-
 4 files changed, 10 insertions(+), 2 deletions(-)
 create mode 100644 contrib/bench_radix_tree/.gitignore

diff --git a/contrib/bench_radix_tree/.gitignore b/contrib/bench_radix_tree/.gitignore
new file mode 100644
index 00..8830f5460d
--- /dev/null
+++ b/contrib/bench_radix_tree/.gitignore
@@ -0,0 +1,3 @@
+*data
+log/*
+results/*
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
index 5806ef7519..36c5218ae7 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -13,6 +13,7 @@
 #include "fmgr.h"
 #include "funcapi.h"
 #include "lib/radixtree.h"
+#include 
 #include "miscadmin.h"
 #include "utils/timestamp.h"
 
@@ -183,6 +184,8 @@ bench_search(FunctionCallInfo fcinfo, bool shuffle)
 	TimestampDifference(start_time, end_time, &secs, &usecs);
 	rt_load_ms = secs * 1000 + usecs / 1000;
 
+	rt_stats(rt);
+
 	/* measure the load time of the array */
 	itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
 	  sizeof(ItemPointerData) * ntids);
@@ -292,6 +295,8 @@ bench_load_random_int(PG_FUNCTION_ARGS)
 	TimestampDifference(start_time, end_time, &secs, &usecs);
 	load_time_ms = secs * 1000 + usecs / 1000;
 
+	rt_stats(rt);
+
 	MemSet(nulls, false, sizeof(nulls))

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-10-06 Thread John Naylor
On Thu, Oct 6, 2022 at 2:53 PM Masahiko Sawada 
wrote:
>
> On Wed, Oct 5, 2022 at 6:40 PM John Naylor 
wrote:
> >
> > This wasn't the focus of your current email, but while experimenting
with v6 I had another thought about local allocation: If we use the default
slab block size of 8192 bytes, then only 3 chunks of size 2088 can fit,
right? If so, since aset and DSA also waste at least a few hundred bytes,
we could store a useless 256-byte slot array within node256. That way,
node128 and node256 share the same start of pointers/values array, so there
would be one less branch for getting that address. In v6,
rt_node_get_values and rt_node_get_children are not inlined (asde: gcc uses
a jump table for 5 kinds but not for 4), but possibly should be, and the
smaller the better.
>
> It would be good for performance but I'm a bit concerned that it's
> highly optimized to the design of aset and DSA. Since size 2088 will
> be currently classed as 2616 in DSA, DSA wastes 528 bytes. However, if
> we introduce a new class of 2304 (=2048 + 256) bytes we cannot store a
> useless 256-byte and the assumption will be broken.

A new DSA class is hypothetical. A better argument against my idea is that
SLAB_DEFAULT_BLOCK_SIZE is arbitrary. FWIW, I looked at the prototype just
now and the slab block sizes are:

Max(pg_nextpower2_32((MAXALIGN(inner_class_info[i].size) + 16) * 32), 1024)

...which would be 128kB for nodemax. I'm curious about the difference.

> > One concern is, handling both local and dsa cases in the same code
requires more (predictable) branches and reduces code density. That might
be a reason in favor of templating to handle each case in its own
translation unit.
>
> Right. We also need to support locking for shared radix tree, which
> would require more branches.

Hmm, now it seems we'll likely want to template local vs. shared as a later
step...

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-10-06 Thread Masahiko Sawada
On Wed, Oct 5, 2022 at 6:40 PM John Naylor  wrote:
>
>
> On Wed, Oct 5, 2022 at 1:46 PM Masahiko Sawada  wrote:
> >
> > On Wed, Sep 28, 2022 at 12:49 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Fri, Sep 23, 2022 at 12:11 AM John Naylor
> > >  wrote:
> > > Yeah, node31 and node256 are bloated.  We probably could use slab for
> > > node256 independently. It's worth trying a benchmark to see how it
> > > affects the performance and the tree size.
>
> This wasn't the focus of your current email, but while experimenting with v6 
> I had another thought about local allocation: If we use the default slab 
> block size of 8192 bytes, then only 3 chunks of size 2088 can fit, right? If 
> so, since aset and DSA also waste at least a few hundred bytes, we could 
> store a useless 256-byte slot array within node256. That way, node128 and 
> node256 share the same start of pointers/values array, so there would be one 
> less branch for getting that address. In v6, rt_node_get_values and 
> rt_node_get_children are not inlined (asde: gcc uses a jump table for 5 kinds 
> but not for 4), but possibly should be, and the smaller the better.

It would be good for performance but I'm a bit concerned that it's
highly optimized to the design of aset and DSA. Since size 2088 will
be currently classed as 2616 in DSA, DSA wastes 528 bytes. However, if
we introduce a new class of 2304 (=2048 + 256) bytes we cannot store a
useless 256-byte and the assumption will be broken.

>
> > Regarding DSA support, IIUC we need to use dsa_pointer in inner nodes
> > to point to its child nodes, instead of C pointers (ig, backend-local
> > address). I'm thinking of a straightforward approach as the first
> > step; inner nodes have a union of rt_node* and dsa_pointer and we
> > choose either one based on whether the radix tree is shared or not. We
> > allocate and free the shared memory for individual nodes by
> > dsa_allocate() and dsa_free(), respectively. Therefore we need to get
> > a C pointer from dsa_pointer by using dsa_get_address() while
> > descending the tree. I'm a bit concerned that calling
> > dsa_get_address() for every descent could be performance overhead but
> > I'm going to measure it anyway.
>
> Are dsa pointers aligned the same as pointers to locally allocated memory? 
> Meaning, is the offset portion always a multiple of 4 (or 8)?

I think so.

> It seems that way from a glance, but I can't say for sure. If the lower 2 
> bits of a DSA pointer are never set, we can tag them the same way as a 
> regular pointer. That same technique could help hide the latency of 
> converting the pointer, by the same way it would hide the latency of loading 
> parts of a node into CPU registers.
>
> One concern is, handling both local and dsa cases in the same code requires 
> more (predictable) branches and reduces code density. That might be a reason 
> in favor of templating to handle each case in its own translation unit.

Right. We also need to support locking for shared radix tree, which
would require more branches.

Regards,

-- 
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-10-05 Thread John Naylor
On Wed, Oct 5, 2022 at 1:46 PM Masahiko Sawada 
wrote:
>
> On Wed, Sep 28, 2022 at 12:49 PM Masahiko Sawada 
wrote:
> >
> > On Fri, Sep 23, 2022 at 12:11 AM John Naylor
> >  wrote:
> > Yeah, node31 and node256 are bloated.  We probably could use slab for
> > node256 independently. It's worth trying a benchmark to see how it
> > affects the performance and the tree size.

This wasn't the focus of your current email, but while experimenting with
v6 I had another thought about local allocation: If we use the default slab
block size of 8192 bytes, then only 3 chunks of size 2088 can fit, right?
If so, since aset and DSA also waste at least a few hundred bytes, we could
store a useless 256-byte slot array within node256. That way, node128 and
node256 share the same start of pointers/values array, so there would be
one less branch for getting that address. In v6, rt_node_get_values and
rt_node_get_children are not inlined (asde: gcc uses a jump table for 5
kinds but not for 4), but possibly should be, and the smaller the better.

> Regarding DSA support, IIUC we need to use dsa_pointer in inner nodes
> to point to its child nodes, instead of C pointers (ig, backend-local
> address). I'm thinking of a straightforward approach as the first
> step; inner nodes have a union of rt_node* and dsa_pointer and we
> choose either one based on whether the radix tree is shared or not. We
> allocate and free the shared memory for individual nodes by
> dsa_allocate() and dsa_free(), respectively. Therefore we need to get
> a C pointer from dsa_pointer by using dsa_get_address() while
> descending the tree. I'm a bit concerned that calling
> dsa_get_address() for every descent could be performance overhead but
> I'm going to measure it anyway.

Are dsa pointers aligned the same as pointers to locally allocated memory?
Meaning, is the offset portion always a multiple of 4 (or 8)? It seems that
way from a glance, but I can't say for sure. If the lower 2 bits of a DSA
pointer are never set, we can tag them the same way as a regular pointer.
That same technique could help hide the latency of converting the pointer,
by the same way it would hide the latency of loading parts of a node into
CPU registers.

One concern is, handling both local and dsa cases in the same code requires
more (predictable) branches and reduces code density. That might be a
reason in favor of templating to handle each case in its own translation
unit. But that might be overkill.
--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-10-04 Thread Masahiko Sawada
On Wed, Sep 28, 2022 at 12:49 PM Masahiko Sawada  wrote:
>
> On Fri, Sep 23, 2022 at 12:11 AM John Naylor
>  wrote:
> >
> >
> > On Thu, Sep 22, 2022 at 11:46 AM John Naylor  
> > wrote:
> > > One thing I want to try soon is storing fewer than 16/32 etc entries, so 
> > > that the whole node fits comfortably inside a power-of-two allocation. 
> > > That would allow us to use aset without wasting space for the smaller 
> > > nodes, which would be faster and possibly would solve the fragmentation 
> > > problem Andres referred to in
> >
> > > https://www.postgresql.org/message-id/20220704220038.at2ane5xkymzzssb%40awork3.anarazel.de
> >
> > While calculating node sizes that fit within a power-of-two size, I noticed 
> > the current base node is a bit wasteful, taking up 8 bytes. The node kind 
> > only has a small number of values, so it doesn't really make sense to use 
> > an enum here in the struct (in fact, Andres' prototype used a uint8 for 
> > node_kind). We could use a bitfield for the count and kind:
> >
> > uint16 -- kind and count bitfield
> > uint8 shift;
> > uint8 chunk;
> >
> > That's only 4 bytes. Plus, if the kind is ever encoded in a pointer tag, 
> > the bitfield can just go back to being count only.
>
> Good point, agreed.
>
> >
> > Here are the v6 node kinds:
> >
> > node4:   8 +   4 +(4)+   4*8 =   48 bytes
> > node16:  8 +  16 +  16*8 =  152
> > node32:  8 +  32 +  32*8 =  296
> > node128: 8 + 256 + 128/8 + 128*8 = 1304
> > node256: 8   + 256/8 + 256*8 = 2088
> >
> > And here are the possible ways we could optimize nodes for space using aset 
> > allocation. Parentheses are padding bytes. Even if my math has mistakes, 
> > the numbers shouldn't be too far off:
> >
> > node3:   4 +   3 +(1)+   3*8 =   32 bytes
> > node6:   4 +   6 +(6)+   6*8 =   64
> > node13:  4 +  13 +(7)+  13*8 =  128
> > node28:  4 +  28 +  28*8 =  256
> > node31:  4 + 256 +  32/8 +  31*8 =  512 (XXX not good)
> > node94:  4 + 256 +  96/8 +  94*8 = 1024
> > node220: 4 + 256 + 224/8 + 220*8 = 2048
> > node256: = 4096
> >
> > The main disadvantage is that node256 would balloon in size.
>
> Yeah, node31 and node256 are bloated.  We probably could use slab for
> node256 independently. It's worth trying a benchmark to see how it
> affects the performance and the tree size.
>
> BTW We need to consider not only aset/slab but also DSA since we
> allocate dead tuple TIDs on DSM in parallel vacuum cases. FYI DSA uses
> the following size classes:
>
> static const uint16 dsa_size_classes[] = {
> sizeof(dsa_area_span), 0,   /* special size classes */
> 8, 16, 24, 32, 40, 48, 56, 64,  /* 8 classes separated by 8 bytes */
> 80, 96, 112, 128,   /* 4 classes separated by 16 bytes */
> 160, 192, 224, 256, /* 4 classes separated by 32 bytes */
> 320, 384, 448, 512, /* 4 classes separated by 64 bytes */
> 640, 768, 896, 1024,/* 4 classes separated by 128 bytes */
> 1280, 1560, 1816, 2048, /* 4 classes separated by ~256 bytes */
> 2616, 3120, 3640, 4096, /* 4 classes separated by ~512 bytes */
> 5456, 6552, 7280, 8192  /* 4 classes separated by ~1024 bytes */
> };
>
> node256 will be classed as 2616, which is still not good.
>
> Anyway, I'll implement DSA support for radix tree.
>

Regarding DSA support, IIUC we need to use dsa_pointer in inner nodes
to point to its child nodes, instead of C pointers (ig, backend-local
address). I'm thinking of a straightforward approach as the first
step; inner nodes have a union of rt_node* and dsa_pointer and we
choose either one based on whether the radix tree is shared or not. We
allocate and free the shared memory for individual nodes by
dsa_allocate() and dsa_free(), respectively. Therefore we need to get
a C pointer from dsa_pointer by using dsa_get_address() while
descending the tree. I'm a bit concerned that calling
dsa_get_address() for every descent could be performance overhead but
I'm going to measure it anyway.

Regards,

--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-10-02 Thread Masahiko Sawada
On Mon, Oct 3, 2022 at 2:04 AM Andres Freund  wrote:
>
> Hi,
>
> On 2022-09-16 15:00:31 +0900, Masahiko Sawada wrote:
> > I've updated the radix tree patch. It's now separated into two patches.
>
> cfbot notices a compiler warning:
> https://cirrus-ci.com/task/6247907681632256?logs=gcc_warning#L446
>
> [11:03:05.343] radixtree.c: In function ‘rt_iterate_next’:
> [11:03:05.343] radixtree.c:1758:15: error: ‘slot’ may be used uninitialized 
> in this function [-Werror=maybe-uninitialized]
> [11:03:05.343]  1758 |*value_p = *((uint64 *) slot);
> [11:03:05.343]   |   ^~
>

Thanks, I'll fix it in the next version patch.

Regards,

-- 
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-10-02 Thread Andres Freund
Hi,

On 2022-09-16 15:00:31 +0900, Masahiko Sawada wrote:
> I've updated the radix tree patch. It's now separated into two patches.

cfbot notices a compiler warning:
https://cirrus-ci.com/task/6247907681632256?logs=gcc_warning#L446

[11:03:05.343] radixtree.c: In function ‘rt_iterate_next’:
[11:03:05.343] radixtree.c:1758:15: error: ‘slot’ may be used uninitialized in 
this function [-Werror=maybe-uninitialized]
[11:03:05.343]  1758 |*value_p = *((uint64 *) slot);
[11:03:05.343]   |   ^~

Greetings,

Andres Freund




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-09-30 Thread Masahiko Sawada
On Wed, Sep 28, 2022 at 3:18 PM John Naylor
 wrote:
>
> On Wed, Sep 28, 2022 at 10:49 AM Masahiko Sawada  
> wrote:
>
> > BTW We need to consider not only aset/slab but also DSA since we
> > allocate dead tuple TIDs on DSM in parallel vacuum cases. FYI DSA uses
> > the following size classes:
> >
> > static const uint16 dsa_size_classes[] = {
> > [...]
>
> Thanks for that info -- I wasn't familiar with the details of DSA. For the 
> non-parallel case, I plan to at least benchmark using aset because I gather 
> it's the most heavily optimized. I'm thinking that will allow other problem 
> areas to be more prominent. I'll also want to compare total context size 
> compared to slab to see if possibly less fragmentation makes up for other 
> wastage.

Thanks!

>
> Along those lines, one thing I've been thinking about is the number of size 
> classes. There is a tradeoff between memory efficiency and number of branches 
> when searching/inserting. My current thinking is there is too much coupling 
> between size class and data type. Each size class currently uses a different 
> data type and a different algorithm to search and set it, which in turn 
> requires another branch. We've found that a larger number of size classes 
> leads to poor branch prediction [1] and (I imagine) code density.
>
> I'm thinking we can use "flexible array members" for the values/pointers, and 
> keep the rest of the control data in the struct the same. That way, we never 
> have more than 4 actual "kinds" to code and branch on. As a bonus, when 
> migrating a node to a larger size class of the same kind, we can simply 
> repalloc() to the next size.

Interesting idea. Using flexible array members for values would be
good also for the case in the future where we want to support other
value types than uint64.

With this idea, we can just repalloc() to grow to the larger size in a
pair but I'm slightly concerned that the more size class we use, the
more frequent the node needs to grow. If we want to support node
shrink, the deletion is also affected.

> To show what I mean, consider this new table:
>
> node2:   5 +  6   +(5)+  2*8 =   32 bytes
> node6:   5 +  6   +(5)+  6*8 =   64
>
> node12:  5 + 27   + 12*8 =  128
> node27:  5 + 27   + 27*8 =  248(->256)
>
> node91:  5 + 256 + 28 +(7)+ 91*8 = 1024
> node219: 5 + 256 + 28 +(7)+219*8 = 2048
>
> node256: 5 + 32   +(3)+256*8 = 2088(->4096)
>
> Seven size classes are grouped into the four kinds.
>
> The common base at the front is here 5 bytes because there is a new uint8 
> field for "capacity", which we can ignore for node256 since we assume we can 
> always insert/update that node. The control data is the same in each pair, 
> and so the offset to the pointer/value array is the same. Thus, migration 
> would look something like:

I think we can use a bitfield for capacity. That way, we can pack
count (9bits), kind (2bits)and capacity (4bits) in uint16.

> Somewhat unrelated, we could still implement Andres' idea [1] to dispense 
> with the isset array in inner nodes of the indirect array type (now node128), 
> since we can just test if the pointer is null.

Right. I didn't do that to use the common logic for inner node128 and
leaf node128.

Regards,

-- 
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-09-28 Thread John Naylor
On Wed, Sep 28, 2022 at 1:18 PM John Naylor 
wrote:
> [stuff about size classes]

I kind of buried the lede here on one thing: If we only have 4 kinds
regardless of the number of size classes, we can use 2 bits of the pointer
for dispatch, which would only require 4-byte alignment. That should make
that technique more portable.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-09-27 Thread John Naylor
On Wed, Sep 28, 2022 at 10:49 AM Masahiko Sawada 
wrote:

> BTW We need to consider not only aset/slab but also DSA since we
> allocate dead tuple TIDs on DSM in parallel vacuum cases. FYI DSA uses
> the following size classes:
>
> static const uint16 dsa_size_classes[] = {
> [...]

Thanks for that info -- I wasn't familiar with the details of DSA. For the
non-parallel case, I plan to at least benchmark using aset because I gather
it's the most heavily optimized. I'm thinking that will allow other problem
areas to be more prominent. I'll also want to compare total context size
compared to slab to see if possibly less fragmentation makes up for other
wastage.

Along those lines, one thing I've been thinking about is the number of size
classes. There is a tradeoff between memory efficiency and number of
branches when searching/inserting. My current thinking is there is too much
coupling between size class and data type. Each size class currently uses a
different data type and a different algorithm to search and set it, which
in turn requires another branch. We've found that a larger number of size
classes leads to poor branch prediction [1] and (I imagine) code density.

I'm thinking we can use "flexible array members" for the values/pointers,
and keep the rest of the control data in the struct the same. That way, we
never have more than 4 actual "kinds" to code and branch on. As a bonus,
when migrating a node to a larger size class of the same kind, we can
simply repalloc() to the next size. To show what I mean, consider this new
table:

node2:   5 +  6   +(5)+  2*8 =   32 bytes
node6:   5 +  6   +(5)+  6*8 =   64

node12:  5 + 27   + 12*8 =  128
node27:  5 + 27   + 27*8 =  248(->256)

node91:  5 + 256 + 28 +(7)+ 91*8 = 1024
node219: 5 + 256 + 28 +(7)+219*8 = 2048

node256: 5 + 32   +(3)+256*8 = 2088(->4096)

Seven size classes are grouped into the four kinds.

The common base at the front is here 5 bytes because there is a new uint8
field for "capacity", which we can ignore for node256 since we assume we
can always insert/update that node. The control data is the same in each
pair, and so the offset to the pointer/value array is the same. Thus,
migration would look something like:

case FOO_KIND:
if (unlikely(count == capacity))
{
  if (capacity == XYZ) /* for smaller size class of the pair */
  {
;
capacity = next-higher-capacity;
goto do_insert;
  }
  else
;
}
else
{
do_insert:
  <...>;
  break;
}
/* FALLTHROUGH */
...

One disadvantage is that this wastes some space by reserving the full set
of control data in the smaller size class of the pair, but it's usually
small compared to array size. Somewhat unrelated, we could still implement
Andres' idea [1] to dispense with the isset array in inner nodes of the
indirect array type (now node128), since we can just test if the pointer is
null.

[1]
https://www.postgresql.org/message-id/20220704220038.at2ane5xkymzzssb%40awork3.anarazel.de

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-09-27 Thread Masahiko Sawada
On Fri, Sep 23, 2022 at 12:11 AM John Naylor
 wrote:
>
>
> On Thu, Sep 22, 2022 at 11:46 AM John Naylor  
> wrote:
> > One thing I want to try soon is storing fewer than 16/32 etc entries, so 
> > that the whole node fits comfortably inside a power-of-two allocation. That 
> > would allow us to use aset without wasting space for the smaller nodes, 
> > which would be faster and possibly would solve the fragmentation problem 
> > Andres referred to in
>
> > https://www.postgresql.org/message-id/20220704220038.at2ane5xkymzzssb%40awork3.anarazel.de
>
> While calculating node sizes that fit within a power-of-two size, I noticed 
> the current base node is a bit wasteful, taking up 8 bytes. The node kind 
> only has a small number of values, so it doesn't really make sense to use an 
> enum here in the struct (in fact, Andres' prototype used a uint8 for 
> node_kind). We could use a bitfield for the count and kind:
>
> uint16 -- kind and count bitfield
> uint8 shift;
> uint8 chunk;
>
> That's only 4 bytes. Plus, if the kind is ever encoded in a pointer tag, the 
> bitfield can just go back to being count only.

Good point, agreed.

>
> Here are the v6 node kinds:
>
> node4:   8 +   4 +(4)+   4*8 =   48 bytes
> node16:  8 +  16 +  16*8 =  152
> node32:  8 +  32 +  32*8 =  296
> node128: 8 + 256 + 128/8 + 128*8 = 1304
> node256: 8   + 256/8 + 256*8 = 2088
>
> And here are the possible ways we could optimize nodes for space using aset 
> allocation. Parentheses are padding bytes. Even if my math has mistakes, the 
> numbers shouldn't be too far off:
>
> node3:   4 +   3 +(1)+   3*8 =   32 bytes
> node6:   4 +   6 +(6)+   6*8 =   64
> node13:  4 +  13 +(7)+  13*8 =  128
> node28:  4 +  28 +  28*8 =  256
> node31:  4 + 256 +  32/8 +  31*8 =  512 (XXX not good)
> node94:  4 + 256 +  96/8 +  94*8 = 1024
> node220: 4 + 256 + 224/8 + 220*8 = 2048
> node256: = 4096
>
> The main disadvantage is that node256 would balloon in size.

Yeah, node31 and node256 are bloated.  We probably could use slab for
node256 independently. It's worth trying a benchmark to see how it
affects the performance and the tree size.

BTW We need to consider not only aset/slab but also DSA since we
allocate dead tuple TIDs on DSM in parallel vacuum cases. FYI DSA uses
the following size classes:

static const uint16 dsa_size_classes[] = {
sizeof(dsa_area_span), 0,   /* special size classes */
8, 16, 24, 32, 40, 48, 56, 64,  /* 8 classes separated by 8 bytes */
80, 96, 112, 128,   /* 4 classes separated by 16 bytes */
160, 192, 224, 256, /* 4 classes separated by 32 bytes */
320, 384, 448, 512, /* 4 classes separated by 64 bytes */
640, 768, 896, 1024,/* 4 classes separated by 128 bytes */
1280, 1560, 1816, 2048, /* 4 classes separated by ~256 bytes */
2616, 3120, 3640, 4096, /* 4 classes separated by ~512 bytes */
5456, 6552, 7280, 8192  /* 4 classes separated by ~1024 bytes */
};

node256 will be classed as 2616, which is still not good.

Anyway, I'll implement DSA support for radix tree.

Regards,

-- 
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-09-22 Thread John Naylor
On Thu, Sep 22, 2022 at 11:46 AM John Naylor 
wrote:
> One thing I want to try soon is storing fewer than 16/32 etc entries, so
that the whole node fits comfortably inside a power-of-two allocation. That
would allow us to use aset without wasting space for the smaller nodes,
which would be faster and possibly would solve the fragmentation problem
Andres referred to in

>
https://www.postgresql.org/message-id/20220704220038.at2ane5xkymzzssb%40awork3.anarazel.de

While calculating node sizes that fit within a power-of-two size, I noticed
the current base node is a bit wasteful, taking up 8 bytes. The node kind
only has a small number of values, so it doesn't really make sense to use
an enum here in the struct (in fact, Andres' prototype used a uint8 for
node_kind). We could use a bitfield for the count and kind:

uint16 -- kind and count bitfield
uint8 shift;
uint8 chunk;

That's only 4 bytes. Plus, if the kind is ever encoded in a pointer tag,
the bitfield can just go back to being count only.

Here are the v6 node kinds:

node4:   8 +   4 +(4)+   4*8 =   48 bytes
node16:  8 +  16 +  16*8 =  152
node32:  8 +  32 +  32*8 =  296
node128: 8 + 256 + 128/8 + 128*8 = 1304
node256: 8   + 256/8 + 256*8 = 2088

And here are the possible ways we could optimize nodes for space using aset
allocation. Parentheses are padding bytes. Even if my math has mistakes,
the numbers shouldn't be too far off:

node3:   4 +   3 +(1)+   3*8 =   32 bytes
node6:   4 +   6 +(6)+   6*8 =   64
node13:  4 +  13 +(7)+  13*8 =  128
node28:  4 +  28 +  28*8 =  256
node31:  4 + 256 +  32/8 +  31*8 =  512 (XXX not good)
node94:  4 + 256 +  96/8 +  94*8 = 1024
node220: 4 + 256 + 224/8 + 220*8 = 2048
node256: = 4096

The main disadvantage is that node256 would balloon in size.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-09-22 Thread John Naylor
On Thu, Sep 22, 2022 at 7:52 PM John Naylor 
wrote:
>
>
> On Thu, Sep 22, 2022 at 1:26 PM Masahiko Sawada 
wrote:
> > Good point. While keeping the chunks in the small nodes in sorted
> > order is useful for visiting all keys in sorted order, additional
> > branches and memmove calls could be slow.
>
> Right, the ordering is a property that some users will need, so best to
keep it. Although the node128 doesn't have that property -- too slow to do
so, I think.

Nevermind, I must have been mixing up keys and values there...

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-09-22 Thread John Naylor
On Thu, Sep 22, 2022 at 1:26 PM Masahiko Sawada 
wrote:
>
> On Thu, Sep 22, 2022 at 1:46 PM John Naylor
>  wrote:
> > While on the subject, I wonder how important it is to keep the chunks
in the small nodes in sorted order. That adds branches and memmove calls,
and is the whole reason for the recent "pg_lfind_ge" function.
>
> Good point. While keeping the chunks in the small nodes in sorted
> order is useful for visiting all keys in sorted order, additional
> branches and memmove calls could be slow.

Right, the ordering is a property that some users will need, so best to
keep it. Although the node128 doesn't have that property -- too slow to do
so, I think.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-09-21 Thread Masahiko Sawada
On Thu, Sep 22, 2022 at 1:46 PM John Naylor
 wrote:
>
>
> On Thu, Sep 22, 2022 at 1:01 AM Nathan Bossart  
> wrote:
> >
> > On Wed, Sep 21, 2022 at 01:17:21PM +0700, John Naylor wrote:
> >
> > > In short, this code needs to be lower level so that we still have full
> > > control while being portable. I will work on this, and also the related
> > > code for node dispatch.
> >
> > Is it possible to use approach #2 here, too?  AFAICT space is allocated for
> > all of the chunks, so there wouldn't be any danger in searching all them
> > and discarding any results >= node->count.
>
> Sure, the caller could pass the maximum node capacity, and then check if the 
> returned index is within the range of the node count.
>
> > Granted, we're depending on the
> > number of chunks always being a multiple of elements-per-vector in order to
> > avoid the tail path, but that seems like a reasonably safe assumption that
> > can be covered with comments.
>
> Actually, we don't need to depend on that at all. When I said "junk" above, 
> that can be any bytes, as long as we're not reading off the end of allocated 
> memory. We'll never do that here, since the child pointers/values follow. In 
> that case, the caller can hard-code the  size (it would even happen to work 
> now to multiply rt_node_kind by 16, to be sneaky). One thing I want to try 
> soon is storing fewer than 16/32 etc entries, so that the whole node fits 
> comfortably inside a power-of-two allocation. That would allow us to use aset 
> without wasting space for the smaller nodes, which would be faster and 
> possibly would solve the fragmentation problem Andres referred to in
>
> https://www.postgresql.org/message-id/20220704220038.at2ane5xkymzzssb%40awork3.anarazel.de
>
> While on the subject, I wonder how important it is to keep the chunks in the 
> small nodes in sorted order. That adds branches and memmove calls, and is the 
> whole reason for the recent "pg_lfind_ge" function.

Good point. While keeping the chunks in the small nodes in sorted
order is useful for visiting all keys in sorted order, additional
branches and memmove calls could be slow.

Regards,

-- 
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-09-21 Thread John Naylor
On Thu, Sep 22, 2022 at 1:01 AM Nathan Bossart 
wrote:
>
> On Wed, Sep 21, 2022 at 01:17:21PM +0700, John Naylor wrote:
>
> > In short, this code needs to be lower level so that we still have full
> > control while being portable. I will work on this, and also the related
> > code for node dispatch.
>
> Is it possible to use approach #2 here, too?  AFAICT space is allocated
for
> all of the chunks, so there wouldn't be any danger in searching all them
> and discarding any results >= node->count.

Sure, the caller could pass the maximum node capacity, and then check if
the returned index is within the range of the node count.

> Granted, we're depending on the
> number of chunks always being a multiple of elements-per-vector in order
to
> avoid the tail path, but that seems like a reasonably safe assumption that
> can be covered with comments.

Actually, we don't need to depend on that at all. When I said "junk" above,
that can be any bytes, as long as we're not reading off the end of
allocated memory. We'll never do that here, since the child pointers/values
follow. In that case, the caller can hard-code the  size (it would even
happen to work now to multiply rt_node_kind by 16, to be sneaky). One thing
I want to try soon is storing fewer than 16/32 etc entries, so that the
whole node fits comfortably inside a power-of-two allocation. That would
allow us to use aset without wasting space for the smaller nodes, which
would be faster and possibly would solve the fragmentation problem Andres
referred to in

https://www.postgresql.org/message-id/20220704220038.at2ane5xkymzzssb%40awork3.anarazel.de

While on the subject, I wonder how important it is to keep the chunks in
the small nodes in sorted order. That adds branches and memmove calls, and
is the whole reason for the recent "pg_lfind_ge" function.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-09-21 Thread Nathan Bossart
On Wed, Sep 21, 2022 at 01:17:21PM +0700, John Naylor wrote:
> In trying to wrap the SIMD code behind layers of abstraction, the latest
> patch (and Nathan's cleanup) threw it away in almost all cases. To explain,
> we need to talk about how vectorized code deals with the "tail" that is too
> small for the register:
> 
> 1. Use a one-by-one algorithm, like we do for the pg_lfind* variants.
> 2. Read some junk into the register and mask off false positives from the
> result.
> 
> There are advantages to both depending on the situation.
> 
> Patch v5 and earlier used #2. Patch v6 used #1, so if a node16 has 15
> elements or less, it will iterate over them one-by-one exactly like a
> node4. Only when full with 16 will the vector path be taken. When another
> entry is added, the elements are copied to the next bigger node, so there's
> a *small* window where it's fast.
> 
> In short, this code needs to be lower level so that we still have full
> control while being portable. I will work on this, and also the related
> code for node dispatch.

Is it possible to use approach #2 here, too?  AFAICT space is allocated for
all of the chunks, so there wouldn't be any danger in searching all them
and discarding any results >= node->count.  Granted, we're depending on the
number of chunks always being a multiple of elements-per-vector in order to
avoid the tail path, but that seems like a reasonably safe assumption that
can be covered with comments.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-09-20 Thread John Naylor
On Tue, Sep 20, 2022 at 3:19 PM Masahiko Sawada 
wrote:
>
> On Fri, Sep 16, 2022 at 4:54 PM John Naylor
>  wrote:

> > Here again, I'd rather put this off and focus on getting the "large
> > details" in good enough shape so we can got towards integrating with
> > vacuum.
>
> Thank you for the comments! These above comments are addressed by
> Nathan in a newly derived thread. I'll work on the patch.

I still seem to be out-voted on when to tackle this particular
optimization, so I've extended the v6 benchmark code with a hackish
function that populates a fixed number of keys, but with different fanouts.
(diff attached as a text file)

I didn't take particular care to make this scientific, but the following
seems pretty reproducible. Note what happens to load and search performance
when node16 has 15 entries versus 16:

 fanout | nkeys  | rt_mem_allocated | rt_load_ms | rt_search_ms
++--++--
 15 | 327680 |  3776512 | 39 |   20
(1 row)
num_keys = 327680, height = 4, n4 = 1, n16 = 23408, n32 = 0, n128 = 0, n256
= 0

 fanout | nkeys  | rt_mem_allocated | rt_load_ms | rt_search_ms
++--++--
 16 | 327680 |  3514368 | 25 |   11
(1 row)
num_keys = 327680, height = 4, n4 = 0, n16 = 21846, n32 = 0, n128 = 0, n256
= 0

In trying to wrap the SIMD code behind layers of abstraction, the latest
patch (and Nathan's cleanup) threw it away in almost all cases. To explain,
we need to talk about how vectorized code deals with the "tail" that is too
small for the register:

1. Use a one-by-one algorithm, like we do for the pg_lfind* variants.
2. Read some junk into the register and mask off false positives from the
result.

There are advantages to both depending on the situation.

Patch v5 and earlier used #2. Patch v6 used #1, so if a node16 has 15
elements or less, it will iterate over them one-by-one exactly like a
node4. Only when full with 16 will the vector path be taken. When another
entry is added, the elements are copied to the next bigger node, so there's
a *small* window where it's fast.

In short, this code needs to be lower level so that we still have full
control while being portable. I will work on this, and also the related
code for node dispatch.

Since v6 has some good infrastructure to do low-level benchmarking, I also
want to do some experiments with memory management.

(I have further comments about the code, but I will put that off until
later)

> I'll consider how to integrate with vacuum as the next step. One
> concern for me is how to limit the memory usage to
> maintenance_work_mem. Unlike using a flat array, memory space for
> adding one TID varies depending on the situation. If we want strictly
> not to allow using memory more than maintenance_work_mem, probably we
> need to estimate the memory consumption in a conservative way.

+1

--
John Naylor
EDB: http://www.enterprisedb.com
commit 18407962e96ccec6c9aeeba97412edd762a5a4fe
Author: John Naylor 
Date:   Wed Sep 21 11:44:43 2022 +0700

Add special benchmark function to test effect of fanout

diff --git a/contrib/bench_radix_tree/Makefile 
b/contrib/bench_radix_tree/Makefile
index b8f70e12d1..952bb0ceae 100644
--- a/contrib/bench_radix_tree/Makefile
+++ b/contrib/bench_radix_tree/Makefile
@@ -7,7 +7,7 @@ OBJS = \
 EXTENSION = bench_radix_tree
 DATA = bench_radix_tree--1.0.sql
 
-REGRESS = bench
+REGRESS = bench_fixed_height
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql 
b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
index 6663abe6a4..f2fee15b17 100644
--- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -40,3 +40,15 @@ OUT load_ms int8)
 returns record
 as 'MODULE_PATHNAME'
 LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c 
b/contrib/bench_radix_tree/bench_radix_tree.c
index 5806ef7519..0778da2d7b 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -13,6 +13,7 @@
 #include "fmgr.h"
 #include "funcapi.h"
 #include "lib/radixtree.h"
+#include 
 #include "miscadmin.h"
 #include "utils/timestamp.h"
 
@@ -24,6 +25,7 @@ PG_MODULE_MAGIC;
 PG_FUNCTION_INFO_V1(bench_seq_search);
 PG_FUNCTION_INFO_V1(bench_shuffle_search);
 PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
 
 static radix_tree *rt = NULL;
 static ItemPointer itemptrs = NULL;
@@ -299,3 +301,108 @@ bench_load_random_int(PG_FUNCTION_ARGS)
rt_free(rt);
PG_RETURN_DATUM(HeapTupleGetDatum(h

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-09-20 Thread Masahiko Sawada
On Fri, Sep 16, 2022 at 4:54 PM John Naylor
 wrote:
>
> On Fri, Sep 16, 2022 at 1:01 PM Masahiko Sawada  wrote:
> >
> > On Mon, Aug 15, 2022 at 10:39 PM John Naylor
> >  wrote:
> > >
> > > bool, buth = and <=. Should be pretty close. Also, i believe if you
> > > left this for last as a possible refactoring, it might save some work.
>
> v6 demonstrates why this should have been put off towards the end. (more 
> below)
>
> > > In any case, I'll take a look at the latest patch next month.
>
> Since the CF entry said "Needs Review", I began looking at v5 again
> this week. Hopefully not too much has changed, but in the future I
> strongly recommend setting to "Waiting on Author" if a new version is
> forthcoming. I realize many here share updated patches at any time,
> but I'd like to discourage the practice especially for large patches.

Understood. Sorry for the inconveniences.

>
> > I've updated the radix tree patch. It's now separated into two patches.
> >
> > 0001 patch introduces pg_lsearch8() and pg_lsearch8_ge() (we may find
> > better names) that are similar to the pg_lfind8() family but they
> > return the index of the key in the vector instead of true/false. The
> > patch includes regression tests.
>
> I don't want to do a full review of this just yet, but I'll just point
> out some problems from a quick glance.
>
> +/*
> + * Return the index of the first element in the vector that is greater than
> + * or eual to the given scalar. Return sizeof(Vector8) if there is no such
> + * element.
>
> That's a bizarre API to indicate non-existence.
>
> + *
> + * Note that this function assumes the elements in the vector are sorted.
> + */
>
> That is *completely* unacceptable for a general-purpose function.
>
> +#else /* USE_NO_SIMD */
> + Vector8 r = 0;
> + uint8 *rp = (uint8 *) &r;
> +
> + for (Size i = 0; i < sizeof(Vector8); i++)
> + rp[i] = (((const uint8 *) &v1)[i] == ((const uint8 *) &v2)[i]) ? 0xFF : 0;
>
> I don't think we should try to force the non-simd case to adopt the
> special semantics of vector comparisons. It's much easier to just use
> the same logic as the assert builds.
>
> +#ifdef USE_SSE2
> + return (uint32) _mm_movemask_epi8(v);
> +#elif defined(USE_NEON)
> + static const uint8 mask[16] = {
> +1 << 0, 1 << 1, 1 << 2, 1 << 3,
> +1 << 4, 1 << 5, 1 << 6, 1 << 7,
> +1 << 0, 1 << 1, 1 << 2, 1 << 3,
> +1 << 4, 1 << 5, 1 << 6, 1 << 7,
> +  };
> +
> +uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t)
> vshrq_n_s8(v, 7));
> +uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
> +
> +return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
>
> For Arm, we need to be careful here. This article goes into a lot of
> detail for this situation:
>
> https://community.arm.com/arm-community-blogs/b/infrastructure-solutions-blog/posts/porting-x86-vector-bitmask-optimizations-to-arm-neon
>
> Here again, I'd rather put this off and focus on getting the "large
> details" in good enough shape so we can got towards integrating with
> vacuum.

Thank you for the comments! These above comments are addressed by
Nathan in a newly derived thread. I'll work on the patch.

I'll consider how to integrate with vacuum as the next step. One
concern for me is how to limit the memory usage to
maintenance_work_mem. Unlike using a flat array, memory space for
adding one TID varies depending on the situation. If we want strictly
not to allow using memory more than maintenance_work_mem, probably we
need to estimate the memory consumption in a conservative way.


Regards,

--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-09-17 Thread Nathan Bossart
On Fri, Sep 16, 2022 at 02:54:14PM +0700, John Naylor wrote:
> Here again, I'd rather put this off and focus on getting the "large
> details" in good enough shape so we can got towards integrating with
> vacuum.

I started a new thread for the SIMD patch [0] so that this thread can
remain focused on the radix tree stuff.

[0] https://www.postgresql.org/message-id/20220917052903.GA3172400%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-09-16 Thread John Naylor
On Fri, Sep 16, 2022 at 1:01 PM Masahiko Sawada  wrote:
>
> On Mon, Aug 15, 2022 at 10:39 PM John Naylor
>  wrote:
> >
> > bool, buth = and <=. Should be pretty close. Also, i believe if you
> > left this for last as a possible refactoring, it might save some work.

v6 demonstrates why this should have been put off towards the end. (more below)

> > In any case, I'll take a look at the latest patch next month.

Since the CF entry said "Needs Review", I began looking at v5 again
this week. Hopefully not too much has changed, but in the future I
strongly recommend setting to "Waiting on Author" if a new version is
forthcoming. I realize many here share updated patches at any time,
but I'd like to discourage the practice especially for large patches.

> I've updated the radix tree patch. It's now separated into two patches.
>
> 0001 patch introduces pg_lsearch8() and pg_lsearch8_ge() (we may find
> better names) that are similar to the pg_lfind8() family but they
> return the index of the key in the vector instead of true/false. The
> patch includes regression tests.

I don't want to do a full review of this just yet, but I'll just point
out some problems from a quick glance.

+/*
+ * Return the index of the first element in the vector that is greater than
+ * or eual to the given scalar. Return sizeof(Vector8) if there is no such
+ * element.

That's a bizarre API to indicate non-existence.

+ *
+ * Note that this function assumes the elements in the vector are sorted.
+ */

That is *completely* unacceptable for a general-purpose function.

+#else /* USE_NO_SIMD */
+ Vector8 r = 0;
+ uint8 *rp = (uint8 *) &r;
+
+ for (Size i = 0; i < sizeof(Vector8); i++)
+ rp[i] = (((const uint8 *) &v1)[i] == ((const uint8 *) &v2)[i]) ? 0xFF : 0;

I don't think we should try to force the non-simd case to adopt the
special semantics of vector comparisons. It's much easier to just use
the same logic as the assert builds.

+#ifdef USE_SSE2
+ return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+ static const uint8 mask[16] = {
+1 << 0, 1 << 1, 1 << 2, 1 << 3,
+1 << 4, 1 << 5, 1 << 6, 1 << 7,
+1 << 0, 1 << 1, 1 << 2, 1 << 3,
+1 << 4, 1 << 5, 1 << 6, 1 << 7,
+  };
+
+uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t)
vshrq_n_s8(v, 7));
+uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));

For Arm, we need to be careful here. This article goes into a lot of
detail for this situation:

https://community.arm.com/arm-community-blogs/b/infrastructure-solutions-blog/posts/porting-x86-vector-bitmask-optimizations-to-arm-neon

Here again, I'd rather put this off and focus on getting the "large
details" in good enough shape so we can got towards integrating with
vacuum.

> In addition to two patches, I've attached the third patch. It's not
> part of radix tree implementation but introduces a contrib module
> bench_radix_tree, a tool for radix tree performance benchmarking. It
> measures loading and lookup performance of both the radix tree and a
> flat array.

Excellent! This was high on my wish list.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-09-15 Thread Masahiko Sawada
On Mon, Aug 15, 2022 at 10:39 PM John Naylor
 wrote:
>
> On Mon, Aug 15, 2022 at 12:39 PM Masahiko Sawada  
> wrote:
> >
> > On Fri, Jul 22, 2022 at 10:43 AM Masahiko Sawada  
> > wrote:
> > >
> > > On Tue, Jul 19, 2022 at 1:30 PM John Naylor
> > >  wrote:
> > > >
> > > >
> > > >
> > > > On Tue, Jul 19, 2022 at 9:11 AM Masahiko Sawada  
> > > > wrote:
> > > >
> > > > > I’d like to keep the first version simple. We can improve it and add
> > > > > more optimizations later. Using radix tree for vacuum TID storage
> > > > > would still be a big win comparing to using a flat array, even without
> > > > > all these optimizations. In terms of single-value leaves method, I'm
> > > > > also concerned about an extra pointer traversal and extra memory
> > > > > allocation. It's most flexible but multi-value leaves method is also
> > > > > flexible enough for many use cases. Using the single-value method
> > > > > seems to be too much as the first step for me.
> > > > >
> > > > > Overall, using 64-bit keys and 64-bit values would be a reasonable
> > > > > choice for me as the first step . It can cover wider use cases
> > > > > including vacuum TID use cases. And possibly it can cover use cases by
> > > > > combining a hash table or using tree of tree, for example.
> > > >
> > > > These two aspects would also bring it closer to Andres' prototype, 
> > > > which 1) makes review easier and 2) easier to preserve optimization 
> > > > work already done, so +1 from me.
> > >
> > > Thanks.
> > >
> > > I've updated the patch. It now implements 64-bit keys, 64-bit values,
> > > and the multi-value leaves method. I've tried to remove duplicated
> > > codes but we might find a better way to do that.
> > >
> >
> > With the recent changes related to simd, I'm going to split the patch
> > into at least two parts: introduce other simd optimized functions used
> > by the radix tree and the radix tree implementation. Particularly we
> > need two functions for radix tree: a function like pg_lfind32 but for
> > 8 bits integers and return the index, and a function that returns the
> > index of the first element that is >= key.
>
> I recommend looking at
>
> https://www.postgresql.org/message-id/CAFBsxsESLUyJ5spfOSyPrOvKUEYYNqsBosue9SV1j8ecgNXSKA%40mail.gmail.com
>
> since I did the work just now for searching bytes and returning a
> bool, buth = and <=. Should be pretty close. Also, i believe if you
> left this for last as a possible refactoring, it might save some work.
> In any case, I'll take a look at the latest patch next month.

I've updated the radix tree patch. It's now separated into two patches.

0001 patch introduces pg_lsearch8() and pg_lsearch8_ge() (we may find
better names) that are similar to the pg_lfind8() family but they
return the index of the key in the vector instead of true/false. The
patch includes regression tests.

0002 patch is the main radix tree implementation. I've removed some
duplicated codes of node manipulation. For instance, since node-4,
node-16, and node-32 have a similar structure with different fanouts,
I introduced the common function for them.

In addition to two patches, I've attached the third patch. It's not
part of radix tree implementation but introduces a contrib module
bench_radix_tree, a tool for radix tree performance benchmarking. It
measures loading and lookup performance of both the radix tree and a
flat array.

Regards,

--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 5d0115b068ecb01d791eab5f8a78a6d25b9cf45c Mon Sep 17 00:00:00 2001
From: Masahiko Sawada 
Date: Wed, 14 Sep 2022 12:38:01 +
Subject: [PATCH v6 1/3] Support pg_lsearch8_eq and pg_lsearch8_ge

---
 src/include/port/pg_lfind.h   |  71 
 src/include/port/simd.h   | 155 +-
 .../test_lfind/expected/test_lfind.out|  12 ++
 .../modules/test_lfind/sql/test_lfind.sql |   2 +
 .../modules/test_lfind/test_lfind--1.0.sql|   8 +
 src/test/modules/test_lfind/test_lfind.c  | 139 
 6 files changed, 378 insertions(+), 9 deletions(-)

diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h
index 0625cac6b5..583f204763 100644
--- a/src/include/port/pg_lfind.h
+++ b/src/include/port/pg_lfind.h
@@ -80,6 +80,77 @@ pg_lfind8_le(uint8 key, uint8 *base, uint32 nelem)
 	return false;
 }
 
+/*
+ * pg_lsearch8
+ *
+ * Return the index of the element in 'base' that equals to 'key', otherwise return
+ * -1.
+ */
+static inline int
+pg_lsearch8(uint8 key, uint8 *base, uint32 nelem)
+{
+	uint32		i;
+
+	/* round down to multiple of vector length */
+	uint32		tail_idx = nelem & ~(sizeof(Vector8) - 1);
+	Vector8		chunk;
+
+	for (i = 0; i < tail_idx; i += sizeof(Vector8))
+	{
+		int idx;
+
+		vector8_load(&chunk, &base[i]);
+		if ((idx = vector8_search_eq(chunk, key)) != -1)
+			return i + idx;
+	}
+
+	/* Process the remaining elements one at a time. */
+	for (; i < nele

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-08-15 Thread John Naylor
On Mon, Aug 15, 2022 at 12:39 PM Masahiko Sawada  wrote:
>
> On Fri, Jul 22, 2022 at 10:43 AM Masahiko Sawada  
> wrote:
> >
> > On Tue, Jul 19, 2022 at 1:30 PM John Naylor
> >  wrote:
> > >
> > >
> > >
> > > On Tue, Jul 19, 2022 at 9:11 AM Masahiko Sawada  
> > > wrote:
> > >
> > > > I’d like to keep the first version simple. We can improve it and add
> > > > more optimizations later. Using radix tree for vacuum TID storage
> > > > would still be a big win comparing to using a flat array, even without
> > > > all these optimizations. In terms of single-value leaves method, I'm
> > > > also concerned about an extra pointer traversal and extra memory
> > > > allocation. It's most flexible but multi-value leaves method is also
> > > > flexible enough for many use cases. Using the single-value method
> > > > seems to be too much as the first step for me.
> > > >
> > > > Overall, using 64-bit keys and 64-bit values would be a reasonable
> > > > choice for me as the first step . It can cover wider use cases
> > > > including vacuum TID use cases. And possibly it can cover use cases by
> > > > combining a hash table or using tree of tree, for example.
> > >
> > > These two aspects would also bring it closer to Andres' prototype, which 
> > > 1) makes review easier and 2) easier to preserve optimization work 
> > > already done, so +1 from me.
> >
> > Thanks.
> >
> > I've updated the patch. It now implements 64-bit keys, 64-bit values,
> > and the multi-value leaves method. I've tried to remove duplicated
> > codes but we might find a better way to do that.
> >
>
> With the recent changes related to simd, I'm going to split the patch
> into at least two parts: introduce other simd optimized functions used
> by the radix tree and the radix tree implementation. Particularly we
> need two functions for radix tree: a function like pg_lfind32 but for
> 8 bits integers and return the index, and a function that returns the
> index of the first element that is >= key.

I recommend looking at

https://www.postgresql.org/message-id/CAFBsxsESLUyJ5spfOSyPrOvKUEYYNqsBosue9SV1j8ecgNXSKA%40mail.gmail.com

since I did the work just now for searching bytes and returning a
bool, buth = and <=. Should be pretty close. Also, i believe if you
left this for last as a possible refactoring, it might save some work.
In any case, I'll take a look at the latest patch next month.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-08-14 Thread Masahiko Sawada
On Fri, Jul 22, 2022 at 10:43 AM Masahiko Sawada  wrote:
>
> On Tue, Jul 19, 2022 at 1:30 PM John Naylor
>  wrote:
> >
> >
> >
> > On Tue, Jul 19, 2022 at 9:11 AM Masahiko Sawada  
> > wrote:
> >
> > > I’d like to keep the first version simple. We can improve it and add
> > > more optimizations later. Using radix tree for vacuum TID storage
> > > would still be a big win comparing to using a flat array, even without
> > > all these optimizations. In terms of single-value leaves method, I'm
> > > also concerned about an extra pointer traversal and extra memory
> > > allocation. It's most flexible but multi-value leaves method is also
> > > flexible enough for many use cases. Using the single-value method
> > > seems to be too much as the first step for me.
> > >
> > > Overall, using 64-bit keys and 64-bit values would be a reasonable
> > > choice for me as the first step . It can cover wider use cases
> > > including vacuum TID use cases. And possibly it can cover use cases by
> > > combining a hash table or using tree of tree, for example.
> >
> > These two aspects would also bring it closer to Andres' prototype, which 1) 
> > makes review easier and 2) easier to preserve optimization work already 
> > done, so +1 from me.
>
> Thanks.
>
> I've updated the patch. It now implements 64-bit keys, 64-bit values,
> and the multi-value leaves method. I've tried to remove duplicated
> codes but we might find a better way to do that.
>

With the recent changes related to simd, I'm going to split the patch
into at least two parts: introduce other simd optimized functions used
by the radix tree and the radix tree implementation. Particularly we
need two functions for radix tree: a function like pg_lfind32 but for
8 bits integers and return the index, and a function that returns the
index of the first element that is >= key.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-07-21 Thread Masahiko Sawada
On Tue, Jul 19, 2022 at 1:30 PM John Naylor
 wrote:
>
>
>
> On Tue, Jul 19, 2022 at 9:11 AM Masahiko Sawada  wrote:
>
> > I’d like to keep the first version simple. We can improve it and add
> > more optimizations later. Using radix tree for vacuum TID storage
> > would still be a big win comparing to using a flat array, even without
> > all these optimizations. In terms of single-value leaves method, I'm
> > also concerned about an extra pointer traversal and extra memory
> > allocation. It's most flexible but multi-value leaves method is also
> > flexible enough for many use cases. Using the single-value method
> > seems to be too much as the first step for me.
> >
> > Overall, using 64-bit keys and 64-bit values would be a reasonable
> > choice for me as the first step . It can cover wider use cases
> > including vacuum TID use cases. And possibly it can cover use cases by
> > combining a hash table or using tree of tree, for example.
>
> These two aspects would also bring it closer to Andres' prototype, which 1) 
> makes review easier and 2) easier to preserve optimization work already done, 
> so +1 from me.

Thanks.

I've updated the patch. It now implements 64-bit keys, 64-bit values,
and the multi-value leaves method. I've tried to remove duplicated
codes but we might find a better way to do that.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


radixtree_v5.patch
Description: Binary data


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-07-18 Thread John Naylor
On Tue, Jul 19, 2022 at 9:11 AM Masahiko Sawada 
wrote:

> I’d like to keep the first version simple. We can improve it and add
> more optimizations later. Using radix tree for vacuum TID storage
> would still be a big win comparing to using a flat array, even without
> all these optimizations. In terms of single-value leaves method, I'm
> also concerned about an extra pointer traversal and extra memory
> allocation. It's most flexible but multi-value leaves method is also
> flexible enough for many use cases. Using the single-value method
> seems to be too much as the first step for me.
>
> Overall, using 64-bit keys and 64-bit values would be a reasonable
> choice for me as the first step . It can cover wider use cases
> including vacuum TID use cases. And possibly it can cover use cases by
> combining a hash table or using tree of tree, for example.

These two aspects would also bring it closer to Andres' prototype, which 1)
makes review easier and 2) easier to preserve optimization work already
done, so +1 from me.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-07-18 Thread Peter Geoghegan
On Mon, Jul 18, 2022 at 9:10 PM John Naylor
 wrote:
> On Tue, Jul 19, 2022 at 9:24 AM Andres Freund  wrote:
> > FWIW, I think the best path forward would be to do something similar to the
> > simplehash.h approach, so it can be customized to the specific user.
>
> I figured that would come up at some point. It may be worth doing in the 
> future, but I think it's way too much to ask for the first use case.

I have a prototype patch that creates a read-only snapshot of the
visibility map, and has vacuumlazy.c work off of that when determining
with pages to skip. The patch also gets rid of the
SKIP_PAGES_THRESHOLD stuff. This is very effective with TPC-C,
principally because it really cuts down on the number of scanned_pages
that are scanned only because the VM bit is unset concurrently by DML.
The window for this is very large when the table is large (and
naturally takes a long time to scan), resulting in many more "dead but
not yet removable" tuples being encountered than necessary. Which
itself causes bogus information in the FSM -- information about the
space that VACUUM could free from the page, which is often highly
misleading.

There are remaining questions about how to do this properly. Right now
I'm just copying pages from the VM into local memory, right after
OldestXmin is first acquired -- we "lock in" a snapshot of the VM at
the earliest opportunity, which is what lazy_scan_skip() actually
works off now. There needs to be some consideration given to the
resource management aspects of this -- it needs to use memory
sensibly, which the current prototype patch doesn't do at all. I'm
probably going to seriously pursue this as a project soon, and will
probably need some kind of data structure for the local copy. The raw
pages are usually quite space inefficient, considering we only need an
immutable snapshot of the VM.

I wonder if it makes sense to use this as part of this project. It
will be possible to know the exact heap pages that will become
scanned_pages before scanning even one page with this design (perhaps
with caveats about low memory conditions). It could also be very
effective as a way of speeding up TID lookups in the reasonably common
case where most scanned_pages don't have any LP_DEAD items -- just
look it up in our local/materialized copy of the VM first. But even
when LP_DEAD items are spread fairly evenly, it could still give us
reliable information about the distribution of LP_DEAD items very
early on.

Maybe the two data structures could even be combined in some way? You
can use more memory for the local copy of the VM if you know that you
won't need the memory for dead_items. It's kinda the same problem, in
a way.

-- 
Peter Geoghegan




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-07-18 Thread John Naylor
On Tue, Jul 19, 2022 at 9:24 AM Andres Freund  wrote:
> FWIW, I think the best path forward would be to do something similar to
the
> simplehash.h approach, so it can be customized to the specific user.

I figured that would come up at some point. It may be worth doing in the
future, but I think it's way too much to ask for the first use case.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-07-18 Thread Andres Freund
Hi,

On 2022-07-08 11:09:44 +0900, Masahiko Sawada wrote:
> I think that at this stage it's better to define the design first. For
> example, key size and value size, and these sizes are fixed or can be
> set the arbitary size? Given the use case of buffer mapping, we would
> need a wider key to store RelFileNode, ForkNumber, and BlockNumber. On
> the other hand, limiting the key size is 64 bit integer makes the
> logic simple, and possibly it could still be used in buffer mapping
> cases by using a tree of a tree. For value size, if we support
> different value sizes specified by the user, we can either embed
> multiple values in the leaf node (called Multi-value leaves in ART
> paper) or introduce a leaf node that stores one value (called
> Single-value leaves).

FWIW, I think the best path forward would be to do something similar to the
simplehash.h approach, so it can be customized to the specific user.

Greetings,

Andres Freund




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-07-18 Thread Masahiko Sawada
On Thu, Jul 14, 2022 at 1:17 PM John Naylor
 wrote:
>
> On Tue, Jul 12, 2022 at 8:16 AM Masahiko Sawada  wrote:
>
> > > > I think that at this stage it's better to define the design first. For
> > > > example, key size and value size, and these sizes are fixed or can be
> > > > set the arbitary size?
> > >
> > > I don't think we need to start over. Andres' prototype had certain
> > > design decisions built in for the intended use case (although maybe
> > > not clearly documented as such). Subsequent patches in this thread
> > > substantially changed many design aspects. If there were any changes
> > > that made things wonderful for vacuum, it wasn't explained, but Andres
> > > did explain how some of these changes were not good for other uses.
> > > Going to fixed 64-bit keys and values should still allow many future
> > > applications, so let's do that if there's no reason not to.
> >
> > I thought Andres pointed out that given that we store BufferTag (or
> > part of that) into the key, the fixed 64-bit keys might not be enough
> > for buffer mapping use cases. If we want to use wider keys more than
> > 64-bit, we would need to consider it.
>
> It sounds like you've answered your own question, then. If so, I'm
> curious what your current thinking is.
>
> If we *did* want to have maximum flexibility, then "single-value
> leaves" method would be the way to go, since it seems to be the
> easiest way to have variable-length both keys and values. I do have a
> concern that the extra pointer traversal would be a drag on
> performance, and also require lots of small memory allocations.

Agreed.

> I also have some concerns about also simultaneously trying to design
> for the use for buffer mappings. I certainly want to make this good
> for as many future uses as possible, and I'd really like to preserve
> any optimizations already fought for. However, to make concrete
> progress on the thread subject, I also don't think it's the most
> productive use of time to get tied up about the fine details of
> something that will not likely happen for several years at the
> earliest.

I’d like to keep the first version simple. We can improve it and add
more optimizations later. Using radix tree for vacuum TID storage
would still be a big win comparing to using a flat array, even without
all these optimizations. In terms of single-value leaves method, I'm
also concerned about an extra pointer traversal and extra memory
allocation. It's most flexible but multi-value leaves method is also
flexible enough for many use cases. Using the single-value method
seems to be too much as the first step for me.

Overall, using 64-bit keys and 64-bit values would be a reasonable
choice for me as the first step . It can cover wider use cases
including vacuum TID use cases. And possibly it can cover use cases by
combining a hash table or using tree of tree, for example.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-07-13 Thread John Naylor
On Tue, Jul 12, 2022 at 8:16 AM Masahiko Sawada  wrote:

> > > I think that at this stage it's better to define the design first. For
> > > example, key size and value size, and these sizes are fixed or can be
> > > set the arbitary size?
> >
> > I don't think we need to start over. Andres' prototype had certain
> > design decisions built in for the intended use case (although maybe
> > not clearly documented as such). Subsequent patches in this thread
> > substantially changed many design aspects. If there were any changes
> > that made things wonderful for vacuum, it wasn't explained, but Andres
> > did explain how some of these changes were not good for other uses.
> > Going to fixed 64-bit keys and values should still allow many future
> > applications, so let's do that if there's no reason not to.
>
> I thought Andres pointed out that given that we store BufferTag (or
> part of that) into the key, the fixed 64-bit keys might not be enough
> for buffer mapping use cases. If we want to use wider keys more than
> 64-bit, we would need to consider it.

It sounds like you've answered your own question, then. If so, I'm
curious what your current thinking is.

If we *did* want to have maximum flexibility, then "single-value
leaves" method would be the way to go, since it seems to be the
easiest way to have variable-length both keys and values. I do have a
concern that the extra pointer traversal would be a drag on
performance, and also require lots of small memory allocations. If we
happened to go that route, your idea upthread of using a bitmapset of
item offsets in the leaves sounds like a good fit for that.

I also have some concerns about also simultaneously trying to design
for the use for buffer mappings. I certainly want to make this good
for as many future uses as possible, and I'd really like to preserve
any optimizations already fought for. However, to make concrete
progress on the thread subject, I also don't think it's the most
productive use of time to get tied up about the fine details of
something that will not likely happen for several years at the
earliest.

--
John Naylor
EDB: http://www.enterprisedb.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-07-11 Thread Masahiko Sawada
On Fri, Jul 8, 2022 at 3:43 PM John Naylor  wrote:
>
> On Fri, Jul 8, 2022 at 9:10 AM Masahiko Sawada  wrote:
>
> > I guess that the tree height is affected by where garbages are, right?
> > For example, even if all garbage in the table is concentrated in
> > 0.5GB, if they exist between 2^17 and 2^18 block, we use the first
> > byte of blockhi. If the table is larger than 128GB, the second byte of
> > the blockhi could be used depending on where the garbage exists.
>
> Right.
>
> > Another variation of how to store TID would be that we use the block
> > number as a key and store a bitmap of the offset as a value. We can
> > use Bitmapset for example,
>
> I like the idea of using existing code to set/check a bitmap if it's
> convenient. But (in case that was implied here) I'd really like to
> stay away from variable-length values, which would require
> "Single-value leaves" (slow). I also think it's fine to treat the
> key/value as just bits, and not care where exactly they came from, as
> we've been talking about.
>
> > or an approach like Roaring bitmap.
>
> This would require two new data structures instead of one. That
> doesn't seem like a path to success.

Agreed.

>
> > I think that at this stage it's better to define the design first. For
> > example, key size and value size, and these sizes are fixed or can be
> > set the arbitary size?
>
> I don't think we need to start over. Andres' prototype had certain
> design decisions built in for the intended use case (although maybe
> not clearly documented as such). Subsequent patches in this thread
> substantially changed many design aspects. If there were any changes
> that made things wonderful for vacuum, it wasn't explained, but Andres
> did explain how some of these changes were not good for other uses.
> Going to fixed 64-bit keys and values should still allow many future
> applications, so let's do that if there's no reason not to.

I thought Andres pointed out that given that we store BufferTag (or
part of that) into the key, the fixed 64-bit keys might not be enough
for buffer mapping use cases. If we want to use wider keys more than
64-bit, we would need to consider it.

>
> > For value size, if we support
> > different value sizes specified by the user, we can either embed
> > multiple values in the leaf node (called Multi-value leaves in ART
> > paper)
>
> I don't think "Multi-value leaves" allow for variable-length values,
> FWIW. And now I see I also used this term wrong in my earlier review
> comment -- v3/4 don't actually use "multi-value leaves", but Andres'
> does (going by the multiple leaf types). From the paper: "Multi-value
> leaves: The values are stored in one of four different leaf node
> types, which mirror the structure of inner nodes, but contain values
> instead of pointers."

Right, but sorry I meant the user specifies the arbitrary fixed-size
value length on creation like we do in dynahash.c.

>
> (It seems v3/v4 could be called a variation of "Combined pointer/value
> slots: If values fit into pointers, no separate node types are
> necessary. Instead, each pointer storage location in an inner node can
> either store a pointer or a value." But without the advantage of
> variable length keys).

Agreed.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-07-07 Thread John Naylor
On Fri, Jul 8, 2022 at 9:10 AM Masahiko Sawada  wrote:

> I guess that the tree height is affected by where garbages are, right?
> For example, even if all garbage in the table is concentrated in
> 0.5GB, if they exist between 2^17 and 2^18 block, we use the first
> byte of blockhi. If the table is larger than 128GB, the second byte of
> the blockhi could be used depending on where the garbage exists.

Right.

> Another variation of how to store TID would be that we use the block
> number as a key and store a bitmap of the offset as a value. We can
> use Bitmapset for example,

I like the idea of using existing code to set/check a bitmap if it's
convenient. But (in case that was implied here) I'd really like to
stay away from variable-length values, which would require
"Single-value leaves" (slow). I also think it's fine to treat the
key/value as just bits, and not care where exactly they came from, as
we've been talking about.

> or an approach like Roaring bitmap.

This would require two new data structures instead of one. That
doesn't seem like a path to success.

> I think that at this stage it's better to define the design first. For
> example, key size and value size, and these sizes are fixed or can be
> set the arbitary size?

I don't think we need to start over. Andres' prototype had certain
design decisions built in for the intended use case (although maybe
not clearly documented as such). Subsequent patches in this thread
substantially changed many design aspects. If there were any changes
that made things wonderful for vacuum, it wasn't explained, but Andres
did explain how some of these changes were not good for other uses.
Going to fixed 64-bit keys and values should still allow many future
applications, so let's do that if there's no reason not to.

> For value size, if we support
> different value sizes specified by the user, we can either embed
> multiple values in the leaf node (called Multi-value leaves in ART
> paper)

I don't think "Multi-value leaves" allow for variable-length values,
FWIW. And now I see I also used this term wrong in my earlier review
comment -- v3/4 don't actually use "multi-value leaves", but Andres'
does (going by the multiple leaf types). From the paper: "Multi-value
leaves: The values are stored in one of four different leaf node
types, which mirror the structure of inner nodes, but contain values
instead of pointers."

(It seems v3/v4 could be called a variation of "Combined pointer/value
slots: If values fit into pointers, no separate node types are
necessary. Instead, each pointer storage location in an inner node can
either store a pointer or a value." But without the advantage of
variable length keys).

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-07-07 Thread Masahiko Sawada
On Tue, Jul 5, 2022 at 5:49 PM John Naylor  wrote:
>
> On Mon, Jul 4, 2022 at 12:07 PM Masahiko Sawada  wrote:
>
> > > Looking at the node stats, and then your benchmark code, I think key
> > > construction is a major influence, maybe more than node type. The
> > > key/value scheme tested now makes sense:
> > >
> > > blockhi || blocklo || 9 bits of item offset
> > >
> > > (with the leaf nodes containing a bit map of the lowest few bits of
> > > this whole thing)
> > >
> > > We want the lower fanout nodes at the top of the tree and higher
> > > fanout ones at the bottom.
> >
> > So more inner nodes can fit in CPU cache, right?
>
> My thinking is, on average, there will be more dense space utilization
> in the leaf bitmaps, and fewer inner nodes. I'm not quite sure about
> cache, since with my idea a search might have to visit more nodes to
> get the common negative result (indexed tid not found in vacuum's
> list).
>
> > > Note some consequences: If the table has enough columns such that much
> > > fewer than 100 tuples fit on a page (maybe 30 or 40), then in the
> > > dense case the nodes above the leaves will have lower fanout (maybe
> > > they will fit in a node32). Also, the bitmap values in the leaves will
> > > be more empty. In other words, many tables in the wild *resemble* the
> > > sparse case a bit, even if truly all tuples on the page are dead.
> > >
> > > Note also that the dense case in the benchmark above has ~4500 times
> > > more keys than the sparse case, and uses about ~1000 times more
> > > memory. But the runtime is only 2-3 times longer. That's interesting
> > > to me.
> > >
> > > To optimize for the sparse case, it seems to me that the key/value would 
> > > be
> > >
> > > blockhi || 9 bits of item offset || blocklo
> > >
> > > I believe that would make the leaf nodes more dense, with fewer inner
> > > nodes, and could drastically speed up the sparse case, and maybe many
> > > realistic dense cases.
> >
> > Does it have an effect on the number of inner nodes?
> >
> > >  I'm curious to hear your thoughts.
> >
> > Thank you for your analysis. It's worth trying. We use 9 bits for item
> > offset but most pages don't use all bits in practice. So probably it
> > might be better to move the most significant bit of item offset to the
> > left of blockhi. Or more simply:
> >
> > 9 bits of item offset || blockhi || blocklo
>
> A concern here is most tids won't use many bits in blockhi either,
> most often far fewer, so this would make the tree higher, I think.
> Each value of blockhi represents 0.5GB of heap (32TB max). Even with
> very large tables I'm guessing most pages of interest to vacuum are
> concentrated in a few of these 0.5GB "segments".

Right.

I guess that the tree height is affected by where garbages are, right?
For example, even if all garbage in the table is concentrated in
0.5GB, if they exist between 2^17 and 2^18 block, we use the first
byte of blockhi. If the table is larger than 128GB, the second byte of
the blockhi could be used depending on where the garbage exists.

Another variation of how to store TID would be that we use the block
number as a key and store a bitmap of the offset as a value. We can
use Bitmapset for example, or an approach like Roaring bitmap.

I think that at this stage it's better to define the design first. For
example, key size and value size, and these sizes are fixed or can be
set the arbitary size? Given the use case of buffer mapping, we would
need a wider key to store RelFileNode, ForkNumber, and BlockNumber. On
the other hand, limiting the key size is 64 bit integer makes the
logic simple, and possibly it could still be used in buffer mapping
cases by using a tree of a tree. For value size, if we support
different value sizes specified by the user, we can either embed
multiple values in the leaf node (called Multi-value leaves in ART
paper) or introduce a leaf node that stores one value (called
Single-value leaves).

> And it's possible path compression would change the tradeoffs here.

Agreed.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-07-06 Thread Masahiko Sawada
On Tue, Jul 5, 2022 at 5:09 PM Andres Freund  wrote:
>
> Hi,
>
> On 2022-07-05 16:33:17 +0900, Masahiko Sawada wrote:
> > On Tue, Jul 5, 2022 at 6:18 AM Andres Freund  wrote:
> > A datum value is convenient to represent both a pointer and a value so
> > I used it to avoid defining node types for inner and leaf nodes
> > separately.
>
> I'm not convinced that's a good goal. I think we're going to want to have
> different key and value types, and trying to unify leaf and inner nodes is
> going to make that impossible.
>
> Consider e.g. using it for something like a buffer mapping table - your key
> might be way too wide to fit it sensibly into 64bit.

Right. It seems to be better to have an interface so that the user of
the radix tree can specify the arbitrary key size (and perhaps value
size too?) on creation. And we can have separate leaf node types that
have values instead of pointers. If the value size is less than
pointer size, we can have values within leaf nodes but if it’s bigger
probably the leaf node can have pointers to memory where to store the
value.

>
>
> > Since a datum could be 4 bytes or 8 bytes depending it might not be good for
> > some platforms.
>
> Right - thats another good reason why it's problematic. A lot of key types
> aren't going to be 4/8 bytes dependent on 32/64bit, but either / or.
>
>
> > > > +void
> > > > +radix_tree_insert(radix_tree *tree, uint64 key, Datum val, bool 
> > > > *found_p)
> > > > +{
> > > > + int shift;
> > > > + boolreplaced;
> > > > + radix_tree_node *node;
> > > > + radix_tree_node *parent = tree->root;
> > > > +
> > > > + /* Empty tree, create the root */
> > > > + if (!tree->root)
> > > > + radix_tree_new_root(tree, key, val);
> > > > +
> > > > + /* Extend the tree if necessary */
> > > > + if (key > tree->max_val)
> > > > + radix_tree_extend(tree, key);
> > >
> > > FWIW, the reason I used separate functions for these in the prototype is 
> > > that
> > > it turns out to generate a lot better code, because it allows non-inlined
> > > function calls to be sibling calls - thereby avoiding the need for a 
> > > dedicated
> > > stack frame. That's not possible once you need a palloc or such, so 
> > > splitting
> > > off those call paths into dedicated functions is useful.
> >
> > Thank you for the info. How much does using sibling call optimization
> > help the performance in this case? I think that these two cases are
> > used only a limited number of times: inserting the first key and
> > extending the tree.
>
> It's not that it helps in the cases moved into separate functions - it's that
> not having that code in the "normal" paths keeps the normal path faster.

Thanks, understood.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-07-05 Thread John Naylor
On Mon, Jul 4, 2022 at 12:07 PM Masahiko Sawada  wrote:

> > Looking at the node stats, and then your benchmark code, I think key
> > construction is a major influence, maybe more than node type. The
> > key/value scheme tested now makes sense:
> >
> > blockhi || blocklo || 9 bits of item offset
> >
> > (with the leaf nodes containing a bit map of the lowest few bits of
> > this whole thing)
> >
> > We want the lower fanout nodes at the top of the tree and higher
> > fanout ones at the bottom.
>
> So more inner nodes can fit in CPU cache, right?

My thinking is, on average, there will be more dense space utilization
in the leaf bitmaps, and fewer inner nodes. I'm not quite sure about
cache, since with my idea a search might have to visit more nodes to
get the common negative result (indexed tid not found in vacuum's
list).

> > Note some consequences: If the table has enough columns such that much
> > fewer than 100 tuples fit on a page (maybe 30 or 40), then in the
> > dense case the nodes above the leaves will have lower fanout (maybe
> > they will fit in a node32). Also, the bitmap values in the leaves will
> > be more empty. In other words, many tables in the wild *resemble* the
> > sparse case a bit, even if truly all tuples on the page are dead.
> >
> > Note also that the dense case in the benchmark above has ~4500 times
> > more keys than the sparse case, and uses about ~1000 times more
> > memory. But the runtime is only 2-3 times longer. That's interesting
> > to me.
> >
> > To optimize for the sparse case, it seems to me that the key/value would be
> >
> > blockhi || 9 bits of item offset || blocklo
> >
> > I believe that would make the leaf nodes more dense, with fewer inner
> > nodes, and could drastically speed up the sparse case, and maybe many
> > realistic dense cases.
>
> Does it have an effect on the number of inner nodes?
>
> >  I'm curious to hear your thoughts.
>
> Thank you for your analysis. It's worth trying. We use 9 bits for item
> offset but most pages don't use all bits in practice. So probably it
> might be better to move the most significant bit of item offset to the
> left of blockhi. Or more simply:
>
> 9 bits of item offset || blockhi || blocklo

A concern here is most tids won't use many bits in blockhi either,
most often far fewer, so this would make the tree higher, I think.
Each value of blockhi represents 0.5GB of heap (32TB max). Even with
very large tables I'm guessing most pages of interest to vacuum are
concentrated in a few of these 0.5GB "segments".

And it's possible path compression would change the tradeoffs here.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-07-05 Thread Andres Freund
Hi,

On 2022-07-05 16:33:29 +0900, Masahiko Sawada wrote:
> > One thing I was wondering about is trying to choose node types in
> > roughly-power-of-two struct sizes. It's pretty easy to end up with 
> > significant
> > fragmentation in the slabs right now when inserting as you go, because some 
> > of
> > the smaller node types will be freed but not enough to actually free blocks 
> > of
> > memory. If we instead have ~power-of-two sizes we could just use a single 
> > slab
> > of the max size, and carve out the smaller node types out of that largest
> > allocation.
> 
> You meant to manage memory allocation (and free) for smaller node
> types by ourselves?

For all of them basically. Using a single slab allocator and then subdividing
the "common block size" into however many chunks that fit into a single node
type.

> How about using different block size for different node types?

Not following...


Greetings,

Andres Freund




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-07-05 Thread Andres Freund
Hi,

On 2022-07-05 16:33:17 +0900, Masahiko Sawada wrote:
> On Tue, Jul 5, 2022 at 6:18 AM Andres Freund  wrote:
> A datum value is convenient to represent both a pointer and a value so
> I used it to avoid defining node types for inner and leaf nodes
> separately.

I'm not convinced that's a good goal. I think we're going to want to have
different key and value types, and trying to unify leaf and inner nodes is
going to make that impossible.

Consider e.g. using it for something like a buffer mapping table - your key
might be way too wide to fit it sensibly into 64bit.


> Since a datum could be 4 bytes or 8 bytes depending it might not be good for
> some platforms.

Right - thats another good reason why it's problematic. A lot of key types
aren't going to be 4/8 bytes dependent on 32/64bit, but either / or.


> > > +void
> > > +radix_tree_insert(radix_tree *tree, uint64 key, Datum val, bool *found_p)
> > > +{
> > > + int shift;
> > > + boolreplaced;
> > > + radix_tree_node *node;
> > > + radix_tree_node *parent = tree->root;
> > > +
> > > + /* Empty tree, create the root */
> > > + if (!tree->root)
> > > + radix_tree_new_root(tree, key, val);
> > > +
> > > + /* Extend the tree if necessary */
> > > + if (key > tree->max_val)
> > > + radix_tree_extend(tree, key);
> >
> > FWIW, the reason I used separate functions for these in the prototype is 
> > that
> > it turns out to generate a lot better code, because it allows non-inlined
> > function calls to be sibling calls - thereby avoiding the need for a 
> > dedicated
> > stack frame. That's not possible once you need a palloc or such, so 
> > splitting
> > off those call paths into dedicated functions is useful.
> 
> Thank you for the info. How much does using sibling call optimization
> help the performance in this case? I think that these two cases are
> used only a limited number of times: inserting the first key and
> extending the tree.

It's not that it helps in the cases moved into separate functions - it's that
not having that code in the "normal" paths keeps the normal path faster.

Greetings,

Andres Freund




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-07-05 Thread Masahiko Sawada
On Tue, Jul 5, 2022 at 7:00 AM Andres Freund  wrote:
>
> Hi,
>
> On 2022-06-28 15:24:11 +0900, Masahiko Sawada wrote:
> > In both test cases, There is not much difference between using AVX2
> > and SSE2. The more mode types, the more time it takes for loading the
> > data (see sse2_4_16_32_128_256).
>
> Yea, at some point the compiler starts using a jump table instead of branches,
> and that turns out to be a good bit more expensive. And even with branches, it
> obviously adds hard to predict branches. IIRC I fought a bit with the compiler
> to avoid some of that cost, it's possible that got "lost" in Sawada-san's
> patch.
>
>
> Sawada-san, what led you to discard the 1 and 16 node types? IIRC the 1 node
> one is not unimportant until we have path compression.

I wanted to start with a smaller number of node types for simplicity.
16 node type has been added to v4 patch I submitted[1]. I think it's
trade-off between better memory and the overhead of growing (and
shrinking) the node type. I'm going to add more node types once we
turn out based on the benchmark that it's beneficial.

>
> Right now the node struct sizes are:
> 4 - 48 bytes
> 32 - 296 bytes
> 128 - 1304 bytes
> 256 - 2088 bytes
>
> I guess radix_tree_node_128->isset is just 16 bytes compared to 1288 other
> bytes, but needing that separate isset array somehow is sad :/. I wonder if a
> smaller "free index" would do the trick? Point to the element + 1 where we
> searched last and start a plain loop there. Particularly in an insert-only
> workload that'll always work, and in other cases it'll still often work I
> think.

radix_tree_node_128->isset is used to distinguish between null-pointer
in inner nodes and 0 in leaf nodes. So I guess we can have a flag to
indicate a leaf or an inner so that we can interpret (Datum) 0 as
either null-pointer or 0. Or if we define different data types for
inner and leaf nodes probably we don't need it.


> One thing I was wondering about is trying to choose node types in
> roughly-power-of-two struct sizes. It's pretty easy to end up with significant
> fragmentation in the slabs right now when inserting as you go, because some of
> the smaller node types will be freed but not enough to actually free blocks of
> memory. If we instead have ~power-of-two sizes we could just use a single slab
> of the max size, and carve out the smaller node types out of that largest
> allocation.

You meant to manage memory allocation (and free) for smaller node
types by ourselves?

How about using different block size for different node types?

>
> Btw, that fragmentation is another reason why I think it's better to track
> memory usage via memory contexts, rather than doing so based on
> GetMemoryChunkSpace().

Agreed.

>
>
> > > Ideally, node16 and node32 would have the same code with a different
> > > loop count (1 or 2). More generally, there is too much duplication of
> > > code (noted by Andres in his PoC), and there are many variable names
> > > with the node size embedded. This is a bit tricky to make more
> > > general, so we don't need to try it yet, but ideally we would have
> > > something similar to:
> > >
> > > switch (node->kind) // todo: inspect tagged pointer
> > > {
> > >   case RADIX_TREE_NODE_KIND_4:
> > >idx = node_search_eq(node, chunk, 4);
> > >do_action(node, idx, 4, ...);
> > >break;
> > >   case RADIX_TREE_NODE_KIND_32:
> > >idx = node_search_eq(node, chunk, 32);
> > >do_action(node, idx, 32, ...);
> > >   ...
> > > }
>
> FWIW, that should be doable with an inline function, if you pass it the memory
> to the "array" rather than the node directly. Not so sure it's a good idea to
> do dispatch between node types / search methods inside the helper, as you
> suggest below:
>
>
> > > static pg_alwaysinline void
> > > node_search_eq(radix_tree_node node, uint8 chunk, int16 node_fanout)
> > > {
> > > if (node_fanout <= SIMPLE_LOOP_THRESHOLD)
> > >   // do simple loop with (node_simple *) node;
> > > else if (node_fanout <= VECTORIZED_LOOP_THRESHOLD)
> > >   // do vectorized loop where available with (node_vec *) node;
> > > ...
> > > }

Yeah, It's worth trying at some point.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-07-05 Thread Masahiko Sawada
On Tue, Jul 5, 2022 at 6:18 AM Andres Freund  wrote:
>
> Hi,
>
> On 2022-06-16 13:56:55 +0900, Masahiko Sawada wrote:
> > diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
> > new file mode 100644
> > index 00..bf87f932fd
> > --- /dev/null
> > +++ b/src/backend/lib/radixtree.c
> > @@ -0,0 +1,1763 @@
> > +/*-
> > + *
> > + * radixtree.c
> > + *   Implementation for adaptive radix tree.
> > + *
> > + * This module employs the idea from the paper "The Adaptive Radix Tree: 
> > ARTful
> > + * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and 
> > Thomas
> > + * Neumann, 2013.
> > + *
> > + * There are some differences from the proposed implementation.  For 
> > instance,
> > + * this radix tree module utilizes AVX2 instruction, enabling us to use 
> > 256-bit
> > + * width SIMD vector, whereas 128-bit width SIMD vector is used in the 
> > paper.
> > + * Also, there is no support for path compression and lazy path expansion. 
> > The
> > + * radix tree supports fixed length of the key so we don't expect the tree 
> > level
> > + * wouldn't be high.
>
> I think we're going to need path compression at some point, fwiw. I'd bet on
> it being beneficial even for the tid case.
>
>
> > + * The key is a 64-bit unsigned integer and the value is a Datum.
>
> I don't think it's a good idea to define the value type to be a datum.

A datum value is convenient to represent both a pointer and a value so
I used it to avoid defining node types for inner and leaf nodes
separately. Since a datum could be 4 bytes or 8 bytes depending it
might not be good for some platforms. But what kind of aspects do you
not like the idea of using datum?

>
>
> > +/*
> > + * As we descend a radix tree, we push the node to the stack. The stack is 
> > used
> > + * at deletion.
> > + */
> > +typedef struct radix_tree_stack_data
> > +{
> > + radix_tree_node *node;
> > + struct radix_tree_stack_data *parent;
> > +} radix_tree_stack_data;
> > +typedef radix_tree_stack_data *radix_tree_stack;
>
> I think it's a very bad idea for traversal to need allocations. I really want
> to eventually use this for shared structures (eventually with lock-free
> searches at least), and needing to do allocations while traversing the tree is
> a no-go for that.
>
> Particularly given that the tree currently has a fixed depth, can't you just
> allocate this on the stack once?

Yes, we can do that.

>
> > +/*
> > + * Allocate a new node with the given node kind.
> > + */
> > +static radix_tree_node *
> > +radix_tree_alloc_node(radix_tree *tree, radix_tree_node_kind kind)
> > +{
> > + radix_tree_node *newnode;
> > +
> > + newnode = (radix_tree_node *) 
> > MemoryContextAllocZero(tree->slabs[kind],
> > +   
> >radix_tree_node_info[kind].size);
> > + newnode->kind = kind;
> > +
> > + /* update the statistics */
> > + tree->mem_used += GetMemoryChunkSpace(newnode);
> > + tree->cnt[kind]++;
> > +
> > + return newnode;
> > +}
>
> Why are you tracking the memory usage at this level of detail? It's *much*
> cheaper to track memory usage via the memory contexts? Since they're dedicated
> for the radix tree, that ought to be sufficient?

Indeed. I'll use MemoryContextMemAllocated instead.

>
>
> > + else if (idx != n4->n.count)
> > + {
> > + /*
> > +  * the key needs to be 
> > inserted in the middle of the
> > +  * array, make space for the 
> > new key.
> > +  */
> > + memmove(&(n4->chunks[idx + 
> > 1]), &(n4->chunks[idx]),
> > + sizeof(uint8) 
> > * (n4->n.count - idx));
> > + memmove(&(n4->slots[idx + 
> > 1]), &(n4->slots[idx]),
> > + 
> > sizeof(radix_tree_node *) * (n4->n.count - idx));
> > + }
>
> Maybe we could add a static inline helper for these memmoves? Both because
> it's repetitive (for different node types) and because the last time I looked
> gcc was generating quite bad code for this. And having to put workarounds into
> multiple places is obviously worse than having to do it in one place.

Agreed, I'll update it.

>
>
> > +/*
> > + * Insert the key with the val.
> > + *
> > + * found_p is set to true if the key already present, otherwise false, if
> > + * it's not NULL.
> > + *
> > + * XXX: do we need to support update_if_exists behavior?
> > + */
>
> Yes, I think that's needed - hence using bfm_set() instead

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-07-04 Thread Masahiko Sawada
On Mon, Jul 4, 2022 at 2:07 PM Masahiko Sawada  wrote:
>
> On Tue, Jun 28, 2022 at 10:10 PM John Naylor
>  wrote:
> >
> > On Tue, Jun 28, 2022 at 1:24 PM Masahiko Sawada  
> > wrote:
> > >
> > > > I
> > > > suspect other optimizations would be worth a lot more than using AVX2:
> > > > - collapsing inner nodes
> > > > - taking care when constructing the key (more on this when we
> > > > integrate with VACUUM)
> > > > ...and a couple Andres mentioned:
> > > > - memory management: in
> > > > https://www.postgresql.org/message-id/flat/20210717194333.mr5io3zup3kxahfm%40alap3.anarazel.de
> > > > - node dispatch:
> > > > https://www.postgresql.org/message-id/20210728184139.qhvx6nbwdcvo63m6%40alap3.anarazel.de
> > > >
> > > > Therefore, I would suggest that we use SSE2 only, because:
> > > > - portability is very easy
> > > > - to avoid a performance hit from indirecting through a function pointer
> > >
> > > Okay, I'll try these optimizations and see if the performance becomes 
> > > better.
> >
> > FWIW, I think it's fine if we delay these until after committing a
> > good-enough version. The exception is key construction and I think
> > that deserves some attention now (more on this below).
>
> Agreed.
>
> >
> > > I've done benchmark tests while changing the node types. The code base
> > > is v3 patch that doesn't have the optimization you mentioned below
> > > (memory management and node dispatch) but I added the code to use SSE2
> > > for node-16 and node-32.
> >
> > Great, this is helpful to visualize what's going on!
> >
> > > * sse2_4_16_48_256
> > > * nkeys = 9091, height = 3, n4 = 0, n16 = 0, n48 = 512, n256 = 
> > > 916433
> > > * nkeys = 2, height = 3, n4 = 2, n16 = 0, n48 = 207, n256 = 50
> > >
> > > * sse2_4_32_128_256
> > > * nkeys = 9091, height = 3, n4 = 0, n32 = 285, n128 = 916629, 
> > > n256 = 31
> > > * nkeys = 2, height = 3, n4 = 2, n32 = 48, n128 = 208, n256 = 
> > > 1
> >
> > > Observations are:
> > >
> > > In both test cases, There is not much difference between using AVX2
> > > and SSE2. The more mode types, the more time it takes for loading the
> > > data (see sse2_4_16_32_128_256).
> >
> > Good to know. And as Andres mentioned in his PoC, more node types
> > would be a barrier for pointer tagging, since 32-bit platforms only
> > have two spare bits in the pointer.
> >
> > > In dense case, since most nodes have around 100 children, the radix
> > > tree that has node-128 had a good figure in terms of memory usage. On
> >
> > Looking at the node stats, and then your benchmark code, I think key
> > construction is a major influence, maybe more than node type. The
> > key/value scheme tested now makes sense:
> >
> > blockhi || blocklo || 9 bits of item offset
> >
> > (with the leaf nodes containing a bit map of the lowest few bits of
> > this whole thing)
> >
> > We want the lower fanout nodes at the top of the tree and higher
> > fanout ones at the bottom.
>
> So more inner nodes can fit in CPU cache, right?
>
> >
> > Note some consequences: If the table has enough columns such that much
> > fewer than 100 tuples fit on a page (maybe 30 or 40), then in the
> > dense case the nodes above the leaves will have lower fanout (maybe
> > they will fit in a node32). Also, the bitmap values in the leaves will
> > be more empty. In other words, many tables in the wild *resemble* the
> > sparse case a bit, even if truly all tuples on the page are dead.
> >
> > Note also that the dense case in the benchmark above has ~4500 times
> > more keys than the sparse case, and uses about ~1000 times more
> > memory. But the runtime is only 2-3 times longer. That's interesting
> > to me.
> >
> > To optimize for the sparse case, it seems to me that the key/value would be
> >
> > blockhi || 9 bits of item offset || blocklo
> >
> > I believe that would make the leaf nodes more dense, with fewer inner
> > nodes, and could drastically speed up the sparse case, and maybe many
> > realistic dense cases.
>
> Does it have an effect on the number of inner nodes?
>
> >  I'm curious to hear your thoughts.
>
> Thank you for your analysis. It's worth trying. We use 9 bits for item
> offset but most pages don't use all bits in practice. So probably it
> might be better to move the most significant bit of item offset to the
> left of blockhi. Or more simply:
>
> 9 bits of item offset || blockhi || blocklo
>
> >
> > > the other hand, the radix tree that doesn't have node-128 has a better
> > > number in terms of insertion performance. This is probably because we
> > > need to iterate over 'isset' flags from the beginning of the array in
> > > order to find an empty slot when inserting new data. We do the same
> > > thing also for node-48 but it was better than node-128 as it's up to
> > > 48.
> >
> > I mentioned in my diff, but for those following along, I think we can
> > improve that by iterating over the bytes and if it's 0xFF all 8 bits
> > are set already so keep looking...
>
> Right

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-07-04 Thread Andres Freund
Hi,

On 2022-06-28 15:24:11 +0900, Masahiko Sawada wrote:
> In both test cases, There is not much difference between using AVX2
> and SSE2. The more mode types, the more time it takes for loading the
> data (see sse2_4_16_32_128_256).

Yea, at some point the compiler starts using a jump table instead of branches,
and that turns out to be a good bit more expensive. And even with branches, it
obviously adds hard to predict branches. IIRC I fought a bit with the compiler
to avoid some of that cost, it's possible that got "lost" in Sawada-san's
patch.


Sawada-san, what led you to discard the 1 and 16 node types? IIRC the 1 node
one is not unimportant until we have path compression.

Right now the node struct sizes are:
4 - 48 bytes
32 - 296 bytes
128 - 1304 bytes
256 - 2088 bytes

I guess radix_tree_node_128->isset is just 16 bytes compared to 1288 other
bytes, but needing that separate isset array somehow is sad :/. I wonder if a
smaller "free index" would do the trick? Point to the element + 1 where we
searched last and start a plain loop there. Particularly in an insert-only
workload that'll always work, and in other cases it'll still often work I
think.


One thing I was wondering about is trying to choose node types in
roughly-power-of-two struct sizes. It's pretty easy to end up with significant
fragmentation in the slabs right now when inserting as you go, because some of
the smaller node types will be freed but not enough to actually free blocks of
memory. If we instead have ~power-of-two sizes we could just use a single slab
of the max size, and carve out the smaller node types out of that largest
allocation.

Btw, that fragmentation is another reason why I think it's better to track
memory usage via memory contexts, rather than doing so based on
GetMemoryChunkSpace().


> > Ideally, node16 and node32 would have the same code with a different
> > loop count (1 or 2). More generally, there is too much duplication of
> > code (noted by Andres in his PoC), and there are many variable names
> > with the node size embedded. This is a bit tricky to make more
> > general, so we don't need to try it yet, but ideally we would have
> > something similar to:
> >
> > switch (node->kind) // todo: inspect tagged pointer
> > {
> >   case RADIX_TREE_NODE_KIND_4:
> >idx = node_search_eq(node, chunk, 4);
> >do_action(node, idx, 4, ...);
> >break;
> >   case RADIX_TREE_NODE_KIND_32:
> >idx = node_search_eq(node, chunk, 32);
> >do_action(node, idx, 32, ...);
> >   ...
> > }

FWIW, that should be doable with an inline function, if you pass it the memory
to the "array" rather than the node directly. Not so sure it's a good idea to
do dispatch between node types / search methods inside the helper, as you
suggest below:


> > static pg_alwaysinline void
> > node_search_eq(radix_tree_node node, uint8 chunk, int16 node_fanout)
> > {
> > if (node_fanout <= SIMPLE_LOOP_THRESHOLD)
> >   // do simple loop with (node_simple *) node;
> > else if (node_fanout <= VECTORIZED_LOOP_THRESHOLD)
> >   // do vectorized loop where available with (node_vec *) node;
> > ...
> > }

Greetings,

Andres Freund




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-07-04 Thread Andres Freund
Hi,

On 2022-06-16 13:56:55 +0900, Masahiko Sawada wrote:
> diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
> new file mode 100644
> index 00..bf87f932fd
> --- /dev/null
> +++ b/src/backend/lib/radixtree.c
> @@ -0,0 +1,1763 @@
> +/*-
> + *
> + * radixtree.c
> + *   Implementation for adaptive radix tree.
> + *
> + * This module employs the idea from the paper "The Adaptive Radix Tree: 
> ARTful
> + * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and 
> Thomas
> + * Neumann, 2013.
> + *
> + * There are some differences from the proposed implementation.  For 
> instance,
> + * this radix tree module utilizes AVX2 instruction, enabling us to use 
> 256-bit
> + * width SIMD vector, whereas 128-bit width SIMD vector is used in the paper.
> + * Also, there is no support for path compression and lazy path expansion. 
> The
> + * radix tree supports fixed length of the key so we don't expect the tree 
> level
> + * wouldn't be high.

I think we're going to need path compression at some point, fwiw. I'd bet on
it being beneficial even for the tid case.


> + * The key is a 64-bit unsigned integer and the value is a Datum.

I don't think it's a good idea to define the value type to be a datum.


> +/*
> + * As we descend a radix tree, we push the node to the stack. The stack is 
> used
> + * at deletion.
> + */
> +typedef struct radix_tree_stack_data
> +{
> + radix_tree_node *node;
> + struct radix_tree_stack_data *parent;
> +} radix_tree_stack_data;
> +typedef radix_tree_stack_data *radix_tree_stack;

I think it's a very bad idea for traversal to need allocations. I really want
to eventually use this for shared structures (eventually with lock-free
searches at least), and needing to do allocations while traversing the tree is
a no-go for that.

Particularly given that the tree currently has a fixed depth, can't you just
allocate this on the stack once?

> +/*
> + * Allocate a new node with the given node kind.
> + */
> +static radix_tree_node *
> +radix_tree_alloc_node(radix_tree *tree, radix_tree_node_kind kind)
> +{
> + radix_tree_node *newnode;
> +
> + newnode = (radix_tree_node *) MemoryContextAllocZero(tree->slabs[kind],
> + 
>  radix_tree_node_info[kind].size);
> + newnode->kind = kind;
> +
> + /* update the statistics */
> + tree->mem_used += GetMemoryChunkSpace(newnode);
> + tree->cnt[kind]++;
> +
> + return newnode;
> +}

Why are you tracking the memory usage at this level of detail? It's *much*
cheaper to track memory usage via the memory contexts? Since they're dedicated
for the radix tree, that ought to be sufficient?


> + else if (idx != n4->n.count)
> + {
> + /*
> +  * the key needs to be inserted 
> in the middle of the
> +  * array, make space for the 
> new key.
> +  */
> + memmove(&(n4->chunks[idx + 1]), 
> &(n4->chunks[idx]),
> + sizeof(uint8) * 
> (n4->n.count - idx));
> + memmove(&(n4->slots[idx + 1]), 
> &(n4->slots[idx]),
> + 
> sizeof(radix_tree_node *) * (n4->n.count - idx));
> + }

Maybe we could add a static inline helper for these memmoves? Both because
it's repetitive (for different node types) and because the last time I looked
gcc was generating quite bad code for this. And having to put workarounds into
multiple places is obviously worse than having to do it in one place.


> +/*
> + * Insert the key with the val.
> + *
> + * found_p is set to true if the key already present, otherwise false, if
> + * it's not NULL.
> + *
> + * XXX: do we need to support update_if_exists behavior?
> + */

Yes, I think that's needed - hence using bfm_set() instead of insert() in the
prototype.


> +void
> +radix_tree_insert(radix_tree *tree, uint64 key, Datum val, bool *found_p)
> +{
> + int shift;
> + boolreplaced;
> + radix_tree_node *node;
> + radix_tree_node *parent = tree->root;
> +
> + /* Empty tree, create the root */
> + if (!tree->root)
> + radix_tree_new_root(tree, key, val);
> +
> + /* Extend the tree if necessary */
> + if (key > tree->max_val)
> + radix_tree_extend(tree, key);

FWIW, the reason I used separate functions for these in the prototype is that
it turns out to generate a lot better code, because it allows non-inlined
function calls to be 

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-07-04 Thread Andres Freund
Hi,

I just noticed that I had a reply forgotten in drafts...

On 2022-05-10 10:51:46 +0900, Masahiko Sawada wrote:
> To move this project forward, I've implemented radix tree
> implementation from scratch while studying Andres's implementation. It
> supports insertion, search, and iteration but not deletion yet. In my
> implementation, I use Datum as the value so internal and lead nodes
> have the same data structure, simplifying the implementation. The
> iteration on the radix tree returns keys with the value in ascending
> order of the key. The patch has regression tests for radix tree but is
> still in PoC state: left many debugging codes, not supported SSE2 SIMD
> instructions, added -mavx2 flag is hard-coded.

Very cool - thanks for picking this up.

Greetings,

Andres Freund




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-07-03 Thread Masahiko Sawada
On Tue, Jun 28, 2022 at 10:10 PM John Naylor
 wrote:
>
> On Tue, Jun 28, 2022 at 1:24 PM Masahiko Sawada  wrote:
> >
> > > I
> > > suspect other optimizations would be worth a lot more than using AVX2:
> > > - collapsing inner nodes
> > > - taking care when constructing the key (more on this when we
> > > integrate with VACUUM)
> > > ...and a couple Andres mentioned:
> > > - memory management: in
> > > https://www.postgresql.org/message-id/flat/20210717194333.mr5io3zup3kxahfm%40alap3.anarazel.de
> > > - node dispatch:
> > > https://www.postgresql.org/message-id/20210728184139.qhvx6nbwdcvo63m6%40alap3.anarazel.de
> > >
> > > Therefore, I would suggest that we use SSE2 only, because:
> > > - portability is very easy
> > > - to avoid a performance hit from indirecting through a function pointer
> >
> > Okay, I'll try these optimizations and see if the performance becomes 
> > better.
>
> FWIW, I think it's fine if we delay these until after committing a
> good-enough version. The exception is key construction and I think
> that deserves some attention now (more on this below).

Agreed.

>
> > I've done benchmark tests while changing the node types. The code base
> > is v3 patch that doesn't have the optimization you mentioned below
> > (memory management and node dispatch) but I added the code to use SSE2
> > for node-16 and node-32.
>
> Great, this is helpful to visualize what's going on!
>
> > * sse2_4_16_48_256
> > * nkeys = 9091, height = 3, n4 = 0, n16 = 0, n48 = 512, n256 = 
> > 916433
> > * nkeys = 2, height = 3, n4 = 2, n16 = 0, n48 = 207, n256 = 50
> >
> > * sse2_4_32_128_256
> > * nkeys = 9091, height = 3, n4 = 0, n32 = 285, n128 = 916629, n256 
> > = 31
> > * nkeys = 2, height = 3, n4 = 2, n32 = 48, n128 = 208, n256 = 1
>
> > Observations are:
> >
> > In both test cases, There is not much difference between using AVX2
> > and SSE2. The more mode types, the more time it takes for loading the
> > data (see sse2_4_16_32_128_256).
>
> Good to know. And as Andres mentioned in his PoC, more node types
> would be a barrier for pointer tagging, since 32-bit platforms only
> have two spare bits in the pointer.
>
> > In dense case, since most nodes have around 100 children, the radix
> > tree that has node-128 had a good figure in terms of memory usage. On
>
> Looking at the node stats, and then your benchmark code, I think key
> construction is a major influence, maybe more than node type. The
> key/value scheme tested now makes sense:
>
> blockhi || blocklo || 9 bits of item offset
>
> (with the leaf nodes containing a bit map of the lowest few bits of
> this whole thing)
>
> We want the lower fanout nodes at the top of the tree and higher
> fanout ones at the bottom.

So more inner nodes can fit in CPU cache, right?

>
> Note some consequences: If the table has enough columns such that much
> fewer than 100 tuples fit on a page (maybe 30 or 40), then in the
> dense case the nodes above the leaves will have lower fanout (maybe
> they will fit in a node32). Also, the bitmap values in the leaves will
> be more empty. In other words, many tables in the wild *resemble* the
> sparse case a bit, even if truly all tuples on the page are dead.
>
> Note also that the dense case in the benchmark above has ~4500 times
> more keys than the sparse case, and uses about ~1000 times more
> memory. But the runtime is only 2-3 times longer. That's interesting
> to me.
>
> To optimize for the sparse case, it seems to me that the key/value would be
>
> blockhi || 9 bits of item offset || blocklo
>
> I believe that would make the leaf nodes more dense, with fewer inner
> nodes, and could drastically speed up the sparse case, and maybe many
> realistic dense cases.

Does it have an effect on the number of inner nodes?

>  I'm curious to hear your thoughts.

Thank you for your analysis. It's worth trying. We use 9 bits for item
offset but most pages don't use all bits in practice. So probably it
might be better to move the most significant bit of item offset to the
left of blockhi. Or more simply:

9 bits of item offset || blockhi || blocklo

>
> > the other hand, the radix tree that doesn't have node-128 has a better
> > number in terms of insertion performance. This is probably because we
> > need to iterate over 'isset' flags from the beginning of the array in
> > order to find an empty slot when inserting new data. We do the same
> > thing also for node-48 but it was better than node-128 as it's up to
> > 48.
>
> I mentioned in my diff, but for those following along, I think we can
> improve that by iterating over the bytes and if it's 0xFF all 8 bits
> are set already so keep looking...

Right. Using 0xFF also makes the code readable so I'll change that.

>
> > In terms of lookup performance, the results vary but I could not find
> > any common pattern that makes the performance better or worse. Getting
> > more statistics such as the number of each node type per tree level
> > might 

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-06-28 Thread John Naylor
On Tue, Jun 28, 2022 at 1:24 PM Masahiko Sawada  wrote:
>
> > I
> > suspect other optimizations would be worth a lot more than using AVX2:
> > - collapsing inner nodes
> > - taking care when constructing the key (more on this when we
> > integrate with VACUUM)
> > ...and a couple Andres mentioned:
> > - memory management: in
> > https://www.postgresql.org/message-id/flat/20210717194333.mr5io3zup3kxahfm%40alap3.anarazel.de
> > - node dispatch:
> > https://www.postgresql.org/message-id/20210728184139.qhvx6nbwdcvo63m6%40alap3.anarazel.de
> >
> > Therefore, I would suggest that we use SSE2 only, because:
> > - portability is very easy
> > - to avoid a performance hit from indirecting through a function pointer
>
> Okay, I'll try these optimizations and see if the performance becomes better.

FWIW, I think it's fine if we delay these until after committing a
good-enough version. The exception is key construction and I think
that deserves some attention now (more on this below).

> I've done benchmark tests while changing the node types. The code base
> is v3 patch that doesn't have the optimization you mentioned below
> (memory management and node dispatch) but I added the code to use SSE2
> for node-16 and node-32.

Great, this is helpful to visualize what's going on!

> * sse2_4_16_48_256
> * nkeys = 9091, height = 3, n4 = 0, n16 = 0, n48 = 512, n256 = 916433
> * nkeys = 2, height = 3, n4 = 2, n16 = 0, n48 = 207, n256 = 50
>
> * sse2_4_32_128_256
> * nkeys = 9091, height = 3, n4 = 0, n32 = 285, n128 = 916629, n256 = 
> 31
> * nkeys = 2, height = 3, n4 = 2, n32 = 48, n128 = 208, n256 = 1

> Observations are:
>
> In both test cases, There is not much difference between using AVX2
> and SSE2. The more mode types, the more time it takes for loading the
> data (see sse2_4_16_32_128_256).

Good to know. And as Andres mentioned in his PoC, more node types
would be a barrier for pointer tagging, since 32-bit platforms only
have two spare bits in the pointer.

> In dense case, since most nodes have around 100 children, the radix
> tree that has node-128 had a good figure in terms of memory usage. On

Looking at the node stats, and then your benchmark code, I think key
construction is a major influence, maybe more than node type. The
key/value scheme tested now makes sense:

blockhi || blocklo || 9 bits of item offset

(with the leaf nodes containing a bit map of the lowest few bits of
this whole thing)

We want the lower fanout nodes at the top of the tree and higher
fanout ones at the bottom.

Note some consequences: If the table has enough columns such that much
fewer than 100 tuples fit on a page (maybe 30 or 40), then in the
dense case the nodes above the leaves will have lower fanout (maybe
they will fit in a node32). Also, the bitmap values in the leaves will
be more empty. In other words, many tables in the wild *resemble* the
sparse case a bit, even if truly all tuples on the page are dead.

Note also that the dense case in the benchmark above has ~4500 times
more keys than the sparse case, and uses about ~1000 times more
memory. But the runtime is only 2-3 times longer. That's interesting
to me.

To optimize for the sparse case, it seems to me that the key/value would be

blockhi || 9 bits of item offset || blocklo

I believe that would make the leaf nodes more dense, with fewer inner
nodes, and could drastically speed up the sparse case, and maybe many
realistic dense cases. I'm curious to hear your thoughts.

> the other hand, the radix tree that doesn't have node-128 has a better
> number in terms of insertion performance. This is probably because we
> need to iterate over 'isset' flags from the beginning of the array in
> order to find an empty slot when inserting new data. We do the same
> thing also for node-48 but it was better than node-128 as it's up to
> 48.

I mentioned in my diff, but for those following along, I think we can
improve that by iterating over the bytes and if it's 0xFF all 8 bits
are set already so keep looking...

> In terms of lookup performance, the results vary but I could not find
> any common pattern that makes the performance better or worse. Getting
> more statistics such as the number of each node type per tree level
> might help me.

I think that's a sign that the choice of node types might not be
terribly important for these two cases. That's good if that's true in
general -- a future performance-critical use of this code might tweak
things for itself without upsetting vacuum.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-06-27 Thread Masahiko Sawada
Hi,

On Mon, Jun 27, 2022 at 8:12 PM John Naylor
 wrote:
>
> On Mon, Jun 20, 2022 at 7:57 AM Masahiko Sawada  wrote:
>
> [v3 patch]
>
> Hi Masahiko,
>
> Since there are new files, and they are pretty large, I've attached
> most specific review comments and questions as a diff rather than in
> the email body. This is not a full review, which will take more time
> -- this is a first pass mostly to aid my understanding, and discuss
> some of the design and performance implications.
>
> I tend to think it's a good idea to avoid most cosmetic review until
> it's close to commit, but I did mention a couple things that might
> enhance readability during review.

Thank you for reviewing the patch!

>
> As I mentioned to you off-list, I have some thoughts on the nodes using SIMD:
>
> > On Thu, Jun 16, 2022 at 4:30 PM John Naylor
> >  wrote:
> > >
> > > For now, though, I'd like to question
> > > why we even need to use 32-byte registers in the first place. For one,
> > > the paper referenced has 16-pointer nodes, but none for 32 (next level
> > > is 48 and uses a different method to find the index of the next
> > > pointer). Andres' prototype has 32-pointer nodes, but in a quick read
> > > of his patch a couple weeks ago I don't recall a reason mentioned for
> > > it.
> >
> > I might be wrong but since AVX2 instruction set is introduced in
> > Haswell microarchitecture in 2013 and the referenced paper is
> > published in the same year, the art didn't use AVX2 instruction set.
>
> Sure, but with a bit of work the same technique could be done on that
> node size with two 16-byte registers.
>
> > 32-pointer nodes are better from a memory perspective as you
> > mentioned. Andres' prototype supports both 16-pointer nodes and
> > 32-pointer nodes (out of 6 node types). This would provide better
> > memory usage but on the other hand, it would also bring overhead of
> > switching the node type.
>
> Right, using more node types provides smaller increments of node size.
> Just changing node type can be better or worse, depending on the
> input.
>
> > Anyway, it's an important design decision to
> > support which size of node to support. It should be done based on
> > experiment results and documented.
>
> Agreed. I would add that in the first step, we want something
> straightforward to read and easy to integrate into our codebase.

Agreed.



> I
> suspect other optimizations would be worth a lot more than using AVX2:
> - collapsing inner nodes
> - taking care when constructing the key (more on this when we
> integrate with VACUUM)
> ...and a couple Andres mentioned:
> - memory management: in
> https://www.postgresql.org/message-id/flat/20210717194333.mr5io3zup3kxahfm%40alap3.anarazel.de
> - node dispatch:
> https://www.postgresql.org/message-id/20210728184139.qhvx6nbwdcvo63m6%40alap3.anarazel.de
>
> Therefore, I would suggest that we use SSE2 only, because:
> - portability is very easy
> - to avoid a performance hit from indirecting through a function pointer

Okay, I'll try these optimizations and see if the performance becomes better.

>
> When the PG16 cycle opens, I will work separately on ensuring the
> portability of using SSE2, so you can focus on other aspects.

Thanks!

> I think it would be a good idea to have both node16 and node32 for testing.
> During benchmarking we can delete one or the other and play with the
> other thresholds a bit.

I've done benchmark tests while changing the node types. The code base
is v3 patch that doesn't have the optimization you mentioned below
(memory management and node dispatch) but I added the code to use SSE2
for node-16 and node-32. The 'name' in the below result indicates the
kind of instruction set (AVX2 or SSE2) and the node type used. For
instance, sse2_4_32_48_256 means the radix tree has four kinds of node
types for each which have 4, 32, 48, and 256 pointers, respectively,
and use SSE2 instruction set.

* Case1 - Dense (simulating the case where there are 1000 consecutive
pages each of which has 100 dead tuples, at 100 page intervals.)
select prepare(
100, -- max block
100, -- # of dead tuples per page
1, -- dead tuples interval within  a page
1000, -- # of consecutive pages having dead tuples
1100 -- page interval
);

  name size  attach
  lookup
 avx2_4_32_128_256   1154 MB6742.53 ms   47765.63 ms
 avx2_4_32_48_256 1839 MB4239.35 ms   40528.39 ms
 sse2_4_16_128_256   1154 MB6994.43 ms   40383.85 ms
 sse2_4_16_32_128_256 1154 MB7239.35 ms   43542.39 ms
 sse2_4_16_48_256 1839 MB4404.63 ms   36048.96 ms
 sse2_4_32_128_2561154 MB   6688.50 ms   44902.64 ms

* Case2 - Sparse (simulating a case where there are pages that have 2
dead tuples every 1000 pages.)
select prepare(
1000, -- max block
2, -- # of dead tuples per page
50, -- dead tuples interval within  a page
1, -- # of consecutive pages having dead tuples
1000 -- page interval
);

  

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-06-27 Thread Andres Freund
Hi,

On 2022-06-28 11:17:42 +0700, John Naylor wrote:
> On Mon, Jun 27, 2022 at 10:23 PM Hannu Krosing  wrote:
> >
> > > Another thought: for non-x86 platforms, the SIMD nodes degenerate to
> > > "simple loop", and looping over up to 32 elements is not great
> > > (although possibly okay). We could do binary search, but that has bad
> > > branch prediction.
> >
> > I am not sure that for relevant non-x86 platforms SIMD / vector
> > instructions would not be used (though it would be a good idea to
> > verify)
> 
> By that logic, we can also dispense with intrinsics on x86 because the
> compiler will autovectorize there too (if I understand your claim
> correctly). I'm not quite convinced of that in this case.

Last time I checked (maybe a year ago?) none of the popular compilers could
autovectorize that code pattern.

Greetings,

Andres Freund




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-06-27 Thread John Naylor
On Mon, Jun 27, 2022 at 10:23 PM Hannu Krosing  wrote:
>
> > Another thought: for non-x86 platforms, the SIMD nodes degenerate to
> > "simple loop", and looping over up to 32 elements is not great
> > (although possibly okay). We could do binary search, but that has bad
> > branch prediction.
>
> I am not sure that for relevant non-x86 platforms SIMD / vector
> instructions would not be used (though it would be a good idea to
> verify)

By that logic, we can also dispense with intrinsics on x86 because the
compiler will autovectorize there too (if I understand your claim
correctly). I'm not quite convinced of that in this case.

> I would definitely test before assuming binary search is better.

I wasn't very clear in my language, but I did reject binary search as
having bad branch prediction.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-06-27 Thread Andres Freund
Hi,

On 2022-06-27 18:12:13 +0700, John Naylor wrote:
> Another thought: for non-x86 platforms, the SIMD nodes degenerate to
> "simple loop", and looping over up to 32 elements is not great
> (although possibly okay). We could do binary search, but that has bad
> branch prediction.

I'd be quite quite surprised if binary search were cheaper. Particularly on
less fancy platforms.

- Andres




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-06-27 Thread Hannu Krosing
> Another thought: for non-x86 platforms, the SIMD nodes degenerate to
> "simple loop", and looping over up to 32 elements is not great
> (although possibly okay). We could do binary search, but that has bad
> branch prediction.

I am not sure that for relevant non-x86 platforms SIMD / vector
instructions would not be used (though it would be a good idea to
verify)
Do you know any modern platforms that do not have SIMD ?

I would definitely test before assuming binary search is better.

Often other approaches like counting search over such small vectors is
much better when the vector fits in cache (or even a cache line) and
you always visit all items as this will completely avoid branch
predictions and allows compiler to vectorize and / or unroll the loop
as needed.

Cheers
Hannu




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-06-27 Thread John Naylor
On Mon, Jun 20, 2022 at 7:57 AM Masahiko Sawada  wrote:

[v3 patch]

Hi Masahiko,

Since there are new files, and they are pretty large, I've attached
most specific review comments and questions as a diff rather than in
the email body. This is not a full review, which will take more time
-- this is a first pass mostly to aid my understanding, and discuss
some of the design and performance implications.

I tend to think it's a good idea to avoid most cosmetic review until
it's close to commit, but I did mention a couple things that might
enhance readability during review.

As I mentioned to you off-list, I have some thoughts on the nodes using SIMD:

> On Thu, Jun 16, 2022 at 4:30 PM John Naylor
>  wrote:
> >
> > For now, though, I'd like to question
> > why we even need to use 32-byte registers in the first place. For one,
> > the paper referenced has 16-pointer nodes, but none for 32 (next level
> > is 48 and uses a different method to find the index of the next
> > pointer). Andres' prototype has 32-pointer nodes, but in a quick read
> > of his patch a couple weeks ago I don't recall a reason mentioned for
> > it.
>
> I might be wrong but since AVX2 instruction set is introduced in
> Haswell microarchitecture in 2013 and the referenced paper is
> published in the same year, the art didn't use AVX2 instruction set.

Sure, but with a bit of work the same technique could be done on that
node size with two 16-byte registers.

> 32-pointer nodes are better from a memory perspective as you
> mentioned. Andres' prototype supports both 16-pointer nodes and
> 32-pointer nodes (out of 6 node types). This would provide better
> memory usage but on the other hand, it would also bring overhead of
> switching the node type.

Right, using more node types provides smaller increments of node size.
Just changing node type can be better or worse, depending on the
input.

> Anyway, it's an important design decision to
> support which size of node to support. It should be done based on
> experiment results and documented.

Agreed. I would add that in the first step, we want something
straightforward to read and easy to integrate into our codebase. I
suspect other optimizations would be worth a lot more than using AVX2:
- collapsing inner nodes
- taking care when constructing the key (more on this when we
integrate with VACUUM)
...and a couple Andres mentioned:
- memory management: in
https://www.postgresql.org/message-id/flat/20210717194333.mr5io3zup3kxahfm%40alap3.anarazel.de
- node dispatch:
https://www.postgresql.org/message-id/20210728184139.qhvx6nbwdcvo63m6%40alap3.anarazel.de

Therefore, I would suggest that we use SSE2 only, because:
- portability is very easy
- to avoid a performance hit from indirecting through a function pointer

When the PG16 cycle opens, I will work separately on ensuring the
portability of using SSE2, so you can focus on other aspects. I think
it would be a good idea to have both node16 and node32 for testing.
During benchmarking we can delete one or the other and play with the
other thresholds a bit.

Ideally, node16 and node32 would have the same code with a different
loop count (1 or 2). More generally, there is too much duplication of
code (noted by Andres in his PoC), and there are many variable names
with the node size embedded. This is a bit tricky to make more
general, so we don't need to try it yet, but ideally we would have
something similar to:

switch (node->kind) // todo: inspect tagged pointer
{
  case RADIX_TREE_NODE_KIND_4:
   idx = node_search_eq(node, chunk, 4);
   do_action(node, idx, 4, ...);
   break;
  case RADIX_TREE_NODE_KIND_32:
   idx = node_search_eq(node, chunk, 32);
   do_action(node, idx, 32, ...);
  ...
}

static pg_alwaysinline void
node_search_eq(radix_tree_node node, uint8 chunk, int16 node_fanout)
{
if (node_fanout <= SIMPLE_LOOP_THRESHOLD)
  // do simple loop with (node_simple *) node;
else if (node_fanout <= VECTORIZED_LOOP_THRESHOLD)
  // do vectorized loop where available with (node_vec *) node;
...
}

...and let the compiler do loop unrolling and branch removal. Not sure
how difficult this is to do, but something to think about.

Another thought: for non-x86 platforms, the SIMD nodes degenerate to
"simple loop", and looping over up to 32 elements is not great
(although possibly okay). We could do binary search, but that has bad
branch prediction.

-- 
John Naylor
EDB: http://www.enterprisedb.com
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index bf87f932fd..2bb04eba86 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -16,6 +16,11 @@
  *
  * The key is a 64-bit unsigned integer and the value is a Datum. Both internal
  * nodes and leaf nodes have the identical structure. For internal tree nodes,
+It might worth mentioning:
+- the paper refers to this technique as "Multi-value leaves"
+- we chose it (I assume) for simplicity and to avoid an additional pointer 
traversal
+- it is the re

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-06-19 Thread Masahiko Sawada
Hi,

On Thu, Jun 16, 2022 at 4:30 PM John Naylor
 wrote:
>
> On Thu, Jun 16, 2022 at 11:57 AM Masahiko Sawada  
> wrote:
> > I've attached an updated version patch that changes the configure
> > script. I'm still studying how to support AVX2 on msvc build. Also,
> > added more regression tests.
>
> Thanks for the update, I will take a closer look at the patch in the
> near future, possibly next week.

Thanks!

> For now, though, I'd like to question
> why we even need to use 32-byte registers in the first place. For one,
> the paper referenced has 16-pointer nodes, but none for 32 (next level
> is 48 and uses a different method to find the index of the next
> pointer). Andres' prototype has 32-pointer nodes, but in a quick read
> of his patch a couple weeks ago I don't recall a reason mentioned for
> it.

I might be wrong but since AVX2 instruction set is introduced in
Haswell microarchitecture in 2013 and the referenced paper is
published in the same year, the art didn't use AVX2 instruction set.
32-pointer nodes are better from a memory perspective as you
mentioned. Andres' prototype supports both 16-pointer nodes and
32-pointer nodes (out of 6 node types). This would provide better
memory usage but on the other hand, it would also bring overhead of
switching the node type. Anyway, it's an important design decision to
support which size of node to support. It should be done based on
experiment results and documented.

> Even if 32-pointer nodes are better from a memory perspective, I
> imagine it should be possible to use two SSE2 registers to find the
> index. It'd be locally slightly more complex, but not much. It might
> not even cost much more in cycles since AVX2 would require indirecting
> through a function pointer. It's much more convenient if we don't need
> a runtime check.

Right.

> There are also thermal and power disadvantages when
> using AXV2 in some workloads. I'm not sure that's the case here, but
> if it is, we'd better be getting something in return.

Good point.

> One more thing in general: In an earlier version, I noticed that
> Andres used the slab allocator and documented why. The last version of
> your patch that I saw had the same allocator, but not the "why".
> Especially in early stages of review, we want to document design
> decisions so it's more clear for the reader.

Indeed. I'll add comments in the next version patch.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-06-16 Thread Andrew Dunstan


On 2022-06-16 Th 00:56, Masahiko Sawada wrote:
>
> I've attached an updated version patch that changes the configure
> script. I'm still studying how to support AVX2 on msvc build. Also,
> added more regression tests.


I think you would need to add '/arch:AVX2' to the compiler flags in
MSBuildProject.pm.


See



cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-06-16 Thread John Naylor
On Thu, Jun 16, 2022 at 11:57 AM Masahiko Sawada  wrote:
> I've attached an updated version patch that changes the configure
> script. I'm still studying how to support AVX2 on msvc build. Also,
> added more regression tests.

Thanks for the update, I will take a closer look at the patch in the
near future, possibly next week. For now, though, I'd like to question
why we even need to use 32-byte registers in the first place. For one,
the paper referenced has 16-pointer nodes, but none for 32 (next level
is 48 and uses a different method to find the index of the next
pointer). Andres' prototype has 32-pointer nodes, but in a quick read
of his patch a couple weeks ago I don't recall a reason mentioned for
it. Even if 32-pointer nodes are better from a memory perspective, I
imagine it should be possible to use two SSE2 registers to find the
index. It'd be locally slightly more complex, but not much. It might
not even cost much more in cycles since AVX2 would require indirecting
through a function pointer. It's much more convenient if we don't need
a runtime check. There are also thermal and power disadvantages when
using AXV2 in some workloads. I'm not sure that's the case here, but
if it is, we'd better be getting something in return.

One more thing in general: In an earlier version, I noticed that
Andres used the slab allocator and documented why. The last version of
your patch that I saw had the same allocator, but not the "why".
Especially in early stages of review, we want to document design
decisions so it's more clear for the reader.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-06-15 Thread Masahiko Sawada
On Wed, May 25, 2022 at 11:48 AM Masahiko Sawada  wrote:
>
> On Tue, May 10, 2022 at 6:58 PM John Naylor
>  wrote:
> >
> > On Tue, May 10, 2022 at 8:52 AM Masahiko Sawada  
> > wrote:
> > >
> > > Overall, radix tree implementations have good numbers. Once we got an
> > > agreement on moving in this direction, I'll start a new thread for
> > > that and move the implementation further; there are many things to do
> > > and discuss: deletion, API design, SIMD support, more tests etc.
> >
> > +1
> >
>
> Thanks!
>
> I've attached an updated version patch. It is still WIP but I've
> implemented deletion and improved test cases and comments.

I've attached an updated version patch that changes the configure
script. I'm still studying how to support AVX2 on msvc build. Also,
added more regression tests.

The integration with lazy vacuum and parallel vacuum is missing for
now. In order to support parallel vacuum, we need to have the radix
tree support to be created on DSA.

Added this item to the next CF.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/
diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index d3562d6fee..a56d6e89da 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -676,3 +676,27 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_ARMV8_CRC32C_INTRINSICS
+
+# PGAC_AVX2_INTRINSICS
+# 
+# Check if the compiler supports the Intel AVX2 instructinos.
+#
+# If the intrinsics are supported, sets pgac_avx2_intrinsics, and CFLAGS_AVX2.
+AC_DEFUN([PGAC_AVX2_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx2_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm256_set_1_epi8 _mm256_cmpeq_epi8 _mm256_movemask_epi8 CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [__m256i vec = _mm256_set1_epi8(0);
+   __m256i cmp = _mm256_cmpeq_epi8(vec, vec);
+   return _mm256_movemask_epi8(cmp) > 0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_AVX2="$1"
+  pgac_avx2_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX2_INTRINSICS
diff --git a/configure b/configure
index 7dec6b7bf9..6ebc15a8c1 100755
--- a/configure
+++ b/configure
@@ -645,6 +645,7 @@ XGETTEXT
 MSGMERGE
 MSGFMT_FLAGS
 MSGFMT
+CFLAGS_AVX2
 PG_CRC32C_OBJS
 CFLAGS_ARMV8_CRC32C
 CFLAGS_SSE42
@@ -18829,6 +18830,82 @@ $as_echo "slicing-by-8" >&6; }
 fi
 
 
+# Check for Intel AVX2 intrinsics.
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm256i CFLAGS=" >&5
+$as_echo_n "checking for _mm256i CFLAGS=... " >&6; }
+if ${pgac_cv_avx2_intrinsics_+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include 
+int
+main ()
+{
+__m256i vec = _mm256_set1_epi8(0);
+   __m256i cmp = _mm256_cmpeq_epi8(vec, vec);
+   return _mm256_movemask_epi8(cmp) > 0;
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_avx2_intrinsics_=yes
+else
+  pgac_cv_avx2_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx2_intrinsics_" >&5
+$as_echo "$pgac_cv_avx2_intrinsics_" >&6; }
+if test x"$pgac_cv_avx2_intrinsics_" = x"yes"; then
+  CFLAGS_AVX2=""
+  pgac_avx2_intrinsics=yes
+fi
+
+if test x"pgac_avx2_intrinsics" != x"yes"; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm256i CFLAGS=-mavx2" >&5
+$as_echo_n "checking for _mm256i CFLAGS=-mavx2... " >&6; }
+if ${pgac_cv_avx2_intrinsics__mavx2+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -mavx2"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include 
+int
+main ()
+{
+__m256i vec = _mm256_set1_epi8(0);
+   __m256i cmp = _mm256_cmpeq_epi8(vec, vec);
+   return _mm256_movemask_epi8(cmp) > 0;
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_avx2_intrinsics__mavx2=yes
+else
+  pgac_cv_avx2_intrinsics__mavx2=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx2_intrinsics__mavx2" >&5
+$as_echo "$pgac_cv_avx2_intrinsics__mavx2" >&6; }
+if test x"$pgac_cv_avx2_intrinsics__mavx2" = x"yes"; then
+  CFLAGS_AVX2="-mavx2"
+  pgac_avx2_intrinsics=yes
+fi
+
+fi
+
 
 # Select semaphore implementation type.
 if test "$PORTNAME" != "win32"; then
diff --git a/configure.ac b/configure.ac
index d093fb88dd..6b6d095306 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2300,6 +2300,12 @@ else
 fi
 AC_SUBST(PG_CRC32C_OBJS)
 
+# Check for Intel AVX2 intrinsics.
+PGAC_AVX2_INTRINSICS([])
+if test x"pgac_avx2_intrinsics" != x"yes"; then
+  PGAC

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-05-24 Thread Masahiko Sawada
On Tue, May 10, 2022 at 6:58 PM John Naylor
 wrote:
>
> On Tue, May 10, 2022 at 8:52 AM Masahiko Sawada  wrote:
> >
> > Overall, radix tree implementations have good numbers. Once we got an
> > agreement on moving in this direction, I'll start a new thread for
> > that and move the implementation further; there are many things to do
> > and discuss: deletion, API design, SIMD support, more tests etc.
>
> +1
>

Thanks!

I've attached an updated version patch. It is still WIP but I've
implemented deletion and improved test cases and comments.

> (FWIW, I think the current thread is still fine.)

Okay, agreed.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


radixtree_wip_v2.patch
Description: Binary data


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-05-10 Thread John Naylor
On Tue, May 10, 2022 at 8:52 AM Masahiko Sawada  wrote:
>
> Overall, radix tree implementations have good numbers. Once we got an
> agreement on moving in this direction, I'll start a new thread for
> that and move the implementation further; there are many things to do
> and discuss: deletion, API design, SIMD support, more tests etc.

+1

(FWIW, I think the current thread is still fine.)

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-05-09 Thread Masahiko Sawada
Hi,

On Sun, Feb 13, 2022 at 12:39 PM Andres Freund  wrote:
>
> On 2022-02-13 12:36:13 +0900, Masahiko Sawada wrote:
> > Actually, I'm working on simplifying and improving radix tree
> > implementation for PG16 dev cycle. From the discussion so far I think
> > it's better to have a data structure that can be used for
> > general-purpose and is also good for storing TID, not very specific to
> > store TID. So I think radix tree would be a potent candidate. I have
> > done the insertion and search implementation.
>
> Awesome!

To move this project forward, I've implemented radix tree
implementation from scratch while studying Andres's implementation. It
supports insertion, search, and iteration but not deletion yet. In my
implementation, I use Datum as the value so internal and lead nodes
have the same data structure, simplifying the implementation. The
iteration on the radix tree returns keys with the value in ascending
order of the key. The patch has regression tests for radix tree but is
still in PoC state: left many debugging codes, not supported SSE2 SIMD
instructions, added -mavx2 flag is hard-coded.

I've measured the size and loading and lookup performance for each
candidate data structure with two test cases: dense and sparse, by
using the test tool[1]. Here are the results:

* Case1 - Dense (simulating the case where there are 1000 consecutive
pages each of which has 100 dead tuples, at 100 page intervals.)
select prepare(
100, -- max block
100, -- # of dead tuples per page
1, -- dead tuples interval within  a page
1000, -- # of consecutive pages having dead tuples
1100 -- page interval
);

name sizeattach lookup
array   520 MB  248.60 ms   89891.92 ms
hash 3188 MB  28029.59 ms   50850.32 ms
intset   85 MB   644.96 ms   39801.17 ms
tbm  96 MB   474.06 ms 6641.38 ms
radix37 MB   173.03 ms 9145.97 ms
radix_tree36 MB   184.51 ms 9729.94 ms

* Case2 - Sparse (simulating a case where there are pages that have 2
dead tuples every 1000 pages.)
select prepare(
1000, -- max block
2, -- # of dead tuples per page
50, -- dead tuples interval within  a page
1, -- # of consecutive pages having dead tuples
1000 -- page interval
);

name size   attach lookup
array  125 kB  0.53 ms82183.61 ms
hash 1032 kB  1.31 ms   28128.33 ms
intset  222 kB  0.51 ms87775.68 ms
tbm768 MB  1.24 ms   98674.60 ms
radix 1080 kB  1.66 ms20698.07 ms
radix_tree   949 kB  1.50 ms21465.23 ms

Each test virtually generates TIDs and loads them to the data
structure, and then searches for virtual index TIDs.
'array' is a sorted array which is the current method, 'hash' is HTAB,
'intset' is IntegerSet, and 'tbm' is TIDBitmap. The last two results
are radix tree implementations: 'radix' is Andres's radix tree
implementation and 'radix_tree' is my radix tree implementation. In
both radix tree tests, I converted TIDs into an int64 and store the
lower 6 bits in the value part of the radix tree.

Overall, radix tree implementations have good numbers. Once we got an
agreement on moving in this direction, I'll start a new thread for
that and move the implementation further; there are many things to do
and discuss: deletion, API design, SIMD support, more tests etc.

Regards,

[1] https://github.com/MasahikoSawada/pgtools/tree/master/bdbench
[2] 
https://www.postgresql.org/message-id/CAFiTN-visUO9VTz2%2Bh224z5QeUjKhKNdSfjaCucPhYJdbzxx0g%40mail.gmail.com

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


radixtree.patch
Description: Binary data


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-02-12 Thread Andres Freund
On 2022-02-13 12:36:13 +0900, Masahiko Sawada wrote:
> Actually, I'm working on simplifying and improving radix tree
> implementation for PG16 dev cycle. From the discussion so far I think
> it's better to have a data structure that can be used for
> general-purpose and is also good for storing TID, not very specific to
> store TID. So I think radix tree would be a potent candidate. I have
> done the insertion and search implementation.

Awesome!




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-02-12 Thread Masahiko Sawada
On Sun, Feb 13, 2022 at 11:02 AM Andres Freund  wrote:
>
> Hi,
>
> On 2022-02-11 13:47:01 +0100, Matthias van de Meent wrote:
> > Today I noticed the inefficiencies of our dead tuple storage once
> > again, and started theorizing about a better storage method; which is
> > when I remembered that this thread exists, and that this thread
> > already has amazing results.
> >
> > Are there any plans to get the results of this thread from PoC to 
> > committable?
>
> I'm not currently planning to work on it personally. It'd would be awesome if
> somebody did...

Actually, I'm working on simplifying and improving radix tree
implementation for PG16 dev cycle. From the discussion so far I think
it's better to have a data structure that can be used for
general-purpose and is also good for storing TID, not very specific to
store TID. So I think radix tree would be a potent candidate. I have
done the insertion and search implementation.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-02-12 Thread Andres Freund
Hi,

On 2022-02-11 13:47:01 +0100, Matthias van de Meent wrote:
> Today I noticed the inefficiencies of our dead tuple storage once
> again, and started theorizing about a better storage method; which is
> when I remembered that this thread exists, and that this thread
> already has amazing results.
> 
> Are there any plans to get the results of this thread from PoC to committable?

I'm not currently planning to work on it personally. It'd would be awesome if
somebody did...

Greetings,

Andres Freund




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-02-11 Thread Matthias van de Meent
Hi,

Today I noticed the inefficiencies of our dead tuple storage once
again, and started theorizing about a better storage method; which is
when I remembered that this thread exists, and that this thread
already has amazing results.

Are there any plans to get the results of this thread from PoC to committable?

Kind regards,

Matthias van de Meent




Re: [PoC] Improve dead tuple storage for lazy vacuum

2021-07-30 Thread Robert Haas
On Fri, Jul 30, 2021 at 3:34 PM Andres Freund  wrote:
> The lower memory usage also often will result in a better cache
> utilization - which is a crucial factor for index vacuuming when the
> index order isn't correlated with the heap order. Cache misses really
> are a crucial performance factor there.

Fair enough.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2021-07-30 Thread Andres Freund
Hi,

On 2021-07-30 15:13:49 -0400, Robert Haas wrote:
> On Thu, Jul 29, 2021 at 3:14 PM Andres Freund  wrote:
> > I think those advantages are far outstripped by the big disadvantage of
> > needing to either size the array accurately from the start, or to
> > reallocate the whole array.  Our current pre-allocation behaviour is
> > very wasteful for most vacuums but doesn't handle large work_mem at all,
> > causing unnecessary index scans.
> 
> I agree that the current pre-allocation behavior is bad, but I don't
> really see that as an issue with my idea. Fixing that would require
> allocating the array in chunks, but that doesn't really affect the
> core of the idea much, at least as I see it.

Well, then it'd not really be the "simple array approach" anymore :)


> But I accept that Yura has a very good point about the memory usage of
> what I was proposing.

The lower memory usage also often will result in a better cache
utilization - which is a crucial factor for index vacuuming when the
index order isn't correlated with the heap order. Cache misses really
are a crucial performance factor there.

Greetings,

Andres Freund




Re: [PoC] Improve dead tuple storage for lazy vacuum

2021-07-30 Thread Robert Haas
On Thu, Jul 29, 2021 at 3:14 PM Andres Freund  wrote:
> I think those advantages are far outstripped by the big disadvantage of
> needing to either size the array accurately from the start, or to
> reallocate the whole array.  Our current pre-allocation behaviour is
> very wasteful for most vacuums but doesn't handle large work_mem at all,
> causing unnecessary index scans.

I agree that the current pre-allocation behavior is bad, but I don't
really see that as an issue with my idea. Fixing that would require
allocating the array in chunks, but that doesn't really affect the
core of the idea much, at least as I see it.

But I accept that Yura has a very good point about the memory usage of
what I was proposing.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2021-07-29 Thread Andres Freund
Hi,

On 2021-07-29 13:15:53 -0400, Robert Haas wrote:
> I don't know if this is better, but I do kind of like the fact that
> the basic representation is just an array. It makes it really easy to
> predict how much memory will be needed for a given number of dead
> TIDs, and it's very DSM-friendly as well.

I think those advantages are far outstripped by the big disadvantage of
needing to either size the array accurately from the start, or to
reallocate the whole array.  Our current pre-allocation behaviour is
very wasteful for most vacuums but doesn't handle large work_mem at all,
causing unnecessary index scans.

Greetings,

Andres Freund




Re: [PoC] Improve dead tuple storage for lazy vacuum

2021-07-29 Thread Yura Sokolov

Robert Haas писал 2021-07-29 20:15:
On Thu, Jul 29, 2021 at 5:11 AM Masahiko Sawada  
wrote:

Indeed. Given that the radix tree itself has other use cases, I have
no concern about using radix tree for vacuum's dead tuples storage. It
will be better to have one that can be generally used and has some
optimizations that are helpful also for vacuum's use case, rather than
having one that is very optimized only for vacuum's use case.


What I'm about to say might be a really stupid idea, especially since
I haven't looked at any of the code already posted, but what I'm
wondering about is whether we need a full radix tree or maybe just a
radix-like lookup aid. For example, suppose that for a relation <= 8MB
in size, we create an array of 1024 elements indexed by block number.
Each element of the array stores an offset into the dead TID array.
When you need to probe for a TID, you look up blkno and blkno + 1 in
the array and then bsearch only between those two offsets. For bigger
relations, a two or three level structure could be built, or it could
always be 3 levels. This could even be done on demand, so you
initialize all of the elements to some special value that means "not
computed yet" and then fill them the first time they're needed,
perhaps with another special value that means "no TIDs in that block".


8MB relation is not a problem, imo. There is no need to do anything to
handle 8MB relation.

Problem is 2TB relation. It has 256M pages and, lets suppose, 3G dead
tuples.

Then offset array will be 2GB and tuple offset array will be 6GB (2 byte
offset per tuple). 8GB in total.

We can make offset array only for higher 3 bytes of block number.
We then will have 1M offset array weighted 8MB, and there will be array
of 3byte tuple pointers (1 remaining byte from block number, and 2 bytes
from Tuple) weighted 9GB.

But using per-batch compression schemes, there could be amortized
4 byte per page and 1 byte per tuple: 1GB + 3GB = 4GB memory.
Yes, it is not as guaranteed as in array approach. But 95% of time it is
such low and even lower. And better: more tuples are dead - better
compression works. Page with all tuples dead could be encoded as little
as 5 bytes. Therefore, overall memory consumption is more stable and
predictive.

Lower memory consumption of tuple storage means there is less chance
indexes should be scanned twice or more times. It gives more
predictability in user experience.


I don't know if this is better, but I do kind of like the fact that
the basic representation is just an array. It makes it really easy to
predict how much memory will be needed for a given number of dead
TIDs, and it's very DSM-friendly as well.


Whole thing could be encoded in one single array of bytes. Just give
"pointer-to-array"+"array-size" to constructor, and use "bump allocator"
inside. Complex logical structure doesn't imply "DSM-unfriendliness".
Hmm I mean if it is suitably designed.

In fact, my code uses bump allocator internally to avoid "per-allocation
overhead" of "aset", "slab" or "generational". And IntegerSet2 version
even uses it for all allocations since it has no reallocatable parts.

Well, if datastructure has reallocatable parts, it could be less 
friendly

to DSM.

regards,

---
Yura Sokolov
y.soko...@postgrespro.ru
funny.fal...@gmail.com




<    1   2   3   4   5   >