Re: [PoC] Improve dead tuple storage for lazy vacuum

John Naylor Wed, 26 Oct 2022 04:07:10 -0700

On Mon, Oct 24, 2022 at 12:54 PM Masahiko Sawada <[email protected]>
wrote:


> I've attached updated PoC patches for discussion and cfbot. From the
> previous version, I mainly changed the following things:
>
> * Separate treatment of inner and leaf nodes

Overall, this looks much better!

> * Pack both the node kind and node count to an uint16 value.

For this, I did mention a bitfield earlier as something we "could" do, but
it wasn't clear we should. After looking again at the node types, I must
not have thought through this at all. Storing one byte instead of four for
the full enum is a good step, but saving one more byte usually doesn't buy
anything because of padding, with a few exceptions like this example:

node4:   4 +  4           +  4*8 =   40
node4:   5 +  4+(7)       +  4*8 =   48 bytes

Even there, I'd rather not spend the extra cycles to access the members.
And with my idea of decoupling size classes from kind, the variable-sized
kinds will require another byte to store "capacity". Then, even if the kind
gets encoded in a pointer tag, we'll still have 5 bytes in the base type.
So I think we should assume 5 bytes from the start. (Might be 6 temporarily
if I work on size decoupling first).

(Side note, if you have occasion to use bitfields again in the future, C99
has syntactic support for them, so no need to write your own
shifting/masking code).

> I've not done SIMD part seriously yet. But overall the performance
> seems good so far. If we agree with the current approach, I think we
> can proceed with the verification of decoupling node sizes from node
> kind. And I'll investigate DSA support.

Sounds good. I have some additional comments about v7, and after these are
addressed, we can proceed independently with the above two items. Seeing
the DSA work will also inform me how invasive pointer tagging will be.
There will still be some performance tuning and cosmetic work, but it's
getting closer.

-------------------------
0001:

+#ifndef USE_NO_SIMD
+#include "port/pg_bitutils.h"
+#endif

Leftover from an earlier version?

+static inline int vector8_find(const Vector8 v, const uint8 c);
+static inline int vector8_find_ge(const Vector8 v, const uint8 c);

Leftovers, causing compiler warnings. (Also see new variable shadow warning)

+#else /* USE_NO_SIMD */
+ Vector8 r = 0;
+ uint8 *rp = (uint8 *) &r;
+
+ for (Size i = 0; i < sizeof(Vector8); i++)
+ rp[i] = Min(((const uint8 *) &v1)[i], ((const uint8 *) &v2)[i]);
+
+ return r;
+#endif

As I mentioned a couple versions ago, this style is really awkward, and
potential non-SIMD callers will be better off writing their own byte-wise
loop rather than using this API. Especially since the "min" function exists
only as a workaround for lack of unsigned comparison in (at least) SSE2.
There is one existing function in this file with that idiom for non-assert
code (for completeness), but even there, inputs of current interest to us
use the uint64 algorithm.

0002:

+ /* XXX: should not to use vector8_highbit_mask */
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) <<
sizeof(Vector8));

Hmm?

+/*
+ * Return index of the first element in chunks in the given node that is
greater
+ * than or equal to 'key'.  Return -1 if there is no such element.
+ */
+static inline int
+node_32_search_ge(rt_node_base_32 *node, uint8 chunk)

The caller must now have logic for inserting at the end:

+ int insertpos = node_32_search_ge((rt_node_base_32 *) n32, chunk);
+ int16 count = NODE_GET_COUNT(n32);
+
+ if (insertpos < 0)
+ insertpos = count; /* insert to the tail */

It would be a bit more clear if node_*_search_ge() always returns the
position we need (see the prototype for example). In fact, these functions
are probably better named node*_get_insertpos().

+ if (likely(NODE_HAS_FREE_SLOT(n128)))
+ {
+ node_inner_128_insert(n128, chunk, child);
+ break;
+ }
+
+ /* grow node from 128 to 256 */

We want all the node-growing code to be pushed down to the bottom so that
all branches of the hot path are close together. This provides better
locality for the CPU frontend. Looking at the assembly, the above doesn't
have the desired effect, so we need to write like this (also see prototype):

if (unlikely( ! has-free-slot))
  grow-node;
else
{
  ...;
  break;
}
/* FALLTHROUGH */

+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+   rt_node    *child;
+
+   if (NODE_IS_LEAF(node))
+     break;
+
+   if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+     child = rt_node_add_new_child(tree, parent, node, key);
+
+   Assert(child);
+
+   parent = node;
+   node = child;
+   shift -= RT_NODE_SPAN;
+ }

Note that if we have to call rt_node_add_new_child(), each successive loop
iteration must search it and find nothing there (the prototype had a
separate function to handle this). Maybe it's not that critical yet, but
something to keep in mind as we proceed. Maybe a comment about it to remind
us.

+ /* there is no key to delete */
+ if (!rt_node_search_leaf(node, key, RT_ACTION_FIND, NULL))
+   return false;
+
+ /* Update the statistics */
+ tree->num_keys--;
+
+ /*
+  * Delete the key from the leaf node and recursively delete the key in
+  * inner nodes if necessary.
+  */
+ Assert(NODE_IS_LEAF(stack[level]));
+ while (level >= 0)
+ {
+   rt_node    *node = stack[level--];
+
+   if (NODE_IS_LEAF(node))
+     rt_node_search_leaf(node, key, RT_ACTION_DELETE, NULL);
+   else
+     rt_node_search_inner(node, key, RT_ACTION_DELETE, NULL);
+
+   /* If the node didn't become empty, we stop deleting the key */
+   if (!NODE_IS_EMPTY(node))
+     break;
+
+   /* The node became empty */
+   rt_free_node(tree, node);
+ }

Here we call rt_node_search_leaf() twice -- once to check for existence,
and once to delete. All three search calls are inlined, so this wastes
space. Let's try to delete the leaf, return if not found, otherwise handle
the leaf bookkeepping and loop over the inner nodes. This might require
some duplication of code.

+ndoe_inner_128_update(rt_node_inner_128 *node, uint8 chunk, rt_node *child)

Spelling

+static inline void
+chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
+             uint8 *dst_chunks, rt_node **dst_children, int count)
+{
+ memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
+ memcpy(dst_children, src_children, sizeof(rt_node *) * count);
+}

gcc generates better code with something like this (but not hard-coded) at
the top:

    if (count > 4)
        pg_unreachable();

This would have to change when we implement shrinking of nodes, but might
still be useful.

+ if (!rt_node_search_leaf(node, key, RT_ACTION_FIND, value_p))
+   return false;
+
+ return true;

Maybe just "return rt_node_search_leaf(...)" ?

-- 
John Naylor
EDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

Reply via email to