On Mon, Aug 15, 2022 at 10:39 PM John Naylor <john.nay...@enterprisedb.com> wrote: > > On Mon, Aug 15, 2022 at 12:39 PM Masahiko Sawada <sawada.m...@gmail.com> > wrote: > > > > On Fri, Jul 22, 2022 at 10:43 AM Masahiko Sawada <sawada.m...@gmail.com> > > wrote: > > > > > > On Tue, Jul 19, 2022 at 1:30 PM John Naylor > > > <john.nay...@enterprisedb.com> wrote: > > > > > > > > > > > > > > > > On Tue, Jul 19, 2022 at 9:11 AM Masahiko Sawada <sawada.m...@gmail.com> > > > > wrote: > > > > > > > > > I’d like to keep the first version simple. We can improve it and add > > > > > more optimizations later. Using radix tree for vacuum TID storage > > > > > would still be a big win comparing to using a flat array, even without > > > > > all these optimizations. In terms of single-value leaves method, I'm > > > > > also concerned about an extra pointer traversal and extra memory > > > > > allocation. It's most flexible but multi-value leaves method is also > > > > > flexible enough for many use cases. Using the single-value method > > > > > seems to be too much as the first step for me. > > > > > > > > > > Overall, using 64-bit keys and 64-bit values would be a reasonable > > > > > choice for me as the first step . It can cover wider use cases > > > > > including vacuum TID use cases. And possibly it can cover use cases by > > > > > combining a hash table or using tree of tree, for example. > > > > > > > > These two aspects would also bring it closer to Andres' prototype, > > > > which 1) makes review easier and 2) easier to preserve optimization > > > > work already done, so +1 from me. > > > > > > Thanks. > > > > > > I've updated the patch. It now implements 64-bit keys, 64-bit values, > > > and the multi-value leaves method. I've tried to remove duplicated > > > codes but we might find a better way to do that. > > > > > > > With the recent changes related to simd, I'm going to split the patch > > into at least two parts: introduce other simd optimized functions used > > by the radix tree and the radix tree implementation. Particularly we > > need two functions for radix tree: a function like pg_lfind32 but for > > 8 bits integers and return the index, and a function that returns the > > index of the first element that is >= key. > > I recommend looking at > > https://www.postgresql.org/message-id/CAFBsxsESLUyJ5spfOSyPrOvKUEYYNqsBosue9SV1j8ecgNXSKA%40mail.gmail.com > > since I did the work just now for searching bytes and returning a > bool, buth = and <=. Should be pretty close. Also, i believe if you > left this for last as a possible refactoring, it might save some work. > In any case, I'll take a look at the latest patch next month.
I've updated the radix tree patch. It's now separated into two patches. 0001 patch introduces pg_lsearch8() and pg_lsearch8_ge() (we may find better names) that are similar to the pg_lfind8() family but they return the index of the key in the vector instead of true/false. The patch includes regression tests. 0002 patch is the main radix tree implementation. I've removed some duplicated codes of node manipulation. For instance, since node-4, node-16, and node-32 have a similar structure with different fanouts, I introduced the common function for them. In addition to two patches, I've attached the third patch. It's not part of radix tree implementation but introduces a contrib module bench_radix_tree, a tool for radix tree performance benchmarking. It measures loading and lookup performance of both the radix tree and a flat array. Regards, -- Masahiko Sawada PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
From 5d0115b068ecb01d791eab5f8a78a6d25b9cf45c Mon Sep 17 00:00:00 2001 From: Masahiko Sawada <sawada.mshk@gmail.com> Date: Wed, 14 Sep 2022 12:38:01 +0000 Subject: [PATCH v6 1/3] Support pg_lsearch8_eq and pg_lsearch8_ge --- src/include/port/pg_lfind.h | 71 ++++++++ src/include/port/simd.h | 155 +++++++++++++++++- .../test_lfind/expected/test_lfind.out | 12 ++ .../modules/test_lfind/sql/test_lfind.sql | 2 + .../modules/test_lfind/test_lfind--1.0.sql | 8 + src/test/modules/test_lfind/test_lfind.c | 139 ++++++++++++++++ 6 files changed, 378 insertions(+), 9 deletions(-) diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h index 0625cac6b5..583f204763 100644 --- a/src/include/port/pg_lfind.h +++ b/src/include/port/pg_lfind.h @@ -80,6 +80,77 @@ pg_lfind8_le(uint8 key, uint8 *base, uint32 nelem) return false; } +/* + * pg_lsearch8 + * + * Return the index of the element in 'base' that equals to 'key', otherwise return + * -1. + */ +static inline int +pg_lsearch8(uint8 key, uint8 *base, uint32 nelem) +{ + uint32 i; + + /* round down to multiple of vector length */ + uint32 tail_idx = nelem & ~(sizeof(Vector8) - 1); + Vector8 chunk; + + for (i = 0; i < tail_idx; i += sizeof(Vector8)) + { + int idx; + + vector8_load(&chunk, &base[i]); + if ((idx = vector8_search_eq(chunk, key)) != -1) + return i + idx; + } + + /* Process the remaining elements one at a time. */ + for (; i < nelem; i++) + { + if (key == base[i]) + return i; + } + + return -1; +} + + +/* + * pg_lsearch8_ge + * + * Return the index of the first element in 'base' that is greater than or equal to + * 'key'. Return nelem if there is no such element. + * + * Note that this function assumes the elements in 'base' are sorted. + */ +static inline int +pg_lsearch8_ge(uint8 key, uint8 *base, uint32 nelem) +{ + uint32 i; + + /* round down to multiple of vector length */ + uint32 tail_idx = nelem & ~(sizeof(Vector8) - 1); + Vector8 chunk; + + for (i = 0; i < tail_idx; i += sizeof(Vector8)) + { + int idx; + + vector8_load(&chunk, &base[i]); + if ((idx = vector8_search_ge(chunk, key)) != sizeof(Vector8)) + return i + idx; + } + + /* Process the remaining elements one at a time. */ + for (; i < nelem; i++) + { + if (base[i] >= key) + break; + } + + return i; +} + /* * pg_lfind32 * diff --git a/src/include/port/simd.h b/src/include/port/simd.h index 61ae4ecf60..e2a99578a5 100644 --- a/src/include/port/simd.h +++ b/src/include/port/simd.h @@ -18,6 +18,8 @@ #ifndef SIMD_H #define SIMD_H +#include "port/pg_bitutils.h" + #if (defined(__x86_64__) || defined(_M_AMD64)) /* * SSE2 instructions are part of the spec for the 64-bit x86 ISA. We assume @@ -88,14 +90,9 @@ static inline Vector32 vector32_or(const Vector32 v1, const Vector32 v2); static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2); #endif -/* - * comparisons between vectors - * - * Note: These return a vector rather than boolean, which is why we don't - * have non-SIMD implementations. - */ -#ifndef USE_NO_SIMD +/* comparisons between vectors */ static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2); +#ifndef USE_NO_SIMD static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2); #endif @@ -277,6 +274,140 @@ vector8_is_highbit_set(const Vector8 v) #endif } +/* + * Return the bitmak of the high-bit of each element. + */ +static inline uint32 +vector8_highbit_mask(const Vector8 v) +{ +#ifdef USE_SSE2 + return (uint32) _mm_movemask_epi8(v); +#elif defined(USE_NEON) + static const uint8 mask[16] = { + 1 << 0, 1 << 1, 1 << 2, 1 << 3, + 1 << 4, 1 << 5, 1 << 6, 1 << 7, + 1 << 0, 1 << 1, 1 << 2, 1 << 3, + 1 << 4, 1 << 5, 1 << 6, 1 << 7, + }; + + uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7)); + uint8x16_t maskedhi = vextq_u8(masked, masked, 8); + + return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi)); +#else + uint32 mask = 0; + + for (Size i = 0; i < sizeof(Vector8); i++) + mask |= (((const uint8 *) &v)[i] >> 7) << i; + + return mask; +#endif +} + +/* + * Compare the given vectors and return the vector of minimum elements. + */ +static inline Vector8 +vector8_min(const Vector8 v1, const Vector8 v2) +{ +#ifdef USE_SSE2 + return _mm_min_epu8(v1, v2); +#elif defined(USE_NEON) + return vminq_u8(v1, v2); +#else /* USE_NO_SIMD */ + Vector8 r = 0; + uint8 *rp = (uint8 *) &r; + + for (Size i = 0; i < sizeof(Vector8); i++) + rp[i] = Min(((const uint8 *) &v1)[i], ((const uint8 *) &v2)[i]); + + return r; +#endif +} + +/* + * Return the index of the element in the vector that equal to the given + * scalar. Otherwise, return -1. + */ +static inline int +vector8_search_eq(const Vector8 v, const uint8 c) +{ + Vector8 keys = vector8_broadcast(c); + Vector8 cmp; + uint32 mask; + int result; + +#ifdef USE_ASSERT_CHECKING + int assert_result = -1; + + for (Size i = 0; i < sizeof(Vector8); i++) + { + if (((const uint8 *) &v)[i] == c) + { + assert_result = i; + break; + } + } +#endif /* USE_ASSERT_CHECKING */ + + cmp = vector8_eq(keys, v); + mask = vector8_highbit_mask(cmp); + + if (mask) + result = pg_rightmost_one_pos32(mask); + else + result = -1; + + Assert(assert_result == result); + return result; +} + +/* + * Return the index of the first element in the vector that is greater than + * or eual to the given scalar. Return sizeof(Vector8) if there is no such + * element. + * + * Note that this function assumes the elements in the vector are sorted. + */ +static inline int +vector8_search_ge(const Vector8 v, const uint8 c) +{ + Vector8 keys = vector8_broadcast(c); + Vector8 min; + Vector8 cmp; + uint32 mask; + int result; + +#ifdef USE_ASSERT_CHECKING + int assert_result = -1; + Size i; + + for (i = 0; i < sizeof(Vector8); i++) + { + if (((const uint8 *) &v)[i] >= c) + break; + } + assert_result = i; +#endif /* USE_ASSERT_CHECKING */ + + /* + * There is a bit more complicated than vector8_search_eq(), because + * until recently no unsigned uint8 compasion instruction existed. + * Therefore, we need to use vector8_min() to effectively get <= elements. + */ + min = vector8_min(v, keys); + cmp = vector8_eq(keys, min); + mask = vector8_highbit_mask(cmp); + + if (mask) + result = pg_rightmost_one_pos32(mask); + else + result = sizeof(Vector8); + + Assert(assert_result == result); + return result; +} + /* * Exactly like vector8_is_highbit_set except for the input type, so it * looks at each byte separately. @@ -348,7 +479,6 @@ vector8_ssub(const Vector8 v1, const Vector8 v2) * Return a vector with all bits set in each lane where the the corresponding * lanes in the inputs are equal. */ -#ifndef USE_NO_SIMD static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2) { @@ -356,9 +486,16 @@ vector8_eq(const Vector8 v1, const Vector8 v2) return _mm_cmpeq_epi8(v1, v2); #elif defined(USE_NEON) return vceqq_u8(v1, v2); +#else /* USE_NO_SIMD */ + Vector8 r = 0; + uint8 *rp = (uint8 *) &r; + + for (Size i = 0; i < sizeof(Vector8); i++) + rp[i] = (((const uint8 *) &v1)[i] == ((const uint8 *) &v2)[i]) ? 0xFF : 0; + + return r; #endif } -#endif /* ! USE_NO_SIMD */ #ifndef USE_NO_SIMD static inline Vector32 diff --git a/src/test/modules/test_lfind/expected/test_lfind.out b/src/test/modules/test_lfind/expected/test_lfind.out index 1d4b14e703..9416161955 100644 --- a/src/test/modules/test_lfind/expected/test_lfind.out +++ b/src/test/modules/test_lfind/expected/test_lfind.out @@ -22,3 +22,15 @@ SELECT test_lfind32(); (1 row) +SELECT test_lsearch8(); + test_lsearch8 +--------------- + +(1 row) + +SELECT test_lsearch8_ge(); + test_lsearch8_ge +------------------ + +(1 row) + diff --git a/src/test/modules/test_lfind/sql/test_lfind.sql b/src/test/modules/test_lfind/sql/test_lfind.sql index 766c640831..d0dbb142ec 100644 --- a/src/test/modules/test_lfind/sql/test_lfind.sql +++ b/src/test/modules/test_lfind/sql/test_lfind.sql @@ -8,3 +8,5 @@ CREATE EXTENSION test_lfind; SELECT test_lfind8(); SELECT test_lfind8_le(); SELECT test_lfind32(); +SELECT test_lsearch8(); +SELECT test_lsearch8_ge(); diff --git a/src/test/modules/test_lfind/test_lfind--1.0.sql b/src/test/modules/test_lfind/test_lfind--1.0.sql index 81801926ae..13857cec3b 100644 --- a/src/test/modules/test_lfind/test_lfind--1.0.sql +++ b/src/test/modules/test_lfind/test_lfind--1.0.sql @@ -14,3 +14,11 @@ CREATE FUNCTION test_lfind8() CREATE FUNCTION test_lfind8_le() RETURNS pg_catalog.void AS 'MODULE_PATHNAME' LANGUAGE C; + +CREATE FUNCTION test_lsearch8() + RETURNS pg_catalog.void + AS 'MODULE_PATHNAME' LANGUAGE C; + +CREATE FUNCTION test_lsearch8_ge() + RETURNS pg_catalog.void + AS 'MODULE_PATHNAME' LANGUAGE C; diff --git a/src/test/modules/test_lfind/test_lfind.c b/src/test/modules/test_lfind/test_lfind.c index 82673d54c6..c494c27436 100644 --- a/src/test/modules/test_lfind/test_lfind.c +++ b/src/test/modules/test_lfind/test_lfind.c @@ -14,6 +14,7 @@ #include "postgres.h" #include "fmgr.h" +#include "lib/stringinfo.h" #include "port/pg_lfind.h" /* @@ -115,6 +116,144 @@ test_lfind8_le(PG_FUNCTION_ARGS) PG_RETURN_VOID(); } +static void +test_lsearch8_internal(uint8 key) +{ + uint8 charbuf[LEN_WITH_TAIL(Vector8)]; + const int len_no_tail = LEN_NO_TAIL(Vector8); + const int len_with_tail = LEN_WITH_TAIL(Vector8); + int keypos; + + memset(charbuf, 0xFF, len_with_tail); + /* search tail to test one-byte-at-a-time path */ + keypos = len_with_tail - 1; + charbuf[keypos] = key; + if (key > 0x00 && (pg_lsearch8(key - 1, charbuf, len_with_tail) != -1)) + elog(ERROR, "pg_lsearch8() found nonexistent element '0x%x'", key - 1); + if (key < 0xFF && (pg_lsearch8(key, charbuf, len_with_tail) != keypos)) + elog(ERROR, "pg_lsearch8() did not find existing element '0x%x'", key); + if (key < 0xFE && (pg_lsearch8(key + 1, charbuf, len_with_tail) != -1)) + elog(ERROR, "pg_lsearch8() found nonexistent element '0x%x'", key + 1); + + memset(charbuf, 0xFF, len_with_tail); + /* search with vector operations */ + keypos = len_no_tail - 1; + charbuf[keypos] = key; + if (key > 0x00 && (pg_lsearch8(key - 1, charbuf, len_no_tail) != -1)) + elog(ERROR, "pg_lsearch8() found nonexistent element '0x%x'", key - 1); + if (key < 0xFF && (pg_lsearch8(key, charbuf, len_no_tail) != keypos)) + elog(ERROR, "pg_lsearch8() did not find existing element '0x%x'", key); + if (key < 0xFE && (pg_lsearch8(key + 1, charbuf, len_no_tail) != -1)) + elog(ERROR, "pg_lsearch8() found nonexistent element '0x%x'", key + 1); +} + +PG_FUNCTION_INFO_V1(test_lsearch8); +Datum +test_lsearch8(PG_FUNCTION_ARGS) +{ + test_lsearch8_internal(0); + test_lsearch8_internal(1); + test_lsearch8_internal(0x7F); + test_lsearch8_internal(0x80); + test_lsearch8_internal(0x81); + test_lsearch8_internal(0xFD); + test_lsearch8_internal(0xFE); + test_lsearch8_internal(0xFF); + + PG_RETURN_VOID(); +} + +static void +report_lsearch8_error(uint8 *buf, int size, uint8 key, int result, int expected) +{ + StringInfoData bufstr; + char *sep = ""; + + initStringInfo(&bufstr); + + for (int i = 0; i < size; i++) + { + appendStringInfo(&bufstr, "%s0x%02x", sep, buf[i]); + sep = ","; + } + + elog(ERROR, + "pg_lsearch8_ge returned %d, expected %d, key 0x%02x buffer %s", + result, expected, key, bufstr.data); +} + +/* workhorse for test_lsearch8_ge */ +static void +test_lsearch8_ge_internal(uint8 *buf, uint8 key) +{ + const int len_no_tail = LEN_NO_TAIL(Vector8); + const int len_with_tail = LEN_WITH_TAIL(Vector8); + int expected; + int result; + int i; + + /* search tail to test one-byte-at-a-time path */ + for (i = 0; i < len_with_tail; i++) + { + if (buf[i] >= key) + break; + } + expected = i; + result = pg_lsearch8_ge(key, buf, len_with_tail); + + if (result != expected) + report_lsearch8_error(buf, len_with_tail, key, result, expected); + + /* search with vector operations */ + for (i = 0; i < len_no_tail; i++) + { + if (buf[i] >= key) + break; + } + expected = i; + result = pg_lsearch8_ge(key, buf, len_no_tail); + + if (result != expected) + report_lsearch8_error(buf, len_no_tail, key, result, expected); +} + +static int +cmp(const void *p1, const void *p2) +{ + uint8 v1 = *((const uint8 *) p1); + uint8 v2 = *((const uint8 *) p2); + + if (v1 < v2) + return -1; + if (v1 > v2) + return 1; + return 0; +} + +PG_FUNCTION_INFO_V1(test_lsearch8_ge); +Datum +test_lsearch8_ge(PG_FUNCTION_ARGS) +{ + uint8 charbuf[LEN_WITH_TAIL(Vector8)]; + const int len_with_tail = LEN_WITH_TAIL(Vector8); + + for (int i = 0; i < len_with_tail; i++) + charbuf[i] = (uint8) rand(); + + qsort(charbuf, len_with_tail, sizeof(uint8), cmp); + + test_lsearch8_ge_internal(charbuf, 0); + test_lsearch8_ge_internal(charbuf, 1); + test_lsearch8_ge_internal(charbuf, 0x7F); + test_lsearch8_ge_internal(charbuf, 0x80); + test_lsearch8_ge_internal(charbuf, 0x81); + test_lsearch8_ge_internal(charbuf, 0xFD); + test_lsearch8_ge_internal(charbuf, 0xFE); + test_lsearch8_ge_internal(charbuf, 0xFF); + + PG_RETURN_VOID(); +} + PG_FUNCTION_INFO_V1(test_lfind32); Datum test_lfind32(PG_FUNCTION_ARGS) -- 2.31.1
From f49e91ec2a2dcb19259cbf1bc0fd73f36b29a201 Mon Sep 17 00:00:00 2001 From: Masahiko Sawada <sawada.mshk@gmail.com> Date: Wed, 14 Sep 2022 12:38:51 +0000 Subject: [PATCH v6 2/3] Add radix implementation. --- src/backend/lib/Makefile | 1 + src/backend/lib/radixtree.c | 2225 +++++++++++++++++ src/include/lib/radixtree.h | 42 + src/test/modules/Makefile | 1 + src/test/modules/test_radixtree/.gitignore | 4 + src/test/modules/test_radixtree/Makefile | 23 + src/test/modules/test_radixtree/README | 7 + .../expected/test_radixtree.out | 28 + .../test_radixtree/sql/test_radixtree.sql | 7 + .../test_radixtree/test_radixtree--1.0.sql | 8 + .../modules/test_radixtree/test_radixtree.c | 504 ++++ .../test_radixtree/test_radixtree.control | 4 + 12 files changed, 2854 insertions(+) create mode 100644 src/backend/lib/radixtree.c create mode 100644 src/include/lib/radixtree.h create mode 100644 src/test/modules/test_radixtree/.gitignore create mode 100644 src/test/modules/test_radixtree/Makefile create mode 100644 src/test/modules/test_radixtree/README create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql create mode 100644 src/test/modules/test_radixtree/test_radixtree.c create mode 100644 src/test/modules/test_radixtree/test_radixtree.control diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile index 9dad31398a..4c1db794b6 100644 --- a/src/backend/lib/Makefile +++ b/src/backend/lib/Makefile @@ -22,6 +22,7 @@ OBJS = \ integerset.o \ knapsack.o \ pairingheap.o \ + radixtree.o \ rbtree.o \ include $(top_srcdir)/src/backend/common.mk diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c new file mode 100644 index 0000000000..b163eac480 --- /dev/null +++ b/src/backend/lib/radixtree.c @@ -0,0 +1,2225 @@ +/*------------------------------------------------------------------------- + * + * radixtree.c + * Implementation for adaptive radix tree. + * + * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful + * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas + * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node + * types, each with a different numbers of elements. Depending on the number of + * children, the appropriate node type is used. + * + * There are some differences from the proposed implementation. For instance, + * this radix tree module utilizes AVX2 instruction, enabling us to use 256-bit + * width SIMD vector, whereas 128-bit width SIMD vector is used in the paper. + * Also, there is no support for path compression and lazy path expansion. The + * radix tree supports fixed length of the key so we don't expect the tree level + * wouldn't be high. + * + * Both the key and the value are 64-bit unsigned integer. The inner nodes and + * the leaf nodes have slightly different structure: for inner tree nodes, + * shift > 0, store the pointer to its child node as the value. The leaf nodes, + * shift == 0, have the 64-bit unsigned integer that is specified by the user as + * the value. The paper refers to this technique as "Multi-value leaves". We + * choose it to avoid an additional pointer traversal. It is the reason this code + * currently does not support variable-length keys. + * + * XXX: the radix tree node never be shrunk. + * + * Interface + * --------- + * + * rt_create - Create a new, empty radix tree + * rt_free - Free the radix tree + * rt_search - Search a key-value pair + * rt_set - Set a key-value pair + * rt_delete - Delete a key-value pair + * rt_begin_iter - Begin iterating through all key-value pairs + * rt_iter_next - Return next key-value pair, if any + * rt_end_iter - End iteration + * rt_memory_usage - Get the memory usage + * rt_num_entries - Get the number of key-value pairs + * + * rt_create() creates an empty radix tree in the given memory context + * and memory contexts for all kinds of radix tree node under the memory context. + * + * rt_iterate_next() ensures returning key-value pairs in the ascending + * order of the key. + * + * Copyright (c) 2022, PostgreSQL Global Development Group + * + * IDENTIFICATION + * src/backend/lib/radixtree.c + * + *------------------------------------------------------------------------- + */ + +#include "postgres.h" + +#include "miscadmin.h" +#include "port/pg_bitutils.h" +#include "port/pg_lfind.h" +#include "utils/memutils.h" +#include "lib/radixtree.h" +#include "lib/stringinfo.h" + +/* The number of bits encoded in one tree level */ +#define RT_NODE_SPAN BITS_PER_BYTE + +/* The number of maximum slots in the node */ +#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN) + +/* + * Return the number of bits required to represent nslots slots, used + * nodes indexed by array lookup. + */ +#define RT_NODE_NSLOTS_BITS(nslots) ((nslots) / (sizeof(uint8) * BITS_PER_BYTE)) + +/* Mask for extracting a chunk from the key */ +#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1) + +/* Maximum shift the radix tree uses */ +#define RT_MAX_SHIFT key_get_shift(UINT64_MAX) + +/* Tree level the radix tree uses */ +#define RT_MAX_LEVEL ((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN) + +/* Invalid index used in node-128 */ +#define RT_NODE_128_INVALID_IDX 0xFF + +/* Get a chunk from the key */ +#define RT_GET_KEY_CHUNK(key, shift) \ + ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK)) + +/* + * Mapping from the value to the bit in is-set bitmap in the node-256. + */ +#define RT_NODE_BITMAP_BYTE(v) ((v) / BITS_PER_BYTE) +#define RT_NODE_BITMAP_BIT(v) (UINT64CONST(1) << ((v) % RT_NODE_SPAN)) + +/* Enum used rt_node_search() */ +typedef enum +{ + RT_ACTION_FIND = 0, /* find the key-value */ + RT_ACTION_DELETE, /* delete the key-value */ +} rt_action; + +/* + * Supported radix tree nodes. + * + * XXX: These are currently not well chosen. To reduce memory fragmentation + * smaller class should optimally fit neatly into the next larger class + * (except perhaps at the lowest end). Right now its + * 48 -> 152 -> 296 -> 1304 -> 2088 bytes for inner/leaf nodes, leading to + * large amounts of allocator padding with aset.c. Hence the use of slab. + * + * XXX: need to have node-1 until there is no path compression optimization? + * + * XXX: need to explain why we choose these node types based on benchmark + * results etc. + */ +typedef enum rt_node_kind +{ + RT_NODE_KIND_4 = 0, + RT_NODE_KIND_16, + RT_NODE_KIND_32, + RT_NODE_KIND_128, + RT_NODE_KIND_256 +} rt_node_kind; +#define RT_NODE_KIND_COUNT (RT_NODE_KIND_256 + 1) + +/* + * Base type for all nodes types. + */ +typedef struct rt_node +{ + /* + * Number of children. We use uint16 to be able to indicate 256 children + * at the fanout of 8. + */ + uint16 count; + + /* + * Shift indicates which part of the key space is represented by this + * node. That is, the key is shifted by 'shift' and the lowest + * RT_NODE_SPAN bits are then represented in chunk. + */ + uint8 shift; + uint8 chunk; + + /* Size class of the node */ + rt_node_kind kind; +} rt_node; + +/* Macros for radix tree nodes */ +#define IS_LEAF_NODE(n) (((rt_node *) (n))->shift == 0) +#define IS_EMPTY_NODE(n) (((rt_node *) (n))->count == 0) +#define NODE_HAS_FREE_SLOT(n) \ + (((rt_node *) (n))->count < rt_node_info[((rt_node *) (n))->kind].fanout) + +/* + * Definitions of the base types for inner and leaf nodes of each node type. + */ + +/* + * node-4, node-16, and node-32 have similar structure but have different + * the number of fanout. They have the same length for chunks and values + * (or child pointers in inner nodes). The chunks and values are stored at + * corresponding position and chunks are sorted. +*/ +typedef struct rd_node_base_4 +{ + rt_node n; + + /* 4 children, for key chunks */ + uint8 chunks[4]; +} rt_node_base_4; + +typedef struct rd_node_base_16 +{ + rt_node n; + + /* 16 children, for key chunks */ + uint8 chunks[16]; +} rt_node_base_16; + +typedef struct rd_node_base_32 +{ + rt_node n; + + /* 32 children, for key chunks */ + uint8 chunks[32]; +} rt_node_base_32; + +/* + * node-128 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length, typically + * 256, to store indexes into a second array that contains up to 128 values (or + * child pointers in inner nodes). + */ +typedef struct rd_node_base_128 +{ + rt_node n; + + /* The index of slots for each fanout */ + uint8 slot_idxs[RT_NODE_MAX_SLOTS]; + + /* isset is a bitmap to track which slot is in use */ + uint8 isset[RT_NODE_NSLOTS_BITS(128)]; +} rt_node_base_128; + +/* + * node-256 is the largest node type. This node has RT_NODE_MAX_SLOTS length array + * for directly storing values (or child pointers in inner nodes). + */ +typedef struct rd_node_base_256 +{ + rt_node n; + + /* isset is a bitmap to track which slot is in use */ + uint8 isset[RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS)]; +} rt_node_base_256; + +/* + * Inner and leaf nodes. + * + * There are separate from inner node size classes for two main reasons: + * + * 1) the value type might be different than something fitting into a pointer + * width type + * 2) Need to represent non-existing values in a key-type independent way. + * + * 1) is clearly worth being concerned about, but it's not clear 2) is as + * good. It might be better to just indicate non-existing entries the same way + * in inner nodes. + */ +typedef struct rt_node_inner_4 +{ + rt_node_base_4 base; + + /* 4 children, for key chunks */ + rt_node *children[4]; +} rt_node_inner_4; + +typedef struct rt_node_leaf_4 +{ + rt_node_base_4 base; + + /* 4 values, for key chunks */ + uint64 values[4]; +} rt_node_leaf_4; + +typedef struct rt_node_inner_16 +{ + rt_node_base_16 base; + + /* 16 children, for key chunks */ + rt_node *children[16]; +} rt_node_inner_16; + +typedef struct rt_node_leaf_16 +{ + rt_node_base_16 base; + + /* 16 values, for key chunks */ + uint64 values[16]; +} rt_node_leaf_16; + +typedef struct rt_node_inner_32 +{ + rt_node_base_32 base; + + /* 32 children, for key chunks */ + rt_node *children[32]; +} rt_node_inner_32; + +typedef struct rt_node_leaf_32 +{ + rt_node_base_32 base; + + /* 32 values, for key chunks */ + uint64 values[32]; +} rt_node_leaf_32; + +typedef struct rt_node_inner_128 +{ + rt_node_base_128 base; + + /* Slots for 128 children */ + rt_node *children[128]; +} rt_node_inner_128; + +typedef struct rt_node_leaf_128 +{ + rt_node_base_128 base; + + /* Slots for 128 values */ + uint64 values[128]; +} rt_node_leaf_128; + +typedef struct rt_node_inner_256 +{ + rt_node_base_256 base; + + /* Slots for 256 children */ + rt_node *children[RT_NODE_MAX_SLOTS]; +} rt_node_inner_256; + +typedef struct rt_node_leaf_256 +{ + rt_node_base_256 base; + + /* Slots for 256 values */ + uint64 values[RT_NODE_MAX_SLOTS]; +} rt_node_leaf_256; + +/* Information of each size class */ +typedef struct rt_node_info_elem +{ + const char *name; + int fanout; + Size inner_size; + Size leaf_size; +} rt_node_info_elem; + +static rt_node_info_elem rt_node_info[RT_NODE_KIND_COUNT] = { + + [RT_NODE_KIND_4] = { + .name = "radix tree node 4", + .fanout = 4, + .inner_size = sizeof(rt_node_inner_4), + .leaf_size = sizeof(rt_node_leaf_4), + }, + [RT_NODE_KIND_16] = { + .name = "radix tree node 16", + .fanout = 16, + .inner_size = sizeof(rt_node_inner_16), + .leaf_size = sizeof(rt_node_leaf_16), + }, + [RT_NODE_KIND_32] = { + .name = "radix tree node 32", + .fanout = 32, + .inner_size = sizeof(rt_node_inner_32), + .leaf_size = sizeof(rt_node_leaf_32), + }, + [RT_NODE_KIND_128] = { + .name = "radix tree node 128", + .fanout = 128, + .inner_size = sizeof(rt_node_inner_128), + .leaf_size = sizeof(rt_node_leaf_128), + }, + [RT_NODE_KIND_256] = { + .name = "radix tree node 256", + .fanout = 256, + .inner_size = sizeof(rt_node_inner_256), + .leaf_size = sizeof(rt_node_leaf_256), + }, +}; + +/* + * Iteration support. + * + * Iterating the radix tree returns each pair of key and value in the ascending + * order of the key. To support this, the we iterate nodes of each level. + * + * rt_node_iter struct is used to track the iteration within a node. + * + * rt_iter is the struct for iteration of the radix tree, and uses rt_node_iter + * in order to track the iteration of each level. During the iteration, we also + * construct the key whenever updating the node iteration information, e.g., when + * advancing the current index within the node or when moving to the next node + * at the same level. + */ +typedef struct rt_node_iter +{ + rt_node *node; /* current node being iterated */ + int current_idx; /* current position. -1 for initial value */ +} rt_node_iter; + +struct rt_iter +{ + radix_tree *tree; + + /* Track the iteration on nodes of each level */ + rt_node_iter stack[RT_MAX_LEVEL]; + int stack_len; + + /* The key is being constructed during the iteration */ + uint64 key; +}; + +/* A radix tree with nodes */ +struct radix_tree +{ + MemoryContext context; + + rt_node *root; + uint64 max_val; + uint64 num_keys; + + MemoryContextData *inner_slabs[RT_NODE_KIND_COUNT]; + MemoryContextData *leaf_slabs[RT_NODE_KIND_COUNT]; + + /* statistics */ +#ifdef RT_DEBUG + int32 cnt[RT_NODE_KIND_COUNT]; +#endif +}; + +static void rt_new_root(radix_tree *tree, uint64 key); +static rt_node *rt_alloc_node(radix_tree *tree, rt_node_kind kind, bool inner); +static void rt_free_node(radix_tree *tree, rt_node *node); +static void rt_copy_node_common(rt_node *src, rt_node *dst); +static void rt_extend(radix_tree *tree, uint64 key); +static bool rt_node_search(rt_node *node, uint64 key, rt_action action, void **slot_p); +static bool rt_node_search_inner(rt_node *node, uint64 key, rt_action action, + rt_node **child_p); +static bool rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, + uint64 *value_p); +static rt_node *rt_node_add_new_child(radix_tree *tree, rt_node *parent, + rt_node *node, uint64 key); +static int rt_node_prepare_insert(radix_tree *tree, rt_node *parent, + rt_node **node_p, uint64 key, + bool *will_replace_p); +static void rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, + uint64 key, rt_node *child, bool *replaced_p); +static void rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node, + uint64 key, uint64 value, bool *replaced_p); +static rt_node *rt_node_grow(radix_tree *tree, rt_node *parent, + rt_node *node, uint64 key); +static void rt_update_iter_stack(rt_iter *iter, int from); +static void *rt_node_iterate_next(rt_iter *iter, rt_node_iter *node_iter, + bool *found_p); +static void rt_update_node_iter(rt_iter *iter, rt_node_iter *node_iter, + rt_node *node); +static pg_attribute_always_inline void rt_iter_update_key(rt_iter *iter, uint8 chunk, + uint8 shift); + +/* verification (available only with assertion) */ +static void rt_verify_node(rt_node *node); + +/* Return the array of children in the given inner node */ +static rt_node ** +rt_node_get_children(rt_node *node) +{ + rt_node **children = NULL; + + Assert(!IS_LEAF_NODE(node)); + + switch (node->kind) + { + case RT_NODE_KIND_4: + children = (rt_node **) ((rt_node_inner_4 *) node)->children; + break; + case RT_NODE_KIND_16: + children = (rt_node **) ((rt_node_inner_16 *) node)->children; + break; + case RT_NODE_KIND_32: + children = (rt_node **) ((rt_node_inner_32 *) node)->children; + break; + case RT_NODE_KIND_128: + children = (rt_node **) ((rt_node_inner_128 *) node)->children; + break; + case RT_NODE_KIND_256: + children = (rt_node **) ((rt_node_inner_256 *) node)->children; + break; + default: + elog(ERROR, "unexpected node type %u", node->kind); + } + + return children; +} + +/* Return the array of values in the given leaf node */ +static uint64 * +rt_node_get_values(rt_node *node) +{ + uint64 *values = NULL; + + Assert(IS_LEAF_NODE(node)); + + switch (node->kind) + { + case RT_NODE_KIND_4: + values = ((rt_node_leaf_4 *) node)->values; + break; + case RT_NODE_KIND_16: + values = ((rt_node_leaf_16 *) node)->values; + break; + case RT_NODE_KIND_32: + values = ((rt_node_leaf_32 *) node)->values; + break; + case RT_NODE_KIND_128: + values = ((rt_node_leaf_128 *) node)->values; + break; + case RT_NODE_KIND_256: + values = ((rt_node_leaf_256 *) node)->values; + break; + default: + elog(ERROR, "unexpected node type %u", node->kind); + } + + return values; +} + +/* + * Node support functions for node-4, node-16, and node-32. + * + * These three node types have similar structure -- they have the array of chunks with + * different length and corresponding pointers or values depending on inner nodes or + * leaf nodes. + */ +#define CHECK_CHUNK_ARRAY_NODE(node) \ + Assert(((((rt_node*) node)->kind) == RT_NODE_KIND_4) || \ + ((((rt_node*) node)->kind) == RT_NODE_KIND_16) || \ + ((((rt_node*) node)->kind) == RT_NODE_KIND_32)) + +/* Get the pointer to either the child or the value at 'idx */ +static void * +chunk_array_node_get_slot(rt_node *node, int idx) +{ + void *slot; + + CHECK_CHUNK_ARRAY_NODE(node); + + if (IS_LEAF_NODE(node)) + { + uint64 *values = rt_node_get_values(node); + + slot = (void *) &(values[idx]); + } + else + { + rt_node **children = rt_node_get_children(node); + + slot = (void *) children[idx]; + } + + return slot; +} + +/* Return the chunk array in the node */ +static uint8 * +chunk_array_node_get_chunks(rt_node *node) +{ + uint8 *chunk = NULL; + + CHECK_CHUNK_ARRAY_NODE(node); + + switch (node->kind) + { + case RT_NODE_KIND_4: + chunk = (uint8 *) ((rt_node_base_4 *) node)->chunks; + break; + case RT_NODE_KIND_16: + chunk = (uint8 *) ((rt_node_base_16 *) node)->chunks; + break; + case RT_NODE_KIND_32: + chunk = (uint8 *) ((rt_node_base_32 *) node)->chunks; + break; + default: + /* this function don't support node-128 and node-256 */ + elog(ERROR, "unsupported node type %d", node->kind); + } + + return chunk; +} + +/* Copy the contents of the node from 'src' to 'dst' */ +static void +chunk_array_node_copy_contents(rt_node *src, rt_node *dst) +{ + uint8 *chunks_src, + *chunks_dst; + + CHECK_CHUNK_ARRAY_NODE(src); + CHECK_CHUNK_ARRAY_NODE(dst); + + /* Copy base type */ + rt_copy_node_common(src, dst); + + /* Copy chunk array */ + chunks_src = chunk_array_node_get_chunks(src); + chunks_dst = chunk_array_node_get_chunks(dst); + memcpy(chunks_dst, chunks_src, sizeof(uint8) * src->count); + + /* Copy children or values */ + if (IS_LEAF_NODE(src)) + { + uint64 *values_src, + *values_dst; + + Assert(IS_LEAF_NODE(dst)); + values_src = rt_node_get_values(src); + values_dst = rt_node_get_values(dst); + memcpy(values_dst, values_src, sizeof(uint64) * src->count); + } + else + { + rt_node **children_src, + **children_dst; + + Assert(!IS_LEAF_NODE(dst)); + children_src = rt_node_get_children(src); + children_dst = rt_node_get_children(dst); + memcpy(children_dst, children_src, sizeof(rt_node *) * src->count); + } +} + +/* + * Return the index of the (sorted) chunk array where the chunk is inserted. + * Set true to replaced_p if the chunk already exists in the array. + */ +static int +chunk_array_node_find_insert_pos(rt_node *node, uint8 chunk, bool *found_p) +{ + uint8 *chunks; + int idx; + + CHECK_CHUNK_ARRAY_NODE(node); + + *found_p = false; + chunks = chunk_array_node_get_chunks(node); + + /* Find the insert pos */ + idx = pg_lsearch8_ge(chunk, chunks, node->count); + + if (idx < node->count && chunks[idx] == chunk) + *found_p = true; + + return idx; +} + +/* Delete the chunk at idx */ +static void +chunk_array_node_delete(rt_node *node, int idx) +{ + uint8 *chunks = chunk_array_node_get_chunks(node); + + /* delete the chunk from the chunk array */ + memmove(&(chunks[idx]), &(chunks[idx + 1]), + sizeof(uint8) * (node->count - idx - 1)); + + /* delete either the value or the child as well */ + if (IS_LEAF_NODE(node)) + { + uint64 *values = rt_node_get_values(node); + + memmove(&(values[idx]), + &(values[idx + 1]), + sizeof(uint64) * (node->count - idx - 1)); + } + else + { + rt_node **children = rt_node_get_children(node); + + memmove(&(children[idx]), + &(children[idx + 1]), + sizeof(rt_node *) * (node->count - idx - 1)); + } +} + +/* Support function for both node-128 */ + +/* Does the given chunk in the node has the value? */ +static pg_attribute_always_inline bool +node_128_is_chunk_used(rt_node_base_128 *node, uint8 chunk) +{ + return node->slot_idxs[chunk] != RT_NODE_128_INVALID_IDX; +} + +/* Is the slot in the node used? */ +static pg_attribute_always_inline bool +node_128_is_slot_used(rt_node_base_128 *node, uint8 slot) +{ + return ((node->isset[RT_NODE_BITMAP_BYTE(slot)] & RT_NODE_BITMAP_BIT(slot)) != 0); +} + +/* Get the pointer to either the child or the value corresponding to chunk */ +static void * +node_128_get_slot(rt_node_base_128 *node, uint8 chunk) +{ + int slotpos; + void *slot; + + slotpos = node->slot_idxs[chunk]; + Assert(slotpos != RT_NODE_128_INVALID_IDX); + + if (IS_LEAF_NODE(node)) + slot = (void *) &(((rt_node_leaf_128 *) node)->values[slotpos]); + else + slot = (void *) (((rt_node_inner_128 *) node)->children[slotpos]); + + return slot; +} + +/* Delete the chunk in the node */ +static void +node_128_delete(rt_node_base_128 *node, uint8 chunk) +{ + int slotpos = node->slot_idxs[chunk]; + + node->isset[RT_NODE_BITMAP_BYTE(slotpos)] &= ~(RT_NODE_BITMAP_BIT(slotpos)); + node->slot_idxs[chunk] = RT_NODE_128_INVALID_IDX; +} + +/* Return an unused slot in node-128 */ +static int +node_128_find_unused_slot(rt_node_base_128 *node, uint8 chunk) +{ + int slotpos; + + /* + * Find an unused slot. We iterate over the isset bitmap per byte then + * check each bit. + */ + for (slotpos = 0; slotpos < RT_NODE_NSLOTS_BITS(128); slotpos++) + { + if (node->isset[slotpos] < 0xFF) + break; + } + Assert(slotpos < RT_NODE_NSLOTS_BITS(128)); + + slotpos *= BITS_PER_BYTE; + while (node_128_is_slot_used(node, slotpos)) + slotpos++; + + return slotpos; +} + + +/* XXX: duplicate with node_128_set_leaf */ +static void +node_128_set_inner(rt_node_base_128 *node, uint8 chunk, rt_node *child) +{ + int slotpos; + rt_node_inner_128 *n128 = (rt_node_inner_128 *) node; + + /* Overwrite the existing value if exists */ + if (node_128_is_chunk_used(node, chunk)) + { + n128->children[n128->base.slot_idxs[chunk]] = child; + return; + } + + /* find unused slot */ + slotpos = node_128_find_unused_slot(node, chunk); + + n128->base.slot_idxs[chunk] = slotpos; + n128->base.isset[RT_NODE_BITMAP_BYTE(slotpos)] |= RT_NODE_BITMAP_BIT(slotpos); + n128->children[slotpos] = child; +} + +/* Set the slot at the corresponding chunk */ +static void +node_128_set_leaf(rt_node_base_128 *node, uint8 chunk, uint64 value) +{ + int slotpos; + rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node; + + /* Overwrite the existing value if exists */ + if (node_128_is_chunk_used(node, chunk)) + { + n128->values[n128->base.slot_idxs[chunk]] = value; + return; + } + + /* find unused slot */ + slotpos = node_128_find_unused_slot(node, chunk); + + n128->base.slot_idxs[chunk] = slotpos; + n128->base.isset[RT_NODE_BITMAP_BYTE(slotpos)] |= RT_NODE_BITMAP_BIT(slotpos); + n128->values[slotpos] = value; +} + +/* Return true if the slot corresponding to the given chunk is in use */ +static bool +node_256_is_chunk_used(rt_node_base_256 *node, uint8 chunk) +{ + return (node->isset[RT_NODE_BITMAP_BYTE(chunk)] & RT_NODE_BITMAP_BIT(chunk)) != 0; +} + +/* Get the pointer to either the child or the value corresponding to chunk */ +static void * +node_256_get_slot(rt_node_base_256 *node, uint8 chunk) +{ + void *slot; + + Assert(node_256_is_chunk_used(node, chunk)); + if (IS_LEAF_NODE(node)) + slot = (void *) &(((rt_node_leaf_256 *) node)->values[chunk]); + else + slot = (void *) (((rt_node_inner_256 *) node)->children[chunk]); + + return slot; +} + +/* Set the child in the node-256 */ +static pg_attribute_always_inline void +node_256_set_inner(rt_node_base_256 *node, uint8 chunk, rt_node *child) +{ + rt_node_inner_256 *n256 = (rt_node_inner_256 *) node; + + n256->base.isset[RT_NODE_BITMAP_BYTE(chunk)] |= RT_NODE_BITMAP_BIT(chunk); + n256->children[chunk] = child; +} + +/* Set the value in the node-256 */ +static pg_attribute_always_inline void +node_256_set_leaf(rt_node_base_256 *node, uint8 chunk, uint64 value) +{ + rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node; + + n256->base.isset[RT_NODE_BITMAP_BYTE(chunk)] |= RT_NODE_BITMAP_BIT(chunk); + n256->values[chunk] = value; +} + +/* Set the slot at the given chunk position */ +static pg_attribute_always_inline void +node_256_delete(rt_node_base_256 *node, uint8 chunk) +{ + node->isset[RT_NODE_BITMAP_BYTE(chunk)] &= ~(RT_NODE_BITMAP_BIT(chunk)); +} + +/* + * Return the shift that is satisfied to store the given key. + */ +static pg_attribute_always_inline int +key_get_shift(uint64 key) +{ + return (key == 0) + ? 0 + : (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN; +} + +/* + * Return the max value stored in a node with the given shift. + */ +static uint64 +shift_get_max_val(int shift) +{ + if (shift == RT_MAX_SHIFT) + return UINT64_MAX; + + return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1; +} + +/* + * Create a new node as the root. Subordinate nodes will be created during + * the insertion. + */ +static void +rt_new_root(radix_tree *tree, uint64 key) +{ + int shift = key_get_shift(key); + rt_node *node; + + node = (rt_node *) rt_alloc_node(tree, RT_NODE_KIND_4, shift > 0); + node->shift = shift; + tree->max_val = shift_get_max_val(shift); + tree->root = node; +} + +/* + * Allocate a new node with the given node kind. + */ +static rt_node * +rt_alloc_node(radix_tree *tree, rt_node_kind kind, bool inner) +{ + rt_node *newnode; + + if (inner) + newnode = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind], + rt_node_info[kind].inner_size); + else + newnode = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind], + rt_node_info[kind].leaf_size); + + newnode->kind = kind; + + /* Initialize slot_idxs to invalid values */ + if (kind == RT_NODE_KIND_128) + { + rt_node_base_128 *n128 = (rt_node_base_128 *) newnode; + + memset(n128->slot_idxs, RT_NODE_128_INVALID_IDX, sizeof(n128->slot_idxs)); + } + +#ifdef RT_DEBUG + /* update the statistics */ + tree->cnt[kind]++; +#endif + + return newnode; +} + +/* Free the given node */ +static void +rt_free_node(radix_tree *tree, rt_node *node) +{ + /* If we're deleting the root node, make the tree empty */ + if (tree->root == node) + tree->root = NULL; + +#ifdef RT_DEBUG + /* update the statistics */ + tree->cnt[node->kind]--; + + Assert(tree->cnt[node->kind] >= 0); +#endif + + pfree(node); +} + +/* Copy the common fields without the node kind */ +static void +rt_copy_node_common(rt_node *src, rt_node *dst) +{ + dst->shift = src->shift; + dst->chunk = src->chunk; + dst->count = src->count; +} + +/* + * The radix tree doesn't sufficient height. Extend the radix tree so it can + * store the key. + */ +static void +rt_extend(radix_tree *tree, uint64 key) +{ + int target_shift; + int shift = tree->root->shift + RT_NODE_SPAN; + + target_shift = key_get_shift(key); + + /* Grow tree from 'shift' to 'target_shift' */ + while (shift <= target_shift) + { + rt_node_inner_4 *node = + (rt_node_inner_4 *) rt_alloc_node(tree, RT_NODE_KIND_4, true); + + node->base.n.count = 1; + node->base.n.shift = shift; + node->base.chunks[0] = 0; + node->children[0] = tree->root; + + tree->root->chunk = 0; + tree->root = (rt_node *) node; + + shift += RT_NODE_SPAN; + } + + tree->max_val = shift_get_max_val(target_shift); +} + +/* + * Search for the given key in the node. Return true if the key is found, otherwise + * return false. On success, we do the specified action for the key, and set the + * pointer to the slot to slot_p. + */ +static bool +rt_node_search(rt_node *node, uint64 key, rt_action action, void **slot_p) +{ + uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift); + bool found = false; + + switch (node->kind) + { + case RT_NODE_KIND_4: + case RT_NODE_KIND_16: + case RT_NODE_KIND_32: + { + int idx; + uint8 *chunks = chunk_array_node_get_chunks(node); + + idx = pg_lsearch8(chunk, chunks, node->count); + if (idx < 0) + break; + + found = true; + if (action == RT_ACTION_FIND) + *slot_p = chunk_array_node_get_slot(node, idx); + else /* RT_ACTION_DELETE */ + chunk_array_node_delete(node, idx); + + break; + } + case RT_NODE_KIND_128: + { + rt_node_base_128 *n128 = (rt_node_base_128 *) node; + + if (!node_128_is_chunk_used(n128, chunk)) + break; + + found = true; + if (action == RT_ACTION_FIND) + *slot_p = node_128_get_slot(n128, chunk); + else /* RT_ACTION_DELETE */ + node_128_delete(n128, chunk); + + break; + } + case RT_NODE_KIND_256: + { + rt_node_base_256 *n256 = (rt_node_base_256 *) node; + + if (!node_256_is_chunk_used(n256, chunk)) + break; + + found = true; + if (action == RT_ACTION_FIND) + *slot_p = node_256_get_slot(n256, chunk); + else /* RT_ACTION_DELETE */ + node_256_delete(n256, chunk); + + break; + } + } + + /* Update the statistics */ + if (action == RT_ACTION_DELETE && found) + node->count--; + + return found; +} + +/* + * Search for the child pointer corresponding to the key in the given node. + * + * Return true if the key is found, otherwise return false. On success, the child + * pointer is set to child_p. + */ +static bool +rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **child_p) +{ + rt_node *child; + + if (!rt_node_search(node, key, action, (void **) &child)) + return false; + + if (child_p) + *child_p = child; + + return true; +} + +/* + * Search for the value corresponding to the key in the given node. + * + * Return true if the key is found, otherwise return false. On success, the pointer + * to the value is set to value_p. + */ +static bool +rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p) +{ + uint64 *value; + + if (!rt_node_search(node, key, action, (void **) &value)) + return false; + + if (value_p) + *value_p = *value; + + return true; +} + +/* Insert 'node' as a child node of 'parent' */ +static rt_node * +rt_node_add_new_child(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key) +{ + uint8 newshift = node->shift - RT_NODE_SPAN; + rt_node *newchild = + (rt_node *) rt_alloc_node(tree, RT_NODE_KIND_4, newshift > 0); + + Assert(!IS_LEAF_NODE(node)); + + newchild->shift = newshift; + newchild->chunk = RT_GET_KEY_CHUNK(key, node->shift); + + rt_node_insert_inner(tree, parent, node, key, newchild, NULL); + + return (rt_node *) newchild; +} + +/* + * For a upcoming insertion, we make sure that the node has enough free slots or + * grow the node if necessary. node_p is updated with the grown node. We set true + * to will_replace_p to tell the caller that the given chunk already exists in the + * node. + * + * Return the index in the chunk array where the key can be inserted. We always + * return 0 in node-128 and node-256 cases. + */ +static int +rt_node_prepare_insert(radix_tree *tree, rt_node *parent, rt_node **node_p, + uint64 key, bool *will_replace_p) +{ + rt_node *node = *node_p; + uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift); + bool will_replace = false; + int idx = 0; + + switch (node->kind) + { + case RT_NODE_KIND_4: + case RT_NODE_KIND_16: + case RT_NODE_KIND_32: + { + bool can_insert = false; + + while ((node->kind == RT_NODE_KIND_4) || + (node->kind == RT_NODE_KIND_16) || + (node->kind == RT_NODE_KIND_32)) + { + /* Find the insert pos */ + idx = chunk_array_node_find_insert_pos(node, chunk, &will_replace); + + if (will_replace || NODE_HAS_FREE_SLOT(node)) + { + /* + * Found. We can insert a new one or replace the exiting + * value. + */ + can_insert = true; + break; + } + + node = rt_node_grow(tree, parent, node, key); + } + + if (can_insert) + { + uint8 *chunks = chunk_array_node_get_chunks(node); + + Assert(idx >= 0); + + /* + * Make the space for the new key if it will be inserted in + * the middle of the array. + */ + if (!will_replace && node->count != 0 && idx < node->count) + { + /* shift chunks array */ + memmove(&(chunks[idx + 1]), &(chunks[idx]), + sizeof(uint8) * (node->count - idx)); + + /* shift either the values array or the children array */ + if (IS_LEAF_NODE(node)) + { + uint64 *values = rt_node_get_values(node); + + memmove(&(values[idx + 1]), &(values[idx]), + sizeof(uint64) * (node->count - idx)); + } + else + { + rt_node **children = rt_node_get_children(node); + + memmove(&(children[idx + 1]), &(children[idx]), + sizeof(rt_node *) * (node->count - idx)); + } + } + + break; + } + + Assert(node->kind == RT_NODE_KIND_128); + } + /* FALLTHROUGH */ + case RT_NODE_KIND_128: + { + rt_node_base_128 *n128 = (rt_node_base_128 *) node; + + if (node_128_is_chunk_used(n128, chunk) || NODE_HAS_FREE_SLOT(n128)) + { + if (node_128_is_chunk_used(n128, chunk)) + will_replace = true; + + break; + } + + node = rt_node_grow(tree, parent, node, key); + } + /* FALLTHROUGH */ + case RT_NODE_KIND_256: + { + rt_node_base_256 *n256 = (rt_node_base_256 *) node; + + if (node_256_is_chunk_used(n256, chunk)) + will_replace = true; + + break; + } + } + + *node_p = node; + *will_replace_p = will_replace; + + return idx; +} + +/* Insert the child to the inner node */ +static void +rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, + uint64 key, rt_node *child, bool *replaced_p) +{ + uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift); + int idx; + bool replaced; + + Assert(!IS_LEAF_NODE(node)); + + idx = rt_node_prepare_insert(tree, parent, &node, key, &replaced); + + switch (node->kind) + { + case RT_NODE_KIND_4: + case RT_NODE_KIND_16: + case RT_NODE_KIND_32: + { + uint8 *chunks = chunk_array_node_get_chunks(node); + rt_node **children = rt_node_get_children(node); + + Assert(idx >= 0); + chunks[idx] = chunk; + children[idx] = child; + break; + } + case RT_NODE_KIND_128: + { + node_128_set_inner((rt_node_base_128 *) node, chunk, child); + break; + } + case RT_NODE_KIND_256: + { + node_256_set_inner((rt_node_base_256 *) node, chunk, child); + break; + } + } + + /* Update statistics */ + if (!replaced) + node->count++; + + if (replaced_p) + *replaced_p = replaced; + + /* + * Done. Finally, verify the chunk and value is inserted or replaced + * properly in the node. + */ + rt_verify_node(node); +} + +/* Insert the value to the leaf node */ +static void +rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node, + uint64 key, uint64 value, bool *replaced_p) +{ + uint8 chunk = RT_GET_KEY_CHUNK(key, node->shift); + int idx; + bool replaced; + + Assert(IS_LEAF_NODE(node)); + + idx = rt_node_prepare_insert(tree, parent, &node, key, &replaced); + + switch (node->kind) + { + case RT_NODE_KIND_4: + case RT_NODE_KIND_16: + case RT_NODE_KIND_32: + { + uint8 *chunks = chunk_array_node_get_chunks(node); + uint64 *values = rt_node_get_values(node); + + Assert(idx >= 0); + chunks[idx] = chunk; + values[idx] = value; + break; + } + case RT_NODE_KIND_128: + { + node_128_set_leaf((rt_node_base_128 *) node, chunk, value); + break; + } + case RT_NODE_KIND_256: + { + node_256_set_leaf((rt_node_base_256 *) node, chunk, value); + break; + } + } + + /* Update statistics */ + if (!replaced) + node->count++; + + *replaced_p = replaced; + + /* + * Done. Finally, verify if the chunk and value is inserted or replaced + * properly in the node. + */ + rt_verify_node(node); +} + +/* Change the node type to the next larger one */ +static rt_node * +rt_node_grow(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key) +{ + rt_node *newnode = NULL; + + Assert(node->count == rt_node_info[node->kind].fanout); + + switch (node->kind) + { + case RT_NODE_KIND_4: + { + newnode = rt_alloc_node(tree, RT_NODE_KIND_16, + IS_LEAF_NODE(node)); + + /* Copy both chunks and slots to the new node */ + chunk_array_node_copy_contents(node, newnode); + break; + } + case RT_NODE_KIND_16: + { + newnode = rt_alloc_node(tree, RT_NODE_KIND_32, + IS_LEAF_NODE(node)); + + /* Copy both chunks and slots to the new node */ + chunk_array_node_copy_contents(node, newnode); + break; + } + case RT_NODE_KIND_32: + { + newnode = rt_alloc_node(tree, RT_NODE_KIND_128, + IS_LEAF_NODE(node)); + + /* Copy both chunks and slots to the new node */ + rt_copy_node_common(node, newnode); + + if (IS_LEAF_NODE(node)) + { + rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node; + + for (int i = 0; i < node->count; i++) + node_128_set_leaf((rt_node_base_128 *) newnode, + n32->base.chunks[i], n32->values[i]); + } + else + { + rt_node_inner_32 *n32 = (rt_node_inner_32 *) node; + + for (int i = 0; i < node->count; i++) + node_128_set_inner((rt_node_base_128 *) newnode, + n32->base.chunks[i], n32->children[i]); + } + + break; + } + case RT_NODE_KIND_128: + { + rt_node_base_128 *n128 = (rt_node_base_128 *) node; + int cnt = 0; + + newnode = rt_alloc_node(tree, RT_NODE_KIND_256, + IS_LEAF_NODE(node)); + + /* Copy both chunks and slots to the new node */ + rt_copy_node_common(node, newnode); + + for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->n.count; i++) + { + void *slot; + + if (!node_128_is_chunk_used(n128, i)) + continue; + + slot = node_128_get_slot(n128, i); + + if (IS_LEAF_NODE(node)) + node_256_set_leaf((rt_node_base_256 *) newnode, i, + *(uint64 *) slot); + else + node_256_set_inner((rt_node_base_256 *) newnode, i, + (rt_node *) slot); + + cnt++; + } + + break; + } + case RT_NODE_KIND_256: + elog(ERROR, "radix tree node-256 cannot grow"); + break; + } + + if (parent == node) + { + /* Replace the root node with the new large node */ + tree->root = newnode; + } + else + { + /* Set the new node to the parent node */ + rt_node_insert_inner(tree, NULL, parent, key, newnode, NULL); + } + + /* Verify if the node has grown properly */ + rt_verify_node(newnode); + + /* Free the old node */ + rt_free_node(tree, node); + + return newnode; +} + +/* + * Create the radix tree in the given memory context and return it. + */ +radix_tree * +rt_create(MemoryContext ctx) +{ + radix_tree *tree; + MemoryContext old_ctx; + + old_ctx = MemoryContextSwitchTo(ctx); + + tree = palloc(sizeof(radix_tree)); + tree->context = ctx; + tree->root = NULL; + tree->max_val = 0; + tree->num_keys = 0; + + /* Create the slab allocator for each size class */ + for (int i = 0; i < RT_NODE_KIND_COUNT; i++) + { + tree->inner_slabs[i] = SlabContextCreate(ctx, + rt_node_info[i].name, + SLAB_DEFAULT_BLOCK_SIZE, + rt_node_info[i].inner_size); + tree->leaf_slabs[i] = SlabContextCreate(ctx, + rt_node_info[i].name, + SLAB_DEFAULT_BLOCK_SIZE, + rt_node_info[i].leaf_size); +#ifdef RT_DEBUG + tree->cnt[i] = 0; +#endif + } + + MemoryContextSwitchTo(old_ctx); + + return tree; +} + +/* + * Free the given radix tree. + */ +void +rt_free(radix_tree *tree) +{ + for (int i = 0; i < RT_NODE_KIND_COUNT; i++) + { + MemoryContextDelete(tree->inner_slabs[i]); + MemoryContextDelete(tree->leaf_slabs[i]); + } + + pfree(tree); +} + +/* + * Set key to value. If the entry already exists, we update its value to 'value' + * and return true. Returns false if entry doesn't yet exist. + */ +bool +rt_set(radix_tree *tree, uint64 key, uint64 value) +{ + int shift; + bool replaced; + rt_node *node; + rt_node *parent = tree->root; + + /* Empty tree, create the root */ + if (!tree->root) + rt_new_root(tree, key); + + /* Extend the tree if necessary */ + if (key > tree->max_val) + rt_extend(tree, key); + + Assert(tree->root); + + shift = tree->root->shift; + node = tree->root; + + /* Descend the tree until a leaf node */ + while (shift >= 0) + { + rt_node *child; + + if (IS_LEAF_NODE(node)) + break; + + if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child)) + child = rt_node_add_new_child(tree, parent, node, key); + + Assert(child); + + parent = node; + node = child; + shift -= RT_NODE_SPAN; + } + + /* arrived at a leaf */ + Assert(IS_LEAF_NODE(node)); + + rt_node_insert_leaf(tree, parent, node, key, value, &replaced); + + /* Update the statistics */ + if (!replaced) + tree->num_keys++; + + return replaced; +} + +/* + * Search the given key in the radix tree. Return true if there is the key, + * otherwise return false. On success, we set the value to *val_p so it must + * not be NULL. + */ +bool +rt_search(radix_tree *tree, uint64 key, uint64 *value_p) +{ + rt_node *node; + int shift; + + Assert(value_p != NULL); + + if (!tree->root || key > tree->max_val) + return false; + + node = tree->root; + shift = tree->root->shift; + + /* Descend the tree until a leaf node */ + while (shift >= 0) + { + rt_node *child; + + if (IS_LEAF_NODE(node)) + break; + + if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child)) + return false; + + node = child; + shift -= RT_NODE_SPAN; + } + + /* We reached at a leaf node, so search the corresponding slot */ + Assert(IS_LEAF_NODE(node)); + if (!rt_node_search_leaf(node, key, RT_ACTION_FIND, value_p)) + return false; + + return true; +} + +/* + * Delete the given key from the radix tree. Return true if the key is found (and + * deleted), otherwise do nothing and return false. + */ +bool +rt_delete(radix_tree *tree, uint64 key) +{ + rt_node *node; + int shift; + rt_node *stack[RT_MAX_LEVEL] = {0}; + int level; + + if (!tree->root || key > tree->max_val) + return false; + + /* + * Descend the tree to search the key while building a stack of nodes + * we visited. + */ + node = tree->root; + shift = tree->root->shift; + level = 0; + while (shift >= 0) + { + rt_node *child; + + /* Push the current node to the stack */ + stack[level] = node; + + if (IS_LEAF_NODE(node)) + break; + + if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child)) + return false; + + node = child; + shift -= RT_NODE_SPAN; + level++; + } + + Assert(IS_LEAF_NODE(node)); + + /* there is no key to delete */ + if (!rt_node_search_leaf(node, key, RT_ACTION_FIND, NULL)) + return false; + + /* Update the statistics */ + tree->num_keys--; + + /* + * Delete the key from the leaf node and recursively delete the key in + * inner nodes if necessary. + */ + Assert(IS_LEAF_NODE(stack[level])); + while (level >= 0) + { + rt_node *node = stack[level--]; + + if (IS_LEAF_NODE(node)) + rt_node_search_leaf(node, key, RT_ACTION_DELETE, NULL); + else + rt_node_search_inner(node, key, RT_ACTION_DELETE, NULL); + + /* If the node didn't become empty, we stop deleting the key */ + if (!IS_EMPTY_NODE(node)) + break; + + /* The node became empty */ + rt_free_node(tree, node); + } + + /* + * If we eventually deleted the root node while recursively deleting empty + * nodes, we make the tree empty. + */ + if (level == 0) + { + tree->root = NULL; + tree->max_val = 0; + } + + return true;; +} + +/* Create and return the iterator for the given radix tree */ +rt_iter * +rt_begin_iterate(radix_tree *tree) +{ + MemoryContext old_ctx; + rt_iter *iter; + int top_level; + + old_ctx = MemoryContextSwitchTo(tree->context); + + iter = (rt_iter *) palloc0(sizeof(rt_iter)); + iter->tree = tree; + + /* empty tree */ + if (!iter->tree) + return iter; + + top_level = iter->tree->root->shift / RT_NODE_SPAN; + + iter->stack_len = top_level; + iter->stack[top_level].node = iter->tree->root; + iter->stack[top_level].current_idx = -1; + + /* + * Descend to the left most leaf node from the root. The key is being + * constructed while descending to the leaf. + */ + rt_update_iter_stack(iter, top_level); + + MemoryContextSwitchTo(old_ctx); + + return iter; +} + +/* + * Update the stack of the radix tree node while descending to the leaf from + * the 'from' level. + */ +static void +rt_update_iter_stack(rt_iter *iter, int from) +{ + rt_node *node = iter->stack[from].node; + int level = from; + + for (;;) + { + rt_node_iter *node_iter = &(iter->stack[level--]); + bool found; + + /* Set the node to this level */ + rt_update_node_iter(iter, node_iter, node); + + /* Finish if we reached to the leaf node */ + if (IS_LEAF_NODE(node)) + break; + + /* Advance to the next slot in the node */ + node = (rt_node *) rt_node_iterate_next(iter, node_iter, &found); + + /* + * Since we always get the first slot in the node, we have to found + * the slot. + */ + Assert(found); + } +} + +/* + * Return true with setting key_p and value_p if there is next key. Otherwise, + * return false. + */ +bool +rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p) +{ + bool found = false; + void *slot; + + /* Empty tree */ + if (!iter->tree) + return false; + + for (;;) + { + rt_node *node; + rt_node_iter *node_iter; + int level; + + /* + * Iterate node at each level from the bottom of the tree, i.e., the + * lead node, until we find the next slot. + */ + for (level = 0; level <= iter->stack_len; level++) + { + slot = rt_node_iterate_next(iter, &(iter->stack[level]), &found); + + if (found) + break; + } + + /* We could not find any new key-value pair, the iteration finished */ + if (!found) + break; + + /* found the next slot at the leaf node, return it */ + if (level == 0) + { + *key_p = iter->key; + *value_p = *((uint64 *) slot); + break; + } + + /* + * We have advanced slots more than one nodes including both the lead + * node and inner nodes. So we update the stack by descending to + * the left most leaf node from this level. + */ + node = (rt_node *) (rt_node *) slot; + node_iter = &(iter->stack[level - 1]); + rt_update_node_iter(iter, node_iter, node); + rt_update_iter_stack(iter, level - 1); + } + + return found; +} + +void +rt_end_iterate(rt_iter *iter) +{ + pfree(iter); +} + +/* + * Iterate over the given radix tree node and returns the next slot of the given + * node and set true to *found_p, if any. Otherwise, set false to *found_p. + */ +static void * +rt_node_iterate_next(rt_iter *iter, rt_node_iter *node_iter, bool *found_p) +{ + rt_node *node = node_iter->node; + void *slot = NULL; + + switch (node->kind) + { + case RT_NODE_KIND_4: + case RT_NODE_KIND_16: + case RT_NODE_KIND_32: + { + node_iter->current_idx++; + + if (node_iter->current_idx >= node->count) + goto not_found; + + slot = chunk_array_node_get_slot(node, node_iter->current_idx); + + /* Update the part of the key by the current chunk */ + if (IS_LEAF_NODE(node)) + { + uint8 *chunks = chunk_array_node_get_chunks(node); + + rt_iter_update_key(iter, chunks[node_iter->current_idx], 0); + } + + break; + } + case RT_NODE_KIND_128: + { + rt_node_base_128 *n128 = (rt_node_base_128 *) node; + int i; + + for (i = node_iter->current_idx + 1; i < 256; i++) + { + if (node_128_is_chunk_used(n128, i)) + break; + } + + if (i >= 256) + goto not_found; + + node_iter->current_idx = i; + slot = node_128_get_slot(n128, i); + + /* Update the part of the key */ + if (IS_LEAF_NODE(n128)) + rt_iter_update_key(iter, node_iter->current_idx, 0); + + break; + } + case RT_NODE_KIND_256: + { + rt_node_base_256 *n256 = (rt_node_base_256 *) node; + int i; + + for (i = node_iter->current_idx + 1; i < 256; i++) + { + if (node_256_is_chunk_used(n256, i)) + break; + } + + if (i >= 256) + goto not_found; + + node_iter->current_idx = i; + slot = node_256_get_slot(n256, i); + + /* Update the part of the key */ + if (IS_LEAF_NODE(n256)) + rt_iter_update_key(iter, node_iter->current_idx, 0); + + break; + } + } + + Assert(slot); + *found_p = true; + return slot; + +not_found: + *found_p = false; + return NULL; +} + +/* + * Set the node to the node_iter so we can begin the iteration of the node. + * Also, we update the part of the key by the chunk of the given node. + */ +static void +rt_update_node_iter(rt_iter *iter, rt_node_iter *node_iter, + rt_node *node) +{ + node_iter->node = node; + node_iter->current_idx = -1; + + rt_iter_update_key(iter, node->chunk, node->shift + RT_NODE_SPAN); +} + +static pg_attribute_always_inline void +rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift) +{ + iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift); + iter->key |= (((uint64) chunk) << shift); +} + +/* + * Return the number of keys in the radix tree. + */ +uint64 +rt_num_entries(radix_tree *tree) +{ + return tree->num_keys; +} + +/* + * Return the statistics of the amount of memory used by the radix tree. + */ +uint64 +rt_memory_usage(radix_tree *tree) +{ + Size total = 0; + + for (int i = 0; i < RT_NODE_KIND_COUNT; i++) + { + total += MemoryContextMemAllocated(tree->inner_slabs[i], true); + total += MemoryContextMemAllocated(tree->leaf_slabs[i], true); + } + + return total; +} + +/* + * Verify the radix tree node. + */ +static void +rt_verify_node(rt_node *node) +{ +#ifdef USE_ASSERT_CHECKING + Assert(node->count >= 0); + + switch (node->kind) + { + case RT_NODE_KIND_4: + case RT_NODE_KIND_16: + case RT_NODE_KIND_32: + { + uint8 *chunks = chunk_array_node_get_chunks(node); + + /* Check if the chunks in the node are sorted */ + for (int i = 1; i < node->count; i++) + Assert(chunks[i - 1] < chunks[i]); + + break; + } + case RT_NODE_KIND_128: + { + rt_node_base_128 *n128 = (rt_node_base_128 *) node; + int cnt = 0; + + for (int i = 0; i < RT_NODE_MAX_SLOTS; i++) + { + if (!node_128_is_chunk_used(n128, i)) + continue; + + /* Check if the corresponding slot is used */ + Assert(node_128_is_slot_used(n128, n128->slot_idxs[i])); + + cnt++; + } + + Assert(n128->n.count == cnt); + break; + } + case RT_NODE_KIND_256: + { + rt_node_base_256 *n256 = (rt_node_base_256 *) node; + int cnt = 0; + + for (int i = 0; i < RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS); i++) + cnt += pg_popcount32(n256->isset[i]); + + /* Check if the number of used chunk matches */ + Assert(n256->n.count == cnt); + + break; + } + } +#endif +} + +/***************** DEBUG FUNCTIONS *****************/ +#ifdef RT_DEBUG +void +rt_stats(radix_tree *tree) +{ + fprintf(stderr, "num_keys = %lu, height = %u, n4 = %u, n16 = %u,n32 = %u, n128 = %u, n256 = %u", + tree->num_keys, + tree->root->shift / RT_NODE_SPAN, + tree->cnt[0], + tree->cnt[1], + tree->cnt[2], + tree->cnt[3], + tree->cnt[4]); + /* rt_dump(tree); */ +} + +static void +rt_print_slot(StringInfo buf, uint8 chunk, uint64 value, int idx, bool is_leaf, int level) +{ + char space[128] = {0}; + + if (level > 0) + sprintf(space, "%*c", level * 4, ' '); + + if (is_leaf) + appendStringInfo(buf, "%s[%d] \"0x%X\" val(0x%lX) LEAF\n", + space, + idx, + chunk, + value); + else + appendStringInfo(buf, "%s[%d] \"0x%X\" -> ", + space, + idx, + chunk); +} + +static void +rt_dump_node(rt_node *node, int level, StringInfo buf, bool recurse) +{ + bool is_leaf = IS_LEAF_NODE(node); + + appendStringInfo(buf, "[\"%s\" type %d, cnt %u, shift %u, chunk \"0x%X\"] chunks:\n", + IS_LEAF_NODE(node) ? "LEAF" : "INNR", + (node->kind == RT_NODE_KIND_4) ? 4 : + (node->kind == RT_NODE_KIND_32) ? 32 : + (node->kind == RT_NODE_KIND_128) ? 128 : 256, + node->count, node->shift, node->chunk); + + switch (node->kind) + { + case RT_NODE_KIND_4: + case RT_NODE_KIND_16: + case RT_NODE_KIND_32: + { + uint8 *chunks = chunk_array_node_get_chunks(node); + + for (int i = 0; i < node->count; i++) + { + if (IS_LEAF_NODE(node)) + { + uint64 *values = rt_node_get_values(node); + + rt_print_slot(buf, chunks[i], + values[i], + i, is_leaf, level); + } + else + rt_print_slot(buf, chunks[i], + UINT64_MAX, + i, is_leaf, level); + + if (!is_leaf) + { + if (recurse) + { + rt_node **children = rt_node_get_children(node); + StringInfoData buf2; + + initStringInfo(&buf2); + rt_dump_node(children[i], + level + 1, &buf2, recurse); + appendStringInfo(buf, "%s", buf2.data); + } + else + appendStringInfo(buf, "\n"); + } + } + + break; + } + case RT_NODE_KIND_128: + { + rt_node_base_128 *n128 = (rt_node_base_128 *) node; + uint8 *tmp = (uint8 *) n128->isset; + + appendStringInfo(buf, "slot_idxs:"); + for (int j = 0; j < 256; j++) + { + if (!node_128_is_chunk_used(n128, j)) + continue; + + appendStringInfo(buf, " [%d]=%d, ", j, n128->slot_idxs[j]); + } + appendStringInfo(buf, "\nisset-bitmap:"); + for (int j = 0; j < 16; j++) + { + appendStringInfo(buf, "%X ", (uint8) tmp[j]); + } + appendStringInfo(buf, "\n"); + + for (int i = 0; i < 256; i++) + { + void *slot; + + if (!node_128_is_chunk_used(n128, i)) + continue; + + slot = node_128_get_slot(n128, i); + + if (is_leaf) + rt_print_slot(buf, i, *(uint64 *) slot, + i, is_leaf, level); + else + rt_print_slot(buf, i, UINT64_MAX, i, is_leaf, level); + + if (!is_leaf) + { + if (recurse) + { + StringInfoData buf2; + + initStringInfo(&buf2); + rt_dump_node((rt_node *) slot, + level + 1, &buf2, recurse); + appendStringInfo(buf, "%s", buf2.data); + } + else + appendStringInfo(buf, "\n"); + } + } + break; + } + case RT_NODE_KIND_256: + { + rt_node_base_256 *n256 = (rt_node_base_256 *) node; + + for (int i = 0; i < 256; i++) + { + void *slot; + + if (!node_256_is_chunk_used(n256, i)) + continue; + + slot = node_256_get_slot(n256, i); + + if (is_leaf) + rt_print_slot(buf, i, *(uint64 *) slot, i, is_leaf, level); + else + rt_print_slot(buf, i, UINT64_MAX, i, is_leaf, level); + + if (!is_leaf) + { + if (recurse) + { + StringInfoData buf2; + + initStringInfo(&buf2); + rt_dump_node((rt_node *) slot, level + 1, &buf2, recurse); + appendStringInfo(buf, "%s", buf2.data); + } + else + appendStringInfo(buf, "\n"); + } + } + break; + } + } +} + +void +rt_dump_search(radix_tree *tree, uint64 key) +{ + StringInfoData buf; + rt_node *node; + int shift; + int level = 0; + + elog(NOTICE, "-----------------------------------------------------------"); + elog(NOTICE, "max_val = %lu (0x%lX)", tree->max_val, tree->max_val); + + if (!tree->root) + { + elog(NOTICE, "tree is empty"); + return; + } + + if (key > tree->max_val) + { + elog(NOTICE, "key %lu (0x%lX) is larger than max val", + key, key); + return; + } + + initStringInfo(&buf); + node = tree->root; + shift = tree->root->shift; + while (shift >= 0) + { + rt_node *child; + + rt_dump_node(node, level, &buf, false); + + if (IS_LEAF_NODE(node)) + { + uint64 dummy; + + /* We reached at a leaf node, find the corresponding slot */ + rt_node_search_leaf(node, key, RT_ACTION_FIND, &dummy); + + break; + } + + if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child)) + break; + + node = child; + shift -= RT_NODE_SPAN; + level++; + } + + elog(NOTICE, "\n%s", buf.data); +} + +void +rt_dump(radix_tree *tree) +{ + StringInfoData buf; + + initStringInfo(&buf); + + elog(NOTICE, "-----------------------------------------------------------"); + elog(NOTICE, "max_val = %lu", tree->max_val); + rt_dump_node(tree->root, 0, &buf, true); + elog(NOTICE, "\n%s", buf.data); + elog(NOTICE, "-----------------------------------------------------------"); +} +#endif diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h new file mode 100644 index 0000000000..38cc6abf4c --- /dev/null +++ b/src/include/lib/radixtree.h @@ -0,0 +1,42 @@ +/*------------------------------------------------------------------------- + * + * radixtree.h + * Interface for radix tree. + * + * Copyright (c) 2022, PostgreSQL Global Development Group + * + * IDENTIFICATION + * src/include/lib/radixtree.h + * + *------------------------------------------------------------------------- + */ +#ifndef RADIXTREE_H +#define RADIXTREE_H + +#include "postgres.h" + +/* #define RT_DEBUG 1 */ + +typedef struct radix_tree radix_tree; +typedef struct rt_iter rt_iter; + +extern radix_tree *rt_create(MemoryContext ctx); +extern void rt_free(radix_tree *tree); +extern bool rt_search(radix_tree *tree, uint64 key, uint64 *val_p); +extern bool rt_set(radix_tree *tree, uint64 key, uint64 val); +extern rt_iter *rt_begin_iterate(radix_tree *tree); + +extern bool rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p); +extern void rt_end_iterate(rt_iter *iter); +extern bool rt_delete(radix_tree *tree, uint64 key); + +extern uint64 rt_memory_usage(radix_tree *tree); +extern uint64 rt_num_entries(radix_tree *tree); + +#ifdef RT_DEBUG +extern void rt_dump(radix_tree *tree); +extern void rt_dump_search(radix_tree *tree, uint64 key); +extern void rt_stats(radix_tree *tree); +#endif + +#endif /* RADIXTREE_H */ diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile index 6c31c8707c..8252ec41c4 100644 --- a/src/test/modules/Makefile +++ b/src/test/modules/Makefile @@ -25,6 +25,7 @@ SUBDIRS = \ test_parser \ test_pg_dump \ test_predtest \ + test_radixtree \ test_rbtree \ test_regex \ test_rls_hooks \ diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore new file mode 100644 index 0000000000..5dcb3ff972 --- /dev/null +++ b/src/test/modules/test_radixtree/.gitignore @@ -0,0 +1,4 @@ +# Generated subdirectories +/log/ +/results/ +/tmp_check/ diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile new file mode 100644 index 0000000000..da06b93da3 --- /dev/null +++ b/src/test/modules/test_radixtree/Makefile @@ -0,0 +1,23 @@ +# src/test/modules/test_radixtree/Makefile + +MODULE_big = test_radixtree +OBJS = \ + $(WIN32RES) \ + test_radixtree.o +PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c" + +EXTENSION = test_radixtree +DATA = test_radixtree--1.0.sql + +REGRESS = test_radixtree + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = src/test/modules/test_radixtree +top_builddir = ../../../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README new file mode 100644 index 0000000000..a8b271869a --- /dev/null +++ b/src/test/modules/test_radixtree/README @@ -0,0 +1,7 @@ +test_integerset contains unit tests for testing the integer set implementation +in src/backend/lib/integerset.c. + +The tests verify the correctness of the implementation, but they can also be +used as a micro-benchmark. If you set the 'intset_test_stats' flag in +test_integerset.c, the tests will print extra information about execution time +and memory usage. diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out new file mode 100644 index 0000000000..cc6970c87c --- /dev/null +++ b/src/test/modules/test_radixtree/expected/test_radixtree.out @@ -0,0 +1,28 @@ +CREATE EXTENSION test_radixtree; +-- +-- All the logic is in the test_radixtree() function. It will throw +-- an error if something fails. +-- +SELECT test_radixtree(); +NOTICE: testing radix tree node types with shift "0" +NOTICE: testing radix tree node types with shift "8" +NOTICE: testing radix tree node types with shift "16" +NOTICE: testing radix tree node types with shift "24" +NOTICE: testing radix tree node types with shift "32" +NOTICE: testing radix tree node types with shift "40" +NOTICE: testing radix tree node types with shift "48" +NOTICE: testing radix tree node types with shift "56" +NOTICE: testing radix tree with pattern "all ones" +NOTICE: testing radix tree with pattern "alternating bits" +NOTICE: testing radix tree with pattern "clusters of ten" +NOTICE: testing radix tree with pattern "clusters of hundred" +NOTICE: testing radix tree with pattern "one-every-64k" +NOTICE: testing radix tree with pattern "sparse" +NOTICE: testing radix tree with pattern "single values, distance > 2^32" +NOTICE: testing radix tree with pattern "clusters, distance > 2^32" +NOTICE: testing radix tree with pattern "clusters, distance > 2^60" + test_radixtree +---------------- + +(1 row) + diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql new file mode 100644 index 0000000000..41ece5e9f5 --- /dev/null +++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql @@ -0,0 +1,7 @@ +CREATE EXTENSION test_radixtree; + +-- +-- All the logic is in the test_radixtree() function. It will throw +-- an error if something fails. +-- +SELECT test_radixtree(); diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql new file mode 100644 index 0000000000..074a5a7ea7 --- /dev/null +++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql @@ -0,0 +1,8 @@ +/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */ + +-- complain if script is sourced in psql, rather than via CREATE EXTENSION +\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit + +CREATE FUNCTION test_radixtree() +RETURNS pg_catalog.void STRICT +AS 'MODULE_PATHNAME' LANGUAGE C; diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c new file mode 100644 index 0000000000..a4aa80a99c --- /dev/null +++ b/src/test/modules/test_radixtree/test_radixtree.c @@ -0,0 +1,504 @@ +/*-------------------------------------------------------------------------- + * + * test_radixtree.c + * Test radixtree set data structure. + * + * Copyright (c) 2022, PostgreSQL Global Development Group + * + * IDENTIFICATION + * src/test/modules/test_radixtree/test_radixtree.c + * + * ------------------------------------------------------------------------- + */ +#include "postgres.h" + +#include "common/pg_prng.h" +#include "fmgr.h" +#include "lib/radixtree.h" +#include "miscadmin.h" +#include "nodes/bitmapset.h" +#include "storage/block.h" +#include "storage/itemptr.h" +#include "utils/memutils.h" +#include "utils/timestamp.h" + +#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X" + +/* + * If you enable this, the "pattern" tests will print information about + * how long populating, probing, and iterating the test set takes, and + * how much memory the test set consumed. That can be used as + * micro-benchmark of various operations and input patterns (you might + * want to increase the number of values used in each of the test, if + * you do that, to reduce noise). + * + * The information is printed to the server's stderr, mostly because + * that's where MemoryContextStats() output goes. + */ +static const bool rt_test_stats = false; + +/* The maximum number of entries each node type can have */ +static int rt_node_max_entries[] = { + 4, /* RT_NODE_KIND_4 */ + 16, /* RT_NODE_KIND_16 */ + 32, /* RT_NODE_KIND_32 */ + 128, /* RT_NODE_KIND_128 */ + 256 /* RT_NODE_KIND_256 */ +}; + +/* + * A struct to define a pattern of integers, for use with the test_pattern() + * function. + */ +typedef struct +{ + char *test_name; /* short name of the test, for humans */ + char *pattern_str; /* a bit pattern */ + uint64 spacing; /* pattern repeats at this interval */ + uint64 num_values; /* number of integers to set in total */ +} test_spec; + +/* Test patterns borrowed from test_integerset.c */ +static const test_spec test_specs[] = { + { + "all ones", "1111111111", + 10, 1000000 + }, + { + "alternating bits", "0101010101", + 10, 1000000 + }, + { + "clusters of ten", "1111111111", + 10000, 1000000 + }, + { + "clusters of hundred", + "1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111", + 10000, 10000000 + }, + { + "one-every-64k", "1", + 65536, 1000000 + }, + { + "sparse", "100000000000000000000000000000001", + 10000000, 1000000 + }, + { + "single values, distance > 2^32", "1", + UINT64CONST(10000000000), 100000 + }, + { + "clusters, distance > 2^32", "10101010", + UINT64CONST(10000000000), 1000000 + }, + { + "clusters, distance > 2^60", "10101010", + UINT64CONST(2000000000000000000), + 23 /* can't be much higher than this, or we + * overflow uint64 */ + } +}; + +PG_MODULE_MAGIC; + +PG_FUNCTION_INFO_V1(test_radixtree); + +static void +test_empty(void) +{ + radix_tree *radixtree; + uint64 dummy; + + radixtree = rt_create(CurrentMemoryContext); + + if (rt_search(radixtree, 0, &dummy)) + elog(ERROR, "rt_search on empty tree returned true"); + + if (rt_search(radixtree, 1, &dummy)) + elog(ERROR, "rt_search on empty tree returned true"); + + if (rt_search(radixtree, PG_UINT64_MAX, &dummy)) + elog(ERROR, "rt_search on empty tree returned true"); + + if (rt_num_entries(radixtree) != 0) + elog(ERROR, "rt_num_entries on empty tree return non-zero"); + + rt_free(radixtree); +} + +/* + * Check if keys from start to end with the shift exist in the tree. + */ +static void +check_search_on_node(radix_tree *radixtree, uint8 shift, int start, int end) +{ + for (int i = start; i < end; i++) + { + uint64 key = ((uint64) i << shift); + uint64 val; + + if (!rt_search(radixtree, key, &val)) + elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d", + key, end); + if (val != key) + elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT, + key, val, key); + } +} + +static void +test_node_types_insert(radix_tree *radixtree, uint8 shift) +{ + uint64 num_entries; + + for (int i = 0; i < 256; i++) + { + uint64 key = ((uint64) i << shift); + bool found; + + found = rt_set(radixtree, key, key); + + if (found) + elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", key); + + for (int j = 0; j < lengthof(rt_node_max_entries); j++) + { + /* + * After filling all slots in each node type, check if the values are + * stored properly. + */ + if (i == (rt_node_max_entries[j] - 1)) + { + check_search_on_node(radixtree, shift, + (j == 0) ? 0 : rt_node_max_entries[j - 1], + rt_node_max_entries[j]); + break; + } + } + } + + num_entries = rt_num_entries(radixtree); + + if (num_entries != 256) + elog(ERROR, + "rt_num_entries returned" UINT64_FORMAT ", expected " UINT64_FORMAT, + num_entries, UINT64CONST(256)); +} + +static void +test_node_types_delete(radix_tree *radixtree, uint8 shift) +{ + uint64 num_entries; + + for (int i = 0; i < 256; i++) + { + uint64 key = ((uint64) i << shift); + bool found; + + found = rt_delete(radixtree, key); + + if (!found) + elog(ERROR, "inserted key 0x" UINT64_HEX_FORMAT " is not found", key); + } + + num_entries = rt_num_entries(radixtree); + + /* The tree must be empty */ + if (num_entries != 0) + elog(ERROR, + "rt_num_entries returned" UINT64_FORMAT ", expected " UINT64_FORMAT, + num_entries, UINT64CONST(256)); +} + +/* + * Test for inserting and deleting key-value pairs to each node type at the given shift + * level. + */ +static void +test_node_types(uint8 shift) +{ + radix_tree *radixtree; + + elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift); + + radixtree = rt_create(CurrentMemoryContext); + + /* + * Insert and search entries for every node type at the 'shift' level, + * then delete all entries to make it empty, and insert and search + * entries again. + */ + test_node_types_insert(radixtree, shift); + test_node_types_delete(radixtree, shift); + test_node_types_insert(radixtree, shift); + + rt_free(radixtree); +} + +/* + * Test with a repeating pattern, defined by the 'spec'. + */ +static void +test_pattern(const test_spec *spec) +{ + radix_tree *radixtree; + rt_iter *iter; + MemoryContext radixtree_ctx; + TimestampTz starttime; + TimestampTz endtime; + uint64 n; + uint64 last_int; + uint64 ndeleted; + uint64 nbefore; + uint64 nafter; + int patternlen; + uint64 *pattern_values; + uint64 pattern_num_values; + + elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name); + if (rt_test_stats) + fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name); + + /* Pre-process the pattern, creating an array of integers from it. */ + patternlen = strlen(spec->pattern_str); + pattern_values = palloc(patternlen * sizeof(uint64)); + pattern_num_values = 0; + for (int i = 0; i < patternlen; i++) + { + if (spec->pattern_str[i] == '1') + pattern_values[pattern_num_values++] = i; + } + + /* + * Allocate the radix tree. + * + * Allocate it in a separate memory context, so that we can print its + * memory usage easily. + */ + radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext, + "radixtree test", + ALLOCSET_SMALL_SIZES); + MemoryContextSetIdentifier(radixtree_ctx, spec->test_name); + radixtree = rt_create(radixtree_ctx); + + /* + * Add values to the set. + */ + starttime = GetCurrentTimestamp(); + + n = 0; + last_int = 0; + while (n < spec->num_values) + { + uint64 x = 0; + + for (int i = 0; i < pattern_num_values && n < spec->num_values; i++) + { + bool found; + + x = last_int + pattern_values[i]; + + found = rt_set(radixtree, x, x); + + if (found) + elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x); + + n++; + } + last_int += spec->spacing; + } + + endtime = GetCurrentTimestamp(); + + if (rt_test_stats) + fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n", + spec->num_values, (int) (endtime - starttime) / 1000); + + /* + * Print stats on the amount of memory used. + * + * We print the usage reported by rt_memory_usage(), as well as the + * stats from the memory context. They should be in the same ballpark, + * but it's hard to automate testing that, so if you're making changes to + * the implementation, just observe that manually. + */ + if (rt_test_stats) + { + uint64 mem_usage; + + /* + * Also print memory usage as reported by rt_memory_usage(). It + * should be in the same ballpark as the usage reported by + * MemoryContextStats(). + */ + mem_usage = rt_memory_usage(radixtree); + fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n", + mem_usage, (double) mem_usage / spec->num_values); + + MemoryContextStats(radixtree_ctx); + } + + /* Check that rt_num_entries works */ + n = rt_num_entries(radixtree); + if (n != spec->num_values) + elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values); + + /* + * Test random-access probes with rt_search() + */ + starttime = GetCurrentTimestamp(); + + for (n = 0; n < 100000; n++) + { + bool found; + bool expected; + uint64 x; + uint64 v; + + /* + * Pick next value to probe at random. We limit the probes to the + * last integer that we added to the set, plus an arbitrary constant + * (1000). There's no point in probing the whole 0 - 2^64 range, if + * only a small part of the integer space is used. We would very + * rarely hit values that are actually in the set. + */ + x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000); + + /* Do we expect this value to be present in the set? */ + if (x >= last_int) + expected = false; + else + { + uint64 idx = x % spec->spacing; + + if (idx >= patternlen) + expected = false; + else if (spec->pattern_str[idx] == '1') + expected = true; + else + expected = false; + } + + /* Is it present according to rt_search() ? */ + found = rt_search(radixtree, x, &v); + + if (found != expected) + elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected); + if (found && (v != x)) + elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT, + v, x); + } + endtime = GetCurrentTimestamp(); + if (rt_test_stats) + fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n", + n, (int) (endtime - starttime) / 1000); + + /* + * Test iterator + */ + starttime = GetCurrentTimestamp(); + + iter = rt_begin_iterate(radixtree); + n = 0; + last_int = 0; + while (n < spec->num_values) + { + for (int i = 0; i < pattern_num_values && n < spec->num_values; i++) + { + uint64 expected = last_int + pattern_values[i]; + uint64 x; + uint64 val; + + if (!rt_iterate_next(iter, &x, &val)) + break; + + if (x != expected) + elog(ERROR, + "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", + x, expected, i); + if (val != expected) + elog(ERROR, + "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i); + n++; + } + last_int += spec->spacing; + } + endtime = GetCurrentTimestamp(); + if (rt_test_stats) + fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n", + n, (int) (endtime - starttime) / 1000); + + if (n < spec->num_values) + elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values); + if (n > spec->num_values) + elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values); + + /* + * Test random-access probes with rt_delete() + */ + starttime = GetCurrentTimestamp(); + + nbefore = rt_num_entries(radixtree); + ndeleted = 0; + for (n = 0; n < 100000; n++) + { + bool found; + uint64 x; + uint64 v; + + /* + * Pick next value to probe at random. We limit the probes to the + * last integer that we added to the set, plus an arbitrary constant + * (1000). There's no point in probing the whole 0 - 2^64 range, if + * only a small part of the integer space is used. We would very + * rarely hit values that are actually in the set. + */ + x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000); + + /* Is it present according to rt_search() ? */ + found = rt_search(radixtree, x, &v); + + if (!found) + continue; + + /* If the key is found, delete it and check again */ + if (!rt_delete(radixtree, x)) + elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x); + if (rt_search(radixtree, x, &v)) + elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x); + if (rt_delete(radixtree, x)) + elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x); + + ndeleted++; + } + endtime = GetCurrentTimestamp(); + if (rt_test_stats) + fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n", + ndeleted, (int) (endtime - starttime) / 1000); + + nafter = rt_num_entries(radixtree); + + /* Check that rt_num_entries works */ + if ((nbefore - ndeleted) != nafter) + elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion", + nafter, (nbefore - ndeleted), ndeleted); + + MemoryContextDelete(radixtree_ctx); +} + +Datum +test_radixtree(PG_FUNCTION_ARGS) +{ + test_empty(); + + for (int shift = 0; shift <= (64 - 8); shift += 8) + test_node_types(shift); + + /* Test different test patterns, with lots of entries */ + for (int i = 0; i < lengthof(test_specs); i++) + test_pattern(&test_specs[i]); + + PG_RETURN_VOID(); +} diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control new file mode 100644 index 0000000000..e53f2a3e0c --- /dev/null +++ b/src/test/modules/test_radixtree/test_radixtree.control @@ -0,0 +1,4 @@ +comment = 'Test code for radix tree' +default_version = '1.0' +module_pathname = '$libdir/test_radixtree' +relocatable = true -- 2.31.1
From 39f0019d95eb4808d235a07d107aee2ff46856e2 Mon Sep 17 00:00:00 2001 From: Masahiko Sawada <sawada.mshk@gmail.com> Date: Fri, 16 Sep 2022 11:57:03 +0900 Subject: [PATCH v6 3/3] tool for measuring radix tree performance --- contrib/bench_radix_tree/Makefile | 21 ++ .../bench_radix_tree--1.0.sql | 42 +++ contrib/bench_radix_tree/bench_radix_tree.c | 301 ++++++++++++++++++ .../bench_radix_tree/bench_radix_tree.control | 6 + contrib/bench_radix_tree/expected/bench.out | 13 + contrib/bench_radix_tree/sql/bench.sql | 16 + 6 files changed, 399 insertions(+) create mode 100644 contrib/bench_radix_tree/Makefile create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control create mode 100644 contrib/bench_radix_tree/expected/bench.out create mode 100644 contrib/bench_radix_tree/sql/bench.sql diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile new file mode 100644 index 0000000000..b8f70e12d1 --- /dev/null +++ b/contrib/bench_radix_tree/Makefile @@ -0,0 +1,21 @@ +# contrib/bench_radix_tree/Makefile + +MODULE_big = bench_radix_tree +OBJS = \ + bench_radix_tree.o + +EXTENSION = bench_radix_tree +DATA = bench_radix_tree--1.0.sql + +REGRESS = bench + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = contrib/bench_radix_tree +top_builddir = ../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql new file mode 100644 index 0000000000..6663abe6a4 --- /dev/null +++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql @@ -0,0 +1,42 @@ +/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */ + +-- complain if script is sourced in psql, rather than via CREATE EXTENSION +\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit + +create function bench_shuffle_search( +minblk int4, +maxblk int4, +OUT nkeys int8, +OUT rt_mem_allocated int8, +OUT array_mem_allocated int8, +OUT rt_load_ms int8, +OUT array_load_ms int8, +OUT rt_search_ms int8, +OUT array_serach_ms int8 +) +returns record +as 'MODULE_PATHNAME' +LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE; + +create function bench_seq_search( +minblk int4, +maxblk int4, +OUT nkeys int8, +OUT rt_mem_allocated int8, +OUT array_mem_allocated int8, +OUT rt_load_ms int8, +OUT array_load_ms int8, +OUT rt_search_ms int8, +OUT array_serach_ms int8 +) +returns record +as 'MODULE_PATHNAME' +LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE; + +create function bench_load_random_int( +cnt int8, +OUT mem_allocated int8, +OUT load_ms int8) +returns record +as 'MODULE_PATHNAME' +LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE; diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c new file mode 100644 index 0000000000..5806ef7519 --- /dev/null +++ b/contrib/bench_radix_tree/bench_radix_tree.c @@ -0,0 +1,301 @@ +/*------------------------------------------------------------------------- + * + * bench_radix_tree.c + * + * Copyright (c) 2016-2022, PostgreSQL Global Development Group + * + * contrib/bench_radix_tree/bench_radix_tree.c + *------------------------------------------------------------------------- + */ +#include "postgres.h" + +#include "common/pg_prng.h" +#include "fmgr.h" +#include "funcapi.h" +#include "lib/radixtree.h" +#include "miscadmin.h" +#include "utils/timestamp.h" + +PG_MODULE_MAGIC; + +#define TIDS_PER_BLOCK_FOR_LOAD 30 +#define TIDS_PER_BLOCK_FOR_LOOKUP 50 + +PG_FUNCTION_INFO_V1(bench_seq_search); +PG_FUNCTION_INFO_V1(bench_shuffle_search); +PG_FUNCTION_INFO_V1(bench_load_random_int); + +static radix_tree *rt = NULL; +static ItemPointer itemptrs = NULL; + +static uint64 +tid_to_key_off(ItemPointer tid, uint32 *off) +{ + uint32 upper; + uint32 shift = pg_ceil_log2_32(MaxHeapTuplesPerPage); + int64 tid_i; + + Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage); + + tid_i = ItemPointerGetOffsetNumber(tid); + tid_i |= ItemPointerGetBlockNumber(tid) << shift; + + /* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */ + *off = tid_i & ((1 << 6) - 1); + upper = tid_i >> 6; + Assert(*off < (sizeof(uint64) * BITS_PER_BYTE)); + + Assert(*off < 64); + + return upper; +} + +static int +shuffle_randrange(pg_prng_state *state, int lower, int upper) +{ + return (int) floor(pg_prng_double(state) * ((upper-lower)+0.999999)) + lower; +} + +/* Naive Fisher-Yates implementation*/ +static void +shuffle_itemptrs(ItemPointer itemptr, uint64 nitems) +{ + /* reproducability */ + pg_prng_state state; + + pg_prng_seed(&state, 0); + + for (int i = 0; i < nitems - 1; i++) + { + int j = shuffle_randrange(&state, i, nitems - 1); + ItemPointerData t = itemptrs[j]; + + itemptrs[j] = itemptrs[i]; + itemptrs[i] = t; + } +} + +static ItemPointer +generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p) +{ + ItemPointer tids; + uint64 maxitems; + uint64 ntids = 0; + + maxitems = (maxblk - minblk + 1) * ntids_per_blk; + tids = MemoryContextAllocHuge(TopTransactionContext, + sizeof(ItemPointerData) * maxitems); + + for (BlockNumber blk = minblk; blk < maxblk; blk++) + { + for (OffsetNumber off = FirstOffsetNumber; + off <= ntids_per_blk; off++) + { + CHECK_FOR_INTERRUPTS(); + + ItemPointerSetBlockNumber(&(tids[ntids]), blk); + ItemPointerSetOffsetNumber(&(tids[ntids]), off); + + ntids++; + } + } + + *ntids_p = ntids; + return tids; +} + +static int +vac_cmp_itemptr(const void *left, const void *right) +{ + BlockNumber lblk, + rblk; + OffsetNumber loff, + roff; + + lblk = ItemPointerGetBlockNumber((ItemPointer) left); + rblk = ItemPointerGetBlockNumber((ItemPointer) right); + + if (lblk < rblk) + return -1; + if (lblk > rblk) + return 1; + + loff = ItemPointerGetOffsetNumber((ItemPointer) left); + roff = ItemPointerGetOffsetNumber((ItemPointer) right); + + if (loff < roff) + return -1; + if (loff > roff) + return 1; + + return 0; +} + +static Datum +bench_search(FunctionCallInfo fcinfo, bool shuffle) +{ + BlockNumber minblk = PG_GETARG_INT32(0); + BlockNumber maxblk = PG_GETARG_INT32(1); + uint64 ntids; + uint64 key; + uint64 last_key = PG_UINT64_MAX;; + uint64 val = 0; + ItemPointer tids; + TupleDesc tupdesc; + TimestampTz start_time, end_time; + long secs; + int usecs; + int64 rt_load_ms, rt_search_ms, ar_load_ms, ar_search_ms; + Datum values[7]; + bool nulls[7]; + + /* Build a tuple descriptor for our result type */ + if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE) + elog(ERROR, "return type must be a row type"); + + tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids); + + /* measure the load time of the radix tree */ + rt = rt_create(CurrentMemoryContext); + start_time = GetCurrentTimestamp(); + for (int i = 0; i < ntids; i++) + { + ItemPointer tid = &(tids[i]); + uint32 off; + + CHECK_FOR_INTERRUPTS(); + + key = tid_to_key_off(tid, &off); + + if (last_key != PG_UINT64_MAX && last_key != key) + { + rt_set(rt, last_key, val); + val = 0; + } + + last_key = key; + val |= (uint64) 1 << off; + } + if (last_key != PG_UINT64_MAX) + rt_set(rt, last_key, val); + + end_time = GetCurrentTimestamp(); + TimestampDifference(start_time, end_time, &secs, &usecs); + rt_load_ms = secs * 1000 + usecs / 1000; + + /* measure the load time of the array */ + itemptrs = MemoryContextAllocHuge(CurrentMemoryContext, + sizeof(ItemPointerData) * ntids); + start_time = GetCurrentTimestamp(); + for (int i = 0; i < ntids; i++) + { + ItemPointerSetBlockNumber(&(itemptrs[i]), + ItemPointerGetBlockNumber(&(tids[i]))); + ItemPointerSetOffsetNumber(&(itemptrs[i]), + ItemPointerGetOffsetNumber(&(tids[i]))); + } + end_time = GetCurrentTimestamp(); + TimestampDifference(start_time, end_time, &secs, &usecs); + ar_load_ms = secs * 1000 + usecs / 1000; + + if (shuffle) + shuffle_itemptrs(tids, ntids); + + /* meaure the serach time of the radix tree */ + start_time = GetCurrentTimestamp(); + for (int i = 0; i < ntids; i++) + { + ItemPointer tid = &(tids[i]); + uint64 key, val; + uint32 off; + + CHECK_FOR_INTERRUPTS(); + + key = tid_to_key_off(tid, &off); + + rt_search(rt, key, &val); + } + end_time = GetCurrentTimestamp(); + TimestampDifference(start_time, end_time, &secs, &usecs); + rt_search_ms = secs * 1000 + usecs / 1000; + + /* next, measure the serach time of the array */ + start_time = GetCurrentTimestamp(); + for (int i = 0; i < ntids; i++) + { + ItemPointer tid = &(tids[i]); + + bsearch((void *) tid, + (void *) itemptrs, + ntids, + sizeof(ItemPointerData), + vac_cmp_itemptr); + } + end_time = GetCurrentTimestamp(); + TimestampDifference(start_time, end_time, &secs, &usecs); + ar_search_ms = secs * 1000 + usecs / 1000; + + MemSet(nulls, false, sizeof(nulls)); + values[0] = Int64GetDatum(rt_num_entries(rt)); + values[1] = Int64GetDatum(rt_memory_usage(rt)); + values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids); + values[3] = Int64GetDatum(rt_load_ms); + values[4] = Int64GetDatum(ar_load_ms); + values[5] = Int64GetDatum(rt_search_ms); + values[6] = Int64GetDatum(ar_search_ms); + + PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls))); +} + +Datum +bench_seq_search(PG_FUNCTION_ARGS) +{ + return bench_search(fcinfo, false); +} + +Datum +bench_shuffle_search(PG_FUNCTION_ARGS) +{ + return bench_search(fcinfo, true); +} + +Datum +bench_load_random_int(PG_FUNCTION_ARGS) +{ + uint64 cnt = (uint64) PG_GETARG_INT64(0); + radix_tree *rt; + pg_prng_state state; + TupleDesc tupdesc; + TimestampTz start_time, end_time; + long secs; + int usecs; + int64 load_time_ms; + Datum values[2]; + bool nulls[2]; + + /* Build a tuple descriptor for our result type */ + if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE) + elog(ERROR, "return type must be a row type"); + + pg_prng_seed(&state, 0); + rt = rt_create(CurrentMemoryContext); + + start_time = GetCurrentTimestamp(); + for (uint64 i = 0; i < cnt; i++) + { + uint64 key = pg_prng_uint64(&state); + + rt_set(rt, key, key); + } + end_time = GetCurrentTimestamp(); + + TimestampDifference(start_time, end_time, &secs, &usecs); + load_time_ms = secs * 1000 + usecs / 1000; + + MemSet(nulls, false, sizeof(nulls)); + values[0] = Int64GetDatum(rt_memory_usage(rt)); + values[1] = Int64GetDatum(load_time_ms); + + rt_free(rt); + PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls))); +} diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control new file mode 100644 index 0000000000..1d988e6c9a --- /dev/null +++ b/contrib/bench_radix_tree/bench_radix_tree.control @@ -0,0 +1,6 @@ +# bench_radix_tree extension +comment = 'benchmark suits for radix tree' +default_version = '1.0' +module_pathname = '$libdir/bench_radix_tree' +relocatable = true +trusted = true diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out new file mode 100644 index 0000000000..60c303892e --- /dev/null +++ b/contrib/bench_radix_tree/expected/bench.out @@ -0,0 +1,13 @@ +create extension bench_radix_tree; +\o seq_search.data +begin; +select * from bench_seq_search(0, 1000000); +commit; +\o shuffle_search.data +begin; +select * from bench_shuffle_search(0, 1000000); +commit; +\o random_load.data +begin; +select * from bench_load_random_int(10000000); +commit; diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql new file mode 100644 index 0000000000..a46018c9d4 --- /dev/null +++ b/contrib/bench_radix_tree/sql/bench.sql @@ -0,0 +1,16 @@ +create extension bench_radix_tree; + +\o seq_search.data +begin; +select * from bench_seq_search(0, 1000000); +commit; + +\o shuffle_search.data +begin; +select * from bench_shuffle_search(0, 1000000); +commit; + +\o random_load.data +begin; +select * from bench_load_random_int(10000000); +commit; -- 2.31.1