On Mon, Aug 15, 2022 at 10:39 PM John Naylor
<john.nay...@enterprisedb.com> wrote:
>
> On Mon, Aug 15, 2022 at 12:39 PM Masahiko Sawada <sawada.m...@gmail.com> 
> wrote:
> >
> > On Fri, Jul 22, 2022 at 10:43 AM Masahiko Sawada <sawada.m...@gmail.com> 
> > wrote:
> > >
> > > On Tue, Jul 19, 2022 at 1:30 PM John Naylor
> > > <john.nay...@enterprisedb.com> wrote:
> > > >
> > > >
> > > >
> > > > On Tue, Jul 19, 2022 at 9:11 AM Masahiko Sawada <sawada.m...@gmail.com> 
> > > > wrote:
> > > >
> > > > > I’d like to keep the first version simple. We can improve it and add
> > > > > more optimizations later. Using radix tree for vacuum TID storage
> > > > > would still be a big win comparing to using a flat array, even without
> > > > > all these optimizations. In terms of single-value leaves method, I'm
> > > > > also concerned about an extra pointer traversal and extra memory
> > > > > allocation. It's most flexible but multi-value leaves method is also
> > > > > flexible enough for many use cases. Using the single-value method
> > > > > seems to be too much as the first step for me.
> > > > >
> > > > > Overall, using 64-bit keys and 64-bit values would be a reasonable
> > > > > choice for me as the first step . It can cover wider use cases
> > > > > including vacuum TID use cases. And possibly it can cover use cases by
> > > > > combining a hash table or using tree of tree, for example.
> > > >
> > > > These two aspects would also bring it closer to Andres' prototype, 
> > > > which 1) makes review easier and 2) easier to preserve optimization 
> > > > work already done, so +1 from me.
> > >
> > > Thanks.
> > >
> > > I've updated the patch. It now implements 64-bit keys, 64-bit values,
> > > and the multi-value leaves method. I've tried to remove duplicated
> > > codes but we might find a better way to do that.
> > >
> >
> > With the recent changes related to simd, I'm going to split the patch
> > into at least two parts: introduce other simd optimized functions used
> > by the radix tree and the radix tree implementation. Particularly we
> > need two functions for radix tree: a function like pg_lfind32 but for
> > 8 bits integers and return the index, and a function that returns the
> > index of the first element that is >= key.
>
> I recommend looking at
>
> https://www.postgresql.org/message-id/CAFBsxsESLUyJ5spfOSyPrOvKUEYYNqsBosue9SV1j8ecgNXSKA%40mail.gmail.com
>
> since I did the work just now for searching bytes and returning a
> bool, buth = and <=. Should be pretty close. Also, i believe if you
> left this for last as a possible refactoring, it might save some work.
> In any case, I'll take a look at the latest patch next month.

I've updated the radix tree patch. It's now separated into two patches.

0001 patch introduces pg_lsearch8() and pg_lsearch8_ge() (we may find
better names) that are similar to the pg_lfind8() family but they
return the index of the key in the vector instead of true/false. The
patch includes regression tests.

0002 patch is the main radix tree implementation. I've removed some
duplicated codes of node manipulation. For instance, since node-4,
node-16, and node-32 have a similar structure with different fanouts,
I introduced the common function for them.

In addition to two patches, I've attached the third patch. It's not
part of radix tree implementation but introduces a contrib module
bench_radix_tree, a tool for radix tree performance benchmarking. It
measures loading and lookup performance of both the radix tree and a
flat array.

Regards,

--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 5d0115b068ecb01d791eab5f8a78a6d25b9cf45c Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:01 +0000
Subject: [PATCH v6 1/3] Support pg_lsearch8_eq and pg_lsearch8_ge

---
 src/include/port/pg_lfind.h                   |  71 ++++++++
 src/include/port/simd.h                       | 155 +++++++++++++++++-
 .../test_lfind/expected/test_lfind.out        |  12 ++
 .../modules/test_lfind/sql/test_lfind.sql     |   2 +
 .../modules/test_lfind/test_lfind--1.0.sql    |   8 +
 src/test/modules/test_lfind/test_lfind.c      | 139 ++++++++++++++++
 6 files changed, 378 insertions(+), 9 deletions(-)

diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h
index 0625cac6b5..583f204763 100644
--- a/src/include/port/pg_lfind.h
+++ b/src/include/port/pg_lfind.h
@@ -80,6 +80,77 @@ pg_lfind8_le(uint8 key, uint8 *base, uint32 nelem)
 	return false;
 }
 
+/*
+ * pg_lsearch8
+ *
+ * Return the index of the element in 'base' that equals to 'key', otherwise return
+ * -1.
+ */
+static inline int
+pg_lsearch8(uint8 key, uint8 *base, uint32 nelem)
+{
+	uint32		i;
+
+	/* round down to multiple of vector length */
+	uint32		tail_idx = nelem & ~(sizeof(Vector8) - 1);
+	Vector8		chunk;
+
+	for (i = 0; i < tail_idx; i += sizeof(Vector8))
+	{
+		int idx;
+
+		vector8_load(&chunk, &base[i]);
+		if ((idx = vector8_search_eq(chunk, key)) != -1)
+			return i + idx;
+	}
+
+	/* Process the remaining elements one at a time. */
+	for (; i < nelem; i++)
+	{
+		if (key == base[i])
+			return i;
+	}
+
+	return -1;
+}
+
+
+/*
+ * pg_lsearch8_ge
+ *
+ * Return the index of the first element in 'base' that is greater than or equal to
+ * 'key'. Return nelem if there is no such element.
+ *
+ * Note that this function assumes the elements in 'base' are sorted.
+ */
+static inline int
+pg_lsearch8_ge(uint8 key, uint8 *base, uint32 nelem)
+{
+	uint32		i;
+
+	/* round down to multiple of vector length */
+	uint32		tail_idx = nelem & ~(sizeof(Vector8) - 1);
+	Vector8		chunk;
+
+	for (i = 0; i < tail_idx; i += sizeof(Vector8))
+	{
+		int idx;
+
+		vector8_load(&chunk, &base[i]);
+		if ((idx = vector8_search_ge(chunk, key)) != sizeof(Vector8))
+			return i + idx;
+	}
+
+	/* Process the remaining elements one at a time. */
+	for (; i < nelem; i++)
+	{
+		if (base[i] >= key)
+			break;
+	}
+
+	return i;
+}
+
 /*
  * pg_lfind32
  *
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 61ae4ecf60..e2a99578a5 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -18,6 +18,8 @@
 #ifndef SIMD_H
 #define SIMD_H
 
+#include "port/pg_bitutils.h"
+
 #if (defined(__x86_64__) || defined(_M_AMD64))
 /*
  * SSE2 instructions are part of the spec for the 64-bit x86 ISA. We assume
@@ -88,14 +90,9 @@ static inline Vector32 vector32_or(const Vector32 v1, const Vector32 v2);
 static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
 #endif
 
-/*
- * comparisons between vectors
- *
- * Note: These return a vector rather than boolean, which is why we don't
- * have non-SIMD implementations.
- */
-#ifndef USE_NO_SIMD
+/* comparisons between vectors */
 static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+#ifndef USE_NO_SIMD
 static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
 #endif
 
@@ -277,6 +274,140 @@ vector8_is_highbit_set(const Vector8 v)
 #endif
 }
 
+/*
+ * Return the bitmak of the high-bit of each element.
+ */
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+	return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+	static const uint8 mask[16] = {
+        1 << 0, 1 << 1, 1 << 2, 1 << 3,
+        1 << 4, 1 << 5, 1 << 6, 1 << 7,
+        1 << 0, 1 << 1, 1 << 2, 1 << 3,
+        1 << 4, 1 << 5, 1 << 6, 1 << 7,
+      };
+
+    uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+    uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+    return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#else
+	uint32 mask = 0;
+
+	for (Size i = 0; i < sizeof(Vector8); i++)
+		mask |= (((const uint8 *) &v)[i] >> 7) << i;
+
+	return mask;
+#endif
+}
+
+/*
+ * Compare the given vectors and return the vector of minimum elements.
+ */
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+	return vminq_u8(v1, v2);
+#else /* USE_NO_SIMD */
+	Vector8 r = 0;
+	uint8 *rp = (uint8 *) &r;
+
+	for (Size i = 0; i < sizeof(Vector8); i++)
+		rp[i] = Min(((const uint8 *) &v1)[i], ((const uint8 *) &v2)[i]);
+
+	return r;
+#endif
+}
+
+/*
+ * Return the index of the element in the vector that equal to the given
+ * scalar. Otherwise, return -1.
+ */
+static inline int
+vector8_search_eq(const Vector8 v, const uint8 c)
+{
+	Vector8 keys = vector8_broadcast(c);
+	Vector8	cmp;
+	uint32	mask;
+	int		result;
+
+#ifdef USE_ASSERT_CHECKING
+	int		assert_result = -1;
+
+	for (Size i = 0; i < sizeof(Vector8); i++)
+	{
+		if (((const uint8 *) &v)[i] == c)
+		{
+			assert_result = i;
+			break;
+		}
+	}
+#endif							/* USE_ASSERT_CHECKING */
+
+	cmp = vector8_eq(keys, v);
+	mask = vector8_highbit_mask(cmp);
+
+	if (mask)
+		result = pg_rightmost_one_pos32(mask);
+	else
+		result = -1;
+
+	Assert(assert_result == result);
+	return result;
+}
+
+/*
+ * Return the index of the first element in the vector that is greater than
+ * or eual to the given scalar. Return sizeof(Vector8) if there is no such
+ * element.
+ *
+ * Note that this function assumes the elements in the vector are sorted.
+ */
+static inline int
+vector8_search_ge(const Vector8 v, const uint8 c)
+{
+	Vector8 keys = vector8_broadcast(c);
+	Vector8 min;
+	Vector8	cmp;
+	uint32	mask;
+	int		result;
+
+#ifdef USE_ASSERT_CHECKING
+	int		assert_result = -1;
+	Size	i;
+
+	for (i = 0; i < sizeof(Vector8); i++)
+	{
+		if (((const uint8 *) &v)[i] >= c)
+			break;
+	}
+	assert_result = i;
+#endif							/* USE_ASSERT_CHECKING */
+
+	/*
+	 * There is a bit more complicated than vector8_search_eq(), because
+	 * until recently no unsigned uint8 compasion instruction existed.
+	 * Therefore, we need to use vector8_min() to effectively get <= elements.
+	 */
+	min = vector8_min(v, keys);
+	cmp = vector8_eq(keys, min);
+	mask = vector8_highbit_mask(cmp);
+
+	if (mask)
+		result = pg_rightmost_one_pos32(mask);
+	else
+		result = sizeof(Vector8);
+
+	Assert(assert_result == result);
+	return result;
+}
+
 /*
  * Exactly like vector8_is_highbit_set except for the input type, so it
  * looks at each byte separately.
@@ -348,7 +479,6 @@ vector8_ssub(const Vector8 v1, const Vector8 v2)
  * Return a vector with all bits set in each lane where the the corresponding
  * lanes in the inputs are equal.
  */
-#ifndef USE_NO_SIMD
 static inline Vector8
 vector8_eq(const Vector8 v1, const Vector8 v2)
 {
@@ -356,9 +486,16 @@ vector8_eq(const Vector8 v1, const Vector8 v2)
 	return _mm_cmpeq_epi8(v1, v2);
 #elif defined(USE_NEON)
 	return vceqq_u8(v1, v2);
+#else /* USE_NO_SIMD */
+	Vector8 r = 0;
+	uint8 *rp = (uint8 *) &r;
+
+	for (Size i = 0; i < sizeof(Vector8); i++)
+		rp[i] = (((const uint8 *) &v1)[i] == ((const uint8 *) &v2)[i]) ? 0xFF : 0;
+
+	return r;
 #endif
 }
-#endif							/* ! USE_NO_SIMD */
 
 #ifndef USE_NO_SIMD
 static inline Vector32
diff --git a/src/test/modules/test_lfind/expected/test_lfind.out b/src/test/modules/test_lfind/expected/test_lfind.out
index 1d4b14e703..9416161955 100644
--- a/src/test/modules/test_lfind/expected/test_lfind.out
+++ b/src/test/modules/test_lfind/expected/test_lfind.out
@@ -22,3 +22,15 @@ SELECT test_lfind32();
  
 (1 row)
 
+SELECT test_lsearch8();
+ test_lsearch8 
+---------------
+ 
+(1 row)
+
+SELECT test_lsearch8_ge();
+ test_lsearch8_ge 
+------------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_lfind/sql/test_lfind.sql b/src/test/modules/test_lfind/sql/test_lfind.sql
index 766c640831..d0dbb142ec 100644
--- a/src/test/modules/test_lfind/sql/test_lfind.sql
+++ b/src/test/modules/test_lfind/sql/test_lfind.sql
@@ -8,3 +8,5 @@ CREATE EXTENSION test_lfind;
 SELECT test_lfind8();
 SELECT test_lfind8_le();
 SELECT test_lfind32();
+SELECT test_lsearch8();
+SELECT test_lsearch8_ge();
diff --git a/src/test/modules/test_lfind/test_lfind--1.0.sql b/src/test/modules/test_lfind/test_lfind--1.0.sql
index 81801926ae..13857cec3b 100644
--- a/src/test/modules/test_lfind/test_lfind--1.0.sql
+++ b/src/test/modules/test_lfind/test_lfind--1.0.sql
@@ -14,3 +14,11 @@ CREATE FUNCTION test_lfind8()
 CREATE FUNCTION test_lfind8_le()
 	RETURNS pg_catalog.void
 	AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION test_lsearch8()
+	RETURNS pg_catalog.void
+	AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION test_lsearch8_ge()
+	RETURNS pg_catalog.void
+	AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_lfind/test_lfind.c b/src/test/modules/test_lfind/test_lfind.c
index 82673d54c6..c494c27436 100644
--- a/src/test/modules/test_lfind/test_lfind.c
+++ b/src/test/modules/test_lfind/test_lfind.c
@@ -14,6 +14,7 @@
 #include "postgres.h"
 
 #include "fmgr.h"
+#include "lib/stringinfo.h"
 #include "port/pg_lfind.h"
 
 /*
@@ -115,6 +116,144 @@ test_lfind8_le(PG_FUNCTION_ARGS)
 	PG_RETURN_VOID();
 }
 
+static void
+test_lsearch8_internal(uint8 key)
+{
+	uint8		charbuf[LEN_WITH_TAIL(Vector8)];
+	const int	len_no_tail = LEN_NO_TAIL(Vector8);
+	const int	len_with_tail = LEN_WITH_TAIL(Vector8);
+	int			keypos;
+
+	memset(charbuf, 0xFF, len_with_tail);
+	/* search tail to test one-byte-at-a-time path */
+	keypos = len_with_tail - 1;
+	charbuf[keypos] = key;
+	if (key > 0x00 && (pg_lsearch8(key - 1, charbuf, len_with_tail) != -1))
+		elog(ERROR, "pg_lsearch8() found nonexistent element '0x%x'", key - 1);
+	if (key < 0xFF && (pg_lsearch8(key, charbuf, len_with_tail) != keypos))
+		elog(ERROR, "pg_lsearch8() did not find existing element '0x%x'", key);
+	if (key < 0xFE && (pg_lsearch8(key + 1, charbuf, len_with_tail) != -1))
+		elog(ERROR, "pg_lsearch8() found nonexistent element '0x%x'", key + 1);
+
+	memset(charbuf, 0xFF, len_with_tail);
+	/* search with vector operations */
+	keypos = len_no_tail - 1;
+	charbuf[keypos] = key;
+	if (key > 0x00 && (pg_lsearch8(key - 1, charbuf, len_no_tail) != -1))
+		elog(ERROR, "pg_lsearch8() found nonexistent element '0x%x'", key - 1);
+	if (key < 0xFF && (pg_lsearch8(key, charbuf, len_no_tail) != keypos))
+		elog(ERROR, "pg_lsearch8() did not find existing element '0x%x'", key);
+	if (key < 0xFE && (pg_lsearch8(key + 1, charbuf, len_no_tail) != -1))
+		elog(ERROR, "pg_lsearch8() found nonexistent element '0x%x'", key + 1);
+}
+
+PG_FUNCTION_INFO_V1(test_lsearch8);
+Datum
+test_lsearch8(PG_FUNCTION_ARGS)
+{
+	test_lsearch8_internal(0);
+	test_lsearch8_internal(1);
+	test_lsearch8_internal(0x7F);
+	test_lsearch8_internal(0x80);
+	test_lsearch8_internal(0x81);
+	test_lsearch8_internal(0xFD);
+	test_lsearch8_internal(0xFE);
+	test_lsearch8_internal(0xFF);
+
+	PG_RETURN_VOID();
+}
+
+static void
+report_lsearch8_error(uint8 *buf, int size, uint8 key, int result, int expected)
+{
+	StringInfoData bufstr;
+	char *sep = "";
+
+	initStringInfo(&bufstr);
+
+	for (int i = 0; i < size; i++)
+	{
+		appendStringInfo(&bufstr, "%s0x%02x", sep, buf[i]);
+		sep = ",";
+	}
+
+	elog(ERROR,
+		 "pg_lsearch8_ge returned %d, expected %d, key 0x%02x buffer %s",
+		 result, expected, key, bufstr.data);
+}
+
+/* workhorse for test_lsearch8_ge */
+static void
+test_lsearch8_ge_internal(uint8 *buf, uint8 key)
+{
+	const int	len_no_tail = LEN_NO_TAIL(Vector8);
+	const int	len_with_tail = LEN_WITH_TAIL(Vector8);
+	int			expected;
+	int			result;
+	int			i;
+
+	/* search tail to test one-byte-at-a-time path */
+	for (i = 0; i < len_with_tail; i++)
+	{
+		if (buf[i] >= key)
+			break;
+	}
+	expected = i;
+	result = pg_lsearch8_ge(key, buf, len_with_tail);
+
+	if (result != expected)
+		report_lsearch8_error(buf, len_with_tail, key, result, expected);
+
+	/* search with vector operations */
+	for (i = 0; i < len_no_tail; i++)
+	{
+		if (buf[i] >= key)
+			break;
+	}
+	expected = i;
+	result = pg_lsearch8_ge(key, buf, len_no_tail);
+
+	if (result != expected)
+		report_lsearch8_error(buf, len_no_tail, key, result, expected);
+}
+
+static int
+cmp(const void *p1, const void *p2)
+{
+	uint8	v1 = *((const uint8 *) p1);
+	uint8	v2 = *((const uint8 *) p2);
+
+	if (v1 < v2)
+		return -1;
+	if (v1 > v2)
+		return 1;
+	return 0;
+}
+
+PG_FUNCTION_INFO_V1(test_lsearch8_ge);
+Datum
+test_lsearch8_ge(PG_FUNCTION_ARGS)
+{
+	uint8		charbuf[LEN_WITH_TAIL(Vector8)];
+	const int	len_with_tail = LEN_WITH_TAIL(Vector8);
+
+	for (int i = 0; i < len_with_tail; i++)
+		charbuf[i] = (uint8) rand();
+
+	qsort(charbuf, len_with_tail, sizeof(uint8), cmp);
+
+	test_lsearch8_ge_internal(charbuf, 0);
+	test_lsearch8_ge_internal(charbuf, 1);
+	test_lsearch8_ge_internal(charbuf, 0x7F);
+	test_lsearch8_ge_internal(charbuf, 0x80);
+	test_lsearch8_ge_internal(charbuf, 0x81);
+	test_lsearch8_ge_internal(charbuf, 0xFD);
+	test_lsearch8_ge_internal(charbuf, 0xFE);
+	test_lsearch8_ge_internal(charbuf, 0xFF);
+
+	PG_RETURN_VOID();
+}
+
 PG_FUNCTION_INFO_V1(test_lfind32);
 Datum
 test_lfind32(PG_FUNCTION_ARGS)
-- 
2.31.1

From f49e91ec2a2dcb19259cbf1bc0fd73f36b29a201 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v6 2/3] Add radix implementation.

---
 src/backend/lib/Makefile                      |    1 +
 src/backend/lib/radixtree.c                   | 2225 +++++++++++++++++
 src/include/lib/radixtree.h                   |   42 +
 src/test/modules/Makefile                     |    1 +
 src/test/modules/test_radixtree/.gitignore    |    4 +
 src/test/modules/test_radixtree/Makefile      |   23 +
 src/test/modules/test_radixtree/README        |    7 +
 .../expected/test_radixtree.out               |   28 +
 .../test_radixtree/sql/test_radixtree.sql     |    7 +
 .../test_radixtree/test_radixtree--1.0.sql    |    8 +
 .../modules/test_radixtree/test_radixtree.c   |  504 ++++
 .../test_radixtree/test_radixtree.control     |    4 +
 12 files changed, 2854 insertions(+)
 create mode 100644 src/backend/lib/radixtree.c
 create mode 100644 src/include/lib/radixtree.h
 create mode 100644 src/test/modules/test_radixtree/.gitignore
 create mode 100644 src/test/modules/test_radixtree/Makefile
 create mode 100644 src/test/modules/test_radixtree/README
 create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
 create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.control

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..4c1db794b6 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	integerset.o \
 	knapsack.o \
 	pairingheap.o \
+	radixtree.o \
 	rbtree.o \
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
new file mode 100644
index 0000000000..b163eac480
--- /dev/null
+++ b/src/backend/lib/radixtree.c
@@ -0,0 +1,2225 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.c
+ *		Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * There are some differences from the proposed implementation. For instance,
+ * this radix tree module utilizes AVX2 instruction, enabling us to use 256-bit
+ * width SIMD vector, whereas 128-bit width SIMD vector is used in the paper.
+ * Also, there is no support for path compression and lazy path expansion. The
+ * radix tree supports fixed length of the key so we don't expect the tree level
+ * wouldn't be high.
+ *
+ * Both the key and the value are 64-bit unsigned integer. The inner nodes and
+ * the leaf nodes have slightly different structure: for inner tree nodes,
+ * shift > 0, store the pointer to its child node as the value. The leaf nodes,
+ * shift == 0, have the 64-bit unsigned integer that is specified by the user as
+ * the value. The paper refers to this technique as "Multi-value leaves".  We
+ * choose it to avoid an additional pointer traversal.  It is the reason this code
+ * currently does not support variable-length keys.
+ *
+ * XXX: the radix tree node never be shrunk.
+ *
+ * Interface
+ * ---------
+ *
+ * rt_create		- Create a new, empty radix tree
+ * rt_free			- Free the radix tree
+ * rt_search		- Search a key-value pair
+ * rt_set			- Set a key-value pair
+ * rt_delete		- Delete a key-value pair
+ * rt_begin_iter	- Begin iterating through all key-value pairs
+ * rt_iter_next		- Return next key-value pair, if any
+ * rt_end_iter		- End iteration
+ * rt_memory_usage	- Get the memory usage
+ * rt_num_entries	- Get the number of key-value pairs
+ *
+ * rt_create() creates an empty radix tree in the given memory context
+ * and memory contexts for all kinds of radix tree node under the memory context.
+ *
+ * rt_iterate_next() ensures returning key-value pairs in the ascending
+ * order of the key.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/lib/radixtree.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "port/pg_lfind.h"
+#include "utils/memutils.h"
+#include "lib/radixtree.h"
+#include "lib/stringinfo.h"
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN	BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/*
+ * Return the number of bits required to represent nslots slots, used
+ * nodes indexed by array lookup.
+ */
+#define RT_NODE_NSLOTS_BITS(nslots) ((nslots) / (sizeof(uint8) * BITS_PER_BYTE))
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT	key_get_shift(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL	((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/* Invalid index used in node-128 */
+#define RT_NODE_128_INVALID_IDX	0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) \
+	((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/*
+ * Mapping from the value to the bit in is-set bitmap in the node-256.
+ */
+#define RT_NODE_BITMAP_BYTE(v) ((v) / BITS_PER_BYTE)
+#define RT_NODE_BITMAP_BIT(v) (UINT64CONST(1) << ((v) % RT_NODE_SPAN))
+
+/* Enum used rt_node_search() */
+typedef enum
+{
+	RT_ACTION_FIND = 0,			/* find the key-value */
+	RT_ACTION_DELETE,			/* delete the key-value */
+} rt_action;
+
+/*
+ * Supported radix tree nodes.
+ *
+ * XXX: These are currently not well chosen. To reduce memory fragmentation
+ * smaller class should optimally fit neatly into the next larger class
+ * (except perhaps at the lowest end). Right now its
+ * 48 -> 152 -> 296 -> 1304 -> 2088 bytes for inner/leaf nodes, leading to
+ * large amounts of allocator padding with aset.c. Hence the use of slab.
+ *
+ * XXX: need to have node-1 until there is no path compression optimization?
+ *
+ * XXX: need to explain why we choose these node types based on benchmark
+ * results etc.
+ */
+typedef enum rt_node_kind
+{
+	RT_NODE_KIND_4 = 0,
+	RT_NODE_KIND_16,
+	RT_NODE_KIND_32,
+	RT_NODE_KIND_128,
+	RT_NODE_KIND_256
+} rt_node_kind;
+#define RT_NODE_KIND_COUNT (RT_NODE_KIND_256 + 1)
+
+/*
+ * Base type for all nodes types.
+ */
+typedef struct rt_node
+{
+	/*
+	 * Number of children.  We use uint16 to be able to indicate 256 children
+	 * at the fanout of 8.
+	 */
+	uint16		count;
+
+	/*
+	 * Shift indicates which part of the key space is represented by this
+	 * node. That is, the key is shifted by 'shift' and the lowest
+	 * RT_NODE_SPAN bits are then represented in chunk.
+	 */
+	uint8		shift;
+	uint8		chunk;
+
+	/* Size class of the node */
+	rt_node_kind kind;
+} rt_node;
+
+/* Macros for radix tree nodes */
+#define IS_LEAF_NODE(n) (((rt_node *) (n))->shift == 0)
+#define IS_EMPTY_NODE(n) (((rt_node *) (n))->count == 0)
+#define NODE_HAS_FREE_SLOT(n) \
+	(((rt_node *) (n))->count < rt_node_info[((rt_node *) (n))->kind].fanout)
+
+/*
+ * Definitions of the base types for inner and leaf nodes of each node type.
+ */
+
+/*
+ * node-4, node-16, and node-32 have similar structure but have different
+ * the number of fanout. They have the same length for chunks and values
+ * (or child pointers in inner nodes). The chunks and values are stored at
+ * corresponding position and chunks are sorted.
+*/
+typedef struct rd_node_base_4
+{
+	rt_node		n;
+
+	/* 4 children, for key chunks */
+	uint8		chunks[4];
+} rt_node_base_4;
+
+typedef struct rd_node_base_16
+{
+	rt_node		n;
+
+	/* 16 children, for key chunks */
+	uint8		chunks[16];
+} rt_node_base_16;
+
+typedef struct rd_node_base_32
+{
+	rt_node		n;
+
+	/* 32 children, for key chunks */
+	uint8		chunks[32];
+} rt_node_base_32;
+
+/*
+ * node-128 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length, typically
+ * 256, to store indexes into a second array that contains up to 128 values (or
+ * child pointers in inner nodes).
+ */
+typedef struct rd_node_base_128
+{
+	rt_node		n;
+
+	/* The index of slots for each fanout */
+	uint8		slot_idxs[RT_NODE_MAX_SLOTS];
+
+	/* isset is a bitmap to track which slot is in use */
+	uint8		isset[RT_NODE_NSLOTS_BITS(128)];
+} rt_node_base_128;
+
+/*
+ * node-256 is the largest node type. This node has RT_NODE_MAX_SLOTS length array
+ * for directly storing values (or child pointers in inner nodes).
+ */
+typedef struct rd_node_base_256
+{
+	rt_node		n;
+
+	/* isset is a bitmap to track which slot is in use */
+	uint8		isset[RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS)];
+} rt_node_base_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * There are separate from inner node size classes for two main reasons:
+ *
+ * 1) the value type might be different than something fitting into a pointer
+ *    width type
+ * 2) Need to represent non-existing values in a key-type independent way.
+ *
+ * 1) is clearly worth being concerned about, but it's not clear 2) is as
+ * good. It might be better to just indicate non-existing entries the same way
+ * in inner nodes.
+ */
+typedef struct rt_node_inner_4
+{
+	rt_node_base_4 base;
+
+	/* 4 children, for key chunks */
+	rt_node    *children[4];
+} rt_node_inner_4;
+
+typedef struct rt_node_leaf_4
+{
+	rt_node_base_4 base;
+
+	/* 4 values, for key chunks */
+	uint64		values[4];
+} rt_node_leaf_4;
+
+typedef struct rt_node_inner_16
+{
+	rt_node_base_16 base;
+
+	/* 16 children, for key chunks */
+	rt_node    *children[16];
+} rt_node_inner_16;
+
+typedef struct rt_node_leaf_16
+{
+	rt_node_base_16 base;
+
+	/* 16 values, for key chunks */
+	uint64		values[16];
+} rt_node_leaf_16;
+
+typedef struct rt_node_inner_32
+{
+	rt_node_base_32 base;
+
+	/* 32 children, for key chunks */
+	rt_node    *children[32];
+} rt_node_inner_32;
+
+typedef struct rt_node_leaf_32
+{
+	rt_node_base_32 base;
+
+	/* 32 values, for key chunks */
+	uint64		values[32];
+} rt_node_leaf_32;
+
+typedef struct rt_node_inner_128
+{
+	rt_node_base_128 base;
+
+	/* Slots for 128 children */
+	rt_node    *children[128];
+} rt_node_inner_128;
+
+typedef struct rt_node_leaf_128
+{
+	rt_node_base_128 base;
+
+	/* Slots for 128 values */
+	uint64		values[128];
+} rt_node_leaf_128;
+
+typedef struct rt_node_inner_256
+{
+	rt_node_base_256 base;
+
+	/* Slots for 256 children */
+	rt_node    *children[RT_NODE_MAX_SLOTS];
+} rt_node_inner_256;
+
+typedef struct rt_node_leaf_256
+{
+	rt_node_base_256 base;
+
+	/* Slots for 256 values */
+	uint64		values[RT_NODE_MAX_SLOTS];
+} rt_node_leaf_256;
+
+/* Information of each size class */
+typedef struct rt_node_info_elem
+{
+	const char *name;
+	int			fanout;
+	Size		inner_size;
+	Size		leaf_size;
+} rt_node_info_elem;
+
+static rt_node_info_elem rt_node_info[RT_NODE_KIND_COUNT] = {
+
+	[RT_NODE_KIND_4] = {
+		.name = "radix tree node 4",
+		.fanout = 4,
+		.inner_size = sizeof(rt_node_inner_4),
+		.leaf_size = sizeof(rt_node_leaf_4),
+	},
+	[RT_NODE_KIND_16] = {
+		.name = "radix tree node 16",
+		.fanout = 16,
+		.inner_size = sizeof(rt_node_inner_16),
+		.leaf_size = sizeof(rt_node_leaf_16),
+	},
+	[RT_NODE_KIND_32] = {
+		.name = "radix tree node 32",
+		.fanout = 32,
+		.inner_size = sizeof(rt_node_inner_32),
+		.leaf_size = sizeof(rt_node_leaf_32),
+	},
+	[RT_NODE_KIND_128] = {
+		.name = "radix tree node 128",
+		.fanout = 128,
+		.inner_size = sizeof(rt_node_inner_128),
+		.leaf_size = sizeof(rt_node_leaf_128),
+	},
+	[RT_NODE_KIND_256] = {
+		.name = "radix tree node 256",
+		.fanout = 256,
+		.inner_size = sizeof(rt_node_inner_256),
+		.leaf_size = sizeof(rt_node_leaf_256),
+	},
+};
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * rt_node_iter struct is used to track the iteration within a node.
+ *
+ * rt_iter is the struct for iteration of the radix tree, and uses rt_node_iter
+ * in order to track the iteration of each level. During the iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
+ */
+typedef struct rt_node_iter
+{
+	rt_node    *node;			/* current node being iterated */
+	int			current_idx;	/* current position. -1 for initial value */
+} rt_node_iter;
+
+struct rt_iter
+{
+	radix_tree *tree;
+
+	/* Track the iteration on nodes of each level */
+	rt_node_iter stack[RT_MAX_LEVEL];
+	int			stack_len;
+
+	/* The key is being constructed during the iteration */
+	uint64		key;
+};
+
+/* A radix tree with nodes */
+struct radix_tree
+{
+	MemoryContext context;
+
+	rt_node    *root;
+	uint64		max_val;
+	uint64		num_keys;
+
+	MemoryContextData *inner_slabs[RT_NODE_KIND_COUNT];
+	MemoryContextData *leaf_slabs[RT_NODE_KIND_COUNT];
+
+	/* statistics */
+#ifdef RT_DEBUG
+	int32		cnt[RT_NODE_KIND_COUNT];
+#endif
+};
+
+static void rt_new_root(radix_tree *tree, uint64 key);
+static rt_node *rt_alloc_node(radix_tree *tree, rt_node_kind kind, bool inner);
+static void rt_free_node(radix_tree *tree, rt_node *node);
+static void rt_copy_node_common(rt_node *src, rt_node *dst);
+static void rt_extend(radix_tree *tree, uint64 key);
+static bool rt_node_search(rt_node *node, uint64 key, rt_action action, void **slot_p);
+static bool rt_node_search_inner(rt_node *node, uint64 key, rt_action action,
+								 rt_node **child_p);
+static bool rt_node_search_leaf(rt_node *node, uint64 key, rt_action action,
+								uint64 *value_p);
+static rt_node *rt_node_add_new_child(radix_tree *tree, rt_node *parent,
+									  rt_node *node, uint64 key);
+static int	rt_node_prepare_insert(radix_tree *tree, rt_node *parent,
+								   rt_node **node_p, uint64 key,
+								   bool *will_replace_p);
+static void rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
+								 uint64 key, rt_node *child, bool *replaced_p);
+static void rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+								uint64 key, uint64 value, bool *replaced_p);
+static rt_node *rt_node_grow(radix_tree *tree, rt_node *parent,
+							 rt_node *node, uint64 key);
+static void rt_update_iter_stack(rt_iter *iter, int from);
+static void *rt_node_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+								  bool *found_p);
+static void rt_update_node_iter(rt_iter *iter, rt_node_iter *node_iter,
+								rt_node *node);
+static pg_attribute_always_inline void rt_iter_update_key(rt_iter *iter, uint8 chunk,
+														  uint8 shift);
+
+/* verification (available only with assertion) */
+static void rt_verify_node(rt_node *node);
+
+/* Return the array of children in the given inner node */
+static rt_node **
+rt_node_get_children(rt_node *node)
+{
+	rt_node   **children = NULL;
+
+	Assert(!IS_LEAF_NODE(node));
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			children = (rt_node **) ((rt_node_inner_4 *) node)->children;
+			break;
+		case RT_NODE_KIND_16:
+			children = (rt_node **) ((rt_node_inner_16 *) node)->children;
+			break;
+		case RT_NODE_KIND_32:
+			children = (rt_node **) ((rt_node_inner_32 *) node)->children;
+			break;
+		case RT_NODE_KIND_128:
+			children = (rt_node **) ((rt_node_inner_128 *) node)->children;
+			break;
+		case RT_NODE_KIND_256:
+			children = (rt_node **) ((rt_node_inner_256 *) node)->children;
+			break;
+		default:
+			elog(ERROR, "unexpected node type %u", node->kind);
+	}
+
+	return children;
+}
+
+/* Return the array of values in the given leaf node */
+static uint64 *
+rt_node_get_values(rt_node *node)
+{
+	uint64	   *values = NULL;
+
+	Assert(IS_LEAF_NODE(node));
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			values = ((rt_node_leaf_4 *) node)->values;
+			break;
+		case RT_NODE_KIND_16:
+			values = ((rt_node_leaf_16 *) node)->values;
+			break;
+		case RT_NODE_KIND_32:
+			values = ((rt_node_leaf_32 *) node)->values;
+			break;
+		case RT_NODE_KIND_128:
+			values = ((rt_node_leaf_128 *) node)->values;
+			break;
+		case RT_NODE_KIND_256:
+			values = ((rt_node_leaf_256 *) node)->values;
+			break;
+		default:
+			elog(ERROR, "unexpected node type %u", node->kind);
+	}
+
+	return values;
+}
+
+/*
+ * Node support functions for node-4, node-16, and node-32.
+ *
+ * These three node types have similar structure -- they have the array of chunks with
+ * different length and corresponding pointers or values depending on inner nodes or
+ * leaf nodes.
+ */
+#define CHECK_CHUNK_ARRAY_NODE(node) \
+	Assert(((((rt_node*) node)->kind) == RT_NODE_KIND_4) || \
+		   ((((rt_node*) node)->kind) == RT_NODE_KIND_16) || \
+		   ((((rt_node*) node)->kind) == RT_NODE_KIND_32))
+
+/* Get the pointer to either the child or the value at 'idx */
+static void *
+chunk_array_node_get_slot(rt_node *node, int idx)
+{
+	void	   *slot;
+
+	CHECK_CHUNK_ARRAY_NODE(node);
+
+	if (IS_LEAF_NODE(node))
+	{
+		uint64	   *values = rt_node_get_values(node);
+
+		slot = (void *) &(values[idx]);
+	}
+	else
+	{
+		rt_node   **children = rt_node_get_children(node);
+
+		slot = (void *) children[idx];
+	}
+
+	return slot;
+}
+
+/* Return the chunk array in the node */
+static uint8 *
+chunk_array_node_get_chunks(rt_node *node)
+{
+	uint8	   *chunk = NULL;
+
+	CHECK_CHUNK_ARRAY_NODE(node);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			chunk = (uint8 *) ((rt_node_base_4 *) node)->chunks;
+			break;
+		case RT_NODE_KIND_16:
+			chunk = (uint8 *) ((rt_node_base_16 *) node)->chunks;
+			break;
+		case RT_NODE_KIND_32:
+			chunk = (uint8 *) ((rt_node_base_32 *) node)->chunks;
+			break;
+		default:
+			/* this function don't support node-128 and node-256 */
+			elog(ERROR, "unsupported node type %d", node->kind);
+	}
+
+	return chunk;
+}
+
+/* Copy the contents of the node from 'src' to 'dst' */
+static void
+chunk_array_node_copy_contents(rt_node *src, rt_node *dst)
+{
+	uint8	   *chunks_src,
+			   *chunks_dst;
+
+	CHECK_CHUNK_ARRAY_NODE(src);
+	CHECK_CHUNK_ARRAY_NODE(dst);
+
+	/* Copy base type */
+	rt_copy_node_common(src, dst);
+
+	/* Copy chunk array */
+	chunks_src = chunk_array_node_get_chunks(src);
+	chunks_dst = chunk_array_node_get_chunks(dst);
+	memcpy(chunks_dst, chunks_src, sizeof(uint8) * src->count);
+
+	/* Copy children or values */
+	if (IS_LEAF_NODE(src))
+	{
+		uint64	   *values_src,
+				   *values_dst;
+
+		Assert(IS_LEAF_NODE(dst));
+		values_src = rt_node_get_values(src);
+		values_dst = rt_node_get_values(dst);
+		memcpy(values_dst, values_src, sizeof(uint64) * src->count);
+	}
+	else
+	{
+		rt_node   **children_src,
+				  **children_dst;
+
+		Assert(!IS_LEAF_NODE(dst));
+		children_src = rt_node_get_children(src);
+		children_dst = rt_node_get_children(dst);
+		memcpy(children_dst, children_src, sizeof(rt_node *) * src->count);
+	}
+}
+
+/*
+ * Return the index of the (sorted) chunk array where the chunk is inserted.
+ * Set true to replaced_p if the chunk already exists in the array.
+ */
+static int
+chunk_array_node_find_insert_pos(rt_node *node, uint8 chunk, bool *found_p)
+{
+	uint8	   *chunks;
+	int			idx;
+
+	CHECK_CHUNK_ARRAY_NODE(node);
+
+	*found_p = false;
+	chunks = chunk_array_node_get_chunks(node);
+
+	/* Find the insert pos */
+	idx = pg_lsearch8_ge(chunk, chunks, node->count);
+
+	if (idx < node->count && chunks[idx] == chunk)
+		*found_p = true;
+
+	return idx;
+}
+
+/* Delete the chunk at idx */
+static void
+chunk_array_node_delete(rt_node *node, int idx)
+{
+	uint8	   *chunks = chunk_array_node_get_chunks(node);
+
+	/* delete the chunk from the chunk array */
+	memmove(&(chunks[idx]), &(chunks[idx + 1]),
+			sizeof(uint8) * (node->count - idx - 1));
+
+	/* delete either the value or the child as well */
+	if (IS_LEAF_NODE(node))
+	{
+		uint64	   *values = rt_node_get_values(node);
+
+		memmove(&(values[idx]),
+				&(values[idx + 1]),
+				sizeof(uint64) * (node->count - idx - 1));
+	}
+	else
+	{
+		rt_node   **children = rt_node_get_children(node);
+
+		memmove(&(children[idx]),
+				&(children[idx + 1]),
+				sizeof(rt_node *) * (node->count - idx - 1));
+	}
+}
+
+/* Support function for both node-128 */
+
+/* Does the given chunk in the node has the value? */
+static pg_attribute_always_inline bool
+node_128_is_chunk_used(rt_node_base_128 *node, uint8 chunk)
+{
+	return node->slot_idxs[chunk] != RT_NODE_128_INVALID_IDX;
+}
+
+/* Is the slot in the node used? */
+static pg_attribute_always_inline bool
+node_128_is_slot_used(rt_node_base_128 *node, uint8 slot)
+{
+	return ((node->isset[RT_NODE_BITMAP_BYTE(slot)] & RT_NODE_BITMAP_BIT(slot)) != 0);
+}
+
+/* Get the pointer to either the child or the value corresponding to chunk */
+static void *
+node_128_get_slot(rt_node_base_128 *node, uint8 chunk)
+{
+	int			slotpos;
+	void	   *slot;
+
+	slotpos = node->slot_idxs[chunk];
+	Assert(slotpos != RT_NODE_128_INVALID_IDX);
+
+	if (IS_LEAF_NODE(node))
+		slot = (void *) &(((rt_node_leaf_128 *) node)->values[slotpos]);
+	else
+		slot = (void *) (((rt_node_inner_128 *) node)->children[slotpos]);
+
+	return slot;
+}
+
+/* Delete the chunk in the node */
+static void
+node_128_delete(rt_node_base_128 *node, uint8 chunk)
+{
+	int			slotpos = node->slot_idxs[chunk];
+
+	node->isset[RT_NODE_BITMAP_BYTE(slotpos)] &= ~(RT_NODE_BITMAP_BIT(slotpos));
+	node->slot_idxs[chunk] = RT_NODE_128_INVALID_IDX;
+}
+
+/* Return an unused slot in node-128 */
+static int
+node_128_find_unused_slot(rt_node_base_128 *node, uint8 chunk)
+{
+	int			slotpos;
+
+	/*
+	 * Find an unused slot. We iterate over the isset bitmap per byte then
+	 * check each bit.
+	 */
+	for (slotpos = 0; slotpos < RT_NODE_NSLOTS_BITS(128); slotpos++)
+	{
+		if (node->isset[slotpos] < 0xFF)
+			break;
+	}
+	Assert(slotpos < RT_NODE_NSLOTS_BITS(128));
+
+	slotpos *= BITS_PER_BYTE;
+	while (node_128_is_slot_used(node, slotpos))
+		slotpos++;
+
+	return slotpos;
+}
+
+
+/* XXX: duplicate with node_128_set_leaf */
+static void
+node_128_set_inner(rt_node_base_128 *node, uint8 chunk, rt_node *child)
+{
+	int			slotpos;
+	rt_node_inner_128 *n128 = (rt_node_inner_128 *) node;
+
+	/* Overwrite the existing value if exists */
+	if (node_128_is_chunk_used(node, chunk))
+	{
+		n128->children[n128->base.slot_idxs[chunk]] = child;
+		return;
+	}
+
+	/* find unused slot */
+	slotpos = node_128_find_unused_slot(node, chunk);
+
+	n128->base.slot_idxs[chunk] = slotpos;
+	n128->base.isset[RT_NODE_BITMAP_BYTE(slotpos)] |= RT_NODE_BITMAP_BIT(slotpos);
+	n128->children[slotpos] = child;
+}
+
+/* Set the slot at the corresponding chunk */
+static void
+node_128_set_leaf(rt_node_base_128 *node, uint8 chunk, uint64 value)
+{
+	int			slotpos;
+	rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node;
+
+	/* Overwrite the existing value if exists */
+	if (node_128_is_chunk_used(node, chunk))
+	{
+		n128->values[n128->base.slot_idxs[chunk]] = value;
+		return;
+	}
+
+	/* find unused slot */
+	slotpos = node_128_find_unused_slot(node, chunk);
+
+	n128->base.slot_idxs[chunk] = slotpos;
+	n128->base.isset[RT_NODE_BITMAP_BYTE(slotpos)] |= RT_NODE_BITMAP_BIT(slotpos);
+	n128->values[slotpos] = value;
+}
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static bool
+node_256_is_chunk_used(rt_node_base_256 *node, uint8 chunk)
+{
+	return (node->isset[RT_NODE_BITMAP_BYTE(chunk)] & RT_NODE_BITMAP_BIT(chunk)) != 0;
+}
+
+/* Get the pointer to either the child or the value corresponding to chunk */
+static void *
+node_256_get_slot(rt_node_base_256 *node, uint8 chunk)
+{
+	void	   *slot;
+
+	Assert(node_256_is_chunk_used(node, chunk));
+	if (IS_LEAF_NODE(node))
+		slot = (void *) &(((rt_node_leaf_256 *) node)->values[chunk]);
+	else
+		slot = (void *) (((rt_node_inner_256 *) node)->children[chunk]);
+
+	return slot;
+}
+
+/* Set the child in the node-256 */
+static pg_attribute_always_inline void
+node_256_set_inner(rt_node_base_256 *node, uint8 chunk, rt_node *child)
+{
+	rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+	n256->base.isset[RT_NODE_BITMAP_BYTE(chunk)] |= RT_NODE_BITMAP_BIT(chunk);
+	n256->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static pg_attribute_always_inline void
+node_256_set_leaf(rt_node_base_256 *node, uint8 chunk, uint64 value)
+{
+	rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+	n256->base.isset[RT_NODE_BITMAP_BYTE(chunk)] |= RT_NODE_BITMAP_BIT(chunk);
+	n256->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static pg_attribute_always_inline void
+node_256_delete(rt_node_base_256 *node, uint8 chunk)
+{
+	node->isset[RT_NODE_BITMAP_BYTE(chunk)] &= ~(RT_NODE_BITMAP_BIT(chunk));
+}
+
+/*
+ * Return the shift that is satisfied to store the given key.
+ */
+static pg_attribute_always_inline int
+key_get_shift(uint64 key)
+{
+	return (key == 0)
+		? 0
+		: (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value stored in a node with the given shift.
+ */
+static uint64
+shift_get_max_val(int shift)
+{
+	if (shift == RT_MAX_SHIFT)
+		return UINT64_MAX;
+
+	return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+rt_new_root(radix_tree *tree, uint64 key)
+{
+	int			shift = key_get_shift(key);
+	rt_node    *node;
+
+	node = (rt_node *) rt_alloc_node(tree, RT_NODE_KIND_4, shift > 0);
+	node->shift = shift;
+	tree->max_val = shift_get_max_val(shift);
+	tree->root = node;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static rt_node *
+rt_alloc_node(radix_tree *tree, rt_node_kind kind, bool inner)
+{
+	rt_node    *newnode;
+
+	if (inner)
+		newnode = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
+													 rt_node_info[kind].inner_size);
+	else
+		newnode = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
+													 rt_node_info[kind].leaf_size);
+
+	newnode->kind = kind;
+
+	/* Initialize slot_idxs to invalid values */
+	if (kind == RT_NODE_KIND_128)
+	{
+		rt_node_base_128 *n128 = (rt_node_base_128 *) newnode;
+
+		memset(n128->slot_idxs, RT_NODE_128_INVALID_IDX, sizeof(n128->slot_idxs));
+	}
+
+#ifdef RT_DEBUG
+	/* update the statistics */
+	tree->cnt[kind]++;
+#endif
+
+	return newnode;
+}
+
+/* Free the given node */
+static void
+rt_free_node(radix_tree *tree, rt_node *node)
+{
+	/* If we're deleting the root node, make the tree empty */
+	if (tree->root == node)
+		tree->root = NULL;
+
+#ifdef RT_DEBUG
+	/* update the statistics */
+	tree->cnt[node->kind]--;
+
+	Assert(tree->cnt[node->kind] >= 0);
+#endif
+
+	pfree(node);
+}
+
+/* Copy the common fields without the node kind */
+static void
+rt_copy_node_common(rt_node *src, rt_node *dst)
+{
+	dst->shift = src->shift;
+	dst->chunk = src->chunk;
+	dst->count = src->count;
+}
+
+/*
+ * The radix tree doesn't sufficient height. Extend the radix tree so it can
+ * store the key.
+ */
+static void
+rt_extend(radix_tree *tree, uint64 key)
+{
+	int			target_shift;
+	int			shift = tree->root->shift + RT_NODE_SPAN;
+
+	target_shift = key_get_shift(key);
+
+	/* Grow tree from 'shift' to 'target_shift' */
+	while (shift <= target_shift)
+	{
+		rt_node_inner_4 *node =
+		(rt_node_inner_4 *) rt_alloc_node(tree, RT_NODE_KIND_4, true);
+
+		node->base.n.count = 1;
+		node->base.n.shift = shift;
+		node->base.chunks[0] = 0;
+		node->children[0] = tree->root;
+
+		tree->root->chunk = 0;
+		tree->root = (rt_node *) node;
+
+		shift += RT_NODE_SPAN;
+	}
+
+	tree->max_val = shift_get_max_val(target_shift);
+}
+
+/*
+ * Search for the given key in the node. Return true if the key is found, otherwise
+ * return false. On success, we do the specified action for the key, and set the
+ * pointer to the slot to slot_p.
+ */
+static bool
+rt_node_search(rt_node *node, uint64 key, rt_action action, void **slot_p)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		found = false;
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+		case RT_NODE_KIND_16:
+		case RT_NODE_KIND_32:
+			{
+				int			idx;
+				uint8	   *chunks = chunk_array_node_get_chunks(node);
+
+				idx = pg_lsearch8(chunk, chunks, node->count);
+				if (idx < 0)
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					*slot_p = chunk_array_node_get_slot(node, idx);
+				else			/* RT_ACTION_DELETE */
+					chunk_array_node_delete(node, idx);
+
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_base_128 *n128 = (rt_node_base_128 *) node;
+
+				if (!node_128_is_chunk_used(n128, chunk))
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					*slot_p = node_128_get_slot(n128, chunk);
+				else		/* RT_ACTION_DELETE */
+					node_128_delete(n128, chunk);
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_base_256 *n256 = (rt_node_base_256 *) node;
+
+				if (!node_256_is_chunk_used(n256, chunk))
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					*slot_p = node_256_get_slot(n256, chunk);
+				else		/* RT_ACTION_DELETE */
+					node_256_delete(n256, chunk);
+
+				break;
+			}
+	}
+
+	/* Update the statistics */
+	if (action == RT_ACTION_DELETE && found)
+		node->count--;
+
+	return found;
+}
+
+/*
+ * Search for the child pointer corresponding to the key in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static bool
+rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **child_p)
+{
+	rt_node    *child;
+
+	if (!rt_node_search(node, key, action, (void **) &child))
+		return false;
+
+	if (child_p)
+		*child_p = child;
+
+	return true;
+}
+
+/*
+ * Search for the value corresponding to the key in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static bool
+rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p)
+{
+	uint64	   *value;
+
+	if (!rt_node_search(node, key, action, (void **) &value))
+		return false;
+
+	if (value_p)
+		*value_p = *value;
+
+	return true;
+}
+
+/* Insert 'node' as a child node of 'parent' */
+static rt_node *
+rt_node_add_new_child(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key)
+{
+	uint8		newshift = node->shift - RT_NODE_SPAN;
+	rt_node    *newchild =
+	(rt_node *) rt_alloc_node(tree, RT_NODE_KIND_4, newshift > 0);
+
+	Assert(!IS_LEAF_NODE(node));
+
+	newchild->shift = newshift;
+	newchild->chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+	rt_node_insert_inner(tree, parent, node, key, newchild, NULL);
+
+	return (rt_node *) newchild;
+}
+
+/*
+ * For a upcoming insertion, we make sure that the node has enough free slots or
+ * grow the node if necessary. node_p is updated with the grown node. We set true
+ * to will_replace_p to tell the caller that the given chunk already exists in the
+ * node.
+ *
+ * Return the index in the chunk array where the key can be inserted. We always
+ * return 0 in node-128 and node-256 cases.
+ */
+static int
+rt_node_prepare_insert(radix_tree *tree, rt_node *parent, rt_node **node_p,
+					   uint64 key, bool *will_replace_p)
+{
+	rt_node    *node = *node_p;
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		will_replace = false;
+	int			idx = 0;
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+		case RT_NODE_KIND_16:
+		case RT_NODE_KIND_32:
+			{
+				bool		can_insert = false;
+
+				while ((node->kind == RT_NODE_KIND_4) ||
+					   (node->kind == RT_NODE_KIND_16) ||
+					   (node->kind == RT_NODE_KIND_32))
+				{
+					/* Find the insert pos */
+					idx = chunk_array_node_find_insert_pos(node, chunk, &will_replace);
+
+					if (will_replace || NODE_HAS_FREE_SLOT(node))
+					{
+						/*
+						 * Found. We can insert a new one or replace the exiting
+						 * value.
+						 */
+						can_insert = true;
+						break;
+					}
+
+					node = rt_node_grow(tree, parent, node, key);
+				}
+
+				if (can_insert)
+				{
+					uint8	   *chunks = chunk_array_node_get_chunks(node);
+
+					Assert(idx >= 0);
+
+					/*
+					 * Make the space for the new key if it will be inserted in
+					 * the middle of the array.
+					 */
+					if (!will_replace && node->count != 0 && idx < node->count)
+					{
+						/* shift chunks array */
+						memmove(&(chunks[idx + 1]), &(chunks[idx]),
+								sizeof(uint8) * (node->count - idx));
+
+						/* shift either the values array or the children array */
+						if (IS_LEAF_NODE(node))
+						{
+							uint64	   *values = rt_node_get_values(node);
+
+							memmove(&(values[idx + 1]), &(values[idx]),
+									sizeof(uint64) * (node->count - idx));
+						}
+						else
+						{
+							rt_node   **children = rt_node_get_children(node);
+
+							memmove(&(children[idx + 1]), &(children[idx]),
+									sizeof(rt_node *) * (node->count - idx));
+						}
+					}
+
+					break;
+				}
+
+				Assert(node->kind == RT_NODE_KIND_128);
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_128:
+			{
+				rt_node_base_128 *n128 = (rt_node_base_128 *) node;
+
+				if (node_128_is_chunk_used(n128, chunk) || NODE_HAS_FREE_SLOT(n128))
+				{
+					if (node_128_is_chunk_used(n128, chunk))
+						will_replace = true;
+
+					break;
+				}
+
+				node = rt_node_grow(tree, parent, node, key);
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_256:
+			{
+				rt_node_base_256 *n256 = (rt_node_base_256 *) node;
+
+				if (node_256_is_chunk_used(n256, chunk))
+					will_replace = true;
+
+				break;
+			}
+	}
+
+	*node_p = node;
+	*will_replace_p = will_replace;
+
+	return idx;
+}
+
+/* Insert the child to the inner node */
+static void
+rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
+					 uint64 key, rt_node *child, bool *replaced_p)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	int			idx;
+	bool		replaced;
+
+	Assert(!IS_LEAF_NODE(node));
+
+	idx = rt_node_prepare_insert(tree, parent, &node, key, &replaced);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+		case RT_NODE_KIND_16:
+		case RT_NODE_KIND_32:
+			{
+				uint8	   *chunks = chunk_array_node_get_chunks(node);
+				rt_node   **children = rt_node_get_children(node);
+
+				Assert(idx >= 0);
+				chunks[idx] = chunk;
+				children[idx] = child;
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				node_128_set_inner((rt_node_base_128 *) node, chunk, child);
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				node_256_set_inner((rt_node_base_256 *) node, chunk, child);
+				break;
+			}
+	}
+
+	/* Update statistics */
+	if (!replaced)
+		node->count++;
+
+	if (replaced_p)
+		*replaced_p = replaced;
+
+	/*
+	 * Done. Finally, verify the chunk and value is inserted or replaced
+	 * properly in the node.
+	 */
+	rt_verify_node(node);
+}
+
+/* Insert the value to the leaf node */
+static void
+rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+					uint64 key, uint64 value, bool *replaced_p)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	int			idx;
+	bool		replaced;
+
+	Assert(IS_LEAF_NODE(node));
+
+	idx = rt_node_prepare_insert(tree, parent, &node, key, &replaced);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+		case RT_NODE_KIND_16:
+		case RT_NODE_KIND_32:
+			{
+				uint8	   *chunks = chunk_array_node_get_chunks(node);
+				uint64	   *values = rt_node_get_values(node);
+
+				Assert(idx >= 0);
+				chunks[idx] = chunk;
+				values[idx] = value;
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				node_128_set_leaf((rt_node_base_128 *) node, chunk, value);
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				node_256_set_leaf((rt_node_base_256 *) node, chunk, value);
+				break;
+			}
+	}
+
+	/* Update statistics */
+	if (!replaced)
+		node->count++;
+
+	*replaced_p = replaced;
+
+	/*
+	 * Done. Finally, verify if the chunk and value is inserted or replaced
+	 * properly in the node.
+	 */
+	rt_verify_node(node);
+}
+
+/* Change the node type to the next larger one */
+static rt_node *
+rt_node_grow(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key)
+{
+	rt_node    *newnode = NULL;
+
+	Assert(node->count == rt_node_info[node->kind].fanout);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				newnode = rt_alloc_node(tree, RT_NODE_KIND_16,
+										IS_LEAF_NODE(node));
+
+				/* Copy both chunks and slots to the new node */
+				chunk_array_node_copy_contents(node, newnode);
+				break;
+			}
+		case RT_NODE_KIND_16:
+			{
+				newnode = rt_alloc_node(tree, RT_NODE_KIND_32,
+										IS_LEAF_NODE(node));
+
+				/* Copy both chunks and slots to the new node */
+				chunk_array_node_copy_contents(node, newnode);
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				newnode = rt_alloc_node(tree, RT_NODE_KIND_128,
+										IS_LEAF_NODE(node));
+
+				/* Copy both chunks and slots to the new node */
+				rt_copy_node_common(node, newnode);
+
+				if (IS_LEAF_NODE(node))
+				{
+					rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+
+					for (int i = 0; i < node->count; i++)
+						node_128_set_leaf((rt_node_base_128 *) newnode,
+										  n32->base.chunks[i], n32->values[i]);
+				}
+				else
+				{
+					rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+
+					for (int i = 0; i < node->count; i++)
+						node_128_set_inner((rt_node_base_128 *) newnode,
+										   n32->base.chunks[i], n32->children[i]);
+				}
+
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_base_128 *n128 = (rt_node_base_128 *) node;
+				int			cnt = 0;
+
+				newnode = rt_alloc_node(tree, RT_NODE_KIND_256,
+										IS_LEAF_NODE(node));
+
+				/* Copy both chunks and slots to the new node */
+				rt_copy_node_common(node, newnode);
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->n.count; i++)
+				{
+					void	   *slot;
+
+					if (!node_128_is_chunk_used(n128, i))
+						continue;
+
+					slot = node_128_get_slot(n128, i);
+
+					if (IS_LEAF_NODE(node))
+						node_256_set_leaf((rt_node_base_256 *) newnode, i,
+										  *(uint64 *) slot);
+					else
+						node_256_set_inner((rt_node_base_256 *) newnode, i,
+										   (rt_node *) slot);
+
+					cnt++;
+				}
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			elog(ERROR, "radix tree node-256 cannot grow");
+			break;
+	}
+
+	if (parent == node)
+	{
+		/* Replace the root node with the new large node */
+		tree->root = newnode;
+	}
+	else
+	{
+		/* Set the new node to the parent node */
+		rt_node_insert_inner(tree, NULL, parent, key, newnode, NULL);
+	}
+
+	/* Verify if the node has grown properly */
+	rt_verify_node(newnode);
+
+	/* Free the old node */
+	rt_free_node(tree, node);
+
+	return newnode;
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+radix_tree *
+rt_create(MemoryContext ctx)
+{
+	radix_tree *tree;
+	MemoryContext old_ctx;
+
+	old_ctx = MemoryContextSwitchTo(ctx);
+
+	tree = palloc(sizeof(radix_tree));
+	tree->context = ctx;
+	tree->root = NULL;
+	tree->max_val = 0;
+	tree->num_keys = 0;
+
+	/* Create the slab allocator for each size class */
+	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+	{
+		tree->inner_slabs[i] = SlabContextCreate(ctx,
+												 rt_node_info[i].name,
+												 SLAB_DEFAULT_BLOCK_SIZE,
+												 rt_node_info[i].inner_size);
+		tree->leaf_slabs[i] = SlabContextCreate(ctx,
+												rt_node_info[i].name,
+												SLAB_DEFAULT_BLOCK_SIZE,
+												rt_node_info[i].leaf_size);
+#ifdef RT_DEBUG
+		tree->cnt[i] = 0;
+#endif
+	}
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return tree;
+}
+
+/*
+ * Free the given radix tree.
+ */
+void
+rt_free(radix_tree *tree)
+{
+	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+	{
+		MemoryContextDelete(tree->inner_slabs[i]);
+		MemoryContextDelete(tree->leaf_slabs[i]);
+	}
+
+	pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+bool
+rt_set(radix_tree *tree, uint64 key, uint64 value)
+{
+	int			shift;
+	bool		replaced;
+	rt_node    *node;
+	rt_node    *parent = tree->root;
+
+	/* Empty tree, create the root */
+	if (!tree->root)
+		rt_new_root(tree, key);
+
+	/* Extend the tree if necessary */
+	if (key > tree->max_val)
+		rt_extend(tree, key);
+
+	Assert(tree->root);
+
+	shift = tree->root->shift;
+	node = tree->root;
+
+	/* Descend the tree until a leaf node */
+	while (shift >= 0)
+	{
+		rt_node    *child;
+
+		if (IS_LEAF_NODE(node))
+			break;
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+			child = rt_node_add_new_child(tree, parent, node, key);
+
+		Assert(child);
+
+		parent = node;
+		node = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	/* arrived at a leaf */
+	Assert(IS_LEAF_NODE(node));
+
+	rt_node_insert_leaf(tree, parent, node, key, value, &replaced);
+
+	/* Update the statistics */
+	if (!replaced)
+		tree->num_keys++;
+
+	return replaced;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false.  On success, we set the value to *val_p so it must
+ * not be NULL.
+ */
+bool
+rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
+{
+	rt_node    *node;
+	int			shift;
+
+	Assert(value_p != NULL);
+
+	if (!tree->root || key > tree->max_val)
+		return false;
+
+	node = tree->root;
+	shift = tree->root->shift;
+
+	/* Descend the tree until a leaf node */
+	while (shift >= 0)
+	{
+		rt_node    *child;
+
+		if (IS_LEAF_NODE(node))
+			break;
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+			return false;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	/* We reached at a leaf node, so search the corresponding slot */
+	Assert(IS_LEAF_NODE(node));
+	if (!rt_node_search_leaf(node, key, RT_ACTION_FIND, value_p))
+		return false;
+
+	return true;
+}
+
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+bool
+rt_delete(radix_tree *tree, uint64 key)
+{
+	rt_node    *node;
+	int			shift;
+	rt_node    *stack[RT_MAX_LEVEL] = {0};
+	int			level;
+
+	if (!tree->root || key > tree->max_val)
+		return false;
+
+	/*
+	 * Descend the tree to search the key while building a stack of nodes
+	 * we visited.
+	 */
+	node = tree->root;
+	shift = tree->root->shift;
+	level = 0;
+	while (shift >= 0)
+	{
+		rt_node    *child;
+
+		/* Push the current node to the stack */
+		stack[level] = node;
+
+		if (IS_LEAF_NODE(node))
+			break;
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+			return false;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+		level++;
+	}
+
+	Assert(IS_LEAF_NODE(node));
+
+	/* there is no key to delete */
+	if (!rt_node_search_leaf(node, key, RT_ACTION_FIND, NULL))
+		return false;
+
+	/* Update the statistics */
+	tree->num_keys--;
+
+	/*
+	 * Delete the key from the leaf node and recursively delete the key in
+	 * inner nodes if necessary.
+	 */
+	Assert(IS_LEAF_NODE(stack[level]));
+	while (level >= 0)
+	{
+		rt_node    *node = stack[level--];
+
+		if (IS_LEAF_NODE(node))
+			rt_node_search_leaf(node, key, RT_ACTION_DELETE, NULL);
+		else
+			rt_node_search_inner(node, key, RT_ACTION_DELETE, NULL);
+
+		/* If the node didn't become empty, we stop deleting the key */
+		if (!IS_EMPTY_NODE(node))
+			break;
+
+		/* The node became empty */
+		rt_free_node(tree, node);
+	}
+
+	/*
+	 * If we eventually deleted the root node while recursively deleting empty
+	 * nodes, we make the tree empty.
+	 */
+	if (level == 0)
+	{
+		tree->root = NULL;
+		tree->max_val = 0;
+	}
+
+	return true;;
+}
+
+/* Create and return the iterator for the given radix tree */
+rt_iter *
+rt_begin_iterate(radix_tree *tree)
+{
+	MemoryContext old_ctx;
+	rt_iter    *iter;
+	int			top_level;
+
+	old_ctx = MemoryContextSwitchTo(tree->context);
+
+	iter = (rt_iter *) palloc0(sizeof(rt_iter));
+	iter->tree = tree;
+
+	/* empty tree */
+	if (!iter->tree)
+		return iter;
+
+	top_level = iter->tree->root->shift / RT_NODE_SPAN;
+
+	iter->stack_len = top_level;
+	iter->stack[top_level].node = iter->tree->root;
+	iter->stack[top_level].current_idx = -1;
+
+	/*
+	 * Descend to the left most leaf node from the root. The key is being
+	 * constructed while descending to the leaf.
+	 */
+	rt_update_iter_stack(iter, top_level);
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return iter;
+}
+
+/*
+ * Update the stack of the radix tree node while descending to the leaf from
+ * the 'from' level.
+ */
+static void
+rt_update_iter_stack(rt_iter *iter, int from)
+{
+	rt_node    *node = iter->stack[from].node;
+	int			level = from;
+
+	for (;;)
+	{
+		rt_node_iter *node_iter = &(iter->stack[level--]);
+		bool		found;
+
+		/* Set the node to this level */
+		rt_update_node_iter(iter, node_iter, node);
+
+		/* Finish if we reached to the leaf node */
+		if (IS_LEAF_NODE(node))
+			break;
+
+		/* Advance to the next slot in the node */
+		node = (rt_node *) rt_node_iterate_next(iter, node_iter, &found);
+
+		/*
+		 * Since we always get the first slot in the node, we have to found
+		 * the slot.
+		 */
+		Assert(found);
+	}
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key.  Otherwise,
+ * return false.
+ */
+bool
+rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
+{
+	bool		found = false;
+	void	   *slot;
+
+	/* Empty tree */
+	if (!iter->tree)
+		return false;
+
+	for (;;)
+	{
+		rt_node    *node;
+		rt_node_iter *node_iter;
+		int			level;
+
+		/*
+		 * Iterate node at each level from the bottom of the tree, i.e., the
+		 * lead node, until we find the next slot.
+		 */
+		for (level = 0; level <= iter->stack_len; level++)
+		{
+			slot = rt_node_iterate_next(iter, &(iter->stack[level]), &found);
+
+			if (found)
+				break;
+		}
+
+		/* We could not find any new key-value pair, the iteration finished */
+		if (!found)
+			break;
+
+		/* found the next slot at the leaf node, return it */
+		if (level == 0)
+		{
+			*key_p = iter->key;
+			*value_p = *((uint64 *) slot);
+			break;
+		}
+
+		/*
+		 * We have advanced slots more than one nodes including both the lead
+		 * node and inner nodes. So we update the stack by descending to
+		 * the left most leaf node from this level.
+		 */
+		node = (rt_node *) (rt_node *) slot;
+		node_iter = &(iter->stack[level - 1]);
+		rt_update_node_iter(iter, node_iter, node);
+		rt_update_iter_stack(iter, level - 1);
+	}
+
+	return found;
+}
+
+void
+rt_end_iterate(rt_iter *iter)
+{
+	pfree(iter);
+}
+
+/*
+ * Iterate over the given radix tree node and returns the next slot of the given
+ * node and set true to *found_p, if any.  Otherwise, set false to *found_p.
+ */
+static void *
+rt_node_iterate_next(rt_iter *iter, rt_node_iter *node_iter, bool *found_p)
+{
+	rt_node    *node = node_iter->node;
+	void	   *slot = NULL;
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+		case RT_NODE_KIND_16:
+		case RT_NODE_KIND_32:
+			{
+				node_iter->current_idx++;
+
+				if (node_iter->current_idx >= node->count)
+					goto not_found;
+
+				slot = chunk_array_node_get_slot(node, node_iter->current_idx);
+
+				/* Update the part of the key by the current chunk */
+				if (IS_LEAF_NODE(node))
+				{
+					uint8	   *chunks = chunk_array_node_get_chunks(node);
+
+					rt_iter_update_key(iter, chunks[node_iter->current_idx], 0);
+				}
+
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_base_128 *n128 = (rt_node_base_128 *) node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < 256; i++)
+				{
+					if (node_128_is_chunk_used(n128, i))
+						break;
+				}
+
+				if (i >= 256)
+					goto not_found;
+
+				node_iter->current_idx = i;
+				slot = node_128_get_slot(n128, i);
+
+				/* Update the part of the key */
+				if (IS_LEAF_NODE(n128))
+					rt_iter_update_key(iter, node_iter->current_idx, 0);
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_base_256 *n256 = (rt_node_base_256 *) node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < 256; i++)
+				{
+					if (node_256_is_chunk_used(n256, i))
+						break;
+				}
+
+				if (i >= 256)
+					goto not_found;
+
+				node_iter->current_idx = i;
+				slot = node_256_get_slot(n256, i);
+
+				/* Update the part of the key */
+				if (IS_LEAF_NODE(n256))
+					rt_iter_update_key(iter, node_iter->current_idx, 0);
+
+				break;
+			}
+	}
+
+	Assert(slot);
+	*found_p = true;
+	return slot;
+
+not_found:
+	*found_p = false;
+	return NULL;
+}
+
+/*
+ * Set the node to the node_iter so we can begin the iteration of the node.
+ * Also, we update the part of the key by the chunk of the given node.
+ */
+static void
+rt_update_node_iter(rt_iter *iter, rt_node_iter *node_iter,
+					rt_node *node)
+{
+	node_iter->node = node;
+	node_iter->current_idx = -1;
+
+	rt_iter_update_key(iter, node->chunk, node->shift + RT_NODE_SPAN);
+}
+
+static pg_attribute_always_inline void
+rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift)
+{
+	iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+	iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+uint64
+rt_num_entries(radix_tree *tree)
+{
+	return tree->num_keys;
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+uint64
+rt_memory_usage(radix_tree *tree)
+{
+	Size		total = 0;
+
+	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+	{
+		total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+		total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+	}
+
+	return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+rt_verify_node(rt_node *node)
+{
+#ifdef USE_ASSERT_CHECKING
+	Assert(node->count >= 0);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+		case RT_NODE_KIND_16:
+		case RT_NODE_KIND_32:
+			{
+				uint8	   *chunks = chunk_array_node_get_chunks(node);
+
+				/* Check if the chunks in the node are sorted */
+				for (int i = 1; i < node->count; i++)
+					Assert(chunks[i - 1] < chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_base_128 *n128 = (rt_node_base_128 *) node;
+				int			cnt = 0;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!node_128_is_chunk_used(n128, i))
+						continue;
+
+					/* Check if the corresponding slot is used */
+					Assert(node_128_is_slot_used(n128, n128->slot_idxs[i]));
+
+					cnt++;
+				}
+
+				Assert(n128->n.count == cnt);
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_base_256 *n256 = (rt_node_base_256 *) node;
+				int			cnt = 0;
+
+				for (int i = 0; i < RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS); i++)
+					cnt += pg_popcount32(n256->isset[i]);
+
+				/* Check if the number of used chunk matches */
+				Assert(n256->n.count == cnt);
+
+				break;
+			}
+	}
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+void
+rt_stats(radix_tree *tree)
+{
+	fprintf(stderr, "num_keys = %lu, height = %u, n4 = %u, n16 = %u,n32 = %u, n128 = %u, n256 = %u",
+			tree->num_keys,
+			tree->root->shift / RT_NODE_SPAN,
+			tree->cnt[0],
+			tree->cnt[1],
+			tree->cnt[2],
+			tree->cnt[3],
+			tree->cnt[4]);
+	/* rt_dump(tree); */
+}
+
+static void
+rt_print_slot(StringInfo buf, uint8 chunk, uint64 value, int idx, bool is_leaf, int level)
+{
+	char		space[128] = {0};
+
+	if (level > 0)
+		sprintf(space, "%*c", level * 4, ' ');
+
+	if (is_leaf)
+		appendStringInfo(buf, "%s[%d] \"0x%X\" val(0x%lX) LEAF\n",
+						 space,
+						 idx,
+						 chunk,
+						 value);
+	else
+		appendStringInfo(buf, "%s[%d] \"0x%X\" -> ",
+						 space,
+						 idx,
+						 chunk);
+}
+
+static void
+rt_dump_node(rt_node *node, int level, StringInfo buf, bool recurse)
+{
+	bool		is_leaf = IS_LEAF_NODE(node);
+
+	appendStringInfo(buf, "[\"%s\" type %d, cnt %u, shift %u, chunk \"0x%X\"] chunks:\n",
+					 IS_LEAF_NODE(node) ? "LEAF" : "INNR",
+					 (node->kind == RT_NODE_KIND_4) ? 4 :
+					 (node->kind == RT_NODE_KIND_32) ? 32 :
+					 (node->kind == RT_NODE_KIND_128) ? 128 : 256,
+					 node->count, node->shift, node->chunk);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+		case RT_NODE_KIND_16:
+		case RT_NODE_KIND_32:
+			{
+				uint8	   *chunks = chunk_array_node_get_chunks(node);
+
+				for (int i = 0; i < node->count; i++)
+				{
+					if (IS_LEAF_NODE(node))
+					{
+						uint64	   *values = rt_node_get_values(node);
+
+						rt_print_slot(buf, chunks[i],
+									  values[i],
+									  i, is_leaf, level);
+					}
+					else
+						rt_print_slot(buf, chunks[i],
+									  UINT64_MAX,
+									  i, is_leaf, level);
+
+					if (!is_leaf)
+					{
+						if (recurse)
+						{
+							rt_node   **children = rt_node_get_children(node);
+							StringInfoData buf2;
+
+							initStringInfo(&buf2);
+							rt_dump_node(children[i],
+										 level + 1, &buf2, recurse);
+							appendStringInfo(buf, "%s", buf2.data);
+						}
+						else
+							appendStringInfo(buf, "\n");
+					}
+				}
+
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_base_128 *n128 = (rt_node_base_128 *) node;
+				uint8	   *tmp = (uint8 *) n128->isset;
+
+				appendStringInfo(buf, "slot_idxs:");
+				for (int j = 0; j < 256; j++)
+				{
+					if (!node_128_is_chunk_used(n128, j))
+						continue;
+
+					appendStringInfo(buf, " [%d]=%d, ", j, n128->slot_idxs[j]);
+				}
+				appendStringInfo(buf, "\nisset-bitmap:");
+				for (int j = 0; j < 16; j++)
+				{
+					appendStringInfo(buf, "%X ", (uint8) tmp[j]);
+				}
+				appendStringInfo(buf, "\n");
+
+				for (int i = 0; i < 256; i++)
+				{
+					void	   *slot;
+
+					if (!node_128_is_chunk_used(n128, i))
+						continue;
+
+					slot = node_128_get_slot(n128, i);
+
+					if (is_leaf)
+						rt_print_slot(buf, i, *(uint64 *) slot,
+									  i, is_leaf, level);
+					else
+						rt_print_slot(buf, i, UINT64_MAX, i, is_leaf, level);
+
+					if (!is_leaf)
+					{
+						if (recurse)
+						{
+							StringInfoData buf2;
+
+							initStringInfo(&buf2);
+							rt_dump_node((rt_node *) slot,
+										 level + 1, &buf2, recurse);
+							appendStringInfo(buf, "%s", buf2.data);
+						}
+						else
+							appendStringInfo(buf, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_base_256 *n256 = (rt_node_base_256 *) node;
+
+				for (int i = 0; i < 256; i++)
+				{
+					void	   *slot;
+
+					if (!node_256_is_chunk_used(n256, i))
+						continue;
+
+					slot = node_256_get_slot(n256, i);
+
+					if (is_leaf)
+						rt_print_slot(buf, i, *(uint64 *) slot, i, is_leaf, level);
+					else
+						rt_print_slot(buf, i, UINT64_MAX, i, is_leaf, level);
+
+					if (!is_leaf)
+					{
+						if (recurse)
+						{
+							StringInfoData buf2;
+
+							initStringInfo(&buf2);
+							rt_dump_node((rt_node *) slot, level + 1, &buf2, recurse);
+							appendStringInfo(buf, "%s", buf2.data);
+						}
+						else
+							appendStringInfo(buf, "\n");
+					}
+				}
+				break;
+			}
+	}
+}
+
+void
+rt_dump_search(radix_tree *tree, uint64 key)
+{
+	StringInfoData buf;
+	rt_node    *node;
+	int			shift;
+	int			level = 0;
+
+	elog(NOTICE, "-----------------------------------------------------------");
+	elog(NOTICE, "max_val = %lu (0x%lX)", tree->max_val, tree->max_val);
+
+	if (!tree->root)
+	{
+		elog(NOTICE, "tree is empty");
+		return;
+	}
+
+	if (key > tree->max_val)
+	{
+		elog(NOTICE, "key %lu (0x%lX) is larger than max val",
+			 key, key);
+		return;
+	}
+
+	initStringInfo(&buf);
+	node = tree->root;
+	shift = tree->root->shift;
+	while (shift >= 0)
+	{
+		rt_node    *child;
+
+		rt_dump_node(node, level, &buf, false);
+
+		if (IS_LEAF_NODE(node))
+		{
+			uint64		dummy;
+
+			/* We reached at a leaf node, find the corresponding slot */
+			rt_node_search_leaf(node, key, RT_ACTION_FIND, &dummy);
+
+			break;
+		}
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+			break;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+		level++;
+	}
+
+	elog(NOTICE, "\n%s", buf.data);
+}
+
+void
+rt_dump(radix_tree *tree)
+{
+	StringInfoData buf;
+
+	initStringInfo(&buf);
+
+	elog(NOTICE, "-----------------------------------------------------------");
+	elog(NOTICE, "max_val = %lu", tree->max_val);
+	rt_dump_node(tree->root, 0, &buf, true);
+	elog(NOTICE, "\n%s", buf.data);
+	elog(NOTICE, "-----------------------------------------------------------");
+}
+#endif
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..38cc6abf4c
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,42 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ *	  Interface for radix tree.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RADIXTREE_H
+#define RADIXTREE_H
+
+#include "postgres.h"
+
+/* #define RT_DEBUG 1 */
+
+typedef struct radix_tree radix_tree;
+typedef struct rt_iter rt_iter;
+
+extern radix_tree *rt_create(MemoryContext ctx);
+extern void rt_free(radix_tree *tree);
+extern bool rt_search(radix_tree *tree, uint64 key, uint64 *val_p);
+extern bool rt_set(radix_tree *tree, uint64 key, uint64 val);
+extern rt_iter *rt_begin_iterate(radix_tree *tree);
+
+extern bool rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p);
+extern void rt_end_iterate(rt_iter *iter);
+extern bool rt_delete(radix_tree *tree, uint64 key);
+
+extern uint64 rt_memory_usage(radix_tree *tree);
+extern uint64 rt_num_entries(radix_tree *tree);
+
+#ifdef RT_DEBUG
+extern void rt_dump(radix_tree *tree);
+extern void rt_dump_search(radix_tree *tree, uint64 key);
+extern void rt_stats(radix_tree *tree);
+#endif
+
+#endif							/* RADIXTREE_H */
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 6c31c8707c..8252ec41c4 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -25,6 +25,7 @@ SUBDIRS = \
 		  test_parser \
 		  test_pg_dump \
 		  test_predtest \
+		  test_radixtree \
 		  test_rbtree \
 		  test_regex \
 		  test_rls_hooks \
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+	$(WIN32RES) \
+	test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..cc6970c87c
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,28 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE:  testing radix tree node types with shift "0"
+NOTICE:  testing radix tree node types with shift "8"
+NOTICE:  testing radix tree node types with shift "16"
+NOTICE:  testing radix tree node types with shift "24"
+NOTICE:  testing radix tree node types with shift "32"
+NOTICE:  testing radix tree node types with shift "40"
+NOTICE:  testing radix tree node types with shift "48"
+NOTICE:  testing radix tree node types with shift "56"
+NOTICE:  testing radix tree with pattern "all ones"
+NOTICE:  testing radix tree with pattern "alternating bits"
+NOTICE:  testing radix tree with pattern "clusters of ten"
+NOTICE:  testing radix tree with pattern "clusters of hundred"
+NOTICE:  testing radix tree with pattern "one-every-64k"
+NOTICE:  testing radix tree with pattern "sparse"
+NOTICE:  testing radix tree with pattern "single values, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree 
+----------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..a4aa80a99c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,504 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ *		Test radixtree set data structure.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "lib/radixtree.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed.  That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+/* The maximum number of entries each node type can have */
+static int rt_node_max_entries[] = {
+	4,		/* RT_NODE_KIND_4 */
+	16,		/* RT_NODE_KIND_16 */
+	32,		/* RT_NODE_KIND_32 */
+	128,	/* RT_NODE_KIND_128 */
+	256		/* RT_NODE_KIND_256 */
+};
+
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+	char	   *test_name;		/* short name of the test, for humans */
+	char	   *pattern_str;	/* a bit pattern */
+	uint64		spacing;		/* pattern repeats at this interval */
+	uint64		num_values;		/* number of integers to set in total */
+} test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+	{
+		"all ones", "1111111111",
+		10, 1000000
+	},
+	{
+		"alternating bits", "0101010101",
+		10, 1000000
+	},
+	{
+		"clusters of ten", "1111111111",
+		10000, 1000000
+	},
+	{
+		"clusters of hundred",
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+		10000, 10000000
+	},
+	{
+		"one-every-64k", "1",
+		65536, 1000000
+	},
+	{
+		"sparse", "100000000000000000000000000000001",
+		10000000, 1000000
+	},
+	{
+		"single values, distance > 2^32", "1",
+		UINT64CONST(10000000000), 100000
+	},
+	{
+		"clusters, distance > 2^32", "10101010",
+		UINT64CONST(10000000000), 1000000
+	},
+	{
+		"clusters, distance > 2^60", "10101010",
+		UINT64CONST(2000000000000000000),
+		23						/* can't be much higher than this, or we
+								 * overflow uint64 */
+	}
+};
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+	radix_tree *radixtree;
+	uint64 dummy;
+
+	radixtree = rt_create(CurrentMemoryContext);
+
+	if (rt_search(radixtree, 0, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, 1, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_num_entries(radixtree) != 0)
+		elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+	rt_free(radixtree);
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(radix_tree *radixtree, uint8 shift, int start, int end)
+{
+	for (int i = start; i < end; i++)
+	{
+		uint64 key = ((uint64) i << shift);
+		uint64 val;
+
+		if (!rt_search(radixtree, key, &val))
+			elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+				 key, end);
+		if (val != key)
+			elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 key, val, key);
+	}
+}
+
+static void
+test_node_types_insert(radix_tree *radixtree, uint8 shift)
+{
+	uint64 num_entries;
+
+	for (int i = 0; i < 256; i++)
+	{
+		uint64 key = ((uint64) i << shift);
+		bool found;
+
+		found = rt_set(radixtree, key, key);
+
+		if (found)
+			elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", key);
+
+		for (int j = 0; j < lengthof(rt_node_max_entries); j++)
+		{
+			/*
+			 * After filling all slots in each node type, check if the values are
+			 * stored properly.
+			 */
+			if (i == (rt_node_max_entries[j] - 1))
+			{
+				check_search_on_node(radixtree, shift,
+									 (j == 0) ? 0 : rt_node_max_entries[j - 1],
+									 rt_node_max_entries[j]);
+				break;
+			}
+		}
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	if (num_entries != 256)
+		elog(ERROR,
+			 "rt_num_entries returned" UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(radix_tree *radixtree, uint8 shift)
+{
+	uint64 num_entries;
+
+	for (int i = 0; i < 256; i++)
+	{
+		uint64	key = ((uint64) i << shift);
+		bool	found;
+
+		found = rt_delete(radixtree, key);
+
+		if (!found)
+			elog(ERROR, "inserted key 0x" UINT64_HEX_FORMAT " is not found", key);
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	/* The tree must be empty */
+	if (num_entries != 0)
+		elog(ERROR,
+			 "rt_num_entries returned" UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+	radix_tree *radixtree;
+
+	elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+	radixtree = rt_create(CurrentMemoryContext);
+
+	/*
+	 * Insert and search entries for every node type at the 'shift' level,
+	 * then delete all entries to make it empty, and insert and search
+	 * entries again.
+	 */
+	test_node_types_insert(radixtree, shift);
+	test_node_types_delete(radixtree, shift);
+	test_node_types_insert(radixtree, shift);
+
+	rt_free(radixtree);
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec *spec)
+{
+	radix_tree *radixtree;
+	rt_iter *iter;
+	MemoryContext radixtree_ctx;
+	TimestampTz starttime;
+	TimestampTz endtime;
+	uint64		n;
+	uint64		last_int;
+	uint64		ndeleted;
+	uint64		nbefore;
+	uint64		nafter;
+	int			patternlen;
+	uint64	   *pattern_values;
+	uint64		pattern_num_values;
+
+	elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+	if (rt_test_stats)
+		fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+	/* Pre-process the pattern, creating an array of integers from it. */
+	patternlen = strlen(spec->pattern_str);
+	pattern_values = palloc(patternlen * sizeof(uint64));
+	pattern_num_values = 0;
+	for (int i = 0; i < patternlen; i++)
+	{
+		if (spec->pattern_str[i] == '1')
+			pattern_values[pattern_num_values++] = i;
+	}
+
+	/*
+	 * Allocate the radix tree.
+	 *
+	 * Allocate it in a separate memory context, so that we can print its
+	 * memory usage easily.
+	 */
+	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+										  "radixtree test",
+										  ALLOCSET_SMALL_SIZES);
+	MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+	radixtree = rt_create(radixtree_ctx);
+
+	/*
+	 * Add values to the set.
+	 */
+	starttime = GetCurrentTimestamp();
+
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		uint64		x = 0;
+
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			bool found;
+
+			x = last_int + pattern_values[i];
+
+			found = rt_set(radixtree, x, x);
+
+			if (found)
+				elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+
+	endtime = GetCurrentTimestamp();
+
+	if (rt_test_stats)
+		fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+				spec->num_values, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Print stats on the amount of memory used.
+	 *
+	 * We print the usage reported by rt_memory_usage(), as well as the
+	 * stats from the memory context.  They should be in the same ballpark,
+	 * but it's hard to automate testing that, so if you're making changes to
+	 * the implementation, just observe that manually.
+	 */
+	if (rt_test_stats)
+	{
+		uint64		mem_usage;
+
+		/*
+		 * Also print memory usage as reported by rt_memory_usage().  It
+		 * should be in the same ballpark as the usage reported by
+		 * MemoryContextStats().
+		 */
+		mem_usage = rt_memory_usage(radixtree);
+		fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+				mem_usage, (double) mem_usage / spec->num_values);
+
+		MemoryContextStats(radixtree_ctx);
+	}
+
+	/* Check that rt_num_entries works */
+	n = rt_num_entries(radixtree);
+	if (n != spec->num_values)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_search()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	for (n = 0; n < 100000; n++)
+	{
+		bool		found;
+		bool		expected;
+		uint64		x;
+		uint64		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Do we expect this value to be present in the set? */
+		if (x >= last_int)
+			expected = false;
+		else
+		{
+			uint64		idx = x % spec->spacing;
+
+			if (idx >= patternlen)
+				expected = false;
+			else if (spec->pattern_str[idx] == '1')
+				expected = true;
+			else
+				expected = false;
+		}
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (found != expected)
+			elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+		if (found && (v != x))
+			elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 v, x);
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Test iterator
+	 */
+	starttime = GetCurrentTimestamp();
+
+	iter = rt_begin_iterate(radixtree);
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			uint64		expected = last_int + pattern_values[i];
+			uint64		x;
+			uint64		val;
+
+			if (!rt_iterate_next(iter, &x, &val))
+				break;
+
+			if (x != expected)
+				elog(ERROR,
+					 "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+					 x, expected, i);
+			if (val != expected)
+				elog(ERROR,
+					 "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	if (n < spec->num_values)
+		elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+	if (n > spec->num_values)
+		elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_delete()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	nbefore = rt_num_entries(radixtree);
+	ndeleted = 0;
+	for (n = 0; n < 100000; n++)
+	{
+		bool		found;
+		uint64		x;
+		uint64		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (!found)
+			continue;
+
+		/* If the key is found, delete it and check again */
+		if (!rt_delete(radixtree, x))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_search(radixtree, x, &v))
+			elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_delete(radixtree, x))
+			elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+		ndeleted++;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+				ndeleted, (int) (endtime - starttime) / 1000);
+
+	nafter = rt_num_entries(radixtree);
+
+	/* Check that rt_num_entries works */
+	if ((nbefore - ndeleted) != nafter)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+			 nafter, (nbefore - ndeleted), ndeleted);
+
+	MemoryContextDelete(radixtree_ctx);
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+	test_empty();
+
+	for (int shift = 0; shift <= (64 - 8); shift += 8)
+		test_node_types(shift);
+
+	/* Test different test patterns, with lots of entries */
+	for (int i = 0; i < lengthof(test_specs); i++)
+		test_pattern(&test_specs[i]);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
-- 
2.31.1

From 39f0019d95eb4808d235a07d107aee2ff46856e2 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [PATCH v6 3/3] tool for measuring radix tree performance

---
 contrib/bench_radix_tree/Makefile             |  21 ++
 .../bench_radix_tree--1.0.sql                 |  42 +++
 contrib/bench_radix_tree/bench_radix_tree.c   | 301 ++++++++++++++++++
 .../bench_radix_tree/bench_radix_tree.control |   6 +
 contrib/bench_radix_tree/expected/bench.out   |  13 +
 contrib/bench_radix_tree/sql/bench.sql        |  16 +
 6 files changed, 399 insertions(+)
 create mode 100644 contrib/bench_radix_tree/Makefile
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
 create mode 100644 contrib/bench_radix_tree/expected/bench.out
 create mode 100644 contrib/bench_radix_tree/sql/bench.sql

diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..b8f70e12d1
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+	bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..6663abe6a4
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,42 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..5806ef7519
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,301 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ *	  contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include "miscadmin.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+#define TIDS_PER_BLOCK_FOR_LOAD		30
+#define TIDS_PER_BLOCK_FOR_LOOKUP	50
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+
+static radix_tree *rt = NULL;
+static ItemPointer itemptrs = NULL;
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+	uint32 upper;
+	uint32 shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+	int64 tid_i;
+
+	Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+	tid_i = ItemPointerGetOffsetNumber(tid);
+	tid_i |= ItemPointerGetBlockNumber(tid) << shift;
+
+	/* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+	*off = tid_i & ((1 << 6) - 1);
+	upper = tid_i >> 6;
+	Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+	Assert(*off < 64);
+
+	return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+	return (int) floor(pg_prng_double(state) * ((upper-lower)+0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+	/* reproducability */
+	pg_prng_state state;
+
+	pg_prng_seed(&state, 0);
+
+	for (int i = 0; i < nitems - 1; i++)
+	{
+		int j = shuffle_randrange(&state, i, nitems - 1);
+		ItemPointerData t = itemptrs[j];
+
+		itemptrs[j] = itemptrs[i];
+		itemptrs[i] = t;
+	}
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p)
+{
+	ItemPointer	tids;
+	uint64	maxitems;
+	uint64	ntids = 0;
+
+	maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+	tids = MemoryContextAllocHuge(TopTransactionContext,
+								  sizeof(ItemPointerData) * maxitems);
+
+	for (BlockNumber blk = minblk; blk < maxblk; blk++)
+	{
+		for (OffsetNumber off = FirstOffsetNumber;
+			 off <= ntids_per_blk; off++)
+		{
+			CHECK_FOR_INTERRUPTS();
+
+			ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+			ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+			ntids++;
+		}
+	}
+
+	*ntids_p = ntids;
+	return tids;
+}
+
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+	BlockNumber lblk,
+				rblk;
+	OffsetNumber loff,
+				roff;
+
+	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+	if (lblk < rblk)
+		return -1;
+	if (lblk > rblk)
+		return 1;
+
+	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+	if (loff < roff)
+		return -1;
+	if (loff > roff)
+		return 1;
+
+	return 0;
+}
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+	BlockNumber minblk = PG_GETARG_INT32(0);
+	BlockNumber maxblk = PG_GETARG_INT32(1);
+	uint64	ntids;
+	uint64	key;
+	uint64	last_key = PG_UINT64_MAX;;
+	uint64	val = 0;
+	ItemPointer tids;
+	TupleDesc	tupdesc;
+	TimestampTz	start_time,	end_time;
+	long	secs;
+	int		usecs;
+	int64	rt_load_ms, rt_search_ms, ar_load_ms, ar_search_ms;
+	Datum	values[7];
+	bool	nulls[7];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids);
+
+	/* measure the load time of the radix tree */
+	rt = rt_create(CurrentMemoryContext);
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32	off;
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		if (last_key != PG_UINT64_MAX && last_key != key)
+		{
+			rt_set(rt, last_key, val);
+			val = 0;
+		}
+
+		last_key = key;
+		val |= (uint64) 1 << off;
+	}
+	if (last_key != PG_UINT64_MAX)
+		rt_set(rt, last_key, val);
+
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+	/* measure the load time of the array */
+	itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+									  sizeof(ItemPointerData) * ntids);
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointerSetBlockNumber(&(itemptrs[i]),
+								  ItemPointerGetBlockNumber(&(tids[i])));
+		ItemPointerSetOffsetNumber(&(itemptrs[i]),
+								   ItemPointerGetOffsetNumber(&(tids[i])));
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	ar_load_ms = secs * 1000 + usecs / 1000;
+
+	if (shuffle)
+		shuffle_itemptrs(tids, ntids);
+
+	/* meaure the serach time of the radix tree */
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint64	key, val;
+		uint32	off;
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		rt_search(rt, key, &val);
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	/* next, measure the serach time of the array */
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+
+		bsearch((void *) tid,
+				(void *) itemptrs,
+				ntids,
+				sizeof(ItemPointerData),
+				vac_cmp_itemptr);
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	ar_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_num_entries(rt));
+	values[1] = Int64GetDatum(rt_memory_usage(rt));
+	values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+	values[3] = Int64GetDatum(rt_load_ms);
+	values[4] = Int64GetDatum(ar_load_ms);
+	values[5] = Int64GetDatum(rt_search_ms);
+	values[6] = Int64GetDatum(ar_search_ms);
+
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+	uint64	cnt = (uint64) PG_GETARG_INT64(0);
+	radix_tree *rt;
+	pg_prng_state state;
+	TupleDesc	tupdesc;
+	TimestampTz	start_time,	end_time;
+	long	secs;
+	int		usecs;
+	int64	load_time_ms;
+	Datum	values[2];
+	bool	nulls[2];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	pg_prng_seed(&state, 0);
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		uint64 key = pg_prng_uint64(&state);
+
+		rt_set(rt, key, key);
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	load_time_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_memory_usage(rt));
+	values[1] = Int64GetDatum(load_time_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
-- 
2.31.1

Reply via email to