Re: tuple radix sort

John Naylor Sun, 15 Feb 2026 00:41:04 -0800

I've committed v7-0001 with the above review and a couple more
cosmetic adjustments. Most notably, I've never really liked the name
sort_byvalue_datum(). It really is the entry point to radix sort, and
it only diverts to qsort if after the NULL partitioning phase there
aren't enough non-null tuples to justify the overhead of radix sort.
So for better symmetry with qsort_tuple, I've renamed
sort_byvalue_datum() to radix_sort_tuple(), which then calls out to
now-named radix_sort_recursive(), which will call itself until it
completes or diverts. Another change down below:

On Fri, Feb 13, 2026 at 10:25 AM cca5507 <[email protected]> wrote:
> I think we need to add a comment to explain why we do the
> check. The cost of this check is not small.

(presorted check) The cost should only matter for pathological inputs,
and I haven't found it to matter much overall. I've left it
uncommented for now. However, while rebasing 0002 to deal with recent
review comments, I had an idea: In yesterday's commit, I moved the
presorted check down to just before invoking the actual radix sort.
That way, with the attached v8-0002 the common prefix detection is
done at the same time as the presorted check. That makes 0002 a
smaller patch and by doing both in the same for-loop it's easier to
read and can reduce the number of memory reads. We can consider more
commentary here, but the motivation do avoid unnecessary work should
be fairly obvious.

Next step : Test whether it's worth it for the common prefix detection
to use the normalized datum.

--
John Naylor
Amazon Web Services

From 14df5854de4b73da8792c015859e8d7ecf09d4d0 Mon Sep 17 00:00:00 2001
From: John Naylor <[email protected]>
Date: Tue, 10 Feb 2026 19:10:20 +0700
Subject: [PATCH v8 4/4] WIP Call verify_memtuples_sorted() after qsort for
 consistency

The regression test changes are from a user defined function that
is now called a couple more times.
---
 src/backend/utils/sort/tuplesort.c      | 1 +
 src/test/regress/expected/namespace.out | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 629d598d8b7..dea3347667f 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -3033,6 +3033,7 @@ tuplesort_sort_memtuples(Tuplesortstate *state)
 						state->base.comparetup,
 						state);
 		}
+		verify_memtuples_sorted(state);
 	}
 }
 
diff --git a/src/test/regress/expected/namespace.out b/src/test/regress/expected/namespace.out
index dbbda72d395..b78abdf21ea 100644
--- a/src/test/regress/expected/namespace.out
+++ b/src/test/regress/expected/namespace.out
@@ -149,6 +149,8 @@ NOTICE:  current search_path: pg_catalog, pg_temp
 NOTICE:  current search_path: pg_catalog, pg_temp
 NOTICE:  current search_path: pg_catalog, pg_temp
 NOTICE:  current search_path: pg_catalog, pg_temp
+NOTICE:  current search_path: pg_catalog, pg_temp
+NOTICE:  current search_path: pg_catalog, pg_temp
 REFRESH MATERIALIZED VIEW test_maint_search_path.test_maint_mv;
 NOTICE:  current search_path: pg_catalog, pg_temp
 NOTICE:  current search_path: pg_catalog, pg_temp
-- 
2.53.0

From a7dfc1c9fa535dfceac48f8e91ad2cb516872855 Mon Sep 17 00:00:00 2001
From: John Naylor <[email protected]>
Date: Tue, 10 Feb 2026 20:02:19 +0700
Subject: [PATCH v8 3/4] WIP Add possible message changes

Not sure if necessary so kept out of main commit
---
 src/backend/utils/sort/tuplesort.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 2203cfb725d..629d598d8b7 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -2243,7 +2243,7 @@ dumptuples(Tuplesortstate *state, bool alltuples)
 	state->currentRun++;
 
 	if (trace_sort)
-		elog(LOG, "worker %d starting quicksort of run %d: %s",
+		elog(LOG, "worker %d starting internal sort of run %d: %s",
 			 state->worker, state->currentRun,
 			 pg_rusage_show(&state->ru_start));
 
@@ -2254,7 +2254,7 @@ dumptuples(Tuplesortstate *state, bool alltuples)
 	tuplesort_sort_memtuples(state);
 
 	if (trace_sort)
-		elog(LOG, "worker %d finished quicksort of run %d: %s",
+		elog(LOG, "worker %d finished internal sort of run %d: %s",
 			 state->worker, state->currentRun,
 			 pg_rusage_show(&state->ru_start));
 
-- 
2.53.0

From ab8ba71b018423779289c442f05a369e373dc41d Mon Sep 17 00:00:00 2001
From: John Naylor <[email protected]>
Date: Sat, 14 Feb 2026 11:41:34 +0700
Subject: [PATCH v8 2/4] Detect common prefix to avoid wasted work during radix
 sort

Start radix sort at the most significant byte position that has more
than one distinct byte in the input. This skips passes where radix
sort would count the distinct bytes just to find only a single one,
in which case there is nothing further to do for that pass. This can
give a few percent speedup for integers that have some zero upper
bytes, which is common for those that didn't arrive via abbreviation.

Reviewed-by: Chengpeng Yan <[email protected]>
Discussion: https://postgr.es/m/canwcazypgmdsswaa18foxjgxapzvdypswpokfcx32dwh3qz...@mail.gmail.com
---
 src/backend/utils/sort/tuplesort.c | 48 ++++++++++++++++++++++++++----
 1 file changed, 43 insertions(+), 5 deletions(-)

diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 1fc440ea6ca..2203cfb725d 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -104,6 +104,7 @@
 #include "commands/tablespace.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
+#include "port/pg_bitutils.h"
 #include "storage/shmem.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
@@ -2909,28 +2910,65 @@ radix_sort_tuple(SortTuple *data, size_t n, Tuplesortstate *state)
 	}
 	else
 	{
+		int			common_prefix;
+		Datum		ref_datum;
+		Datum		common_upper_bits = 0;
 		bool		presorted = true;
 
+		Assert(not_null_count > 0);
+		ref_datum = not_null_start[0].datum1;
+
+		/* compute the common prefix of all datums */
 		for (SortTuple *st = not_null_start + 1;
 			 st < not_null_start + not_null_count;
 			 st++)
 		{
-			if (COMPARETUP(state, st - 1, st) > 0)
-			{
+			Datum		this_datum = st->datum1;
+
+			/*
+			 * Accumulate bits that represent a difference from the reference
+			 * datum.
+			 */
+			common_upper_bits |= ref_datum ^ this_datum;
+
+			/* do a presorted check while we're at it */
+			if (presorted && COMPARETUP(state, st - 1, st) > 0)
 				presorted = false;
-				break;
-			}
 
 			CHECK_FOR_INTERRUPTS();
 		}
 
 		if (presorted)
 			return;
+		else if (common_upper_bits == 0)
+		{
+			/*
+			 * All NOT NULL tuples have the same datum, so we can skip radix
+			 * sort. Sort using the tiebreak comparator if necessary.
+			 */
+			if (state->base.onlyKey == NULL)
+			{
+				qsort_tuple(not_null_start,
+							not_null_count,
+							state->base.comparetup_tiebreak,
+							state);
+			}
+		}
 		else
 		{
+			int			diffpos;
+
+			/*
+			 * The upper bits of common_upper_bits are zero where all datums
+			 * have the same bits. The byte position of the leftmost one bit
+			 * is the byte where radix sort should start.
+			 */
+			diffpos = pg_leftmost_one_pos64(DatumGetUInt64(common_upper_bits));
+			common_prefix = sizeof(Datum) - 1 - (diffpos / BITS_PER_BYTE);
+
 			radix_sort_recursive(not_null_start,
 								 not_null_count,
-								 0,
+								 common_prefix,
 								 state);
 		}
 	}
-- 
2.53.0

Re: tuple radix sort

Reply via email to