Re: [HACKERS] Using quicksort for every external sort run

Peter Geoghegan Sun, 06 Dec 2015 16:26:46 -0800

On Tue, Nov 24, 2015 at 4:33 PM, Peter Geoghegan <p...@heroku.com> wrote:
> So, the bottom line is: This patch seems very good, is unlikely to
> have any notable downside (no case has been shown to be regressed),
> but has yet to receive code review. I am working on a new version with
> the first two commits consolidated, and better comments, but that will
> have the same code, unless I find bugs or am dissatisfied. It mostly
> needs thorough code review, and to a lesser extent some more
> performance testing.


I'm currently spending a lot of time working on parallel CREATE INDEX.
I should not delay posting a new version of my patch series any
further, though. I hope to polish up parallel CREATE INDEX to be able
to show people something in a couple of weeks.

This version features consolidated commits, the removal of the
multipass_warning parameter, and improved comments and commit
messages. It has almost entirely unchanged functionality.

The only functional changes are:

* The function useselection() is taught to distrust an obviously bogus
caller reltuples hint (when it's already less than half of what we
know to be the minimum number of tuples that the sort must sort,
immediately after LACKMEM() first becomes true -- this is probably a
generic estimate).

* Prefetching only occurs when writing tuples. Explicit prefetching
appears to hurt in some cases, as David Rowley has shown over on the
dedicated thread. But it might still be that writing tuples is a case
that is simple enough to benefit consistently, due to the relatively
uniform processing that memory latency can hide behind for that case
(before, the same prefetching instructions were used for CREATE INDEX
and for aggregates, for example).

Maybe we should consider trying to get patch 0002 (the memory
pool/merge patch) committed first, something Greg Stark suggested
privately. That might actually be an easier way of integrating this
work, since it changes nothing about the algorithm we use for merging
(it only improves memory locality), and so is really an independent
piece of work (albeit one that makes a huge overall difference due to
the other patches increasing the time spent merging in absolute terms,
and especially as a proportion of the total).

-- 
Peter Geoghegan

From 7ff4a3e4448b25ba524e40c0db57307466f167d9 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <peter.geoghega...@gmail.com>
Date: Sun, 12 Jul 2015 13:14:01 -0700
Subject: [PATCH 3/6] Perform memory prefetching when writing memtuples

This patch is based on, but quite distinct to a separately submitted,
more general version which performs prefetching in several places [1].
This version now only performs prefetching of each "tuple proper" during
the writing of batches of tuples (an entire run, written following a
quicksort).  The case for prefetching each "tuple proper" at several
sites now seems weak due to difference in CPU microarchitecture.
However, it might still be that there is a consistent improvement
observable when writing out tuples, because that involves a particularly
tight inner loop, with relatively predictable processing to hide memory
latency behind.  A helpful generic prefetch hint may be possible for
this case, even if it proves impossible elsewhere.

This has been shown to appreciably help on both a POWER7 server
processor [2], and an Intel Mobile processor.

[1] https://commitfest.postgresql.org/6/305/
[2] CAM3SWZR5rv3+F3FOKf35=dti7otmmcdfoe2vogur0pddg3j...@mail.gmail.com
---
 config/c-compiler.m4               | 17 +++++++++++++++++
 configure                          | 31 +++++++++++++++++++++++++++++++
 configure.in                       |  1 +
 src/backend/utils/sort/tuplesort.c | 15 +++++++++++++++
 src/include/c.h                    | 14 ++++++++++++++
 src/include/pg_config.h.in         |  3 +++
 src/include/pg_config.h.win32      |  3 +++
 src/include/pg_config_manual.h     | 10 ++++++++++
 8 files changed, 94 insertions(+)

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 550d034..8be2122 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -271,6 +271,23 @@ fi])# PGAC_C_BUILTIN_UNREACHABLE
 
 
 
+# PGAC_C_BUILTIN_PREFETCH
+# -------------------------
+# Check if the C compiler understands __builtin_prefetch(),
+# and define HAVE__BUILTIN_PREFETCH if so.
+AC_DEFUN([PGAC_C_BUILTIN_PREFETCH],
+[AC_CACHE_CHECK(for __builtin_prefetch, pgac_cv__builtin_prefetch,
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([],
+[int i = 0;__builtin_prefetch(&i, 0, 3);])],
+[pgac_cv__builtin_prefetch=yes],
+[pgac_cv__builtin_prefetch=no])])
+if test x"$pgac_cv__builtin_prefetch" = xyes ; then
+AC_DEFINE(HAVE__BUILTIN_PREFETCH, 1,
+          [Define to 1 if your compiler understands __builtin_prefetch.])
+fi])# PGAC_C_BUILTIN_PREFETCH
+
+
+
 # PGAC_C_VA_ARGS
 # --------------
 # Check if the C compiler understands C99-style variadic macros,
diff --git a/configure b/configure
index 8c61390..cd8db5d 100755
--- a/configure
+++ b/configure
@@ -11338,6 +11338,37 @@ if test x"$pgac_cv__builtin_unreachable" = xyes ; then
 $as_echo "#define HAVE__BUILTIN_UNREACHABLE 1" >>confdefs.h
 
 fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_prefetch" >&5
+$as_echo_n "checking for __builtin_prefetch... " >&6; }
+if ${pgac_cv__builtin_prefetch+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+int
+main ()
+{
+int i = 0;__builtin_prefetch(&i, 0, 3);
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv__builtin_prefetch=yes
+else
+  pgac_cv__builtin_prefetch=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__builtin_prefetch" >&5
+$as_echo "$pgac_cv__builtin_prefetch" >&6; }
+if test x"$pgac_cv__builtin_prefetch" = xyes ; then
+
+$as_echo "#define HAVE__BUILTIN_PREFETCH 1" >>confdefs.h
+
+fi
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for __VA_ARGS__" >&5
 $as_echo_n "checking for __VA_ARGS__... " >&6; }
 if ${pgac_cv__va_args+:} false; then :
diff --git a/configure.in b/configure.in
index b5868b0..0e25b89 100644
--- a/configure.in
+++ b/configure.in
@@ -1320,6 +1320,7 @@ PGAC_C_BUILTIN_BSWAP32
 PGAC_C_BUILTIN_BSWAP64
 PGAC_C_BUILTIN_CONSTANT_P
 PGAC_C_BUILTIN_UNREACHABLE
+PGAC_C_BUILTIN_PREFETCH
 PGAC_C_VA_ARGS
 PGAC_STRUCT_TIMEZONE
 PGAC_UNION_SEMUN
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 9821bb2..a4707c6 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -3373,6 +3373,21 @@ dumpbatch(Tuplesortstate *state, bool alltuples)
 		WRITETUP(state, state->tp_tapenum[state->destTape],
 				 &state->memtuples[i]);
 		state->memtupcount--;
+
+		/*
+		 * Perform memory prefetch of "tuple proper" of the SortTuple that's
+		 * two places ahead of tuple just written.  Testing shows that this
+		 * significantly boosts performance, by hiding memory latency behind
+		 * CPU processing of writing tuples.
+		 *
+		 * Don't do this for pass-by-value datum sorts;  even though hinting a
+		 * NULL address does not affect correctness, it would have a noticeable
+		 * overhead here.
+		 */
+#ifdef USE_MEM_PREFETCH
+		if (state->memtuples[i].tuple != NULL && i + 2 < memtupwrite)
+			pg_read_prefetch(state->memtuples[i + 2].tuple);
+#endif
 	}
 	markrunend(state, state->tp_tapenum[state->destTape]);
 	state->tp_runs[state->destTape]++;
diff --git a/src/include/c.h b/src/include/c.h
index 8163b00..3d05ff8 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -927,6 +927,20 @@ typedef NameData *Name;
 #define pg_unreachable() abort()
 #endif
 
+/*
+ * Prefetch support -- Support memory prefetching hints on some platforms.
+ *
+ * pg_read_prefetch() is specialized for the case where an array is accessed
+ * sequentially, and we can prefetch a pointer within the next element (or an
+ * even later element) in order to hide memory latency.  This case involves
+ * prefetching addresses with low temporal locality.  Note that it's rather
+ * difficult to get any kind of speedup with this;  any use of the intrinsic
+ * should be carefully tested.  It's okay to pass it an invalid or NULL
+ * address, although it's best avoided.
+ */
+#if defined(USE_MEM_PREFETCH)
+#define pg_read_prefetch(addr)		__builtin_prefetch((addr), 0, 0)
+#endif
 
 /* ----------------------------------------------------------------
  *				Section 8:	random stuff
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index a20e0cd..ad1d17a 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -666,6 +666,9 @@
 /* Define to 1 if your compiler understands __builtin_constant_p. */
 #undef HAVE__BUILTIN_CONSTANT_P
 
+/* Define to 1 if your compiler understands __builtin_prefetch. */
+#undef HAVE__BUILTIN_PREFETCH
+
 /* Define to 1 if your compiler understands __builtin_types_compatible_p. */
 #undef HAVE__BUILTIN_TYPES_COMPATIBLE_P
 
diff --git a/src/include/pg_config.h.win32 b/src/include/pg_config.h.win32
index 8566065..990c3c3 100644
--- a/src/include/pg_config.h.win32
+++ b/src/include/pg_config.h.win32
@@ -514,6 +514,9 @@
 /* Define to 1 if your compiler understands __builtin_constant_p. */
 /* #undef HAVE__BUILTIN_CONSTANT_P */
 
+/* Define to 1 if your compiler understands __builtin_prefetch. */
+#undef HAVE__BUILTIN_PREFETCH
+
 /* Define to 1 if your compiler understands __builtin_types_compatible_p. */
 /* #undef HAVE__BUILTIN_TYPES_COMPATIBLE_P */
 
diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
index e278fa0..4c7b1d5 100644
--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -153,6 +153,16 @@
 #endif
 
 /*
+ * USE_MEM_PREFETCH controls whether Postgres will attempt to use memory
+ * prefetching.  Usually the automatic configure tests are sufficient, but
+ * it's conceivable that using prefetching is counter-productive on some
+ * platforms.  If necessary you can remove the #define here.
+ */
+#ifdef HAVE__BUILTIN_PREFETCH
+#define USE_MEM_PREFETCH
+#endif
+
+/*
  * USE_SSL code should be compiled only when compiling with an SSL
  * implementation.  (Currently, only OpenSSL is supported, but we might add
  * more implementations in the future.)
-- 
1.9.1

From 135f7fa6cd9fd543a14433cb208336d39301e3f1 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <peter.geoghega...@gmail.com>
Date: Mon, 24 Aug 2015 13:22:11 -0700
Subject: [PATCH 2/6] Use "tuple proper" memory pool for tuplesort merge

Allocate a large pool of memory for use by READTUP() routines.  This
accelerates the final on-the-fly merge phase significantly.  As tapes
are pre-read, memory is reused from the pool, virtually eliminating
retail palloc() calls.

Sequentially consuming memory from the pool accelerates merging because
the natural order of all subsequent access (access during merging
proper) is sequential per tape; cache characteristics are improved.  In
addition, modern microarchitectures and compilers may even automatically
perform prefetching with a sequential access pattern.
---
 src/backend/utils/sort/tuplesort.c | 346 +++++++++++++++++++++++++++++++++----
 1 file changed, 315 insertions(+), 31 deletions(-)

diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index da0453e..9821bb2 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -345,6 +345,23 @@ struct Tuplesortstate
 	int64	   *mergeavailmem;	/* availMem for prereading each tape */
 	int			mergefreelist;	/* head of freelist of recycled slots */
 	int			mergefirstfree; /* first slot never used in this merge */
+	int64		spacePerTape;	/* space (memory) segmented to each tape */
+
+	/*
+	 * During final on-the-fly merge steps, an ad-hoc "tuple proper" memory
+	 * pool is used, reducing palloc() overhead.  Importantly, this also
+	 * results in far more cache efficient merging, since each tape's tuples
+	 * must naturally be accessed sequentially (in sorted order).
+	 *
+	 * These variables manage each active tape's ownership of memory pool
+	 * space.  The underlying buffer is allocated once, just before
+	 * TSS_FINALMERGE prereading.
+	 */
+	bool		usePool;		/* using "tuple proper" memory pool? */
+	char	  **mergetupstart;	/* start of each run's partition */
+	char	  **mergetupcur;	/* current offset into each run's partition */
+	char	  **mergetuptail;	/* last appended item's start point */
+	char	  **mergeoverflow;	/* single retail palloc() "overflow" */
 
 	/*
 	 * Variables for Algorithm D.  Note that destTape is a "logical" tape
@@ -510,9 +527,13 @@ static void selectnewtape(Tuplesortstate *state);
 static void mergeruns(Tuplesortstate *state);
 static void mergeonerun(Tuplesortstate *state);
 static void mergememruns(Tuplesortstate *state);
-static void beginmerge(Tuplesortstate *state);
+static void beginmerge(Tuplesortstate *state, bool finalMerge);
+static void mergepool(Tuplesortstate *state);
+static void mergepoolone(Tuplesortstate *state, int srcTape,
+					  SortTuple *stup, bool *should_free);
 static void mergepreread(Tuplesortstate *state);
-static void mergeprereadone(Tuplesortstate *state, int srcTape);
+static void mergeprereadone(Tuplesortstate *state, int srcTape,
+					  SortTuple *stup, bool *should_free);
 static void dumptuples(Tuplesortstate *state, bool alltuples);
 static void dumpbatch(Tuplesortstate *state, bool alltuples);
 static void make_bounded_heap(Tuplesortstate *state);
@@ -524,6 +545,7 @@ static void tuplesort_heap_siftup(Tuplesortstate *state, bool checkIndex);
 static void reversedirection(Tuplesortstate *state);
 static unsigned int getlen(Tuplesortstate *state, int tapenum, bool eofOK);
 static void markrunend(Tuplesortstate *state, int tapenum);
+static void *tupproperalloc(Tuplesortstate *state, int tapenum, Size size);
 static int comparetup_heap(const SortTuple *a, const SortTuple *b,
 				Tuplesortstate *state);
 static void copytup_heap(Tuplesortstate *state, SortTuple *stup, void *tup);
@@ -652,6 +674,7 @@ tuplesort_begin_common(int workMem, double rowNumHint, bool randomAccess)
 	 * inittapes(), if needed
 	 */
 
+	state->usePool = false;		/* memory pool not used until final merge */
 	state->result_tape = -1;	/* flag that result tape has not been formed */
 
 	MemoryContextSwitchTo(oldcontext);
@@ -1763,6 +1786,7 @@ tuplesort_performsort(Tuplesortstate *state)
  * Internal routine to fetch the next tuple in either forward or back
  * direction into *stup.  Returns FALSE if no more tuples.
  * If *should_free is set, the caller must pfree stup.tuple when done with it.
+ * Otherwise, caller should not use tuple following next call here.
  */
 static bool
 tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
@@ -1774,6 +1798,7 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
 	{
 		case TSS_SORTEDINMEM:
 			Assert(forward || state->randomAccess);
+			Assert(!state->usePool);
 			*should_free = false;
 			if (forward)
 			{
@@ -1818,6 +1843,7 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
 
 		case TSS_SORTEDONTAPE:
 			Assert(forward || state->randomAccess);
+			Assert(!state->usePool);
 			*should_free = true;
 			if (forward)
 			{
@@ -1903,6 +1929,7 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
 
 		case TSS_MEMTAPEMERGE:
 			Assert(forward);
+			Assert(!state->usePool);
 			/* For now, assume tuple returned from memory */
 			*should_free = false;
 
@@ -2001,7 +2028,9 @@ just_memtuples:
 
 		case TSS_FINALMERGE:
 			Assert(forward);
-			*should_free = true;
+			Assert(state->usePool);
+			/* For now, assume "tuple proper" is from memory pool */
+			*should_free = false;
 
 			/*
 			 * This code should match the inner loop of mergeonerun().
@@ -2009,18 +2038,17 @@ just_memtuples:
 			if (state->memtupcount > 0)
 			{
 				int			srcTape = state->memtuples[0].tupindex;
-				Size		tuplen;
 				int			tupIndex;
 				SortTuple  *newtup;
 
+				/*
+				 * Returned tuple is still counted in our memory space most
+				 * of the time.  See mergepoolone() for discussion of why
+				 * caller may occasionally be required to free returned
+				 * tuple, and how the preread memory pool is managed with
+				 * regard to edge cases more generally.
+				 */
 				*stup = state->memtuples[0];
-				/* returned tuple is no longer counted in our memory space */
-				if (stup->tuple)
-				{
-					tuplen = GetMemoryChunkSpace(stup->tuple);
-					state->availMem += tuplen;
-					state->mergeavailmem[srcTape] += tuplen;
-				}
 				tuplesort_heap_siftup(state, false);
 				if ((tupIndex = state->mergenext[srcTape]) == 0)
 				{
@@ -2028,9 +2056,11 @@ just_memtuples:
 					 * out of preloaded data on this tape, try to read more
 					 *
 					 * Unlike mergeonerun(), we only preload from the single
-					 * tape that's run dry.  See mergepreread() comments.
+					 * tape that's run dry, though not before preparing its
+					 * partition within the memory pool for a new round of
+					 * sequential consumption.  See mergepreread() comments.
 					 */
-					mergeprereadone(state, srcTape);
+					mergeprereadone(state, srcTape, stup, should_free);
 
 					/*
 					 * if still no data, we've reached end of run on this tape
@@ -2092,6 +2122,8 @@ tuplesort_gettupleslot(Tuplesortstate *state, bool forward,
  * Fetch the next tuple in either forward or back direction.
  * Returns NULL if no more tuples.  If *should_free is set, the
  * caller must pfree the returned tuple when done with it.
+ * If it is not set, caller should not use tuple following next
+ * call here.
  */
 HeapTuple
 tuplesort_getheaptuple(Tuplesortstate *state, bool forward, bool *should_free)
@@ -2111,6 +2143,8 @@ tuplesort_getheaptuple(Tuplesortstate *state, bool forward, bool *should_free)
  * Fetch the next index tuple in either forward or back direction.
  * Returns NULL if no more tuples.  If *should_free is set, the
  * caller must pfree the returned tuple when done with it.
+ * If it is not set, caller should not use tuple following next
+ * call here.
  */
 IndexTuple
 tuplesort_getindextuple(Tuplesortstate *state, bool forward,
@@ -2414,6 +2448,10 @@ inittapes(Tuplesortstate *state)
 	state->mergelast = (int *) palloc0(maxTapes * sizeof(int));
 	state->mergeavailslots = (int *) palloc0(maxTapes * sizeof(int));
 	state->mergeavailmem = (int64 *) palloc0(maxTapes * sizeof(int64));
+	state->mergetupstart = (char **) palloc0(maxTapes * sizeof(char *));
+	state->mergetupcur = (char **) palloc0(maxTapes * sizeof(char *));
+	state->mergetuptail = (char **) palloc0(maxTapes * sizeof(char *));
+	state->mergeoverflow = (char **) palloc0(maxTapes * sizeof(char *));
 	state->tp_fib = (int *) palloc0(maxTapes * sizeof(int));
 	state->tp_runs = (int *) palloc0(maxTapes * sizeof(int));
 	state->tp_dummy = (int *) palloc0(maxTapes * sizeof(int));
@@ -2595,7 +2633,7 @@ mergeruns(Tuplesortstate *state)
 				/* Tell logtape.c we won't be writing anymore */
 				LogicalTapeSetForgetFreeSpace(state->tapeset);
 				/* Initialize for the final merge pass */
-				beginmerge(state);
+				beginmerge(state, true);
 				state->status = TSS_FINALMERGE;
 				return;
 			}
@@ -2687,7 +2725,7 @@ mergeonerun(Tuplesortstate *state)
 	 * Start the merge by loading one tuple from each active source tape into
 	 * the heap.  We can also decrease the input run/dummy run counts.
 	 */
-	beginmerge(state);
+	beginmerge(state, false);
 
 	/*
 	 * Execute merge by repeatedly extracting lowest tuple in heap, writing it
@@ -2837,13 +2875,12 @@ mergememruns(Tuplesortstate *state)
  * fill the merge heap with the first tuple from each active tape.
  */
 static void
-beginmerge(Tuplesortstate *state)
+beginmerge(Tuplesortstate *state, bool finalMerge)
 {
 	int			activeTapes;
 	int			tapenum;
 	int			srcTape;
 	int			slotsPerTape;
-	int64		spacePerTape;
 
 	/* Heap should be empty here */
 	Assert(state->memtupcount == 0);
@@ -2882,19 +2919,29 @@ beginmerge(Tuplesortstate *state)
 	Assert(activeTapes > 0);
 	slotsPerTape = (state->memtupsize - state->mergefirstfree) / activeTapes;
 	Assert(slotsPerTape > 0);
-	spacePerTape = state->availMem / activeTapes;
+	state->spacePerTape = state->availMem / activeTapes;
 	for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
 	{
 		if (state->mergeactive[srcTape])
 		{
 			state->mergeavailslots[srcTape] = slotsPerTape;
-			state->mergeavailmem[srcTape] = spacePerTape;
+			state->mergeavailmem[srcTape] = state->spacePerTape;
 		}
 	}
 
 	/*
+	 * Preallocate "tuple proper" pool memory, and partition pool among
+	 * tapes.  Actual memory allocation is performed here at most once per
+	 * sort, just in advance of the final on-the-fly merge step.  That
+	 * implies that the optimization is never used by randomAccess callers,
+	 * since no on-the-fly final merge step will occur.
+	 */
+	if (finalMerge)
+		mergepool(state);
+
+	/*
 	 * Preread as many tuples as possible (and at least one) from each active
-	 * tape
+	 * tape.  This may use "tuple proper" pool.
 	 */
 	mergepreread(state);
 
@@ -2920,6 +2967,152 @@ beginmerge(Tuplesortstate *state)
 }
 
 /*
+ * mergepool - initialize "tuple proper" memory pool
+ *
+ * This allows sequential access to sorted tuples buffered in memory from
+ * tapes/runs on disk during a final on-the-fly merge step.
+ */
+static void
+mergepool(Tuplesortstate *state)
+{
+	char	   *tupProperPool;
+	Size		poolSize = state->spacePerTape * state->activeTapes;
+	int			srcTape;
+	int			i;
+
+	/* Heap should be empty here */
+	Assert(state->memtupcount == 0);
+	Assert(state->activeTapes > 0);
+	Assert(!state->randomAccess);
+
+	/*
+	 * For the purposes of tuplesort's memory accounting, the memory pool is
+	 * not special, and so per-active-tape mergeavailmem is decreased only
+	 * as memory is consumed from the pool through regular USEMEM() calls
+	 * (see mergeprereadone()).  All memory is actually allocated here all
+	 * at once, with only rare exceptions, so freeing memory consumed from
+	 * the pool must for the most part also occur at a higher, "logical"
+	 * level.
+	 *
+	 * mergeavailmem is reset specially (logically freed) when the partition
+	 * space is exhausted for its tape when a new round of prereading is
+	 * required.  This is okay because there is no "global" availMem
+	 * consumer from here on.  The inability for such a consumer to consume
+	 * memory from the pool is therefore of no concern.
+	 *
+	 * mergeavailmem is only relied on by the on-the-fly merge step to
+	 * determine that space has been exhausted for a tape.  There still
+	 * needs to be some leeway to perform retail palloc() calls because
+	 * LACKMEM() is only a soft limit.  In general, mergeavailmem may be
+	 * insufficient memory for storing even one tuple.
+	 */
+	tupProperPool = MemoryContextAllocHuge(state->sortcontext, poolSize);
+	state->usePool = true;
+
+	i = 0;
+	for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
+	{
+		if (!state->mergeactive[srcTape])
+			continue;
+
+		state->mergetupstart[srcTape] = tupProperPool + (i++ * state->spacePerTape);
+		/* Initialize current point into buffer */
+		state->mergetupcur[srcTape] = state->mergetupstart[srcTape];
+		state->mergetuptail[srcTape] = state->mergetupstart[srcTape];
+		state->mergeoverflow[srcTape] = NULL;
+	}
+	Assert(i == state->activeTapes);
+}
+
+/*
+ * mergepoolone - prepare preallocated pool for one merge input tape
+ *
+ * This is called following the exhaustion of preread tuples for one input
+ * tape.  While no memory is actually deallocated here (allocations should
+ * also almost never happen), this routine could be said to logically free
+ * the source tape's segment of the memory pool.  Of course, all that
+ * actually occurs is that the memory pool state for the source tape is
+ * reset to indicate that all memory may be reused.  Because the tape's
+ * mergeavailmem is also reset to the general per-active-tape share,
+ * calling here will generally result in LACKMEM() ceasing to apply.
+ *
+ * This routine must deal with fixing up the tuple that is about to be
+ * returned to the client, due to fine aspects of memory management.
+ */
+static void
+mergepoolone(Tuplesortstate *state, int srcTape, SortTuple *stup,
+			 bool *should_free)
+{
+	Size		lastTupLen;
+
+	/* By here, final on-the-fly merge step actually underway */
+	Assert(state->status == TSS_FINALMERGE);
+
+	/*
+	 * Tuple about to be returned to caller ("stup") is final preread tuple
+	 * from tape, just removed from the top of the heap.  Special steps
+	 * around memory management must be performed for that tuple.
+	 */
+	if (!state->mergeoverflow[srcTape])
+	{
+		/*
+		 * Mark tuple proper buffer range for reuse, but be careful to move
+		 * final, tail tuple to start of space for next run so that it's
+		 * available to caller when stup is returned, and remains available
+		 * at least until the next tuple is requested.
+		 */
+		lastTupLen =
+			state->mergetupcur[srcTape] - state->mergetuptail[srcTape];
+		state->mergetupcur[srcTape] = state->mergetupstart[srcTape];
+
+		memmove(state->mergetupcur[srcTape],
+				state->mergetuptail[srcTape],
+				lastTupLen);
+
+		/* Make SortTuple at top of the heap point to new "tuple proper" */
+		stup->tuple = (void *) state->mergetupcur[srcTape];
+		state->mergetupcur[srcTape] += lastTupLen;
+	}
+	else
+	{
+		/*
+		 * Handle an "overflow" retail palloc.
+		 *
+		 * This is needed on the rare occasions when very little work_mem is
+		 * available, particularly relative to the size of individual tuples
+		 * passed by caller.  Sometimes, tuplesort will only store one tuple
+		 * per tape in memtuples, so some amount of dynamic allocation is
+		 * inevitable.
+		 */
+		Size		tuplen;
+
+		/* No moving of last "tuple proper" required */
+		lastTupLen = 0;
+		state->mergetupcur[srcTape] = state->mergetupstart[srcTape];
+
+		 /* Returned tuple is no longer counted in our memory space */
+		if (stup->tuple)
+		{
+			Assert(stup->tuple == (void *) state->mergeoverflow[srcTape]);
+			tuplen = GetMemoryChunkSpace(stup->tuple);
+			state->availMem += tuplen;
+			state->mergeavailmem[srcTape] += tuplen;
+			/* Caller should free palloc'd tuple proper */
+			*should_free = true;
+		}
+		state->mergeoverflow[srcTape] = NULL;
+	}
+
+	/*
+	 * Give back tape's range in memory pool (accounting for tail contents
+	 * having been moved to front in the common case where there was no
+	 * handling of an "overflow" retail palloc).
+	 */
+	state->mergetuptail[srcTape] = state->mergetupstart[srcTape];
+	state->mergeavailmem[srcTape] = state->spacePerTape - lastTupLen;
+}
+
+/*
  * mergepreread - load tuples from merge input tapes
  *
  * This routine exists to improve sequentiality of reads during a merge pass,
@@ -2940,7 +3133,8 @@ beginmerge(Tuplesortstate *state)
  * that state and so no point in scanning through all the tapes to fix one.
  * (Moreover, there may be quite a lot of inactive tapes in that state, since
  * we might have had many fewer runs than tapes.  In a regular tape-to-tape
- * merge we can expect most of the tapes to be active.)
+ * merge we can expect most of the tapes to be active.  Plus, only FINALMERGE
+ * state has to consider memory management for the "tuple proper" memory pool.)
  */
 static void
 mergepreread(Tuplesortstate *state)
@@ -2948,7 +3142,7 @@ mergepreread(Tuplesortstate *state)
 	int			srcTape;
 
 	for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
-		mergeprereadone(state, srcTape);
+		mergeprereadone(state, srcTape, NULL, NULL);
 }
 
 /*
@@ -2957,9 +3151,15 @@ mergepreread(Tuplesortstate *state)
  * Read tuples from the specified tape until it has used up its free memory
  * or array slots; but ensure that we have at least one tuple, if any are
  * to be had.
+ *
+ * FINALMERGE state passes *rtup and *should_free variables, to have
+ * pool-related memory management responsibilities handled by
+ * mergepoolone().  Otherwise a memory pool isn't used, and this is not
+ * required.
  */
 static void
-mergeprereadone(Tuplesortstate *state, int srcTape)
+mergeprereadone(Tuplesortstate *state, int srcTape, SortTuple *rtup,
+				bool *should_free)
 {
 	unsigned int tuplen;
 	SortTuple	stup;
@@ -2969,6 +3169,26 @@ mergeprereadone(Tuplesortstate *state, int srcTape)
 
 	if (!state->mergeactive[srcTape])
 		return;					/* tape's run is already exhausted */
+
+	/*
+	 * Manage memory pool segment for tape (if pool is in use).
+	 *
+	 * This is also a natural point to redraw partitioning boundaries inside
+	 * the preread memory pool, by donating the unneeded memory of our tape
+	 * (once it becomes exhausted) to some adjacent active tape.
+	 *
+	 * For now we don't bother, though.  It does not seem to help much even
+	 * in the event of a strong logical/physical correlation, where earlier
+	 * runs will be drained before later runs return their earliest/lowest
+	 * tuples.  A contributing factor must be that the size of memtuples
+	 * during the merge phase is based on memory accounting with significant
+	 * palloc() overhead.  There is almost always initially somewhat more
+	 * memory for each tape in the pool than is needed, because the number
+	 * of slots available tends to be the limiting factor then.
+	 */
+	if (rtup)
+		mergepoolone(state, srcTape, rtup, should_free);
+
 	priorAvail = state->availMem;
 	state->availMem = state->mergeavailmem[srcTape];
 	while ((state->mergeavailslots[srcTape] > 0 && !LACKMEM(state)) ||
@@ -3616,6 +3836,68 @@ markrunend(Tuplesortstate *state, int tapenum)
 	LogicalTapeWrite(state->tapeset, tapenum, (void *) &len, sizeof(len));
 }
 
+/*
+ * Allocate memory either using palloc(), or using a dedicated memory pool
+ * "logical allocation" during tuple preloading.  READTUP() routines call
+ * here in place of a palloc() and USEMEM() call.
+ *
+ * READTUP() routines may receive memory from the memory pool when calling
+ * here, but at that point the tuples cannot subsequently be used with
+ * WRITETUP() routines, since they are unprepared for any "tuple proper" not
+ * allocated with a retail palloc().
+ *
+ * In the main, it doesn't seem worth optimizing the case where WRITETUP()
+ * must be called following consuming memory from the "tuple proper" pool
+ * (and allocating the pool earlier).  During initial sorts of runs, the
+ * order in which tuples will finally need to be written out is
+ * unpredictable, so no big improvement in cache efficiency should be
+ * expected from using a memory pool.  Multi-pass sorts are usually
+ * unnecessary and generally best avoided, so again, the non-use of a pool
+ * is not considered a problem.  However, randomAccess callers may
+ * appreciably benefit from using a memory pool, and may be revisited as a
+ * target for memory pooling in the future.
+ */
+static void *
+tupproperalloc(Tuplesortstate *state, int tapenum, Size size)
+{
+	Size		reserve_size = MAXALIGN(size);
+	char	   *ret;
+
+	if (!state->usePool)
+	{
+		/* Memory pool not in use */
+		ret = palloc(size);
+		USEMEM(state, GetMemoryChunkSpace(ret));
+	}
+	else if (state->mergetupcur[tapenum] + reserve_size <
+			 state->mergetupstart[tapenum] + state->spacePerTape)
+	{
+		/*
+		 * Usual case -- caller is returned memory from its tape's partition
+		 * in buffer, since there is an adequate supply of memory.
+		 */
+		ret = state->mergetuptail[tapenum] = state->mergetupcur[tapenum];
+
+		/*
+		 * Logically, for accounting purposes, memory from our pool has only
+		 * been allocated now.  We expect reserve_size to actually diminish
+		 * our tape's mergeavailmem, not "global" availMem.
+		 */
+		state->mergetupcur[tapenum] += reserve_size;
+		USEMEM(state, reserve_size);
+	}
+	else
+	{
+		/* Should only use overflow allocation once per tape per level */
+		Assert(state->mergeoverflow[tapenum] == NULL);
+		/* palloc() in ordinary way */
+		ret = state->mergeoverflow[tapenum] = palloc(size);
+		USEMEM(state, GetMemoryChunkSpace(ret));
+	}
+
+	return ret;
+}
+
 
 /*
  * Routines specialized for HeapTuple (actually MinimalTuple) case
@@ -3778,6 +4060,7 @@ writetup_heap(Tuplesortstate *state, int tapenum, SortTuple *stup)
 		LogicalTapeWrite(state->tapeset, tapenum,
 						 (void *) &tuplen, sizeof(tuplen));
 
+	Assert(!state->usePool);
 	FREEMEM(state, GetMemoryChunkSpace(tuple));
 	heap_free_minimal_tuple(tuple);
 }
@@ -3788,11 +4071,10 @@ readtup_heap(Tuplesortstate *state, SortTuple *stup,
 {
 	unsigned int tupbodylen = len - sizeof(int);
 	unsigned int tuplen = tupbodylen + MINIMAL_TUPLE_DATA_OFFSET;
-	MinimalTuple tuple = (MinimalTuple) palloc(tuplen);
+	MinimalTuple tuple = (MinimalTuple) tupproperalloc(state, tapenum, tuplen);
 	char	   *tupbody = (char *) tuple + MINIMAL_TUPLE_DATA_OFFSET;
 	HeapTupleData htup;
 
-	USEMEM(state, GetMemoryChunkSpace(tuple));
 	/* read in the tuple proper */
 	tuple->t_len = tuplen;
 	LogicalTapeReadExact(state->tapeset, tapenum,
@@ -4013,6 +4295,7 @@ writetup_cluster(Tuplesortstate *state, int tapenum, SortTuple *stup)
 		LogicalTapeWrite(state->tapeset, tapenum,
 						 &tuplen, sizeof(tuplen));
 
+	Assert(!state->usePool);
 	FREEMEM(state, GetMemoryChunkSpace(tuple));
 	heap_freetuple(tuple);
 }
@@ -4022,9 +4305,9 @@ readtup_cluster(Tuplesortstate *state, SortTuple *stup,
 				int tapenum, unsigned int tuplen)
 {
 	unsigned int t_len = tuplen - sizeof(ItemPointerData) - sizeof(int);
-	HeapTuple	tuple = (HeapTuple) palloc(t_len + HEAPTUPLESIZE);
+	HeapTuple	tuple = (HeapTuple) tupproperalloc(state, tapenum,
+												   t_len + HEAPTUPLESIZE);
 
-	USEMEM(state, GetMemoryChunkSpace(tuple));
 	/* Reconstruct the HeapTupleData header */
 	tuple->t_data = (HeapTupleHeader) ((char *) tuple + HEAPTUPLESIZE);
 	tuple->t_len = t_len;
@@ -4315,6 +4598,7 @@ writetup_index(Tuplesortstate *state, int tapenum, SortTuple *stup)
 		LogicalTapeWrite(state->tapeset, tapenum,
 						 (void *) &tuplen, sizeof(tuplen));
 
+	Assert(!state->usePool);
 	FREEMEM(state, GetMemoryChunkSpace(tuple));
 	pfree(tuple);
 }
@@ -4324,9 +4608,8 @@ readtup_index(Tuplesortstate *state, SortTuple *stup,
 			  int tapenum, unsigned int len)
 {
 	unsigned int tuplen = len - sizeof(unsigned int);
-	IndexTuple	tuple = (IndexTuple) palloc(tuplen);
+	IndexTuple	tuple = (IndexTuple) tupproperalloc(state, tapenum, tuplen);
 
-	USEMEM(state, GetMemoryChunkSpace(tuple));
 	LogicalTapeReadExact(state->tapeset, tapenum,
 						 tuple, tuplen);
 	if (state->randomAccess)	/* need trailing length word? */
@@ -4406,6 +4689,7 @@ writetup_datum(Tuplesortstate *state, int tapenum, SortTuple *stup)
 		LogicalTapeWrite(state->tapeset, tapenum,
 						 (void *) &writtenlen, sizeof(writtenlen));
 
+	Assert(!state->usePool);
 	if (stup->tuple)
 	{
 		FREEMEM(state, GetMemoryChunkSpace(stup->tuple));
@@ -4436,14 +4720,13 @@ readtup_datum(Tuplesortstate *state, SortTuple *stup,
 	}
 	else
 	{
-		void	   *raddr = palloc(tuplen);
+		void	   *raddr = tupproperalloc(state, tapenum, tuplen);
 
 		LogicalTapeReadExact(state->tapeset, tapenum,
 							 raddr, tuplen);
 		stup->datum1 = PointerGetDatum(raddr);
 		stup->isnull1 = false;
 		stup->tuple = raddr;
-		USEMEM(state, GetMemoryChunkSpace(raddr));
 	}
 
 	if (state->randomAccess)	/* need trailing length word? */
@@ -4457,6 +4740,7 @@ readtup_datum(Tuplesortstate *state, SortTuple *stup,
 static void
 free_sort_tuple(Tuplesortstate *state, SortTuple *stup)
 {
+	Assert(!state->usePool);
 	FREEMEM(state, GetMemoryChunkSpace(stup->tuple));
 	pfree(stup->tuple);
 }
-- 
1.9.1

From d336cf08f5bb1b121ea470abde34ec6de6a37ead Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <peter.geoghega...@gmail.com>
Date: Wed, 29 Jul 2015 15:38:12 -0700
Subject: [PATCH 1/6] Quicksort when performing external sorts

Add "quicksort with spillover".  This allows an external sort that has a
work_mem setting just short of one that would allow an internal sort to
perform almost as well as an internal sort, quicksorting perhaps almost
all tuples rather than performing a degenerate heapsort.  For these
marginal external sorts, most I/O can be avoided.  Sort performance is
made much more predictable as tables gradually increase in size.

The benefits of replacement selection (the currentRun + nextRun heap)
are seen only where incremental spilling rather than spilling in batches
allows a "quicksort with spillover" to ultimately write almost no tuples
out.  It is not worth the cost in CPU cache misses to use a heap, even
when it can produce runs that are much larger than with the new
approach.  "Quicksort with spillover" is the sole remaining
justification for using a heap for even the first run.

For that reason, tuplesort callers now provide a total row estimate
hint, used to determine if the use of a heap is worthwhile for the first
run (now the only remaining run where it's even possible).  The row
estimate is input into a simple ad-hoc cost model.    If "quicksort with
spillover" is unlikely to save much I/O, or if I/O is unlikely to be a
significant cost overall, it is worth avoiding any use of a heap in the
first place.
---
 src/backend/access/hash/hash.c         |   2 +-
 src/backend/access/hash/hashsort.c     |   4 +-
 src/backend/access/nbtree/nbtree.c     |  11 +-
 src/backend/access/nbtree/nbtsort.c    |  10 +-
 src/backend/catalog/index.c            |   1 +
 src/backend/commands/cluster.c         |   4 +-
 src/backend/commands/explain.c         |  13 +-
 src/backend/executor/nodeAgg.c         |  26 +-
 src/backend/executor/nodeSort.c        |   1 +
 src/backend/utils/adt/orderedsetaggs.c |  13 +-
 src/backend/utils/sort/tuplesort.c     | 762 ++++++++++++++++++++++++++++-----
 src/include/access/hash.h              |   3 +-
 src/include/access/nbtree.h            |   2 +-
 src/include/executor/nodeAgg.h         |   2 +
 src/include/utils/tuplesort.h          |  18 +-
 15 files changed, 748 insertions(+), 124 deletions(-)

diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 24b06a5..8f71980 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -86,7 +86,7 @@ hashbuild(PG_FUNCTION_ARGS)
 	 * one page.
 	 */
 	if (num_buckets >= (uint32) NBuffers)
-		buildstate.spool = _h_spoolinit(heap, index, num_buckets);
+		buildstate.spool = _h_spoolinit(heap, index, num_buckets, reltuples);
 	else
 		buildstate.spool = NULL;
 
diff --git a/src/backend/access/hash/hashsort.c b/src/backend/access/hash/hashsort.c
index c67c057..5c7e137 100644
--- a/src/backend/access/hash/hashsort.c
+++ b/src/backend/access/hash/hashsort.c
@@ -44,7 +44,8 @@ struct HSpool
  * create and initialize a spool structure
  */
 HSpool *
-_h_spoolinit(Relation heap, Relation index, uint32 num_buckets)
+_h_spoolinit(Relation heap, Relation index, uint32 num_buckets,
+			 double reltuples)
 {
 	HSpool	   *hspool = (HSpool *) palloc0(sizeof(HSpool));
 	uint32		hash_mask;
@@ -71,6 +72,7 @@ _h_spoolinit(Relation heap, Relation index, uint32 num_buckets)
 												   index,
 												   hash_mask,
 												   maintenance_work_mem,
+												   reltuples,
 												   false);
 
 	return hspool;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index cf4a6dc..0957e0f 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -23,6 +23,7 @@
 #include "access/xlog.h"
 #include "catalog/index.h"
 #include "commands/vacuum.h"
+#include "optimizer/plancat.h"
 #include "storage/indexfsm.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
@@ -85,7 +86,9 @@ btbuild(PG_FUNCTION_ARGS)
 	Relation	index = (Relation) PG_GETARG_POINTER(1);
 	IndexInfo  *indexInfo = (IndexInfo *) PG_GETARG_POINTER(2);
 	IndexBuildResult *result;
+	BlockNumber relpages;
 	double		reltuples;
+	double		allvisfrac;
 	BTBuildState buildstate;
 
 	buildstate.isUnique = indexInfo->ii_Unique;
@@ -100,6 +103,9 @@ btbuild(PG_FUNCTION_ARGS)
 		ResetUsage();
 #endif   /* BTREE_BUILD_STATS */
 
+	/* Estimate the number of rows currently present in the table */
+	estimate_rel_size(heap, NULL, &relpages, &reltuples, &allvisfrac);
+
 	/*
 	 * We expect to be called exactly once for any index relation. If that's
 	 * not the case, big trouble's what we have.
@@ -108,14 +114,15 @@ btbuild(PG_FUNCTION_ARGS)
 		elog(ERROR, "index \"%s\" already contains data",
 			 RelationGetRelationName(index));
 
-	buildstate.spool = _bt_spoolinit(heap, index, indexInfo->ii_Unique, false);
+	buildstate.spool = _bt_spoolinit(heap, index, indexInfo->ii_Unique, false,
+									 reltuples);
 
 	/*
 	 * If building a unique index, put dead tuples in a second spool to keep
 	 * them out of the uniqueness check.
 	 */
 	if (indexInfo->ii_Unique)
-		buildstate.spool2 = _bt_spoolinit(heap, index, false, true);
+		buildstate.spool2 = _bt_spoolinit(heap, index, false, true, reltuples);
 
 	/* do the heap scan */
 	reltuples = IndexBuildHeapScan(heap, index, indexInfo, true,
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index f95f67a..0d4a5ea 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -149,7 +149,8 @@ static void _bt_load(BTWriteState *wstate,
  * create and initialize a spool structure
  */
 BTSpool *
-_bt_spoolinit(Relation heap, Relation index, bool isunique, bool isdead)
+_bt_spoolinit(Relation heap, Relation index, bool isunique, bool isdead,
+			  double reltuples)
 {
 	BTSpool    *btspool = (BTSpool *) palloc0(sizeof(BTSpool));
 	int			btKbytes;
@@ -165,10 +166,15 @@ _bt_spoolinit(Relation heap, Relation index, bool isunique, bool isdead)
 	 * unique index actually requires two BTSpool objects.  We expect that the
 	 * second one (for dead tuples) won't get very full, so we give it only
 	 * work_mem.
+	 *
+	 * reltuples hint does not account for factors like whether or not this is
+	 * a partial index, or if this is second BTSpool object, because it seems
+	 * more conservative to estimate high.
 	 */
 	btKbytes = isdead ? work_mem : maintenance_work_mem;
 	btspool->sortstate = tuplesort_begin_index_btree(heap, index, isunique,
-													 btKbytes, false);
+													 btKbytes, reltuples,
+													 false);
 
 	return btspool;
 }
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e59b163..88ee81d 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2835,6 +2835,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	state.tuplesort = tuplesort_begin_datum(TIDOID, TIDLessOperator,
 											InvalidOid, false,
 											maintenance_work_mem,
+											ivinfo.num_heap_tuples,
 											false);
 	state.htups = state.itups = state.tups_inserted = 0;
 
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 7ab4874..23f6459 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -891,7 +891,9 @@ copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	/* Set up sorting if wanted */
 	if (use_sort)
 		tuplesort = tuplesort_begin_cluster(oldTupDesc, OldIndex,
-											maintenance_work_mem, false);
+											maintenance_work_mem,
+											OldHeap->rd_rel->reltuples,
+											false);
 	else
 		tuplesort = NULL;
 
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 183d3d9..65d8c3b 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2114,20 +2114,27 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 		const char *sortMethod;
 		const char *spaceType;
 		long		spaceUsed;
+		int			rowsSortedMem;
 
-		tuplesort_get_stats(state, &sortMethod, &spaceType, &spaceUsed);
+		tuplesort_get_stats(state, &sortMethod, &spaceType, &spaceUsed,
+							&rowsSortedMem);
 
 		if (es->format == EXPLAIN_FORMAT_TEXT)
 		{
 			appendStringInfoSpaces(es->str, es->indent * 2);
-			appendStringInfo(es->str, "Sort Method: %s  %s: %ldkB\n",
-							 sortMethod, spaceType, spaceUsed);
+			appendStringInfo(es->str,
+							 "Sort Method: %s  %s: %ldkB  Rows In Memory: %d\n",
+							 sortMethod,
+							 spaceType,
+							 spaceUsed,
+							 rowsSortedMem);
 		}
 		else
 		{
 			ExplainPropertyText("Sort Method", sortMethod, es);
 			ExplainPropertyLong("Sort Space Used", spaceUsed, es);
 			ExplainPropertyText("Sort Space Type", spaceType, es);
+			ExplainPropertyInteger("Rows In Memory", rowsSortedMem, es);
 		}
 	}
 }
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 2e36855..f580cca 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -520,6 +520,7 @@ initialize_phase(AggState *aggstate, int newphase)
 												  sortnode->collations,
 												  sortnode->nullsFirst,
 												  work_mem,
+												  sortnode->plan.plan_rows,
 												  false);
 	}
 
@@ -588,7 +589,8 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
 									  pertrans->sortOperators[0],
 									  pertrans->sortCollations[0],
 									  pertrans->sortNullsFirst[0],
-									  work_mem, false);
+									  work_mem, agg_input_rows(aggstate),
+									  false);
 		else
 			pertrans->sortstates[aggstate->current_set] =
 				tuplesort_begin_heap(pertrans->evaldesc,
@@ -597,7 +599,8 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
 									 pertrans->sortOperators,
 									 pertrans->sortCollations,
 									 pertrans->sortNullsFirst,
-									 work_mem, false);
+									 work_mem, agg_input_rows(aggstate),
+									 false);
 	}
 
 	/*
@@ -1439,6 +1442,25 @@ find_hash_columns(AggState *aggstate)
 }
 
 /*
+ * Estimate the number of rows input to the sorter.
+ *
+ * Exported for use by ordered-set aggregates.
+ */
+double
+agg_input_rows(AggState *aggstate)
+{
+	Plan	   *outerNode;
+
+	/*
+	 * Get information about the size of the relation to be sorted (it's the
+	 * "outer" subtree of this node)
+	 */
+	outerNode = outerPlanState(aggstate)->plan;
+
+	return outerNode->plan_rows;
+}
+
+/*
  * Estimate per-hash-table-entry overhead for the planner.
  *
  * Note that the estimate does not include space for pass-by-reference
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index af1dccf..e4b1104 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -89,6 +89,7 @@ ExecSort(SortState *node)
 											  plannode->collations,
 											  plannode->nullsFirst,
 											  work_mem,
+											  plannode->plan.plan_rows,
 											  node->randomAccess);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
index 39ed85b..b51a945 100644
--- a/src/backend/utils/adt/orderedsetaggs.c
+++ b/src/backend/utils/adt/orderedsetaggs.c
@@ -20,6 +20,7 @@
 #include "catalog/pg_operator.h"
 #include "catalog/pg_type.h"
 #include "executor/executor.h"
+#include "executor/nodeAgg.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/tlist.h"
@@ -103,6 +104,7 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
 {
 	OSAPerGroupState *osastate;
 	OSAPerQueryState *qstate;
+	AggState		 *aggstate;
 	MemoryContext gcontext;
 	MemoryContext oldcontext;
 
@@ -117,8 +119,11 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
 	/*
 	 * We keep a link to the per-query state in fn_extra; if it's not there,
 	 * create it, and do the per-query setup we need.
+	 *
+	 * aggstate is used to get hint on total number of tuples for tuplesort.
 	 */
 	qstate = (OSAPerQueryState *) fcinfo->flinfo->fn_extra;
+	aggstate = (AggState *) fcinfo->context;
 	if (qstate == NULL)
 	{
 		Aggref	   *aggref;
@@ -276,13 +281,17 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
 												   qstate->sortOperators,
 												   qstate->sortCollations,
 												   qstate->sortNullsFirsts,
-												   work_mem, false);
+												   work_mem,
+												   agg_input_rows(aggstate),
+												   false);
 	else
 		osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
 													qstate->sortOperator,
 													qstate->sortCollation,
 													qstate->sortNullsFirst,
-													work_mem, false);
+													work_mem,
+													agg_input_rows(aggstate),
+													false);
 
 	osastate->number_of_rows = 0;
 
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 6572af8..da0453e 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -8,17 +8,19 @@
  * if necessary).  It works efficiently for both small and large amounts
  * of data.  Small amounts are sorted in-memory using qsort().  Large
  * amounts are sorted using temporary files and a standard external sort
- * algorithm.
+ * algorithm with numerous significant enhancements.
  *
  * See Knuth, volume 3, for more than you want to know about the external
- * sorting algorithm.  We divide the input into sorted runs using replacement
- * selection, in the form of a priority tree implemented as a heap
- * (essentially his Algorithm 5.2.3H), then merge the runs using polyphase
- * merge, Knuth's Algorithm 5.4.2D.  The logical "tapes" used by Algorithm D
- * are implemented by logtape.c, which avoids space wastage by recycling
- * disk space as soon as each block is read from its "tape".
+ * sorting algorithm.  Historically, we divided the input into sorted runs
+ * using replacement selection, in the form of a priority tree implemented
+ * as a heap (essentially his Algorithm 5.2.3H -- although that strategy is
+ * often avoided altogether), but there are now numerous special case
+ * optimizations.  We still merge the runs using polyphase merge, Knuth's
+ * Algorithm 5.4.2D.  The logical "tapes" used by Algorithm D are
+ * implemented by logtape.c, which avoids space wastage by recycling disk
+ * space as soon as each block is read from its "tape".
  *
- * We do not form the initial runs using Knuth's recommended replacement
+ * We never form the initial runs using Knuth's recommended replacement
  * selection data structure (Algorithm 5.4.1R), because it uses a fixed
  * number of records in memory at all times.  Since we are dealing with
  * tuples that may vary considerably in size, we want to be able to vary
@@ -28,7 +30,25 @@
  * Algorithm 5.4.1R, each record is stored with the run number that it
  * must go into, and we use (run number, key) as the ordering key for the
  * heap.  When the run number at the top of the heap changes, we know that
- * no more records of the prior run are left in the heap.
+ * no more records of the prior run are left in the heap.  Note that there
+ * are in practice only ever two valid run numbers: 0 (first), and 1
+ * (second).  This is due to the greatly reduced use of replacement
+ * selection starting in PostgreSQL 9.6.
+ *
+ * Starting with PostgreSQL 9.6, a heap is only used when incremental
+ * spilling can enable a "quicksort with spillover" to avoid most I/O
+ * (which is why there are now only two valid run numbers possible in
+ * practice).  We prefer a simple hybrid sort-merge strategy most of the
+ * time, where runs are sorted in much the same way as the entire input of
+ * an internal sort is sorted.  There are several reasons for this.
+ * Maintaining a priority tree/heap has poor cache characteristics.  At the
+ * same time, the growth in main memory sizes has greatly diminished the
+ * value of having runs that are larger than available work_mem, even in
+ * the case where there is partially sorted input and runs can be made far
+ * larger by using a heap; we'll tend to still get a single-pass merge step
+ * in the end.  Even if we don't, the additional random I/O that could have
+ * been avoided with a heap will tend to be relatively inexpensive due to
+ * modern block device characteristics.
  *
  * The approximate amount of memory allowed for any one sort operation
  * is specified in kilobytes by the caller (most pass work_mem).  Initially,
@@ -160,11 +180,15 @@ bool		optimize_bounded_sort = true;
  * described above.  Accordingly, "tuple" is always used in preference to
  * datum1 as the authoritative value for pass-by-reference cases.
  *
- * While building initial runs, tupindex holds the tuple's run number.  During
- * merge passes, we re-use it to hold the input tape number that each tuple in
- * the heap was read from, or to hold the index of the next tuple pre-read
- * from the same tape in the case of pre-read entries.  tupindex goes unused
- * if the sort occurs entirely in memory.
+ * While building initial runs, tupindex holds the tuple's run number.
+ * Historically, the run number could meaningfully distinguish many runs, but
+ * currently it only meaningfully distinguishes the first run with any other
+ * run (in practice this is limited to the second run), since replacement
+ * selection is abandoned after the first run.  During merge passes, we re-use
+ * it to hold the input tape number that each tuple in the heap was read from,
+ * or to hold the index of the next tuple pre-read from the same tape in the
+ * case of pre-read entries.  tupindex goes unused if the sort occurs entirely
+ * in memory.
  */
 typedef struct
 {
@@ -186,6 +210,7 @@ typedef enum
 	TSS_BUILDRUNS,				/* Loading tuples; writing to tape */
 	TSS_SORTEDINMEM,			/* Sort completed entirely in memory */
 	TSS_SORTEDONTAPE,			/* Sort completed, final run is on tape */
+	TSS_MEMTAPEMERGE,			/* Performing memory/tape merge on-the-fly */
 	TSS_FINALMERGE				/* Performing final merge on-the-fly */
 } TupSortStatus;
 
@@ -214,6 +239,7 @@ struct Tuplesortstate
 {
 	TupSortStatus status;		/* enumerated value as shown above */
 	int			nKeys;			/* number of columns in sort key */
+	double		rowNumHint;		/* caller's hint of total # of rows */
 	bool		randomAccess;	/* did caller request random access? */
 	bool		bounded;		/* did caller specify a maximum number of
 								 * tuples to return? */
@@ -280,6 +306,13 @@ struct Tuplesortstate
 	bool		growmemtuples;	/* memtuples' growth still underway? */
 
 	/*
+	 * While building initial runs, this indicates if the replacement
+	 * selection strategy is in use.  When it isn't, then a simple hybrid
+	 * sort-merge strategy is in use instead.
+	 */
+	bool		replaceActive;
+
+	/*
 	 * While building initial runs, this is the current output run number
 	 * (starting at 0).  Afterwards, it is the number of initial runs we made.
 	 */
@@ -327,12 +360,22 @@ struct Tuplesortstate
 	int			activeTapes;	/* # of active input tapes in merge pass */
 
 	/*
+	 * These variables are used after tuplesort_performsort() for the
+	 * TSS_MEMTAPEMERGE case.  This is a special, optimized final on-the-fly
+	 * merge pass involving merging the result tape with memtuples that were
+	 * quicksorted (but never made it out to a tape).
+	 */
+	SortTuple	tape_cache;		/* cached tape tuple from prior call */
+	bool		cached;			/* tape_cache holds pending tape tuple */
+	bool		just_memtuples;	/* merge only fetching from memtuples */
+
+	/*
 	 * These variables are used after completion of sorting to keep track of
 	 * the next tuple to return.  (In the tape case, the tape's current read
 	 * position is also critical state.)
 	 */
 	int			result_tape;	/* actual tape number of finished output */
-	int			current;		/* array index (only used if SORTEDINMEM) */
+	int			current;		/* memtuples array index */
 	bool		eof_reached;	/* reached EOF (needed for cursors) */
 
 	/* markpos_xxx holds marked position for mark and restore */
@@ -457,19 +500,24 @@ struct Tuplesortstate
 	} while(0)
 
 
-static Tuplesortstate *tuplesort_begin_common(int workMem, bool randomAccess);
+static Tuplesortstate *tuplesort_begin_common(int workMem, double rowNumHint,
+									  bool randomAccess);
 static void puttuple_common(Tuplesortstate *state, SortTuple *tuple);
 static bool consider_abort_common(Tuplesortstate *state);
+static bool useselection(Tuplesortstate *state);
 static void inittapes(Tuplesortstate *state);
 static void selectnewtape(Tuplesortstate *state);
 static void mergeruns(Tuplesortstate *state);
 static void mergeonerun(Tuplesortstate *state);
+static void mergememruns(Tuplesortstate *state);
 static void beginmerge(Tuplesortstate *state);
 static void mergepreread(Tuplesortstate *state);
 static void mergeprereadone(Tuplesortstate *state, int srcTape);
 static void dumptuples(Tuplesortstate *state, bool alltuples);
+static void dumpbatch(Tuplesortstate *state, bool alltuples);
 static void make_bounded_heap(Tuplesortstate *state);
 static void sort_bounded_heap(Tuplesortstate *state);
+static void tuplesort_sort_memtuples(Tuplesortstate *state);
 static void tuplesort_heap_insert(Tuplesortstate *state, SortTuple *tuple,
 					  int tupleindex, bool checkIndex);
 static void tuplesort_heap_siftup(Tuplesortstate *state, bool checkIndex);
@@ -533,12 +581,14 @@ static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
  * Each variant of tuplesort_begin has a workMem parameter specifying the
  * maximum number of kilobytes of RAM to use before spilling data to disk.
  * (The normal value of this parameter is work_mem, but some callers use
- * other values.)  Each variant also has a randomAccess parameter specifying
- * whether the caller needs non-sequential access to the sort result.
+ * other values.)  Each variant also has a hint parameter of the total
+ * number of rows to be sorted, and a randomAccess parameter specifying
+ * whether the caller needs non-sequential access to the sort result.  Since
+ * rowNumHint is just a hint, it's acceptable for it to be zero or negative.
  */
 
 static Tuplesortstate *
-tuplesort_begin_common(int workMem, bool randomAccess)
+tuplesort_begin_common(int workMem, double rowNumHint, bool randomAccess)
 {
 	Tuplesortstate *state;
 	MemoryContext sortcontext;
@@ -568,6 +618,7 @@ tuplesort_begin_common(int workMem, bool randomAccess)
 #endif
 
 	state->status = TSS_INITIAL;
+	state->rowNumHint = rowNumHint;
 	state->randomAccess = randomAccess;
 	state->bounded = false;
 	state->boundUsed = false;
@@ -613,9 +664,11 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 					 int nkeys, AttrNumber *attNums,
 					 Oid *sortOperators, Oid *sortCollations,
 					 bool *nullsFirstFlags,
-					 int workMem, bool randomAccess)
+					 int workMem, double rowNumHint,
+					 bool randomAccess)
 {
-	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+	Tuplesortstate *state = tuplesort_begin_common(workMem, rowNumHint,
+												   randomAccess);
 	MemoryContext oldcontext;
 	int			i;
 
@@ -683,9 +736,11 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 Tuplesortstate *
 tuplesort_begin_cluster(TupleDesc tupDesc,
 						Relation indexRel,
-						int workMem, bool randomAccess)
+						int workMem,
+						double rowNumHint, bool randomAccess)
 {
-	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+	Tuplesortstate *state = tuplesort_begin_common(workMem, rowNumHint,
+												   randomAccess);
 	ScanKey		indexScanKey;
 	MemoryContext oldcontext;
 	int			i;
@@ -776,9 +831,11 @@ Tuplesortstate *
 tuplesort_begin_index_btree(Relation heapRel,
 							Relation indexRel,
 							bool enforceUnique,
-							int workMem, bool randomAccess)
+							int workMem,
+							double rowNumHint, bool randomAccess)
 {
-	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+	Tuplesortstate *state = tuplesort_begin_common(workMem, rowNumHint,
+												   randomAccess);
 	ScanKey		indexScanKey;
 	MemoryContext oldcontext;
 	int			i;
@@ -851,9 +908,11 @@ Tuplesortstate *
 tuplesort_begin_index_hash(Relation heapRel,
 						   Relation indexRel,
 						   uint32 hash_mask,
-						   int workMem, bool randomAccess)
+						   int workMem,
+						   double rowNumHint, bool randomAccess)
 {
-	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+	Tuplesortstate *state = tuplesort_begin_common(workMem, rowNumHint,
+												   randomAccess);
 	MemoryContext oldcontext;
 
 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -886,9 +945,10 @@ tuplesort_begin_index_hash(Relation heapRel,
 Tuplesortstate *
 tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 					  bool nullsFirstFlag,
-					  int workMem, bool randomAccess)
+					  int workMem, double rowNumHint, bool randomAccess)
 {
-	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+	Tuplesortstate *state = tuplesort_begin_common(workMem, rowNumHint,
+												   randomAccess);
 	MemoryContext oldcontext;
 	int16		typlen;
 	bool		typbyval;
@@ -1490,22 +1550,61 @@ puttuple_common(Tuplesortstate *state, SortTuple *tuple)
 
 			/*
 			 * Insert the tuple into the heap, with run number currentRun if
-			 * it can go into the current run, else run number currentRun+1.
-			 * The tuple can go into the current run if it is >= the first
-			 * not-yet-output tuple.  (Actually, it could go into the current
-			 * run if it is >= the most recently output tuple ... but that
-			 * would require keeping around the tuple we last output, and it's
-			 * simplest to let writetup free each tuple as soon as it's
-			 * written.)
+			 * it can go into the current run, else run number INT_MAX (some
+			 * later run).  The tuple can go into the current run if it is
+			 * >= the first not-yet-output tuple.  (Actually, it could go
+			 * into the current run if it is >= the most recently output
+			 * tuple ... but that would require keeping around the tuple we
+			 * last output, and it's simplest to let writetup free each
+			 * tuple as soon as it's written.)
 			 *
-			 * Note there will always be at least one tuple in the heap at
-			 * this point; see dumptuples.
+			 * Note that this only applies if the currentRun is 0 (prior to
+			 * giving up on heapification).  There is no meaningful
+			 * distinction between any two runs in memory except the first
+			 * and second run.  When the currentRun is not 0, there is no
+			 * guarantee that any tuples are already stored in memory here,
+			 * and if there are any they're in no significant order.
 			 */
-			Assert(state->memtupcount > 0);
-			if (COMPARETUP(state, tuple, &state->memtuples[0]) >= 0)
+			Assert(!state->replaceActive || state->memtupcount > 0);
+			if (state->replaceActive &&
+				COMPARETUP(state, tuple, &state->memtuples[0]) >= 0)
+			{
+				/*
+				 * Unlike classic replacement selection, which this module was
+				 * previously based on, only run 0 is treated as a priority
+				 * queue through heapification.  The second run (run 1) is
+				 * appended indifferently below, and will never be trusted to
+				 * maintain the heap invariant beyond simply not getting in
+				 * the way of spilling run 0 incrementally.  In other words,
+				 * second run tuples may be sifted out of the way of first
+				 * run tuples; COMPARETUP() will never be called for run
+				 * 1 tuples.  However, not even HEAPCOMPARE() will be
+				 * called for a subsequent run's tuples.
+				 */
 				tuplesort_heap_insert(state, tuple, state->currentRun, true);
+			}
 			else
-				tuplesort_heap_insert(state, tuple, state->currentRun + 1, true);
+			{
+				/*
+				 * Note that unlike Knuth, we do not care about the second
+				 * run's tuples when loading runs.  After the first run is
+				 * complete, tuples will not be dumped incrementally at all,
+				 * but as long as the first run (run 0) is current it will
+				 * be maintained.  dumptuples does not trust that the second
+				 * or subsequent runs are heapified (beyond merely not
+				 * getting in the way of the first, fully heapified run,
+				 * which only matters for the second run, run 1).  Anything
+				 * past the first run will be quicksorted.
+				 *
+				 * Past the first run, there is no need to differentiate runs
+				 * in memory (only the first and second runs will ever be
+				 * usefully differentiated).  Use a generic INT_MAX run
+				 * number (just to be tidy).  There should always be room to
+				 * store the incoming tuple.
+				 */
+				tuple->tupindex = INT_MAX;
+				state->memtuples[state->memtupcount++] = *tuple;
+			}
 
 			/*
 			 * If we are over the memory limit, dump tuples till we're under.
@@ -1580,20 +1679,9 @@ tuplesort_performsort(Tuplesortstate *state)
 
 			/*
 			 * We were able to accumulate all the tuples within the allowed
-			 * amount of memory.  Just qsort 'em and we're done.
+			 * amount of memory.  Just quicksort 'em and we're done.
 			 */
-			if (state->memtupcount > 1)
-			{
-				/* Can we use the single-key sort function? */
-				if (state->onlyKey != NULL)
-					qsort_ssup(state->memtuples, state->memtupcount,
-							   state->onlyKey);
-				else
-					qsort_tuple(state->memtuples,
-								state->memtupcount,
-								state->comparetup,
-								state);
-			}
+			tuplesort_sort_memtuples(state);
 			state->current = 0;
 			state->eof_reached = false;
 			state->markpos_offset = 0;
@@ -1620,12 +1708,27 @@ tuplesort_performsort(Tuplesortstate *state)
 
 			/*
 			 * Finish tape-based sort.  First, flush all tuples remaining in
-			 * memory out to tape; then merge until we have a single remaining
-			 * run (or, if !randomAccess, one run per tape). Note that
-			 * mergeruns sets the correct state->status.
+			 * memory out to tape where that's required (when more than one
+			 * run's tuples made it to tape, or when the caller required
+			 * random access).  Then, either merge until we have a single
+			 * remaining run on tape, or merge runs in memory by sorting
+			 * them into one single in-memory run.  Note that
+			 * mergeruns/mergememruns sets the correct state->status.
 			 */
-			dumptuples(state, true);
-			mergeruns(state);
+			if (state->currentRun > 0 || state->randomAccess)
+			{
+				dumptuples(state, true);
+				mergeruns(state);
+			}
+			else
+			{
+				/*
+				 * "Quicksort with spillover".  Only possible for
+				 * !randomAccess callers, just as with tape based on-the-fly
+				 * merge.
+				 */
+				mergememruns(state);
+			}
 			state->eof_reached = false;
 			state->markpos_block = 0L;
 			state->markpos_offset = 0;
@@ -1644,6 +1747,9 @@ tuplesort_performsort(Tuplesortstate *state)
 			elog(LOG, "performsort done (except %d-way final merge): %s",
 				 state->activeTapes,
 				 pg_rusage_show(&state->ru_start));
+		else if (state->status == TSS_MEMTAPEMERGE)
+			elog(LOG, "performsort done (except memory/tape final merge): %s",
+				 pg_rusage_show(&state->ru_start));
 		else
 			elog(LOG, "performsort done: %s",
 				 pg_rusage_show(&state->ru_start));
@@ -1795,6 +1901,104 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
 			READTUP(state, stup, state->result_tape, tuplen);
 			return true;
 
+		case TSS_MEMTAPEMERGE:
+			Assert(forward);
+			/* For now, assume tuple returned from memory */
+			*should_free = false;
+
+			/*
+			 * Should be at least one memtuple (work_mem should be roughly
+			 * fully consumed)
+			 */
+			Assert(state->memtupcount > 0);
+
+			if (state->eof_reached)
+				return false;
+
+			if (state->just_memtuples)
+				goto just_memtuples;
+
+			/*
+			 * Merge together quicksorted memtuples array run (which often
+			 * consists of what were originally 2 "heap-wise" runs), and
+			 * tuples sorted on tape (which was originally the head of the
+			 * first heap-wise run).
+			 *
+			 * Exhaust the supply of tape tuples first.
+			 *
+			 * "stup" is always initially set to the current tape tuple if
+			 * any remain, which may be cached from previous call, or read
+			 * from tape when nothing cached.
+			 */
+			if (state->cached)
+				*stup = state->tape_cache;
+			else if ((tuplen = getlen(state, state->result_tape, true)) != 0)
+				READTUP(state, stup, state->result_tape, tuplen);
+			else
+			{
+				/* Supply of tape tuples was just exhausted */
+				state->just_memtuples = true;
+				goto just_memtuples;
+			}
+
+			/*
+			 * Kludge:  Trigger abbreviated tie-breaker if in-memory tuples
+			 * use abbreviation (writing tuples to tape never preserves
+			 * abbreviated keys).  Do this by assigning in-memory
+			 * abbreviated tuple to tape tuple directly.
+			 *
+			 * It doesn't seem worth generating a new abbreviated key for
+			 * the tape tuple, and this approach is simpler than
+			 * "unabbreviating" the memtuple tuple from a "common" routine
+			 * like this.
+			 */
+			if (state->sortKeys != NULL && state->sortKeys->abbrev_converter != NULL)
+				stup->datum1 = state->memtuples[state->current].datum1;
+
+			/*
+			 * Compare current tape tuple to current memtuple.
+			 *
+			 * Since we always start with at least one memtuple, and since tape
+			 * tuples are always returned before equal memtuples, it follows
+			 * that there must be at least one memtuple left to return here.
+			 */
+			Assert(state->current < state->memtupcount);
+
+			if (COMPARETUP(state, stup, &state->memtuples[state->current]) <= 0)
+			{
+				/*
+				 * Tape tuple less than or equal to memtuple array current
+				 * position.  Return it.
+				 */
+				state->cached = false;
+				/* Caller can free tape tuple memory */
+				*should_free = true;
+			}
+			else
+			{
+				/*
+				 * Tape tuple greater than memtuple array's current tuple.
+				 *
+				 * Return current memtuple tuple, and cache tape tuple for
+				 * next call.  It will be returned on next or subsequent
+				 * call.
+				 */
+				state->tape_cache = *stup;
+				state->cached = true;
+				*stup = state->memtuples[state->current++];
+			}
+			return true;
+
+just_memtuples:
+			/* Just return memtuples -- merging done */
+			if (state->current < state->memtupcount)
+			{
+				*stup = state->memtuples[state->current++];
+				return true;
+			}
+			state->eof_reached = true;
+			return false;
+
 		case TSS_FINALMERGE:
 			Assert(forward);
 			*should_free = true;
@@ -2004,6 +2208,7 @@ tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples, bool forward)
 			return false;
 
 		case TSS_SORTEDONTAPE:
+		case TSS_MEMTAPEMERGE:
 		case TSS_FINALMERGE:
 
 			/*
@@ -2066,6 +2271,89 @@ tuplesort_merge_order(int64 allowedMem)
 }
 
 /*
+ * useselection - determine if one replacement selection run should be
+ * attempted.
+ *
+ * This is called when we just ran out of memory, and must consider costs
+ * and benefits of replacement selection for first run, which can result in
+ * a "quicksort with spillover".  Note that replacement selection is always
+ * abandoned after the first run.
+ */
+static bool
+useselection(Tuplesortstate *state)
+{
+	int64		memNowUsed = state->allowedMem - state->availMem;
+	double		avgTupleSize;
+	int			increments;
+	double		crossover;
+	bool		useSelection;
+
+	/* For randomAccess callers, "quicksort with spillover" is never used */
+	if (state->randomAccess)
+		return false;
+
+	/*
+	 * Crossover point is somewhere between where memtuples is between 40%
+	 * and all-but-one of total tuples to sort.  This weighs approximate
+	 * savings in I/O, against generic heap sorting cost.
+	 */
+	avgTupleSize = (double) memNowUsed / (double) state->memtupsize;
+
+	/*
+	 * Starting from a threshold of 90%, refund 7.5% per 32 byte
+	 * average-size-increment.
+	 */
+	increments = MAXALIGN_DOWN((int) avgTupleSize) / 32;
+	crossover = 0.90 - (increments * 0.075);
+
+	/*
+	 * Clamp, making either outcome possible regardless of average size.
+	 *
+	 * 40% is about the minimum point at which "quicksort with spillover"
+	 * can still occur without a logical/physical correlation.
+	 */
+	crossover = Max(0.40, Min(crossover, 0.85));
+
+	/*
+	 * The point where the overhead of maintaining the heap invariant is
+	 * likely to dominate over any saving in I/O is somewhat arbitrarily
+	 * assumed to be the point where memtuples' size exceeds MaxAllocSize
+	 * (note that overall memory consumption may be far greater).  Past
+	 * this point, only the most compelling cases use replacement selection
+	 * for their first run.
+	 *
+	 * This is not about cache characteristics so much as the O(n log n)
+	 * cost of sorting larger runs dominating over the O(n) cost of
+	 * writing/reading tuples.
+	 */
+	if (sizeof(SortTuple) * state->memtupcount > MaxAllocSize)
+		crossover = avgTupleSize > 32 ? 0.90 : 0.95;
+
+	useSelection = state->memtupcount > state->rowNumHint * crossover;
+
+	/*
+	 * Sanity check:  If the hint is less than half of the estimated total
+	 * number of input tuples, the estimate is clearly bogus.  It's
+	 * probably generic.  Do not proceed.
+	 */
+	if (state->memtupcount > state->rowNumHint * 0.5)
+		useSelection = false;
+
+#ifdef TRACE_SORT
+	if (trace_sort)
+		elog(LOG,
+			 "%s used at row %d crossover %.3f (est %.2f rows %.2f runs)",
+			 useSelection ?
+			 "replacement selection (quicksort with spillover) strategy" :
+			 "hybrid sort-merge strategy",
+			 state->memtupcount, crossover, state->rowNumHint,
+			 state->rowNumHint / state->memtupcount);
+#endif
+
+	return useSelection;
+}
+
+/*
  * inittapes - initialize for tape sorting.
  *
  * This is called only if we have found we don't have room to sort in memory.
@@ -2074,7 +2362,6 @@ static void
 inittapes(Tuplesortstate *state)
 {
 	int			maxTapes,
-				ntuples,
 				j;
 	int64		tapeSpace;
 
@@ -2133,23 +2420,38 @@ inittapes(Tuplesortstate *state)
 	state->tp_tapenum = (int *) palloc0(maxTapes * sizeof(int));
 
 	/*
-	 * Convert the unsorted contents of memtuples[] into a heap. Each tuple is
-	 * marked as belonging to run number zero.
-	 *
-	 * NOTE: we pass false for checkIndex since there's no point in comparing
-	 * indexes in this step, even though we do intend the indexes to be part
-	 * of the sort key...
+	 * Give replacement selection a try when number of tuples to be sorted
+	 * has a reasonable chance of enabling a "quicksort with spillover".
+	 * There will be a switch to a simple hybrid sort-merge strategy after
+	 * the first run (iff there is to be a second on-tape run).
 	 */
-	ntuples = state->memtupcount;
-	state->memtupcount = 0;		/* make the heap empty */
-	for (j = 0; j < ntuples; j++)
+	state->replaceActive = useselection(state);
+	state->cached = false;
+	state->just_memtuples = false;
+
+	if (state->replaceActive)
 	{
-		/* Must copy source tuple to avoid possible overwrite */
-		SortTuple	stup = state->memtuples[j];
+		/*
+		 * Convert the unsorted contents of memtuples[] into a heap. Each
+		 * tuple is marked as belonging to run number zero.
+		 *
+		 * NOTE: we pass false for checkIndex since there's no point in
+		 * comparing indexes in this step, even though we do intend the
+		 * indexes to be part of the sort key...
+		 */
+		int			ntuples = state->memtupcount;
 
-		tuplesort_heap_insert(state, &stup, 0, false);
+		state->memtupcount = 0;		/* make the heap empty */
+
+		for (j = 0; j < ntuples; j++)
+		{
+			/* Must copy source tuple to avoid possible overwrite */
+			SortTuple	stup = state->memtuples[j];
+
+			tuplesort_heap_insert(state, &stup, 0, false);
+		}
+		Assert(state->memtupcount == ntuples);
 	}
-	Assert(state->memtupcount == ntuples);
 
 	state->currentRun = 0;
 
@@ -2229,6 +2531,14 @@ mergeruns(Tuplesortstate *state)
 	 * volume is between 1X and 2X workMem), we can just use that tape as the
 	 * finished output, rather than doing a useless merge.  (This obvious
 	 * optimization is not in Knuth's algorithm.)
+	 *
+	 * This should almost be dead code in practice because replacement
+	 * selection will not be allowed to continue by tuplesort_performsort().
+	 * It is perhaps still possible that certain edge-cases will leave the
+	 * final dumptuples() call as a no-op, resulting in only one run
+	 * remaining on tape, allowing this optimization to still be used.
+	 * Before the role of replacement selection was significantly diminished
+	 * to allow quicksorting of runs, this code was more useful.
 	 */
 	if (state->currentRun == 1)
 	{
@@ -2430,6 +2740,95 @@ mergeonerun(Tuplesortstate *state)
 }
 
 /*
+ * mergememruns -- merge runs in memory into a new in-memory run.
+ *
+ * This allows tuplesort to avoid dumping many tuples in the common case
+ * where work_mem is only somewhat less than the amount that would allow
+ * an internal sort to proceed.  This technique is called "quicksort
+ * with spillover", and uses a heap (this is the sole reason why this
+ * module continues to support a heap that distinguishes tuples by run
+ * number, in the style of replacement selection).
+ *
+ * The useselection() cost model aims to limit the use of the technique
+ * to cases where it allows most I/O to be avoided.  Prior to PostgreSQL
+ * 9.6, replacement selection could sometimes produce runs that are far
+ * larger than work_mem in size, but currently the goal of incremental
+ * heap spilling is to avoid I/O (i.e. The heap enables the avoidance
+ * of spilling the majority of tuples entirely).  The traditional goal
+ * of replacement selection is to maximize the size of every run, which
+ * is a significant difference.
+ *
+ * Merging here actually means quicksorting, without regard to the
+ * "heap-wise" run number of each memtuple.  Note that this in-memory
+ * merge is distinct from the final on-the-fly merge step that will
+ * follow.  This routine merges what was originally the tail of the
+ * first run with what was originally the entire second run in advance
+ * of the on-the-fly merge step.
+ *
+ * The final on-the-fly merge occurs between the new all-in-memory run
+ * created by this routine, and what was originally the first part of
+ * the first heap-wise run (and is now simply the first run), which is
+ * already sorted on tape by the time we reach here.
+ *
+ * In the event of partially sorted input to the heap, there will
+ * sometimes only be one heap-wise run when we reach here (and so the
+ * tail of the heap-wise run is in memory, while the head is on tape).
+ * This makes the final on-the-fly merge step faster because the new
+ * in-memory run (a part of the one and only heap-wise run) and the
+ * on-tape run naturally cover disjoint ranges, allowing us to quickly
+ * exhaust the on-tape run, and return the majority of tuples from
+ * memory without any actual merging/comparing (much like a conventional
+ * internal sort).  This requires no special handling.
+ *
+ * Note:  We currently don't bother to avoid *all* comparisons during
+ * the final on-the-fly merge in the event of only encountering a single
+ * heap-wise run (when we're still returning tuples from tape, the
+ * lowest/first in-memory tuple will be compared against every tape
+ * tuple).  In the future, avoiding even these comparisons might be
+ * worthwhile for certain cases.
+ */
+static void
+mergememruns(Tuplesortstate *state)
+{
+	Assert(state->replaceActive);
+	Assert(!state->randomAccess);
+
+	/*
+	 * Write an end-of-run marker on the output tape, since the tail of what
+	 * was conceptually the same heap-wise run (run 0) is now to be merged
+	 * with a possible second heap-wise run, to make a new in-memory run.
+	 */
+	markrunend(state, state->currentRun);
+
+#ifdef TRACE_SORT
+	if (trace_sort)
+		elog(LOG, "starting quicksort of %d tuples to create single in-memory run: %s",
+			 state->memtupcount, pg_rusage_show(&state->ru_start));
+#endif
+
+	/*
+	 * Internal sort to merge (part of) heap-wise run 0 with (entire) heap-wise
+	 * run 1.
+	 *
+	 * (If there was only ever a single heap-wise run, tuples are still needed
+	 * in sorted order.  Don't actually distinguish that case.)
+	 */
+	tuplesort_sort_memtuples(state);
+	state->current = 0;
+
+#ifdef TRACE_SORT
+	if (trace_sort)
+		elog(LOG, "finished quicksorting %d tuples to create single in-memory run: %s",
+			 state->memtupcount, pg_rusage_show(&state->ru_start));
+#endif
+
+	state->result_tape = state->tp_tapenum[state->destTape];
+	/* Must freeze and rewind the finished output tape */
+	LogicalTapeFreeze(state->tapeset, state->result_tape);
+	state->status = TSS_MEMTAPEMERGE;
+}
+
+/*
  * beginmerge - initialize for a merge pass
  *
  * We decrease the counts of real and dummy runs for each tape, and mark
@@ -2608,21 +3007,23 @@ mergeprereadone(Tuplesortstate *state, int srcTape)
 }
 
 /*
- * dumptuples - remove tuples from heap and write to tape
+ * dumptuples - remove tuples from memtuples and write to tape
  *
  * This is used during initial-run building, but not during merging.
  *
- * When alltuples = false, dump only enough tuples to get under the
- * availMem limit (and leave at least one tuple in the heap in any case,
- * since puttuple assumes it always has a tuple to compare to).  We also
- * insist there be at least one free slot in the memtuples[] array.
+ * When alltuples = false and replacement selection is still active, dump
+ * only enough tuples to get under the availMem limit (and leave at least
+ * one tuple in memtuples, since puttuple will then assume it is a heap that
+ * has a tuple to compare to).  We also insist there be at least one free
+ * slot in the memtuples[] array.
  *
- * When alltuples = true, dump everything currently in memory.
+ * When alltuples = true, always batch dump everything currently in memory.
  * (This case is only used at end of input data.)
  *
- * If we empty the heap, close out the current run and return (this should
- * only happen at end of input data).  If we see that the tuple run number
- * at the top of the heap has changed, start a new run.
+ * If, when replacement selection is active, we see that the tuple run
+ * number at the top of the heap has changed, start a new run.  This must be
+ * the first run, because replacement selection is always abandoned for all
+ * further runs.
  */
 static void
 dumptuples(Tuplesortstate *state, bool alltuples)
@@ -2631,21 +3032,37 @@ dumptuples(Tuplesortstate *state, bool alltuples)
 		   (LACKMEM(state) && state->memtupcount > 1) ||
 		   state->memtupcount >= state->memtupsize)
 	{
+		if (state->replaceActive && !alltuples)
+		{
+			/*
+			 * Still holding out for a case favorable to replacement selection.
+			 * Still incrementally spilling using heap.
+			 *
+			 * Dump the heap's frontmost entry, and sift up to remove it from
+			 * the heap.
+			 */
+			Assert(state->memtupcount > 1);
+			WRITETUP(state, state->tp_tapenum[state->destTape],
+					 &state->memtuples[0]);
+			tuplesort_heap_siftup(state, true);
+		}
+		else
+		{
+			/*
+			 * Once committed to quicksorting runs, never incrementally
+			 * spill
+			 */
+			dumpbatch(state, alltuples);
+			break;
+		}
+
 		/*
-		 * Dump the heap's frontmost entry, and sift up to remove it from the
-		 * heap.
+		 * If top run number has changed, we've finished the current run
+		 * (this can only be the first run, run 0), and will no longer spill
+		 * incrementally.
 		 */
 		Assert(state->memtupcount > 0);
-		WRITETUP(state, state->tp_tapenum[state->destTape],
-				 &state->memtuples[0]);
-		tuplesort_heap_siftup(state, true);
-
-		/*
-		 * If the heap is empty *or* top run number has changed, we've
-		 * finished the current run.
-		 */
-		if (state->memtupcount == 0 ||
-			state->currentRun != state->memtuples[0].tupindex)
+		if (state->memtuples[0].tupindex != 0)
 		{
 			markrunend(state, state->tp_tapenum[state->destTape]);
 			state->currentRun++;
@@ -2654,24 +3071,105 @@ dumptuples(Tuplesortstate *state, bool alltuples)
 
 #ifdef TRACE_SORT
 			if (trace_sort)
-				elog(LOG, "finished writing%s run %d to tape %d: %s",
-					 (state->memtupcount == 0) ? " final" : "",
+				elog(LOG, "finished writing heapsorted run %d to tape %d: %s",
 					 state->currentRun, state->destTape,
 					 pg_rusage_show(&state->ru_start));
 #endif
 
 			/*
-			 * Done if heap is empty, else prepare for new run.
+			 * Heap cannot be empty, so prepare for new run and give up on
+			 * replacement selection.
 			 */
-			if (state->memtupcount == 0)
-				break;
-			Assert(state->currentRun == state->memtuples[0].tupindex);
 			selectnewtape(state);
+			 /* All future runs will only use dumpbatch/quicksort */
+			state->replaceActive = false;
 		}
 	}
 }
 
 /*
+ * dumpbatch - sort and dump all memtuples, forming one run on tape
+ *
+ * Second or subsequent runs are never heapified by this module (although
+ * heapification still respects run number differences between the first and
+ * second runs), and a heap (replacement selection priority queue) is often
+ * avoided in the first place.
+ *
+ * This helper handles the case where replacement selection was never in use,
+ * and all tuples are to be quicksorted and dumped in batch.  This does not
+ * include "quicksort with spillover".  See mergememruns() for details of why
+ * that must be handled differently.
+ */
+static void
+dumpbatch(Tuplesortstate *state, bool alltuples)
+{
+	int			memtupwrite;
+	int			i;
+
+	Assert(state->status == TSS_BUILDRUNS);
+
+	if (state->memtupcount == 0)
+	{
+		/*
+		 * Final call might require no sorting, in rare cases where we just so
+		 * happen to have previously LACKMEM()'d at the point where exactly all
+		 * remaining tuples are loaded into memory, just before input was
+		 * exhausted (now it's actually apparent that input is exhausted --
+		 * this must be alltuples case, as that's the only case that tries to
+		 * dump anything when memtupcount <= 1).
+		 *
+		 * Do not increment run number, or mark the run as over -- this is a
+		 * no-op.
+		 */
+		Assert(alltuples);
+		return;
+	}
+
+	state->currentRun++;
+
+#ifdef TRACE_SORT
+	if (trace_sort)
+		elog(LOG, "starting quicksort of run %d: %s",
+			 state->currentRun, pg_rusage_show(&state->ru_start));
+#endif
+
+	/*
+	 * In the future, try to predict when we're just short of consuming all
+	 * tuples, and provide burst memory capacity here to avoid one tiny run.
+	 * This might be introduced alongside a new scheme for controlling memory
+	 * usage.
+	 */
+	tuplesort_sort_memtuples(state);
+
+#ifdef TRACE_SORT
+	if (trace_sort)
+		elog(LOG, "finished quicksorting run %d: %s",
+			 state->currentRun, pg_rusage_show(&state->ru_start));
+#endif
+
+	memtupwrite = state->memtupcount;
+	for (i = 0; i < memtupwrite; i++)
+	{
+		WRITETUP(state, state->tp_tapenum[state->destTape],
+				 &state->memtuples[i]);
+		state->memtupcount--;
+	}
+	markrunend(state, state->tp_tapenum[state->destTape]);
+	state->tp_runs[state->destTape]++;
+	state->tp_dummy[state->destTape]--; /* per Alg D step D2 */
+
+#ifdef TRACE_SORT
+	if (trace_sort)
+		elog(LOG, "finished writing run %d to tape %d: %s",
+			 state->currentRun, state->destTape,
+			 pg_rusage_show(&state->ru_start));
+#endif
+
+	if (!alltuples)
+		selectnewtape(state);
+}
+
+/*
  * tuplesort_rescan		- rewind and replay the scan
  */
 void
@@ -2781,7 +3279,8 @@ void
 tuplesort_get_stats(Tuplesortstate *state,
 					const char **sortMethod,
 					const char **spaceType,
-					long *spaceUsed)
+					long *spaceUsed,
+					int *rowsSortedMem)
 {
 	/*
 	 * Note: it might seem we should provide both memory and disk usage for a
@@ -2810,15 +3309,23 @@ tuplesort_get_stats(Tuplesortstate *state,
 				*sortMethod = "top-N heapsort";
 			else
 				*sortMethod = "quicksort";
+			*rowsSortedMem = state->memtupcount;
 			break;
 		case TSS_SORTEDONTAPE:
 			*sortMethod = "external sort";
+			*rowsSortedMem = 0;
+			break;
+		case TSS_MEMTAPEMERGE:
+			*sortMethod = "quicksort with spillover";
+			*rowsSortedMem = state->memtupcount;
 			break;
 		case TSS_FINALMERGE:
 			*sortMethod = "external merge";
+			*rowsSortedMem = 0;
 			break;
 		default:
 			*sortMethod = "still in progress";
+			*rowsSortedMem = -1;
 			break;
 	}
 }
@@ -2829,10 +3336,19 @@ tuplesort_get_stats(Tuplesortstate *state,
  *
  * Compare two SortTuples.  If checkIndex is true, use the tuple index
  * as the front of the sort key; otherwise, no.
+ *
+ * Note that for checkIndex callers, the heap invariant is never maintained
+ * beyond the first run, and so there are no COMPARETUP() calls beyond the
+ * first run.  It is assumed that checkIndex callers are maintaining the
+ * heap invariant for a replacement selection priority queue, but those
+ * callers do not go on to trust the heap to be fully-heapified past the
+ * first run.  Once currentRun isn't the first, memtuples is no longer a
+ * heap at all.
  */
 
 #define HEAPCOMPARE(tup1,tup2) \
-	(checkIndex && ((tup1)->tupindex != (tup2)->tupindex) ? \
+	(checkIndex && ((tup1)->tupindex != (tup2)->tupindex || \
+					(tup1)->tupindex != 0) ? \
 	 ((tup1)->tupindex) - ((tup2)->tupindex) : \
 	 COMPARETUP(state, tup1, tup2))
 
@@ -2931,6 +3447,30 @@ sort_bounded_heap(Tuplesortstate *state)
 }
 
 /*
+ * Sort all memtuples.
+ *
+ * Quicksort is tuplesort's internal sort algorithm.  It is also generally
+ * preferred to replacement selection of runs during external sorts, except
+ * where incrementally spilling is thought to be particularly beneficial.
+ */
+static void
+tuplesort_sort_memtuples(Tuplesortstate *state)
+{
+	if (state->memtupcount > 1)
+	{
+		/* Can we use the single-key sort function? */
+		if (state->onlyKey != NULL)
+			qsort_ssup(state->memtuples, state->memtupcount,
+					   state->onlyKey);
+		else
+			qsort_tuple(state->memtuples,
+						state->memtupcount,
+						state->comparetup,
+						state);
+	}
+}
+
+/*
  * Insert a new tuple into an empty or existing heap, maintaining the
  * heap invariant.  Caller is responsible for ensuring there's room.
  *
@@ -2958,6 +3498,17 @@ tuplesort_heap_insert(Tuplesortstate *state, SortTuple *tuple,
 	memtuples = state->memtuples;
 	Assert(state->memtupcount < state->memtupsize);
 
+	/*
+	 * Once incremental heap spilling is abandoned, this routine should not be
+	 * called when loading runs.  memtuples will be an array of tuples in no
+	 * significant order, so calling here is inappropriate.  Even when
+	 * incremental spilling is still in progress, this routine does not handle
+	 * the second run's tuples (those are heapified to a limited extent that
+	 * they are appended, and thus kept away from those tuples in the first
+	 * run).
+	 */
+	Assert(!checkIndex || tupleindex == 0);
+
 	CHECK_FOR_INTERRUPTS();
 
 	/*
@@ -2989,6 +3540,13 @@ tuplesort_heap_siftup(Tuplesortstate *state, bool checkIndex)
 	int			i,
 				n;
 
+	/*
+	 * Once incremental heap spilling is abandoned, this routine should not be
+	 * called when loading runs.  memtuples will be an array of tuples in no
+	 * significant order, so calling here is inappropriate.
+	 */
+	Assert(!checkIndex || state->currentRun == 0);
+
 	if (--state->memtupcount <= 0)
 		return;
 
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 97cb859..95acc1d 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -335,7 +335,8 @@ extern bool _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir);
 /* hashsort.c */
 typedef struct HSpool HSpool;	/* opaque struct in hashsort.c */
 
-extern HSpool *_h_spoolinit(Relation heap, Relation index, uint32 num_buckets);
+extern HSpool *_h_spoolinit(Relation heap, Relation index, uint32 num_buckets,
+				 double reltuples);
 extern void _h_spooldestroy(HSpool *hspool);
 extern void _h_spool(HSpool *hspool, ItemPointer self,
 		 Datum *values, bool *isnull);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 9e48efd..5504b7b 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -743,7 +743,7 @@ extern void BTreeShmemInit(void);
 typedef struct BTSpool BTSpool; /* opaque type known only within nbtsort.c */
 
 extern BTSpool *_bt_spoolinit(Relation heap, Relation index,
-			  bool isunique, bool isdead);
+			  bool isunique, bool isdead, double reltuples);
 extern void _bt_spooldestroy(BTSpool *btspool);
 extern void _bt_spool(BTSpool *btspool, ItemPointer self,
 		  Datum *values, bool *isnull);
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index fe3b81a..e6144f2 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -21,6 +21,8 @@ extern TupleTableSlot *ExecAgg(AggState *node);
 extern void ExecEndAgg(AggState *node);
 extern void ExecReScanAgg(AggState *node);
 
+extern double agg_input_rows(AggState *aggstate);
+
 extern Size hash_agg_entry_size(int numAggs);
 
 extern Datum aggregate_dummy(PG_FUNCTION_ARGS);
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index de6fc56..11a5fb7 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -62,22 +62,27 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
 					 int nkeys, AttrNumber *attNums,
 					 Oid *sortOperators, Oid *sortCollations,
 					 bool *nullsFirstFlags,
-					 int workMem, bool randomAccess);
+					 int workMem,
+					 double rowNumHint, bool randomAccess);
 extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
 						Relation indexRel,
-						int workMem, bool randomAccess);
+						int workMem,
+						double rowNumHint, bool randomAccess);
 extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
 							Relation indexRel,
 							bool enforceUnique,
-							int workMem, bool randomAccess);
+							int workMem,
+							double rowNumHint, bool randomAccess);
 extern Tuplesortstate *tuplesort_begin_index_hash(Relation heapRel,
 						   Relation indexRel,
 						   uint32 hash_mask,
-						   int workMem, bool randomAccess);
+						   int workMem,
+						   double rowNumHint, bool randomAccess);
 extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 					  Oid sortOperator, Oid sortCollation,
 					  bool nullsFirstFlag,
-					  int workMem, bool randomAccess);
+					  int workMem,
+					  double rowNumHint, bool randomAccess);
 
 extern void tuplesort_set_bound(Tuplesortstate *state, int64 bound);
 
@@ -109,7 +114,8 @@ extern void tuplesort_end(Tuplesortstate *state);
 extern void tuplesort_get_stats(Tuplesortstate *state,
 					const char **sortMethod,
 					const char **spaceType,
-					long *spaceUsed);
+					long *spaceUsed,
+					int *rowsSortedMem);
 
 extern int	tuplesort_merge_order(int64 allowedMem);
 
-- 
1.9.1

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

Reply via email to