Re: [HACKERS] Using quicksort for every external sort run

Peter Geoghegan Thu, 10 Mar 2016 18:55:26 -0800

On Sun, Feb 14, 2016 at 8:01 PM, Peter Geoghegan <p...@heroku.com> wrote:
> The query I'm testing is: "reindex index pgbench_accounts_pkey;"
>
> Now, with a maintenance_work_mem of 5MB, the most recent revision of
> my patch takes about 54.2 seconds to complete this, as compared to
> master's 44.4 seconds. So, clearly a noticeable regression there of
> just under 20%. I did not see a regression with a 5MB
> maintenance_work_mem when pgbench scale was 100, though.

I've fixed this regression, and possibly all regressions where workMem
> 4MB. I've done so without resorting to making the heap structure
more complicated, or using a heap more often than when
replacement_sort_mem is exceeded by work_mem or maintenance_work_mem
(so replacement_sort_mem becomes something a bit different to what we
discussed, Robert -- more on that later). This seems like an
"everybody wins" situation, because in this revision the patch series
is now appreciably *faster* where the amount of memory available is
only a tiny fraction of the total input size.

Jeff Janes deserves a lot of credit for helping me to figure out how
to do this. I couldn't get over his complaint about the regression he
saw a few months back. He spoke of an "anti-sweetspot" in polyphase
merge, and how he suspected that to be the real culprit (after all,
most of his time was spent merging, with or without the patch
applied). He also said that reverting the memory batch/pool patch made
things go a bit faster, somewhat ameliorating his regression (when
just the quicksort patch was applied). This made no sense to me, since
I understood the memory batching patch to be orthogonal to the
quicksort thing, capable of being applied independently, and more or
less a strict improvement on master, no matter what the variables of
the sort are. Jeff's regressed case especially made no sense to me
(and, I gather, to him) given that the regression involved no
correlation, and so clearly wasn't reliant on generating far
fewer/longer runs than the patch (that's the issue we've discussed
more than any other now -- it's a red herring, it seems). As I
suspected out loud on February 14th, replacement selection mostly just
*masked* the real problem: the problem of palloc() fragmentation.
There doesn't seem to be much of an issue with the scheduling of
polyphase merging, once you fix palloc() fragmentation. I've created a
new revision, incorporating this new insight.

New Revision
============

Attached revision of patch series:

1. Creates a separate memory context for tuplesort's copies of
caller's tuples, which can be reset at key points, avoiding
fragmentation. Every SortTuple.tuple is allocated there (with trivial
exception); *everything else*, including the memtuples array, is
allocated in the existing tuplesort context, which becomes the parent
of this new "caller's tuples" context. Roughly speaking, that means
that about half of total memory for the sort is managed by each
context in common cases. Even with a high work_mem memory budget,
memory fragmentation could previously get so bad that tuplesort would
in effect claim a share of memory from the OS that is *significantly*
higher than the work_mem budget allotted to its sort. And with low
work_mem settings, fragmentation previously made palloc() thrash the
sort, especially during non-final merging. In this latest revision,
tuplesort now almost gets to use 100% of the memory that was requested
from the OS by palloc() is cases tested.

2. Loses the "quicksort with spillover" case entirely, making the
quicksort patch significantly simpler. A *lot* of code was thrown out.

This change is especially significant because it allowed me to remove
the cost model that Robert took issue with so vocally. "Quicksort with
spillover" was always far less important than the basic idea of using
quicksort for external sorts, so I'm not sad to see it go. And, I
thought that the cost model was pretty bad myself.

3. Fixes cost_sort(), making optimizer account for the fact that runs
are now about sort_mem-sized, not (sort_mem * 2)-sized.

While I was at it, I made cost_sort() more optimistic about the amount
of random I/O required relative to sequential I/O. This additional
change to cost_sort() was probably overdue.

4. Restores the ability of replacement selection to generate one run
and avoid any merging (previously, only one really long run and one
short run was possible, because at the time I conceptualized
replacement selection as being all about enabling "quicksort with
spillover", which quicksorted that second run in memory). This
only-one-run case is the case that Robert particularly cared about,
and it's fully restored when RS is in use (which can still only happen
for the first run, just never for the benefit of the now-axed
"quicksort with spillover" case).

5. Adds a new GUC, "replacement_sort_mem". The default setting is
16MB. Docs are included. If work_mem/maintenance_work_mem is less than
or equal to this, the first (and hopefully only) run uses replacement
selection.

"replacement_sort_mem" is a different thing to the concept for a GUC
Robert and I discussed (only the name is the same). That other concept
for a GUC related to the hybrid heap/quicksort idea (it controlled how
big the heap portion of memtuples was supposed to be, in a speculative
world where the patch took that "hybrid" approach [1] at all). In
light of this new information about palloc() fragmentation, and given
the risk to tuplesort's stability posed by implementing this "hybrid"
algorithm, this seems like a good compromise. I cannot see an upside
to pursuing the "hybrid" approach now. I regret reversing my position
on that, but that's just how it happened. Since Robert was seemingly
only worried about regressions, which are fixed now for a variety of
cases that I tested, I'm optimistic that this will be acceptable to
him. I believe that replacement_sort_mem as implemented here is quite
useful, although mostly because I see some further opportunities for
it.

Replacement Selection uses
--------------------------

What opportunities, you ask? Maybe CREATE INDEX can be made to accept
a "presorted" parameter, letting the user promise that the input is
more or less presorted. This allows tuplesort to only use a tiny heap,
while having it throw an error if it cannot produce one long run (i.e.
CREATE INDEX is documented as raising an error if the input is not
more or less presorted). The nice thing about that idea is that we can
be very optimistic about the data actually being more or less
presorted, so the implementation doesn't *actually* produce one long
run -- it produces one long *index*, with IndexTuples passed back to
nbtsort.c as soon as the heap fills for the first time, a bit like an
on-the-fly merge. Unlike an on-the-fly merge, no tapes or temp files
are actually involved; we write out IndexTuples by actually writing
out the index optimistically. There is a significant saving by using a
heap *because there is no need for a TSS_SORTEDONTAPE pass over the
data*. We succeed at doing it all at once with a tiny heap, or die
trying. So, in a later version of Postgres (9.7?),
replacement_sort_mem becomes more important because of this
"presorted" CREATE INDEX parameter. That's a relatively easy patch to
write, but it's not 9.6 material.

Commits
-------

Note that the attached revision makes the batch memory patch the first
commit in the patch series. It might be useful to get that one out of
the way first, since I imagine it is now considered the least
controversial, and is perhaps the simplest of the two big patches in
the series. I'm not very optimistic about the memory prefetch patch
0003-* getting committed, but so far I've only seen it help, and all
of my testing is based on having it applied. In any case, it's clearly
way way less important than the other two patches.

Testing
-------

N.B.: The debug patch, 0004-*, should not be applied during benchmarking.

I've used amcheck [2] to test this latest revision -- the tool ought
to not see any problems with any index created with the patch applied.
Reviewers might find it helpful to use amcheck, too. As 9.6 is
stabilized, I anticipate that amcheck will give us a fighting chance
at early detection of any bugs that might have slipped into tuplesort,
or a B-Tree operator class. Since we still don't even have one single
test of the external sort code [3], it's just as well. If we wanted to
test external sorting, maybe we'd do that by adding tests to amcheck,
that are not run by default, much like test_decoding, which tests
logical decoding but is not targeted by "make installcheck"; that
would allow the tests to be fairly comprehensive without being
annoying. Using amcheck neatly side-steps issues with the portability
of "expected" pg_regress output when collatable type sorting is
tested.

Thoughts?

[1]
http://www.postgresql.org/message-id/CA+TgmoY87y9FuZ=ne7jayh2emtovm9jp9alffwunjf3utq4...@mail.gmail.com
[2] https://commitfest.postgresql.org/9/561/
[3]
http://pgci.eisentraut.org/jenkins/job/postgresql_master_coverage/Coverage/src/backend/utils/sort/tuplesort.c.gcov.html
--
Peter Geoghegan

From 1d5478914d46b60efee0d7f2c8caed377b895584 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <p...@bowt.ie>
Date: Thu, 10 Mar 2016 14:52:18 -0800
Subject: [PATCH 4/4] Add MemoryContextStats() calls for debugging

N.B.:  This patch is only intended to help demonstrate that tuplesort's
use of memory remains honest (within budget).  It may have an
appreciable overhead, and should not be applied for performance tests.

Reviewers may find it interesting to consider how commenting out the new
MemoryContextReset() calls affects palloc() fragmentation in practice.
Doing so should not cause a memory leak; bytes "used" should not change
in debug output.  The "total" bytes in tuplesort memory contexts may
balloon, though.
---
 src/backend/utils/sort/tuplesort.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 0029daa..7b81f2c 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -2621,6 +2621,7 @@ mergeonerun(Tuplesortstate *state)
 	 * out, and replacing it with next tuple from same tape (if there is
 	 * another one).
 	 */
+	MemoryContextStats(state->sortcontext);
 	while (state->memtupcount > 0)
 	{
 		/* write the tuple to destTape */
@@ -2928,6 +2929,7 @@ mergebatch(Tuplesortstate *state, int64 spacePerTape)
 
 	state->batchUsed = true;
 	state->spacePerTape = spacePerTape;
+	MemoryContextStats(state->sortcontext);
 }
 
 /*
-- 
1.9.1

From 02b8acf42f245e724f2a9f8fe7c5fe46db58871c Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <p...@bowt.ie>
Date: Sun, 12 Jul 2015 13:14:01 -0700
Subject: [PATCH 3/4] Perform memory prefetching when writing memtuples

This patch is based on, but quite distinct to a separately submitted,
more general version which performs prefetching in several places [1].
This version now only performs prefetching of each "tuple proper" during
the writing of batches of tuples (an entire run, written following a
quicksort).  The case for prefetching each "tuple proper" at several
sites now seems weak due to difference in CPU microarchitecture.
However, it might still be that there is a consistent improvement
observable when writing out tuples, because that involves a particularly
tight inner loop, with relatively predictable processing to hide memory
latency behind.  A helpful generic prefetch hint may be possible for
this case, even if it proves impossible elsewhere.

This has been shown to appreciably help on both a POWER7 server
processor [2], and an Intel Mobile processor.

[1] https://commitfest.postgresql.org/6/305/
[2] CAM3SWZR5rv3+F3FOKf35=dti7otmmcdfoe2vogur0pddg3j...@mail.gmail.com
---
 config/c-compiler.m4               | 17 +++++++++++++++++
 configure                          | 31 +++++++++++++++++++++++++++++++
 configure.in                       |  1 +
 src/backend/utils/sort/tuplesort.c | 14 ++++++++++++++
 src/include/c.h                    | 14 ++++++++++++++
 src/include/pg_config.h.in         |  3 +++
 src/include/pg_config.h.win32      |  3 +++
 src/include/pg_config_manual.h     | 10 ++++++++++
 8 files changed, 93 insertions(+)

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 550d034..8be2122 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -271,6 +271,23 @@ fi])# PGAC_C_BUILTIN_UNREACHABLE
 
 
 
+# PGAC_C_BUILTIN_PREFETCH
+# -------------------------
+# Check if the C compiler understands __builtin_prefetch(),
+# and define HAVE__BUILTIN_PREFETCH if so.
+AC_DEFUN([PGAC_C_BUILTIN_PREFETCH],
+[AC_CACHE_CHECK(for __builtin_prefetch, pgac_cv__builtin_prefetch,
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([],
+[int i = 0;__builtin_prefetch(&i, 0, 3);])],
+[pgac_cv__builtin_prefetch=yes],
+[pgac_cv__builtin_prefetch=no])])
+if test x"$pgac_cv__builtin_prefetch" = xyes ; then
+AC_DEFINE(HAVE__BUILTIN_PREFETCH, 1,
+          [Define to 1 if your compiler understands __builtin_prefetch.])
+fi])# PGAC_C_BUILTIN_PREFETCH
+
+
+
 # PGAC_C_VA_ARGS
 # --------------
 # Check if the C compiler understands C99-style variadic macros,
diff --git a/configure b/configure
index b3f3abe..d1bfdf6 100755
--- a/configure
+++ b/configure
@@ -11387,6 +11387,37 @@ if test x"$pgac_cv__builtin_unreachable" = xyes ; then
 $as_echo "#define HAVE__BUILTIN_UNREACHABLE 1" >>confdefs.h
 
 fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_prefetch" >&5
+$as_echo_n "checking for __builtin_prefetch... " >&6; }
+if ${pgac_cv__builtin_prefetch+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+int
+main ()
+{
+int i = 0;__builtin_prefetch(&i, 0, 3);
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv__builtin_prefetch=yes
+else
+  pgac_cv__builtin_prefetch=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__builtin_prefetch" >&5
+$as_echo "$pgac_cv__builtin_prefetch" >&6; }
+if test x"$pgac_cv__builtin_prefetch" = xyes ; then
+
+$as_echo "#define HAVE__BUILTIN_PREFETCH 1" >>confdefs.h
+
+fi
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for __VA_ARGS__" >&5
 $as_echo_n "checking for __VA_ARGS__... " >&6; }
 if ${pgac_cv__va_args+:} false; then :
diff --git a/configure.in b/configure.in
index 0bd90d7..ab40ca1 100644
--- a/configure.in
+++ b/configure.in
@@ -1333,6 +1333,7 @@ PGAC_C_BUILTIN_BSWAP32
 PGAC_C_BUILTIN_BSWAP64
 PGAC_C_BUILTIN_CONSTANT_P
 PGAC_C_BUILTIN_UNREACHABLE
+PGAC_C_BUILTIN_PREFETCH
 PGAC_C_VA_ARGS
 PGAC_STRUCT_TIMEZONE
 PGAC_UNION_SEMUN
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index f6ff61a..0029daa 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -3335,6 +3335,20 @@ dumpbatch(Tuplesortstate *state, bool alltuples)
 		WRITETUP(state, state->tp_tapenum[state->destTape],
 				 &state->memtuples[i]);
 		state->memtupcount--;
+
+		/*
+		 * Perform memory prefetch of the tuple pointed to by the SortTuple
+		 * that's two places ahead of tuple just written.  Testing shows that
+		 * this significantly boosts performance.
+		 *
+		 * Don't do this for pass-by-value datum sorts, where it will never
+		 * help.  Note that hinting a NULL address does not affect correctness,
+		 * so NULL values are not a concern here.
+		 */
+#ifdef USE_MEM_PREFETCH
+		if (state->tuples && i + 2 < memtupwrite)
+			pg_read_prefetch(state->memtuples[i + 2].tuple);
+#endif
 	}
 	/*
 	 * Reset tuple memory, now that no caller tuples are needed in memory.
diff --git a/src/include/c.h b/src/include/c.h
index 7c57430..93b53bb 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -927,6 +927,20 @@ typedef NameData *Name;
 #define pg_unreachable() abort()
 #endif
 
+/*
+ * Prefetch support -- Support memory prefetching hints on some platforms.
+ *
+ * pg_read_prefetch() is specialized for the case where an array is accessed
+ * sequentially, and we can prefetch a pointer within the next element (or an
+ * even later element) in order to hide memory latency.  This case involves
+ * prefetching addresses with low temporal locality.  Note that it's rather
+ * difficult to get any kind of speedup with this;  any use of the intrinsic
+ * should be carefully tested.  It's okay to pass it an invalid or NULL
+ * address, although it's best avoided.
+ */
+#if defined(USE_MEM_PREFETCH)
+#define pg_read_prefetch(addr)		__builtin_prefetch((addr), 0, 0)
+#endif
 
 /* ----------------------------------------------------------------
  *				Section 8:	random stuff
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index b3ceea5..e169153 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -669,6 +669,9 @@
 /* Define to 1 if your compiler understands __builtin_constant_p. */
 #undef HAVE__BUILTIN_CONSTANT_P
 
+/* Define to 1 if your compiler understands __builtin_prefetch. */
+#undef HAVE__BUILTIN_PREFETCH
+
 /* Define to 1 if your compiler understands __builtin_types_compatible_p. */
 #undef HAVE__BUILTIN_TYPES_COMPATIBLE_P
 
diff --git a/src/include/pg_config.h.win32 b/src/include/pg_config.h.win32
index 8566065..990c3c3 100644
--- a/src/include/pg_config.h.win32
+++ b/src/include/pg_config.h.win32
@@ -514,6 +514,9 @@
 /* Define to 1 if your compiler understands __builtin_constant_p. */
 /* #undef HAVE__BUILTIN_CONSTANT_P */
 
+/* Define to 1 if your compiler understands __builtin_prefetch. */
+#undef HAVE__BUILTIN_PREFETCH
+
 /* Define to 1 if your compiler understands __builtin_types_compatible_p. */
 /* #undef HAVE__BUILTIN_TYPES_COMPATIBLE_P */
 
diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
index ef89521..1dd3125 100644
--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -148,6 +148,16 @@
 #endif
 
 /*
+ * USE_MEM_PREFETCH controls whether Postgres will attempt to use memory
+ * prefetching.  Usually the automatic configure tests are sufficient, but
+ * it's conceivable that using prefetching is counter-productive on some
+ * platforms.  If necessary you can remove the #define here.
+ */
+#ifdef HAVE__BUILTIN_PREFETCH
+#define USE_MEM_PREFETCH
+#endif
+
+/*
  * USE_SSL code should be compiled only when compiling with an SSL
  * implementation.  (Currently, only OpenSSL is supported, but we might add
  * more implementations in the future.)
-- 
1.9.1

From 416833b63e1beafe2f0c619167c3323eab3d680d Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <p...@bowt.ie>
Date: Wed, 29 Jul 2015 15:38:12 -0700
Subject: [PATCH 2/4] Quicksort when performing external sorts

The benefits of replacement selection (the currentRun + nextRun heap)
are seen only where incremental spilling allows the writing of one large
run, avoiding any final merge step.  It is not generally worth the cost
in CPU cache misses to use a heap, even when it can produce runs that
are much larger than with the new approach.  It might be that
replacement selection masks certain "anti-sweetspots" with polyphase
merge, but that doesn't seem like a good reason to keep it, because it's
hard to predict when that will happen.

A simple hybrid sort-merge strategy is therefore now usually used when
tuplesort performs an external sort.  A new GUC, replacement_sort_mem,
governs if replacement selection is still attempted for the first run,
though.  In cases where not a lot of memory is used, it might serve to
get the "one run, no merge" case that is the best case for replacement
selection.

As a further optimization, put a quasi-arbitrary upper bound (500 tapes)
on the number of tapes that tuplesort uses, to avoid excessive random
I/O.  Knuth identified 7 tapes as the "sweet spot" for polyphase
merging.  Commit df700e6b sized the number of tapes only according to
available buffer space (the number of tapes was as high as possible
while still allowing at least 32 * BLCKSZ buffer space per tape),
rejecting Knuth's "sweet spot", which helped a lot when the sort thereby
completed in one pass.  However, it's still true today that there are
unlikely to be benefits from increasing the number of tapes from 7 once
the amount of data to be sorted far exceeds available memory.  More
importantly, with very high work_mem settings and even larger data
volumes, the tapes previously wasted as much as 8% of the available
memory budget;  tens of thousands of tapes could be logically allocated
for a sort that will only benefit from a few dozen.  It didn't matter
that the memory wasn't actually allocated, because tuplesort accounting
still charged for the memory for the number of tapes indicated by the
heuristic ahead of merging.
---
 doc/src/sgml/config.sgml                      |  51 +++
 src/backend/optimizer/path/costsize.c         |  14 +-
 src/backend/utils/init/globals.c              |   1 +
 src/backend/utils/misc/guc.c                  |  13 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/backend/utils/sort/tuplesort.c            | 442 ++++++++++++++++++++------
 src/include/miscadmin.h                       |   1 +
 7 files changed, 420 insertions(+), 103 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a09ceb2..7a1ebf2 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1472,6 +1472,57 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-replacement-sort-mem" xreflabel="replacement_sort_mem">
+      <term><varname>replacement_sort_mem</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>replacement_sort_mem</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies an inclusive cap on the amount of memory (as
+        specified by <varname>work_mem</varname> or
+        <varname>maintenance_work_mem</varname>) at which external
+        sort operations (including those performed by <command>CREATE
+        INDEX</> and <command>CLUSTER</>) generate their first run
+        using a <firstterm>replacement selection</firstterm> priority
+        queue.  This can be useful in memory constrained environments
+        where rows that are input into a sort operation are known to
+        have a strong correlation in logical order (their final order)
+        to physical input order.  Note that this does not include
+        cases with an <emphasis>inverse</emphasis> correlation. It is
+        possible for the replacement selection algorithm to generate
+        one long run that requires no merging, where the default
+        strategy would result in many runs that must be merged to
+        produce a final sorted output.  This may allow sort operations
+        to complete sooner.
+       </para>
+       <para>
+        The value defaults to sixteen megabytes (<literal>16MB</>).
+        Note that higher values are typically not more effective, and
+        may be counter-productive.  The default
+        <varname>maintenance_work_mem</varname> value effectively
+        prevents utility command external sorts (e.g., sorts used by
+        <command>CREATE INDEX</> when a B-Tree index is built) from
+        ever being affected, provided <varname>replacement_sort_mem</>
+        is itself set to its default value.  One strategy that may be
+        effective is to temporarily set
+        <varname>maintenance_work_mem</varname> to a lower setting,
+        equal to <varname>replacement_sort_mem</>, when a correlation
+        is expected.
+       </para>
+       <para>
+        The priority queue can be thought of as a buffer that can
+        reorder input rows, shuffling the rows into their final,
+        logical order.  Rows are incrementally output,  unless and
+        until a row is input <emphasis>after</> a row with a higher
+        sort order has already been output, at which point replacement
+        selection is abandoned.  When replacement selection is
+        abandoned, the sort is completed using the default strategy.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-autovacuum-work-mem" xreflabel="autovacuum_work_mem">
       <term><varname>autovacuum_work_mem</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index ffff3c0..99c8892 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1421,16 +1421,16 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * total, but we will also need to write and read each tuple once per
  * merge pass.  We expect about ceil(logM(r)) merge passes where r is the
  * number of initial runs formed and M is the merge order used by tuplesort.c.
- * Since the average initial run should be about twice sort_mem, we have
- *		disk traffic = 2 * relsize * ceil(logM(p / (2*sort_mem)))
+ * Since the average initial run should be about sort_mem, we have
+ *		disk traffic = 2 * relsize * ceil(logM(p / sort_mem))
  *		cpu = comparison_cost * t * log2(t)
  *
  * If the sort is bounded (i.e., only the first k result tuples are needed)
  * and k tuples can fit into sort_mem, we use a heap method that keeps only
  * k tuples in the heap; this will require about t*log2(k) tuple comparisons.
  *
- * The disk traffic is assumed to be 3/4ths sequential and 1/4th random
- * accesses (XXX can't we refine that guess?)
+ * The disk traffic is assumed to be 90% sequential and 10% random
+ * accesses.
  *
  * By default, we charge two operator evals per tuple comparison, which should
  * be in the right ballpark in most cases.  The caller can tweak this by
@@ -1498,7 +1498,7 @@ cost_sort(Path *path, PlannerInfo *root,
 		 * We'll have to use a disk-based sort of all the tuples
 		 */
 		double		npages = ceil(input_bytes / BLCKSZ);
-		double		nruns = (input_bytes / sort_mem_bytes) * 0.5;
+		double		nruns = input_bytes / sort_mem_bytes;
 		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
 		double		log_runs;
 		double		npageaccesses;
@@ -1518,9 +1518,9 @@ cost_sort(Path *path, PlannerInfo *root,
 		else
 			log_runs = 1.0;
 		npageaccesses = 2.0 * npages * log_runs;
-		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
+		/* Assume 90% of accesses are sequential, 10% are not */
 		startup_cost += npageaccesses *
-			(seq_page_cost * 0.75 + random_page_cost * 0.25);
+			(seq_page_cost * 0.90 + random_page_cost * 0.10);
 	}
 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
 	{
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index ccd9c8e..d8afa3f 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -108,6 +108,7 @@ bool		enableFsync = true;
 bool		allowSystemTableMods = false;
 int			work_mem = 1024;
 int			maintenance_work_mem = 16384;
+int			replacement_sort_mem = 16384;
 
 /*
  * Primary determinants of sizes of shared-memory structures.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index ea5a09a..55a16c5 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1898,6 +1898,19 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"replacement_sort_mem", PGC_USERSET, RESOURCES_MEM,
+			gettext_noop("Sets the maximum memory to be used for sorting priority queue."),
+			gettext_noop("This much memory sets the maximum size of the replacement selection "
+						 "priority queue used by external sort operations, including utility "
+						 "commands and query workspace sort operations."),
+			GUC_UNIT_KB
+		},
+		&replacement_sort_mem,
+		16384, 64, MAX_KILOBYTES,
+		NULL, NULL, NULL
+	},
+
 	/*
 	 * We use the hopefully-safely-small value of 100kB as the compiled-in
 	 * default for max_stack_depth.  InitializeGUCOptions will increase it if
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index ee3d378..b420c54 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -125,6 +125,7 @@
 # actively intend to use prepared transactions.
 #work_mem = 4MB				# min 64kB
 #maintenance_work_mem = 64MB		# min 1MB
+#replacement_sort_mem = 16MB		# min 64kB
 #autovacuum_work_mem = -1		# min 1MB, or -1 to use maintenance_work_mem
 #max_stack_depth = 2MB			# min 100kB
 #dynamic_shared_memory_type = posix	# the default is the first option
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index c3f666e..f6ff61a 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -11,14 +11,16 @@
  * algorithm.
  *
  * See Knuth, volume 3, for more than you want to know about the external
- * sorting algorithm.  We divide the input into sorted runs using replacement
- * selection, in the form of a priority tree implemented as a heap
- * (essentially his Algorithm 5.2.3H), then merge the runs using polyphase
- * merge, Knuth's Algorithm 5.4.2D.  The logical "tapes" used by Algorithm D
- * are implemented by logtape.c, which avoids space wastage by recycling
- * disk space as soon as each block is read from its "tape".
+ * sorting algorithm.  Historically, we divided the input into sorted runs
+ * using replacement selection, in the form of a priority tree implemented
+ * as a heap (essentially his Algorithm 5.2.3H -- although that strategy is
+ * often avoided altogether), but that can now only happen first the first
+ * run.  We merge the runs using polyphase merge, Knuth's Algorithm
+ * 5.4.2D.  The logical "tapes" used by Algorithm D are implemented by
+ * logtape.c, which avoids space wastage by recycling disk space as soon
+ * as each block is read from its "tape".
  *
- * We do not form the initial runs using Knuth's recommended replacement
+ * We never form the initial runs using Knuth's recommended replacement
  * selection data structure (Algorithm 5.4.1R), because it uses a fixed
  * number of records in memory at all times.  Since we are dealing with
  * tuples that may vary considerably in size, we want to be able to vary
@@ -28,7 +30,30 @@
  * Algorithm 5.4.1R, each record is stored with the run number that it
  * must go into, and we use (run number, key) as the ordering key for the
  * heap.  When the run number at the top of the heap changes, we know that
- * no more records of the prior run are left in the heap.
+ * no more records of the prior run are left in the heap.  Note that there
+ * are in practice only ever two distinct run numbers, due to the greatly
+ * reduced use of replacement selection in PostgreSQL 9.6.
+ *
+ * In PostgreSQL 9.6, a heap (based on Knuth's Algorithm H, with some small
+ * customizations) is only used with the aim of producing just one run,
+ * thereby avoiding all merging.  Only the first run can use replacement
+ * selection, which is why there are now only two possible valid run
+ * numbers, and why heapification is customized to not distinguish between
+ * tuples in the second run (those will be quicksorted).  We generally
+ * prefer a simple hybrid sort-merge strategy, where runs are sorted in much
+ * the same way as the entire input of an internal sort is sorted (using
+ * qsort()).  The replacement_sort_mem GUC controls the limited remaining
+ * use of replacement selection for the first run.
+ *
+ * There are several reasons to favor a hybrid sort-merge strategy.
+ * Maintaining a priority tree/heap has poor CPU cache characteristics.
+ * Furthermore, the growth in main memory sizes has greatly diminished the
+ * value of having runs that are larger than available memory, even in the
+ * case where there is partially sorted input and runs can be made far
+ * larger by using a heap.  In most cases, a single-pass merge step is all
+ * that is required even when runs are no larger than available memory.
+ * Avoiding multiple merge passes was traditionally considered to be the
+ * major advantage of using replacement selection.
  *
  * The approximate amount of memory allowed for any one sort operation
  * is specified in kilobytes by the caller (most pass work_mem).  Initially,
@@ -36,13 +61,12 @@
  * we haven't exceeded workMem.  If we reach the end of the input without
  * exceeding workMem, we sort the array using qsort() and subsequently return
  * tuples just by scanning the tuple array sequentially.  If we do exceed
- * workMem, we construct a heap using Algorithm H and begin to emit tuples
- * into sorted runs in temporary tapes, emitting just enough tuples at each
- * step to get back within the workMem limit.  Whenever the run number at
- * the top of the heap changes, we begin a new run with a new output tape
- * (selected per Algorithm D).  After the end of the input is reached,
- * we dump out remaining tuples in memory into a final run (or two),
- * then merge the runs using Algorithm D.
+ * workMem, we begin to emit tuples into sorted runs in temporary tapes.
+ * When tuples are dumped in batch after quicksorting, we begin a new run
+ * with a new output tape (selected per Algorithm D).  After the end of the
+ * input is reached, we dump out remaining tuples in memory into a final run
+ * (or two, when replacement selection is still used), then merge the runs
+ * using Algorithm D.
  *
  * When merging runs, we use a heap containing just the frontmost tuple from
  * each source run; we repeatedly output the smallest tuple and insert the
@@ -84,7 +108,7 @@
  * code we determine the number of tapes M on the basis of workMem: we want
  * workMem/M to be large enough that we read a fair amount of data each time
  * we preread from a tape, so as to maintain the locality of access described
- * above.  Nonetheless, with large workMem we can have many tapes.
+ * above.  Nonetheless, with large workMem we can have a few hundred tapes.
  *
  *
  * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
@@ -136,7 +160,7 @@ bool		optimize_bounded_sort = true;
 
 /*
  * The objects we actually sort are SortTuple structs.  These contain
- * a pointer to the tuple proper (might be a MinimalTuple or IndexTuple),
+ * a pointer to the tuple itself (might be a MinimalTuple or IndexTuple),
  * which is a separate palloc chunk --- we assume it is just one chunk and
  * can be freed by a simple pfree() (except during final on-the-fly merge,
  * when memory is used in batch).  SortTuples also contain the tuple's
@@ -162,15 +186,19 @@ bool		optimize_bounded_sort = true;
  * described above.  Accordingly, "tuple" is always used in preference to
  * datum1 as the authoritative value for pass-by-reference cases.
  *
- * While building initial runs, tupindex holds the tuple's run number.  During
- * merge passes, we re-use it to hold the input tape number that each tuple in
- * the heap was read from, or to hold the index of the next tuple pre-read
- * from the same tape in the case of pre-read entries.  tupindex goes unused
- * if the sort occurs entirely in memory.
+ * While building initial runs, tupindex holds the tuple's run number.
+ * Historically, the run number could meaningfully distinguish many runs, but
+ * currently it only meaningfully distinguishes the first run with any other
+ * run (in practice this is limited to the second run), since replacement
+ * selection is always abandoned after the first run.  During merge passes, we
+ * re-use it to hold the input tape number that each tuple in the heap was
+ * read from, or to hold the index of the next tuple pre-read from the same
+ * tape in the case of pre-read entries.  tupindex goes unused if the sort
+ * occurs entirely in memory.
  */
 typedef struct
 {
-	void	   *tuple;			/* the tuple proper */
+	void	   *tuple;			/* the tuple itself */
 	Datum		datum1;			/* value of first key column */
 	bool		isnull1;		/* is first key column NULL? */
 	int			tupindex;		/* see notes above */
@@ -203,6 +231,7 @@ typedef enum
  * tape during a preread cycle (see discussion at top of file).
  */
 #define MINORDER		6		/* minimum merge order */
+#define MAXORDER		500		/* maximum merge order */
 #define TAPE_BUFFER_OVERHEAD		(BLCKSZ * 3)
 #define MERGE_BUFFER_SIZE			(BLCKSZ * 32)
 
@@ -284,15 +313,22 @@ struct Tuplesortstate
 	bool		growmemtuples;	/* memtuples' growth still underway? */
 
 	/*
-	 * Memory for tuples is sometimes allocated in batch, rather than
-	 * incrementally.  This implies that incremental memory accounting has been
-	 * abandoned.  Currently, this only happens for the final on-the-fly merge
-	 * step.  Large batch allocations can store tuples (e.g. IndexTuples)
+	 * Memory for copies of caller's tuples is sometimes allocated in batch,
+	 * rather than incrementally.  This implies that incremental memory
+	 * accounting has been abandoned.  Currently, this only happens for the
+	 * final on-the-fly merge step.  Large batch allocations can store tuples
 	 * without palloc() fragmentation and other overhead.
 	 */
 	bool		batchUsed;
 
 	/*
+	 * While building initial runs, this indicates if the replacement
+	 * selection strategy is in use.  When it isn't, then a simple hybrid
+	 * sort-merge strategy is in use instead (runs are quicksorted).
+	 */
+	bool		replaceActive;
+
+	/*
 	 * While building initial runs, this is the current output run number
 	 * (starting at 0).  Afterwards, it is the number of initial runs we made.
 	 */
@@ -493,6 +529,7 @@ struct Tuplesortstate
 static Tuplesortstate *tuplesort_begin_common(int workMem, bool randomAccess);
 static void puttuple_common(Tuplesortstate *state, SortTuple *tuple);
 static bool consider_abort_common(Tuplesortstate *state);
+static bool useselection(Tuplesortstate *state);
 static void inittapes(Tuplesortstate *state);
 static void selectnewtape(Tuplesortstate *state);
 static void mergeruns(Tuplesortstate *state);
@@ -508,8 +545,10 @@ static void *mergebatchalloc(Tuplesortstate *state, int tapenum, Size tuplen);
 static void mergepreread(Tuplesortstate *state);
 static void mergeprereadone(Tuplesortstate *state, int srcTape);
 static void dumptuples(Tuplesortstate *state, bool alltuples);
+static void dumpbatch(Tuplesortstate *state, bool alltuples);
 static void make_bounded_heap(Tuplesortstate *state);
 static void sort_bounded_heap(Tuplesortstate *state);
+static void tuplesort_sort_memtuples(Tuplesortstate *state);
 static void tuplesort_heap_insert(Tuplesortstate *state, SortTuple *tuple,
 					  int tupleindex, bool checkIndex);
 static void tuplesort_heap_siftup(Tuplesortstate *state, bool checkIndex);
@@ -1566,22 +1605,61 @@ puttuple_common(Tuplesortstate *state, SortTuple *tuple)
 
 			/*
 			 * Insert the tuple into the heap, with run number currentRun if
-			 * it can go into the current run, else run number currentRun+1.
-			 * The tuple can go into the current run if it is >= the first
-			 * not-yet-output tuple.  (Actually, it could go into the current
-			 * run if it is >= the most recently output tuple ... but that
-			 * would require keeping around the tuple we last output, and it's
-			 * simplest to let writetup free each tuple as soon as it's
-			 * written.)
+			 * it can go into the current run, else run number INT_MAX (some
+			 * later run).  The tuple can go into the current run if it is
+			 * >= the first not-yet-output tuple.  (Actually, it could go
+			 * into the current run if it is >= the most recently output
+			 * tuple ... but that would require keeping around the tuple we
+			 * last output, and it's simplest to let writetup free each
+			 * tuple as soon as it's written.)
 			 *
-			 * Note there will always be at least one tuple in the heap at
-			 * this point; see dumptuples.
+			 * Note that this only applies if the currentRun is 0 (prior to
+			 * giving up on heapification).  There is no meaningful
+			 * distinction between any two runs in memory except the first
+			 * and second run.  When the currentRun is not 0, there is no
+			 * guarantee that any tuples are already stored in memory here,
+			 * and if there are any they're in no significant order.
 			 */
-			Assert(state->memtupcount > 0);
-			if (COMPARETUP(state, tuple, &state->memtuples[0]) >= 0)
+			Assert(!state->replaceActive || state->memtupcount > 0);
+			if (state->replaceActive &&
+				COMPARETUP(state, tuple, &state->memtuples[0]) >= 0)
+			{
+				/*
+				 * Unlike classic replacement selection, which this module was
+				 * previously based on, only run 0 is treated as a priority
+				 * queue through heapification.  The second run (run 1) is
+				 * appended indifferently below, and will never be trusted to
+				 * maintain the heap invariant beyond simply not getting in
+				 * the way of spilling run 0 incrementally.  In other words,
+				 * second run tuples may be sifted out of the way of first
+				 * run tuples; COMPARETUP() will never be called for run
+				 * 1 tuples.  However, not even HEAPCOMPARE() will be
+				 * called for a subsequent run's tuples.
+				 */
 				tuplesort_heap_insert(state, tuple, state->currentRun, true);
+			}
 			else
-				tuplesort_heap_insert(state, tuple, state->currentRun + 1, true);
+			{
+				/*
+				 * Note that unlike Knuth, we do not care about the second
+				 * run's tuples when loading runs.  After the first run is
+				 * complete, tuples will not be dumped incrementally at all,
+				 * but as long as the first run (run 0) is current it will
+				 * be maintained.  dumptuples does not trust that the second
+				 * or subsequent runs are heapified (beyond merely not
+				 * getting in the way of the first, fully heapified run,
+				 * which only matters for the second run, run 1).  Anything
+				 * past the first run will be quicksorted.
+				 *
+				 * Past the first run, there is no need to differentiate runs
+				 * in memory (only the first and second runs will ever be
+				 * usefully differentiated).  Use a generic INT_MAX run
+				 * number (just to be tidy).  There should always be room to
+				 * store the incoming tuple.
+				 */
+				tuple->tupindex = INT_MAX;
+				state->memtuples[state->memtupcount++] = *tuple;
+			}
 
 			/*
 			 * If we are over the memory limit, dump tuples till we're under.
@@ -1656,20 +1734,9 @@ tuplesort_performsort(Tuplesortstate *state)
 
 			/*
 			 * We were able to accumulate all the tuples within the allowed
-			 * amount of memory.  Just qsort 'em and we're done.
+			 * amount of memory.  Just quicksort 'em and we're done.
 			 */
-			if (state->memtupcount > 1)
-			{
-				/* Can we use the single-key sort function? */
-				if (state->onlyKey != NULL)
-					qsort_ssup(state->memtuples, state->memtupcount,
-							   state->onlyKey);
-				else
-					qsort_tuple(state->memtuples,
-								state->memtupcount,
-								state->comparetup,
-								state);
-			}
+			tuplesort_sort_memtuples(state);
 			state->current = 0;
 			state->eof_reached = false;
 			state->markpos_offset = 0;
@@ -2175,13 +2242,52 @@ tuplesort_merge_order(int64 allowedMem)
 	mOrder = (allowedMem - TAPE_BUFFER_OVERHEAD) /
 		(MERGE_BUFFER_SIZE + TAPE_BUFFER_OVERHEAD);
 
-	/* Even in minimum memory, use at least a MINORDER merge */
+	/*
+	 * Even in minimum memory, use at least a MINORDER merge.  Also cap the
+	 * maximum merge order to MAXORDER.
+	 *
+	 * When allowedMem is drastically lower than what is required for an
+	 * internal sort, it is unlikely that there are benefits to increasing the
+	 * number of tapes beyond Knuth's "sweet spot" of 7, especially with modern
+	 * hardware that is typically memory bound.  Furthermore, in the common
+	 * case where there are less than MAXORDER runs, the overhead per-tape if
+	 * we were not to cap could be significant with high allowedMem.
+	 */
 	mOrder = Max(mOrder, MINORDER);
+	mOrder = Min(mOrder, MAXORDER);
 
 	return mOrder;
 }
 
 /*
+ * useselection - determine if one replacement selection run should be
+ * attempted.
+ */
+static bool
+useselection(Tuplesortstate *state)
+{
+	int64		replacementSortMemBytes =
+		replacement_sort_mem * (int64) 1024;
+
+	/*
+	 * Tuplesort just ran out of memory, and so must perform an external sort.
+	 * It can sometimes be useful to use a replacement selection heap if it
+	 * results in one large run, and if there is very little workMem.  See
+	 * remarks on the "one run" optimization within mergeruns().
+	 *
+	 * The use of replacement selection in limited circumstances does not have
+	 * much to do with the traditional advantage of replacement selection,
+	 * which is to double the average run length, given randomly ordered input
+	 * tuples.  Note that replacement selection is always abandoned after the
+	 * first run.
+	 */
+	if (state->allowedMem <= replacementSortMemBytes)
+		return true;
+
+	return false;
+}
+
+/*
  * inittapes - initialize for tape sorting.
  *
  * This is called only if we have found we don't have room to sort in memory.
@@ -2190,7 +2296,6 @@ static void
 inittapes(Tuplesortstate *state)
 {
 	int			maxTapes,
-				ntuples,
 				j;
 	int64		tapeSpace;
 
@@ -2253,23 +2358,35 @@ inittapes(Tuplesortstate *state)
 	state->tp_tapenum = (int *) palloc0(maxTapes * sizeof(int));
 
 	/*
-	 * Convert the unsorted contents of memtuples[] into a heap. Each tuple is
-	 * marked as belonging to run number zero.
-	 *
-	 * NOTE: we pass false for checkIndex since there's no point in comparing
-	 * indexes in this step, even though we do intend the indexes to be part
-	 * of the sort key...
+	 * Give replacement selection a try based on user setting.  There will
+	 * be a switch to a simple hybrid sort-merge strategy after the first
+	 * run (iff there is to be a second on-tape run).
 	 */
-	ntuples = state->memtupcount;
-	state->memtupcount = 0;		/* make the heap empty */
-	for (j = 0; j < ntuples; j++)
+	state->replaceActive = useselection(state);
+
+	if (state->replaceActive)
 	{
-		/* Must copy source tuple to avoid possible overwrite */
-		SortTuple	stup = state->memtuples[j];
+		/*
+		 * Convert the unsorted contents of memtuples[] into a heap. Each
+		 * tuple is marked as belonging to run number zero.
+		 *
+		 * NOTE: we pass false for checkIndex since there's no point in
+		 * comparing indexes in this step, even though we do intend the
+		 * indexes to be part of the sort key...
+		 */
+		int			ntuples = state->memtupcount;
 
-		tuplesort_heap_insert(state, &stup, 0, false);
+		state->memtupcount = 0;		/* make the heap empty */
+
+		for (j = 0; j < ntuples; j++)
+		{
+			/* Must copy source tuple to avoid possible overwrite */
+			SortTuple	stup = state->memtuples[j];
+
+			tuplesort_heap_insert(state, &stup, 0, false);
+		}
+		Assert(state->memtupcount == ntuples);
 	}
-	Assert(state->memtupcount == ntuples);
 
 	state->currentRun = 0;
 
@@ -2362,9 +2479,9 @@ mergeruns(Tuplesortstate *state)
 
 	/*
 	 * If we produced only one initial run (quite likely if the total data
-	 * volume is between 1X and 2X workMem), we can just use that tape as the
-	 * finished output, rather than doing a useless merge.  (This obvious
-	 * optimization is not in Knuth's algorithm.)
+	 * volume is between 1X and 2X workMem when replacement selection is used),
+	 * we can just use that tape as the finished output, rather than doing a
+	 * useless merge.  (This obvious optimization is not in Knuth's algorithm.)
 	 */
 	if (state->currentRun == 1)
 	{
@@ -3079,21 +3196,23 @@ mergeprereadone(Tuplesortstate *state, int srcTape)
 }
 
 /*
- * dumptuples - remove tuples from heap and write to tape
+ * dumptuples - remove tuples from memtuples and write to tape
  *
  * This is used during initial-run building, but not during merging.
  *
- * When alltuples = false, dump only enough tuples to get under the
- * availMem limit (and leave at least one tuple in the heap in any case,
- * since puttuple assumes it always has a tuple to compare to).  We also
- * insist there be at least one free slot in the memtuples[] array.
+ * When alltuples = false and replacement selection is still active, dump
+ * only enough tuples to get under the availMem limit (and leave at least
+ * one tuple in memtuples, since puttuple will then assume it is a heap that
+ * has a tuple to compare to).  We also insist there be at least one free
+ * slot in the memtuples[] array.
  *
- * When alltuples = true, dump everything currently in memory.
+ * When alltuples = true, always batch dump everything currently in memory.
  * (This case is only used at end of input data.)
  *
- * If we empty the heap, close out the current run and return (this should
- * only happen at end of input data).  If we see that the tuple run number
- * at the top of the heap has changed, start a new run.
+ * If, when replacement selection is active, we see that the tuple run
+ * number at the top of the heap has changed, start a new run.  This must be
+ * the first run, because replacement selection is always abandoned for all
+ * further runs.
  */
 static void
 dumptuples(Tuplesortstate *state, bool alltuples)
@@ -3102,21 +3221,37 @@ dumptuples(Tuplesortstate *state, bool alltuples)
 		   (LACKMEM(state) && state->memtupcount > 1) ||
 		   state->memtupcount >= state->memtupsize)
 	{
-		/*
-		 * Dump the heap's frontmost entry, and sift up to remove it from the
-		 * heap.
-		 */
-		Assert(state->memtupcount > 0);
-		WRITETUP(state, state->tp_tapenum[state->destTape],
-				 &state->memtuples[0]);
-		tuplesort_heap_siftup(state, true);
+		if (state->replaceActive)
+		{
+			/*
+			 * Still holding out for a case favorable to replacement selection.
+			 * Still incrementally spilling using heap.
+			 *
+			 * Dump the heap's frontmost entry, and sift up to remove it from
+			 * the heap.
+			 */
+			Assert(state->memtupcount > 0);
+			WRITETUP(state, state->tp_tapenum[state->destTape],
+					 &state->memtuples[0]);
+			tuplesort_heap_siftup(state, true);
+		}
+		else
+		{
+			/*
+			 * Once committed to quicksorting runs, never incrementally
+			 * spill
+			 */
+			dumpbatch(state, alltuples);
+			break;
+		}
 
 		/*
-		 * If the heap is empty *or* top run number has changed, we've
-		 * finished the current run.
+		 * If top run number has changed, we've finished the current run
+		 * (this can only be the first run, run 0), and will no longer spill
+		 * incrementally.
 		 */
 		if (state->memtupcount == 0 ||
-			state->currentRun != state->memtuples[0].tupindex)
+			state->memtuples[0].tupindex != 0)
 		{
 			markrunend(state, state->tp_tapenum[state->destTape]);
 			state->currentRun++;
@@ -3125,24 +3260,104 @@ dumptuples(Tuplesortstate *state, bool alltuples)
 
 #ifdef TRACE_SORT
 			if (trace_sort)
-				elog(LOG, "finished writing%s run %d to tape %d: %s",
-					 (state->memtupcount == 0) ? " final" : "",
+				elog(LOG, "finished incremental write of %s run %d to tape %d: %s",
+					 (state->memtupcount == 0) ? "only" : "first",
 					 state->currentRun, state->destTape,
 					 pg_rusage_show(&state->ru_start));
 #endif
 
 			/*
-			 * Done if heap is empty, else prepare for new run.
+			 * Done if heap is empty, which is possible when there is only one
+			 * run.  Otherwise, prepare for new run, abandoning replacement
+			 * selection.
 			 */
 			if (state->memtupcount == 0)
 				break;
-			Assert(state->currentRun == state->memtuples[0].tupindex);
+			state->replaceActive = false;
+
+			/*
+			 * First tuple of next run should have generic non-first (INT_MAX)
+			 * run number.  In practice, this must actually be the second run.
+			 */
+			Assert(state->memtuples[0].tupindex == INT_MAX);
 			selectnewtape(state);
 		}
 	}
 }
 
 /*
+ * dumpbatch - sort and dump all memtuples, forming one run on tape
+ *
+ * Second or subsequent runs are never heapified by this module (although
+ * heapification still respects run number differences between the first and
+ * second runs), and a heap (replacement selection priority queue) is often
+ * avoided in the first place.
+ */
+static void
+dumpbatch(Tuplesortstate *state, bool alltuples)
+{
+	int			memtupwrite;
+	int			i;
+
+	Assert(state->status == TSS_BUILDRUNS);
+
+	/*
+	 * Final call might require no sorting, in rare cases where we just so
+	 * happen to have previously LACKMEM()'d at the point where exactly all
+	 * remaining tuples are loaded into memory, just before input was
+	 * exhausted.
+	 *
+	 * Do not increment run number, or mark the run as over -- this is a
+	 * no-op.
+	 */
+	if (state->memtupcount == 0)
+		return;
+
+	state->currentRun++;
+
+#ifdef TRACE_SORT
+	if (trace_sort)
+		elog(LOG, "starting quicksort of run %d: %s",
+			 state->currentRun, pg_rusage_show(&state->ru_start));
+#endif
+
+	tuplesort_sort_memtuples(state);
+
+#ifdef TRACE_SORT
+	if (trace_sort)
+		elog(LOG, "finished quicksorting run %d: %s",
+			 state->currentRun, pg_rusage_show(&state->ru_start));
+#endif
+
+	memtupwrite = state->memtupcount;
+	for (i = 0; i < memtupwrite; i++)
+	{
+		WRITETUP(state, state->tp_tapenum[state->destTape],
+				 &state->memtuples[i]);
+		state->memtupcount--;
+	}
+	/*
+	 * Reset tuple memory, now that no caller tuples are needed in memory.
+	 * This prevents fragmentation.
+	 */
+	MemoryContextReset(state->tuplecontext);
+
+	markrunend(state, state->tp_tapenum[state->destTape]);
+	state->tp_runs[state->destTape]++;
+	state->tp_dummy[state->destTape]--; /* per Alg D step D2 */
+
+#ifdef TRACE_SORT
+	if (trace_sort)
+		elog(LOG, "finished writing run %d to tape %d: %s",
+			 state->currentRun, state->destTape,
+			 pg_rusage_show(&state->ru_start));
+#endif
+
+	if (!alltuples)
+		selectnewtape(state);
+}
+
+/*
  * tuplesort_rescan		- rewind and replay the scan
  */
 void
@@ -3300,10 +3515,19 @@ tuplesort_get_stats(Tuplesortstate *state,
  *
  * Compare two SortTuples.  If checkIndex is true, use the tuple index
  * as the front of the sort key; otherwise, no.
+ *
+ * Note that for checkIndex callers, the heap invariant is never
+ * maintained beyond the first run, and so there are no COMPARETUP()
+ * calls beyond the first run (run 0).  It is assumed that checkIndex
+ * callers are maintaining the heap invariant for a replacement
+ * selection priority queue, but those callers do not go on to trust
+ * the heap to be fully-heapified past the first run.  Once currentRun
+ * isn't the first, memtuples is no longer a heap at all.
  */
 
 #define HEAPCOMPARE(tup1,tup2) \
-	(checkIndex && ((tup1)->tupindex != (tup2)->tupindex) ? \
+	(checkIndex && ((tup1)->tupindex != (tup2)->tupindex || \
+					(tup1)->tupindex != 0) ? \
 	 ((tup1)->tupindex) - ((tup2)->tupindex) : \
 	 COMPARETUP(state, tup1, tup2))
 
@@ -3402,6 +3626,30 @@ sort_bounded_heap(Tuplesortstate *state)
 }
 
 /*
+ * Sort all memtuples.
+ *
+ * Quicksort is tuplesort's internal sort algorithm.  It is also generally
+ * preferred to replacement selection of runs during external sorts, except
+ * where incrementally spilling (using a heap) is particularly beneficial.
+ */
+static void
+tuplesort_sort_memtuples(Tuplesortstate *state)
+{
+	if (state->memtupcount > 1)
+	{
+		/* Can we use the single-key sort function? */
+		if (state->onlyKey != NULL)
+			qsort_ssup(state->memtuples, state->memtupcount,
+					   state->onlyKey);
+		else
+			qsort_tuple(state->memtuples,
+						state->memtupcount,
+						state->comparetup,
+						state);
+	}
+}
+
+/*
  * Insert a new tuple into an empty or existing heap, maintaining the
  * heap invariant.  Caller is responsible for ensuring there's room.
  *
@@ -3428,6 +3676,7 @@ tuplesort_heap_insert(Tuplesortstate *state, SortTuple *tuple,
 
 	memtuples = state->memtuples;
 	Assert(state->memtupcount < state->memtupsize);
+	Assert(!checkIndex || tupleindex == 0);
 
 	CHECK_FOR_INTERRUPTS();
 
@@ -3460,6 +3709,7 @@ tuplesort_heap_siftup(Tuplesortstate *state, bool checkIndex)
 	int			i,
 				n;
 
+	Assert(!checkIndex || state->currentRun == 0);
 	if (--state->memtupcount <= 0)
 		return;
 
@@ -3745,7 +3995,7 @@ readtup_heap(Tuplesortstate *state, SortTuple *stup,
 	char	   *tupbody = (char *) tuple + MINIMAL_TUPLE_DATA_OFFSET;
 	HeapTupleData htup;
 
-	/* read in the tuple proper */
+	/* read in the tuple itself */
 	tuple->t_len = tuplen;
 	LogicalTapeReadExact(state->tapeset, tapenum,
 						 tupbody, tupbodylen);
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index cc7833e..e8dd5a6 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -238,6 +238,7 @@ extern bool enableFsync;
 extern bool allowSystemTableMods;
 extern PGDLLIMPORT int work_mem;
 extern PGDLLIMPORT int maintenance_work_mem;
+extern PGDLLIMPORT int replacement_sort_mem;
 
 extern int	VacuumCostPageHit;
 extern int	VacuumCostPageMiss;
-- 
1.9.1

From 2b38f90a6d57e9a4eb7944d84c69d404cc7066b7 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <p...@bowt.ie>
Date: Mon, 24 Aug 2015 13:22:11 -0700
Subject: [PATCH 1/4] Allocate memory in batch for tuplesort merge

Allocate a large buffer of memory for use by READTUP() routines.  This
accelerates the final on-the-fly merge phase significantly.  As tapes
are preread, memory is reused from the tape's segmented portion,
virtually eliminating retail palloc() calls.

Sequentially consuming memory from the large batch allocation
accelerates merging because the natural order of all subsequent access
(access during merging proper) is sequential per tape; cache
characteristics are improved.  In addition, modern microarchitectures
and compilers may automatically perform prefetching with a sequential
access pattern.

Also, use a dedicated memory context to store caller-passed tuples, such
as IndexTuples, and reset the context at certain key points during
external sorts.  This significantly decreases memory fragmentation,
which was shown to be a significant problem in previous versions of the
external quicksort patch by Jeff Janes [1].

[1] CAMkU=1zkbozkx-nqe-kjffmynm2hmgyl9askdeuhhwxassj...@mail.gmail.com
---
 src/backend/utils/sort/tuplesort.c | 556 ++++++++++++++++++++++++++++++++++---
 1 file changed, 516 insertions(+), 40 deletions(-)

diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 67d86ed..c3f666e 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -138,7 +138,8 @@ bool		optimize_bounded_sort = true;
  * The objects we actually sort are SortTuple structs.  These contain
  * a pointer to the tuple proper (might be a MinimalTuple or IndexTuple),
  * which is a separate palloc chunk --- we assume it is just one chunk and
- * can be freed by a simple pfree().  SortTuples also contain the tuple's
+ * can be freed by a simple pfree() (except during final on-the-fly merge,
+ * when memory is used in batch).  SortTuples also contain the tuple's
  * first key column in Datum/nullflag format, and an index integer.
  *
  * Storing the first key column lets us save heap_getattr or index_getattr
@@ -220,11 +221,13 @@ struct Tuplesortstate
 								 * tuples to return? */
 	bool		boundUsed;		/* true if we made use of a bounded heap */
 	int			bound;			/* if bounded, the maximum number of tuples */
+	bool		tuples;			/* Can SortTuple.tuple ever be set? */
 	int64		availMem;		/* remaining memory available, in bytes */
 	int64		allowedMem;		/* total memory allowed, in bytes */
 	int			maxTapes;		/* number of tapes (Knuth's T) */
 	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
-	MemoryContext sortcontext;	/* memory context holding all sort data */
+	MemoryContext sortcontext;	/* memory context holding most sort data */
+	MemoryContext tuplecontext;	/* memory context holding tuple data */
 	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
 
 	/*
@@ -281,6 +284,15 @@ struct Tuplesortstate
 	bool		growmemtuples;	/* memtuples' growth still underway? */
 
 	/*
+	 * Memory for tuples is sometimes allocated in batch, rather than
+	 * incrementally.  This implies that incremental memory accounting has been
+	 * abandoned.  Currently, this only happens for the final on-the-fly merge
+	 * step.  Large batch allocations can store tuples (e.g. IndexTuples)
+	 * without palloc() fragmentation and other overhead.
+	 */
+	bool		batchUsed;
+
+	/*
 	 * While building initial runs, this is the current output run number
 	 * (starting at 0).  Afterwards, it is the number of initial runs we made.
 	 */
@@ -315,6 +327,21 @@ struct Tuplesortstate
 	int			mergefirstfree; /* first slot never used in this merge */
 
 	/*
+	 * Per-tape batch state, when final on-the-fly merge consumes memory from
+	 * just a few large allocations.
+	 *
+	 * Aside from the general benefits of performing fewer individual retail
+	 * palloc() calls, this also helps make merging more cache efficient, since
+	 * each tape's tuples must naturally be accessed sequentially (in sorted
+	 * order).
+	 */
+	int64		spacePerTape;	/* Space (memory) for tuples (not slots) */
+	char	  **mergetuples;	/* Each tape's memory allocation */
+	char	  **mergecurrent;	/* Current offset into each tape's memory */
+	char	  **mergetail;		/* Last item's start point for each tape */
+	char	  **mergeoverflow;	/* Retail palloc() "overflow" for each tape */
+
+	/*
 	 * Variables for Algorithm D.  Note that destTape is a "logical" tape
 	 * number, ie, an index into the tp_xxx[] arrays.  Be careful to keep
 	 * "logical" and "actual" tape numbers straight!
@@ -389,9 +416,8 @@ struct Tuplesortstate
 	 * tuplesort_begin_datum and used only by the DatumTuple routines.
 	 */
 	Oid			datumType;
-	/* we need typelen and byval in order to know how to copy the Datums. */
+	/* we need typelen in order to know how to copy the Datums. */
 	int			datumTypeLen;
-	bool		datumTypeByVal;
 
 	/*
 	 * Resource snapshot for time of sort start.
@@ -405,7 +431,7 @@ struct Tuplesortstate
 #define COPYTUP(state,stup,tup) ((*(state)->copytup) (state, stup, tup))
 #define WRITETUP(state,tape,stup)	((*(state)->writetup) (state, tape, stup))
 #define READTUP(state,stup,tape,len) ((*(state)->readtup) (state, stup, tape, len))
-#define LACKMEM(state)		((state)->availMem < 0)
+#define LACKMEM(state)		((state)->availMem < 0 && !(state)->batchUsed)
 #define USEMEM(state,amt)	((state)->availMem -= (amt))
 #define FREEMEM(state,amt)	((state)->availMem += (amt))
 
@@ -447,7 +473,13 @@ struct Tuplesortstate
  * rather than the originally-requested size.  This is important since
  * palloc can add substantial overhead.  It's not a complete answer since
  * we won't count any wasted space in palloc allocation blocks, but it's
- * a lot better than what we were doing before 7.3.
+ * a lot better than what we were doing before 7.3.  As of 9.6, a
+ * separate memory context is used for caller passed tuples.  Resetting
+ * it at certain key increments significantly ameliorates fragmentation.
+ * Note that this places a responsibility on readtup and copytup routines
+ * to use the right memory context for these tuples (and to not use the
+ * reset context for anything whose lifetime needs to span multiple
+ * external sort runs).
  */
 
 /* When using this macro, beware of double evaluation of len */
@@ -465,7 +497,14 @@ static void inittapes(Tuplesortstate *state);
 static void selectnewtape(Tuplesortstate *state);
 static void mergeruns(Tuplesortstate *state);
 static void mergeonerun(Tuplesortstate *state);
-static void beginmerge(Tuplesortstate *state);
+static void beginmerge(Tuplesortstate *state, bool finalMerge);
+static void batchmemtuples(Tuplesortstate *state);
+static void mergebatch(Tuplesortstate *state, int64 spacePerTape);
+static void mergebatchone(Tuplesortstate *state, int srcTape,
+					  SortTuple *stup, bool *should_free);
+static void mergebatchfreetape(Tuplesortstate *state, int srcTape,
+							   SortTuple *rtup, bool *should_free);
+static void *mergebatchalloc(Tuplesortstate *state, int tapenum, Size tuplen);
 static void mergepreread(Tuplesortstate *state);
 static void mergeprereadone(Tuplesortstate *state, int srcTape);
 static void dumptuples(Tuplesortstate *state, bool alltuples);
@@ -477,6 +516,7 @@ static void tuplesort_heap_siftup(Tuplesortstate *state, bool checkIndex);
 static void reversedirection(Tuplesortstate *state);
 static unsigned int getlen(Tuplesortstate *state, int tapenum, bool eofOK);
 static void markrunend(Tuplesortstate *state, int tapenum);
+static void *readtup_alloc(Tuplesortstate *state, int tapenum, Size tuplen);
 static int comparetup_heap(const SortTuple *a, const SortTuple *b,
 				Tuplesortstate *state);
 static void copytup_heap(Tuplesortstate *state, SortTuple *stup, void *tup);
@@ -543,6 +583,7 @@ tuplesort_begin_common(int workMem, bool randomAccess)
 {
 	Tuplesortstate *state;
 	MemoryContext sortcontext;
+	MemoryContext tuplecontext;
 	MemoryContext oldcontext;
 
 	/*
@@ -550,12 +591,27 @@ tuplesort_begin_common(int workMem, bool randomAccess)
 	 * needed by the sort will live inside this context.
 	 */
 	sortcontext = AllocSetContextCreate(CurrentMemoryContext,
-										"TupleSort",
+										"TupleSort main",
 										ALLOCSET_DEFAULT_MINSIZE,
 										ALLOCSET_DEFAULT_INITSIZE,
 										ALLOCSET_DEFAULT_MAXSIZE);
 
 	/*
+	 * Caller tuple (e.g. IndexTuple) memory context.
+	 *
+	 * A dedicated child content used exclusively for caller passed tuples
+	 * eases memory management.  Resetting at key points reduces fragmentation.
+	 * Note that the memtuples array of SortTuples is allocated in the parent
+	 * context, not this context, because there is no need to free memtuples
+	 * early.
+	 */
+	tuplecontext = AllocSetContextCreate(sortcontext,
+										 "Caller tuples",
+										 ALLOCSET_DEFAULT_MINSIZE,
+										 ALLOCSET_DEFAULT_INITSIZE,
+										 ALLOCSET_DEFAULT_MAXSIZE);
+
+	/*
 	 * Make the Tuplesortstate within the per-sort context.  This way, we
 	 * don't need a separate pfree() operation for it at shutdown.
 	 */
@@ -571,10 +627,12 @@ tuplesort_begin_common(int workMem, bool randomAccess)
 	state->status = TSS_INITIAL;
 	state->randomAccess = randomAccess;
 	state->bounded = false;
+	state->tuples = true;
 	state->boundUsed = false;
 	state->allowedMem = workMem * (int64) 1024;
 	state->availMem = state->allowedMem;
 	state->sortcontext = sortcontext;
+	state->tuplecontext = tuplecontext;
 	state->tapeset = NULL;
 
 	state->memtupcount = 0;
@@ -587,6 +645,7 @@ tuplesort_begin_common(int workMem, bool randomAccess)
 						ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1);
 
 	state->growmemtuples = true;
+	state->batchUsed = false;
 	state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
 
 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
@@ -922,7 +981,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	/* lookup necessary attributes of the datum type */
 	get_typlenbyval(datumType, &typlen, &typbyval);
 	state->datumTypeLen = typlen;
-	state->datumTypeByVal = typbyval;
+	state->tuples = !typbyval;
 
 	/* Prepare SortSupport data */
 	state->sortKeys = (SortSupport) palloc0(sizeof(SortSupportData));
@@ -1258,7 +1317,7 @@ tuplesort_putindextuplevalues(Tuplesortstate *state, Relation rel,
 							  ItemPointer self, Datum *values,
 							  bool *isnull)
 {
-	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	MemoryContext oldcontext = MemoryContextSwitchTo(state->tuplecontext);
 	SortTuple	stup;
 	Datum		original;
 	IndexTuple	tuple;
@@ -1273,6 +1332,8 @@ tuplesort_putindextuplevalues(Tuplesortstate *state, Relation rel,
 							 RelationGetDescr(state->indexRel),
 							 &stup.isnull1);
 
+	MemoryContextSwitchTo(state->sortcontext);
+
 	if (!state->sortKeys || !state->sortKeys->abbrev_converter || stup.isnull1)
 	{
 		/*
@@ -1333,7 +1394,7 @@ tuplesort_putindextuplevalues(Tuplesortstate *state, Relation rel,
 void
 tuplesort_putdatum(Tuplesortstate *state, Datum val, bool isNull)
 {
-	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
+	MemoryContext oldcontext = MemoryContextSwitchTo(state->tuplecontext);
 	SortTuple	stup;
 
 	/*
@@ -1348,7 +1409,7 @@ tuplesort_putdatum(Tuplesortstate *state, Datum val, bool isNull)
 	 * identical to stup.tuple.
 	 */
 
-	if (isNull || state->datumTypeByVal)
+	if (isNull || !state->tuples)
 	{
 		/*
 		 * Set datum1 to zeroed representation for NULLs (to be consistent, and
@@ -1357,6 +1418,7 @@ tuplesort_putdatum(Tuplesortstate *state, Datum val, bool isNull)
 		stup.datum1 = !isNull ? val : (Datum) 0;
 		stup.isnull1 = isNull;
 		stup.tuple = NULL;		/* no separate storage */
+		MemoryContextSwitchTo(state->sortcontext);
 	}
 	else
 	{
@@ -1365,6 +1427,7 @@ tuplesort_putdatum(Tuplesortstate *state, Datum val, bool isNull)
 		stup.isnull1 = false;
 		stup.tuple = DatumGetPointer(original);
 		USEMEM(state, GetMemoryChunkSpace(stup.tuple));
+		MemoryContextSwitchTo(state->sortcontext);
 
 		if (!state->sortKeys->abbrev_converter)
 		{
@@ -1670,6 +1733,7 @@ tuplesort_performsort(Tuplesortstate *state)
  * Internal routine to fetch the next tuple in either forward or back
  * direction into *stup.  Returns FALSE if no more tuples.
  * If *should_free is set, the caller must pfree stup.tuple when done with it.
+ * Otherwise, caller should not use tuple following next call here.
  */
 static bool
 tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
@@ -1681,6 +1745,7 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
 	{
 		case TSS_SORTEDINMEM:
 			Assert(forward || state->randomAccess);
+			Assert(!state->batchUsed);
 			*should_free = false;
 			if (forward)
 			{
@@ -1725,6 +1790,7 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
 
 		case TSS_SORTEDONTAPE:
 			Assert(forward || state->randomAccess);
+			Assert(!state->batchUsed);
 			*should_free = true;
 			if (forward)
 			{
@@ -1810,7 +1876,9 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
 
 		case TSS_FINALMERGE:
 			Assert(forward);
-			*should_free = true;
+			Assert(state->batchUsed || !state->tuples);
+			/* For now, assume tuple is stored in tape's batch memory */
+			*should_free = false;
 
 			/*
 			 * This code should match the inner loop of mergeonerun().
@@ -1818,18 +1886,17 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
 			if (state->memtupcount > 0)
 			{
 				int			srcTape = state->memtuples[0].tupindex;
-				Size		tuplen;
 				int			tupIndex;
 				SortTuple  *newtup;
 
+				/*
+				 * Returned tuple is still counted in our memory space most
+				 * of the time.  See mergebatchone() for discussion of why
+				 * caller may occasionally be required to free returned
+				 * tuple, and how preread memory is managed with regard to
+				 * edge cases more generally.
+				 */
 				*stup = state->memtuples[0];
-				/* returned tuple is no longer counted in our memory space */
-				if (stup->tuple)
-				{
-					tuplen = GetMemoryChunkSpace(stup->tuple);
-					state->availMem += tuplen;
-					state->mergeavailmem[srcTape] += tuplen;
-				}
 				tuplesort_heap_siftup(state, false);
 				if ((tupIndex = state->mergenext[srcTape]) == 0)
 				{
@@ -1837,15 +1904,25 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
 					 * out of preloaded data on this tape, try to read more
 					 *
 					 * Unlike mergeonerun(), we only preload from the single
-					 * tape that's run dry.  See mergepreread() comments.
+					 * tape that's run dry, though not before preparing its
+					 * batch memory for a new round of sequential consumption.
+					 * See mergepreread() comments.
 					 */
+					if (state->batchUsed)
+						mergebatchone(state, srcTape, stup, should_free);
+
 					mergeprereadone(state, srcTape);
 
 					/*
 					 * if still no data, we've reached end of run on this tape
 					 */
 					if ((tupIndex = state->mergenext[srcTape]) == 0)
+					{
+						/* Free tape's buffer, avoiding dangling pointer */
+						if (state->batchUsed)
+							mergebatchfreetape(state, srcTape, stup, should_free);
 						return true;
+					}
 				}
 				/* pull next preread tuple from list, insert in heap */
 				newtup = &state->memtuples[tupIndex];
@@ -1912,6 +1989,8 @@ tuplesort_gettupleslot(Tuplesortstate *state, bool forward,
  * Fetch the next tuple in either forward or back direction.
  * Returns NULL if no more tuples.  If *should_free is set, the
  * caller must pfree the returned tuple when done with it.
+ * If it is not set, caller should not use tuple following next
+ * call here.
  */
 HeapTuple
 tuplesort_getheaptuple(Tuplesortstate *state, bool forward, bool *should_free)
@@ -1931,6 +2010,8 @@ tuplesort_getheaptuple(Tuplesortstate *state, bool forward, bool *should_free)
  * Fetch the next index tuple in either forward or back direction.
  * Returns NULL if no more tuples.  If *should_free is set, the
  * caller must pfree the returned tuple when done with it.
+ * If it is not set, caller should not use tuple following next
+ * call here.
  */
 IndexTuple
 tuplesort_getindextuple(Tuplesortstate *state, bool forward,
@@ -1979,7 +2060,7 @@ tuplesort_getdatum(Tuplesortstate *state, bool forward,
 	if (state->sortKeys->abbrev_converter && abbrev)
 		*abbrev = stup.datum1;
 
-	if (stup.isnull1 || state->datumTypeByVal)
+	if (stup.isnull1 || !state->tuples)
 	{
 		*val = stup.datum1;
 		*isNull = stup.isnull1;
@@ -2162,6 +2243,10 @@ inittapes(Tuplesortstate *state)
 	state->mergelast = (int *) palloc0(maxTapes * sizeof(int));
 	state->mergeavailslots = (int *) palloc0(maxTapes * sizeof(int));
 	state->mergeavailmem = (int64 *) palloc0(maxTapes * sizeof(int64));
+	state->mergetuples = (char **) palloc0(maxTapes * sizeof(char *));
+	state->mergecurrent = (char **) palloc0(maxTapes * sizeof(char *));
+	state->mergetail = (char **) palloc0(maxTapes * sizeof(char *));
+	state->mergeoverflow = (char **) palloc0(maxTapes * sizeof(char *));
 	state->tp_fib = (int *) palloc0(maxTapes * sizeof(int));
 	state->tp_runs = (int *) palloc0(maxTapes * sizeof(int));
 	state->tp_dummy = (int *) palloc0(maxTapes * sizeof(int));
@@ -2320,7 +2405,7 @@ mergeruns(Tuplesortstate *state)
 				/* Tell logtape.c we won't be writing anymore */
 				LogicalTapeSetForgetFreeSpace(state->tapeset);
 				/* Initialize for the final merge pass */
-				beginmerge(state);
+				beginmerge(state, state->tuples);
 				state->status = TSS_FINALMERGE;
 				return;
 			}
@@ -2412,7 +2497,7 @@ mergeonerun(Tuplesortstate *state)
 	 * Start the merge by loading one tuple from each active source tape into
 	 * the heap.  We can also decrease the input run/dummy run counts.
 	 */
-	beginmerge(state);
+	beginmerge(state, false);
 
 	/*
 	 * Execute merge by repeatedly extracting lowest tuple in heap, writing it
@@ -2451,6 +2536,12 @@ mergeonerun(Tuplesortstate *state)
 	}
 
 	/*
+	 * Reset tuple memory, now that no caller tuples are needed in memory.
+	 * This prevents fragmentation.
+	 */
+	MemoryContextReset(state->tuplecontext);
+
+	/*
 	 * When the heap empties, we're done.  Write an end-of-run marker on the
 	 * output tape, and increment its count of real runs.
 	 */
@@ -2471,9 +2562,12 @@ mergeonerun(Tuplesortstate *state)
  * which tapes contain active input runs in mergeactive[].  Then, load
  * as many tuples as we can from each active input tape, and finally
  * fill the merge heap with the first tuple from each active tape.
+ *
+ * finalMergeBatch indicates if this is the beginning of a final on-the-fly
+ * merge where a batched allocation of tuple memory is required.
  */
 static void
-beginmerge(Tuplesortstate *state)
+beginmerge(Tuplesortstate *state, bool finalMergeBatch)
 {
 	int			activeTapes;
 	int			tapenum;
@@ -2511,6 +2605,18 @@ beginmerge(Tuplesortstate *state)
 	state->mergefreelist = 0;	/* nothing in the freelist */
 	state->mergefirstfree = activeTapes;		/* 1st slot avail for preread */
 
+	if (finalMergeBatch)
+	{
+		/* Free outright buffers for tape never actually allocated */
+		FREEMEM(state, (state->maxTapes - activeTapes) * TAPE_BUFFER_OVERHEAD);
+
+		/*
+		 * Grow memtuples one last time, since the palloc() overhead no longer
+		 * incurred can make a big difference
+		 */
+		batchmemtuples(state);
+	}
+
 	/*
 	 * Initialize space allocation to let each active input tape have an equal
 	 * share of preread space.
@@ -2518,7 +2624,7 @@ beginmerge(Tuplesortstate *state)
 	Assert(activeTapes > 0);
 	slotsPerTape = (state->memtupsize - state->mergefirstfree) / activeTapes;
 	Assert(slotsPerTape > 0);
-	spacePerTape = state->availMem / activeTapes;
+	spacePerTape = MAXALIGN_DOWN(state->availMem / activeTapes);
 	for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
 	{
 		if (state->mergeactive[srcTape])
@@ -2529,6 +2635,15 @@ beginmerge(Tuplesortstate *state)
 	}
 
 	/*
+	 * Preallocate tuple batch memory for each tape.  This is the memory used
+	 * for tuples themselves (not SortTuples), so it's never used by
+	 * pass-by-value datum sorts.  Memory allocation is performed here at most
+	 * once per sort, just in advance of the final on-the-fly merge step.
+	 */
+	if (finalMergeBatch)
+		mergebatch(state, spacePerTape);
+
+	/*
 	 * Preread as many tuples as possible (and at least one) from each active
 	 * tape
 	 */
@@ -2551,11 +2666,319 @@ beginmerge(Tuplesortstate *state)
 			tup->tupindex = state->mergefreelist;
 			state->mergefreelist = tupIndex;
 			state->mergeavailslots[srcTape]++;
+
+#ifdef TRACE_SORT
+			if (trace_sort && finalMergeBatch)
+			{
+				int64		perTapeKB = (spacePerTape + 1023) / 1024;
+				int64		usedSpaceKB;
+				int			usedSlots;
+
+				/*
+				 * Report how effective batchmemtuples() was in balancing
+				 * the number of slots against the need for memory for the
+				 * underlying tuples (e.g. IndexTuples).  The big preread of
+				 * all tapes when switching to FINALMERGE state should be
+				 * fairly representative of memory utilization during the
+				 * final merge step, and in any case is the only point at
+				 * which all tapes are guaranteed to have depleted either
+				 * their batch memory allowance or slot allowance.  Ideally,
+				 * both will be completely depleted for every tape by now.
+				 */
+				usedSpaceKB = (state->mergecurrent[srcTape] -
+							   state->mergetuples[srcTape] + 1023) / 1024;
+				usedSlots = slotsPerTape - state->mergeavailslots[srcTape];
+
+				elog(LOG, "tape %d initially used %ld KB of %ld KB batch "
+					 "(%2.3f) and %d out of %d slots (%2.3f)", srcTape,
+					 usedSpaceKB, perTapeKB,
+					 (double) usedSpaceKB / (double) perTapeKB,
+					 usedSlots, slotsPerTape,
+					 (double) usedSlots / (double) slotsPerTape);
+			}
+#endif
 		}
 	}
 }
 
 /*
+ * batchmemtuples - grow memtuples without palloc overhead
+ *
+ * When called, availMem should be approximately the amount of memory we'd
+ * require to allocate memtupsize - memtupcount tuples (not SortTuples/slots)
+ * that were allocated with palloc() overhead, and in doing so use up all
+ * allocated slots.  However, though slots and tuple memory is in balance
+ * following the last grow_memtuples() call, that's predicated on the observed
+ * average tuple size for the "final" grow_memtuples() call, which includes
+ * palloc overhead.
+ *
+ * This will perform an actual final grow_memtuples() call without any palloc()
+ * overhead, rebalancing the use of memory between slots and tuples.
+ */
+static void
+batchmemtuples(Tuplesortstate *state)
+{
+	int64			refund;
+	int64			availMemLessRefund;
+	int				memtupsize = state->memtupsize;
+
+	/* For simplicity, assume no memtuples are actually currently counted */
+	Assert(state->memtupcount == 0);
+
+	/*
+	 * Refund STANDARDCHUNKHEADERSIZE per tuple.
+	 *
+	 * This sometimes fails to make memory use prefectly balanced, but it
+	 * should never make the situation worse.  Note that Assert-enabled builds
+	 * get a larger refund, due to a varying STANDARDCHUNKHEADERSIZE.
+	 */
+	refund = memtupsize * STANDARDCHUNKHEADERSIZE;
+	availMemLessRefund = state->availMem - refund;
+
+	/*
+	 * To establish balanced memory use after refunding palloc overhead,
+	 * temporarily have our accounting indicate that we've allocated all
+	 * memory we're allowed to less that refund, and call grow_memtuples()
+	 * to have it increase the number of slots.
+	 */
+	state->growmemtuples = true;
+	USEMEM(state, availMemLessRefund);
+	(void) grow_memtuples(state);
+	/* Should not matter, but be tidy */
+	FREEMEM(state, availMemLessRefund);
+	state->growmemtuples = false;
+
+#ifdef TRACE_SORT
+	if (trace_sort)
+	{
+		Size	OldKb = (memtupsize * sizeof(SortTuple) + 1023) / 1024;
+		Size	NewKb = (state->memtupsize * sizeof(SortTuple) + 1023) / 1024;
+
+		elog(LOG, "grew memtuples %1.2fx from %d (%zu KB) to %d (%zu KB) for final merge",
+			 (double) NewKb / (double) OldKb,
+			 memtupsize, OldKb,
+			 state->memtupsize, NewKb);
+	}
+#endif
+}
+
+/*
+ * mergebatch - initialize tuple memory in batch
+ *
+ * This allows sequential access to sorted tuples buffered in memory from
+ * tapes/runs on disk during a final on-the-fly merge step.  Note that the
+ * memory is not used for SortTuples, but for the underlying tuples (e.g.
+ * MinimalTuples).
+ *
+ * Note that when batch memory is used, there is a simple division of space
+ * into large buffers (one per active tape).  The conventional incremental
+ * memory accounting (calling USEMEM() and FREEMEM()) is abandoned.  Instead,
+ * when each tape's memory budget is exceeded, a retail palloc() "overflow" is
+ * performed, which is then immediately detected in a way that is analogous to
+ * LACKMEM().  This keeps each tape's use of memory fair, which is always a
+ * goal.
+ */
+static void
+mergebatch(Tuplesortstate *state, int64 spacePerTape)
+{
+	int		srcTape;
+
+	Assert(state->activeTapes > 0);
+	Assert(state->tuples);
+
+	/*
+	 * For the purposes of tuplesort's memory accounting, the batch allocation
+	 * is special, and regular memory accounting through USEMEM() calls is
+	 * abandoned (see mergeprereadone()).
+	 */
+	for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
+	{
+		char	   *mergetuples;
+
+		if (!state->mergeactive[srcTape])
+			continue;
+
+		/* Allocate buffer for each active tape */
+		mergetuples = MemoryContextAllocHuge(state->tuplecontext,
+											 spacePerTape);
+
+		/* Initialize state for tape */
+		state->mergetuples[srcTape] = mergetuples;
+		state->mergecurrent[srcTape] = mergetuples;
+		state->mergetail[srcTape] = mergetuples;
+		state->mergeoverflow[srcTape] = NULL;
+	}
+
+	state->batchUsed = true;
+	state->spacePerTape = spacePerTape;
+}
+
+/*
+ * mergebatchone - prepare batch memory for one merge input tape
+ *
+ * This is called following the exhaustion of preread tuples for one input
+ * tape.  All that actually occurs is that the state for the source tape is
+ * reset to indicate that all memory may be reused.
+ *
+ * This routine must deal with fixing up the tuple that is about to be returned
+ * to the client, due to "overflow" allocations.
+ */
+static void
+mergebatchone(Tuplesortstate *state, int srcTape, SortTuple *rtup,
+			  bool *should_free)
+{
+	Assert(state->batchUsed);
+
+	/*
+	 * Tuple about to be returned to caller ("stup") is final preread tuple
+	 * from tape, just removed from the top of the heap.  Special steps around
+	 * memory management must be performed for that tuple, to make sure it
+	 * isn't overwritten early.
+	 */
+	if (!state->mergeoverflow[srcTape])
+	{
+		Size	tupLen;
+
+		/*
+		 * Mark tuple buffer range for reuse, but be careful to move final,
+		 * tail tuple to start of space for next run so that it's available
+		 * to caller when stup is returned, and remains available at least
+		 * until the next tuple is requested.
+		 */
+		tupLen = state->mergecurrent[srcTape] - state->mergetail[srcTape];
+		state->mergecurrent[srcTape] = state->mergetuples[srcTape];
+		memmove(state->mergecurrent[srcTape], state->mergetail[srcTape],
+				tupLen);
+
+		/* Make SortTuple at top of the merge heap point to new tuple */
+		rtup->tuple = (void *) state->mergecurrent[srcTape];
+
+		state->mergetail[srcTape] = state->mergecurrent[srcTape];
+		state->mergecurrent[srcTape] += tupLen;
+	}
+	else
+	{
+		/*
+		 * Handle an "overflow" retail palloc.
+		 *
+		 * This is needed when we run out of tuple memory for the tape.
+		 */
+		state->mergecurrent[srcTape] = state->mergetuples[srcTape];
+		state->mergetail[srcTape] = state->mergetuples[srcTape];
+
+		if (rtup->tuple)
+		{
+			Assert(rtup->tuple == (void *) state->mergeoverflow[srcTape]);
+			/* Caller should free palloc'd tuple */
+			*should_free = true;
+		}
+		state->mergeoverflow[srcTape] = NULL;
+	}
+}
+
+/*
+ * mergebatchfreetape - handle final clean-up for batch memory once tape is
+ * about to become exhausted
+ *
+ * All tuples are returned from tape, but a single final tuple, *rtup, is to be
+ * passed back to caller.  Free tape's batch allocation buffer while ensuring
+ * that the final tuple is managed appropriately.
+ */
+static void
+mergebatchfreetape(Tuplesortstate *state, int srcTape, SortTuple *rtup,
+				   bool *should_free)
+{
+	Assert(state->batchUsed);
+	Assert(state->status == TSS_FINALMERGE);
+
+	/*
+	 * Tuple may or may not already be an overflow allocation from
+	 * mergebatchone()
+	 */
+	if (!*should_free && rtup->tuple)
+	{
+		/*
+		 * Final tuple still in tape's batch allocation.
+		 *
+		 * Return palloc()'d copy to caller, and have it freed in a similar
+		 * manner to overflow allocation.  Otherwise, we'd free batch memory
+		 * and pass back a pointer to garbage.  Note that we deliberately
+		 * allocate this in the parent tuplesort context, to be on the safe
+		 * side.
+		 */
+		Size		tuplen;
+		void	   *oldTuple = rtup->tuple;
+
+		tuplen = state->mergecurrent[srcTape] - state->mergetail[srcTape];
+		rtup->tuple = MemoryContextAlloc(state->sortcontext, tuplen);
+		memcpy(rtup->tuple, oldTuple, tuplen);
+		*should_free = true;
+	}
+
+	/* Free spacePerTape-sized buffer */
+	pfree(state->mergetuples[srcTape]);
+}
+
+/*
+ * mergebatchalloc - allocate memory for one tuple using a batch memory
+ * "logical allocation".
+ *
+ * This is used for the final on-the-fly merge phase only.  READTUP() routines
+ * receive memory from here in place of palloc() and USEMEM() calls.
+ *
+ * Tuple tapenum is passed, ensuring each tape's tuples are stored in sorted,
+ * contiguous order (while allowing safe reuse of memory made available to
+ * each tape).  This maximizes locality of access as tuples are returned by
+ * final merge.
+ *
+ * Caller must not subsequently attempt to free memory returned here.  In
+ * general, only mergebatch* functions know about how memory returned from
+ * here should be freed, and this function's caller must ensure that batch
+ * memory management code will definitely have the opportunity to do the right
+ * thing during the final on-the-fly merge.
+ */
+static void *
+mergebatchalloc(Tuplesortstate *state, int tapenum, Size tuplen)
+{
+	Size		reserve_tuplen = MAXALIGN(tuplen);
+	char	   *ret;
+
+	/* Should overflow at most once before mergebatchone() call: */
+	Assert(state->mergeoverflow[tapenum] == NULL);
+	Assert(state->batchUsed);
+
+	/* It should be possible to use precisely spacePerTape memory at once */
+	if (state->mergecurrent[tapenum] + reserve_tuplen <=
+		state->mergetuples[tapenum] + state->spacePerTape)
+	{
+		/*
+		 * Usual case -- caller is returned pointer into its tape's buffer, and
+		 * an offset from that point is recorded as where tape has consumed up
+		 * to for current round of preloading.
+		 */
+		ret = state->mergetail[tapenum] = state->mergecurrent[tapenum];
+		state->mergecurrent[tapenum] += reserve_tuplen;
+	}
+	else
+	{
+		/*
+		 * Allocate memory, and record as tape's overflow allocation.  This
+		 * will be detected quickly, in a similar fashion to a LACKMEM()
+		 * condition, and should not happen again before a new round of
+		 * preloading for caller's tape.  Note that we deliberately allocate
+		 * this in the parent tuplesort context, to be on the safe side.
+		 *
+		 * Sometimes, this does not happen because merging runs out of slots
+		 * before running out of memory.
+		 */
+		ret = state->mergeoverflow[tapenum] =
+			MemoryContextAlloc(state->sortcontext, tuplen);
+	}
+
+	return ret;
+}
+
+/*
  * mergepreread - load tuples from merge input tapes
  *
  * This routine exists to improve sequentiality of reads during a merge pass,
@@ -2576,7 +2999,9 @@ beginmerge(Tuplesortstate *state)
  * that state and so no point in scanning through all the tapes to fix one.
  * (Moreover, there may be quite a lot of inactive tapes in that state, since
  * we might have had many fewer runs than tapes.  In a regular tape-to-tape
- * merge we can expect most of the tapes to be active.)
+ * merge we can expect most of the tapes to be active.  Plus, only
+ * FINALMERGE state has to consider memory management for a batch
+ * allocation.)
  */
 static void
 mergepreread(Tuplesortstate *state)
@@ -2605,9 +3030,20 @@ mergeprereadone(Tuplesortstate *state, int srcTape)
 
 	if (!state->mergeactive[srcTape])
 		return;					/* tape's run is already exhausted */
+
+	/*
+	 * Manage per-tape availMem.  Only actually matters when batch memory not
+	 * in use.
+	 */
 	priorAvail = state->availMem;
 	state->availMem = state->mergeavailmem[srcTape];
-	while ((state->mergeavailslots[srcTape] > 0 && !LACKMEM(state)) ||
+
+	/*
+	 * When batch memory is used if final on-the-fly merge, only mergeoverflow
+	 * test is relevant; otherwise, only LACKMEM() test is relevant.
+	 */
+	while ((state->mergeavailslots[srcTape] > 0 &&
+			state->mergeoverflow[srcTape] == NULL && !LACKMEM(state)) ||
 		   state->mergenext[srcTape] == 0)
 	{
 		/* read next tuple, if any */
@@ -3093,6 +3529,42 @@ markrunend(Tuplesortstate *state, int tapenum)
 	LogicalTapeWrite(state->tapeset, tapenum, (void *) &len, sizeof(len));
 }
 
+/*
+ * Get memory for tuple from within READTUP() routine.  Allocate
+ * memory and account for that, or consume from tape's batch
+ * allocation.
+ *
+ * Memory returned here in the final on-the-fly merge case is recycled
+ * from tape's batch allocation.  Otherwise, callers must pfree() or
+ * reset tuple child memory context, and account for that with a
+ * FREEMEM().  Currently, this only ever needs to happen in WRITETUP()
+ * routines.
+ */
+static void *
+readtup_alloc(Tuplesortstate *state, int tapenum, Size tuplen)
+{
+	if (state->batchUsed)
+	{
+		/*
+		 * No USEMEM() call, because during final on-the-fly merge
+		 * accounting is based on tape-private state. ("Overflow"
+		 * allocations are detected as an indication that a new round
+		 * or preloading is required. Preloading marks existing
+		 * contents of tape's batch buffer for reuse.)
+		 */
+		return mergebatchalloc(state, tapenum, tuplen);
+	}
+	else
+	{
+		char	   *ret;
+
+		/* Batch allocation yet to be performed */
+		ret = MemoryContextAlloc(state->tuplecontext, tuplen);
+		USEMEM(state, GetMemoryChunkSpace(ret));
+		return ret;
+	}
+}
+
 
 /*
  * Routines specialized for HeapTuple (actually MinimalTuple) case
@@ -3171,6 +3643,7 @@ copytup_heap(Tuplesortstate *state, SortTuple *stup, void *tup)
 	Datum		original;
 	MinimalTuple tuple;
 	HeapTupleData htup;
+	MemoryContext oldcontext = MemoryContextSwitchTo(state->tuplecontext);
 
 	/* copy the tuple into sort storage */
 	tuple = ExecCopySlotMinimalTuple(slot);
@@ -3184,6 +3657,8 @@ copytup_heap(Tuplesortstate *state, SortTuple *stup, void *tup)
 							state->tupDesc,
 							&stup->isnull1);
 
+	MemoryContextSwitchTo(oldcontext);
+
 	if (!state->sortKeys->abbrev_converter || stup->isnull1)
 	{
 		/*
@@ -3266,11 +3741,10 @@ readtup_heap(Tuplesortstate *state, SortTuple *stup,
 {
 	unsigned int tupbodylen = len - sizeof(int);
 	unsigned int tuplen = tupbodylen + MINIMAL_TUPLE_DATA_OFFSET;
-	MinimalTuple tuple = (MinimalTuple) palloc(tuplen);
+	MinimalTuple tuple = (MinimalTuple) readtup_alloc(state, tapenum, tuplen);
 	char	   *tupbody = (char *) tuple + MINIMAL_TUPLE_DATA_OFFSET;
 	HeapTupleData htup;
 
-	USEMEM(state, GetMemoryChunkSpace(tuple));
 	/* read in the tuple proper */
 	tuple->t_len = tuplen;
 	LogicalTapeReadExact(state->tapeset, tapenum,
@@ -3409,12 +3883,15 @@ copytup_cluster(Tuplesortstate *state, SortTuple *stup, void *tup)
 {
 	HeapTuple	tuple = (HeapTuple) tup;
 	Datum		original;
+	MemoryContext oldcontext = MemoryContextSwitchTo(state->tuplecontext);
 
 	/* copy the tuple into sort storage */
 	tuple = heap_copytuple(tuple);
 	stup->tuple = (void *) tuple;
 	USEMEM(state, GetMemoryChunkSpace(tuple));
 
+	MemoryContextSwitchTo(oldcontext);
+
 	/*
 	 * set up first-column key value, and potentially abbreviate, if it's a
 	 * simple column
@@ -3501,9 +3978,10 @@ readtup_cluster(Tuplesortstate *state, SortTuple *stup,
 				int tapenum, unsigned int tuplen)
 {
 	unsigned int t_len = tuplen - sizeof(ItemPointerData) - sizeof(int);
-	HeapTuple	tuple = (HeapTuple) palloc(t_len + HEAPTUPLESIZE);
+	HeapTuple	tuple = (HeapTuple) readtup_alloc(state,
+												  tapenum,
+												  t_len + HEAPTUPLESIZE);
 
-	USEMEM(state, GetMemoryChunkSpace(tuple));
 	/* Reconstruct the HeapTupleData header */
 	tuple->t_data = (HeapTupleHeader) ((char *) tuple + HEAPTUPLESIZE);
 	tuple->t_len = t_len;
@@ -3722,7 +4200,7 @@ copytup_index(Tuplesortstate *state, SortTuple *stup, void *tup)
 	Datum		original;
 
 	/* copy the tuple into sort storage */
-	newtuple = (IndexTuple) palloc(tuplen);
+	newtuple = (IndexTuple) MemoryContextAlloc(state->tuplecontext, tuplen);
 	memcpy(newtuple, tuple, tuplen);
 	USEMEM(state, GetMemoryChunkSpace(newtuple));
 	stup->tuple = (void *) newtuple;
@@ -3804,9 +4282,8 @@ readtup_index(Tuplesortstate *state, SortTuple *stup,
 			  int tapenum, unsigned int len)
 {
 	unsigned int tuplen = len - sizeof(unsigned int);
-	IndexTuple	tuple = (IndexTuple) palloc(tuplen);
+	IndexTuple	tuple = (IndexTuple) readtup_alloc(state, tapenum, tuplen);
 
-	USEMEM(state, GetMemoryChunkSpace(tuple));
 	LogicalTapeReadExact(state->tapeset, tapenum,
 						 tuple, tuplen);
 	if (state->randomAccess)	/* need trailing length word? */
@@ -3864,7 +4341,7 @@ writetup_datum(Tuplesortstate *state, int tapenum, SortTuple *stup)
 		waddr = NULL;
 		tuplen = 0;
 	}
-	else if (state->datumTypeByVal)
+	else if (!state->tuples)
 	{
 		waddr = &stup->datum1;
 		tuplen = sizeof(Datum);
@@ -3906,7 +4383,7 @@ readtup_datum(Tuplesortstate *state, SortTuple *stup,
 		stup->isnull1 = true;
 		stup->tuple = NULL;
 	}
-	else if (state->datumTypeByVal)
+	else if (!state->tuples)
 	{
 		Assert(tuplen == sizeof(Datum));
 		LogicalTapeReadExact(state->tapeset, tapenum,
@@ -3916,14 +4393,13 @@ readtup_datum(Tuplesortstate *state, SortTuple *stup,
 	}
 	else
 	{
-		void	   *raddr = palloc(tuplen);
+		void	   *raddr = readtup_alloc(state, tapenum, tuplen);
 
 		LogicalTapeReadExact(state->tapeset, tapenum,
 							 raddr, tuplen);
 		stup->datum1 = PointerGetDatum(raddr);
 		stup->isnull1 = false;
 		stup->tuple = raddr;
-		USEMEM(state, GetMemoryChunkSpace(raddr));
 	}
 
 	if (state->randomAccess)	/* need trailing length word? */
-- 
1.9.1

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

Reply via email to