Re: [PATCH] Incremental sort (was: PoC: Partial sort)

Tomas Vondra Sat, 30 Nov 2019 16:48:30 -0800

On Fri, Nov 29, 2019 at 03:01:46PM +0900, Michael Paquier wrote:

On Sun, Sep 29, 2019 at 01:00:49AM +0200, Tomas Vondra wrote:

OK. I'll try extending the set of synthetic queries in [1] to also do
soemthing like this and generate similar plans.


Any progress on that?

Please note that the latest patch does not apply anymore, so a rebase
is needed.  I am switching the patch as waiting on author for now.
--


Ah, thanks for reminding me. I've added a couple more queries with two
joins (there only were queries with two joins, I haven't expected
another joint to make such difference, but seems I was wrong).

So yes, there seem to be 6 different GUCs / places where considering
incremental sort makes a difference (the numbers say how many of the
4960 tested combinations were affected)

- create_ordered_paths_parallel (50)
- create_partial_grouping_paths_2 (228)
- standard_join_search (94)
- add_paths_to_grouping_rel (2148)
- set_rel_pathlist (156)
- apply_scanjoin_target_to_paths (286)

Clearly some of the places are more important than others, plus there
are some overlaps (two GUCs producing the same plan, etc.).

Plus there are four GUCs that did not affect any queries at all:

- create_partial_grouping_paths
- gather_grouping_paths
- create_ordered_paths
- add_paths_to_grouping_rel_parallel

Anyway, this might serve as a way to prioritize the effort. All the
test changes are in the original repo at

  https://github.com/tvondra/incremental-sort-tests-2

and I'm also attaching the rebased patches - the changes were pretty

minor, hopefully that helps others (all the patches with dev GUCs are in

https://github.com/tvondra/postgres/tree/incremental-sort-20191129


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

>From 14beb5a9f3282d452844cffa86bafb59df831343 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <to...@2ndquadrant.com>
Date: Fri, 29 Nov 2019 19:41:26 +0100
Subject: [PATCH 01/13] Consider low startup cost when adding partial path

45be99f8cd5d606086e0a458c9c72910ba8a613d added `add_partial_path` with
the comment:

> Neither do we need to consider startup costs:
> parallelism is only used for plans that will be run to completion.
> Therefore, this routine is much simpler than add_path: it needs to
> consider only pathkeys and total cost.

I'm not entirely sure if that is still true or not--I can't easily come
up with a scenario in which it's not, but I also can't come up with an
inherent reason why such a scenario cannot exist.

Regardless, the in-progress incremental sort patch uncovered a new case
where it definitely no longer holds, and, as a result a higher cost plan
ends up being chosen because a low startup cost partial path is ignored
in favor of a lower total cost partial path and a limit is a applied on
top of that which would normal favor the lower startup cost plan.
---
 src/backend/optimizer/util/pathnode.c | 47 ++++++++++-----------------
 1 file changed, 18 insertions(+), 29 deletions(-)

diff --git a/src/backend/optimizer/util/pathnode.c 
b/src/backend/optimizer/util/pathnode.c
index 60c93ee7c5..f24ba587e5 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -777,41 +777,30 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
                /* Unless pathkeys are incompatible, keep just one of the two 
paths. */
                if (keyscmp != PATHKEYS_DIFFERENT)
                {
-                       if (new_path->total_cost > old_path->total_cost * 
STD_FUZZ_FACTOR)
-                       {
-                               /* New path costs more; keep it only if 
pathkeys are better. */
-                               if (keyscmp != PATHKEYS_BETTER1)
-                                       accept_new = false;
-                       }
-                       else if (old_path->total_cost > new_path->total_cost
-                                        * STD_FUZZ_FACTOR)
+                       PathCostComparison costcmp;
+
+                       /*
+                        * Do a fuzzy cost comparison with standard fuzziness 
limit.
+                        */
+                       costcmp = compare_path_costs_fuzzily(new_path, old_path,
+                                                                               
                 STD_FUZZ_FACTOR);
+
+                       if (costcmp == COSTS_BETTER1)
                        {
-                               /* Old path costs more; keep it only if 
pathkeys are better. */
-                               if (keyscmp != PATHKEYS_BETTER2)
+                               if (keyscmp == PATHKEYS_BETTER1)
                                        remove_old = true;
                        }
-                       else if (keyscmp == PATHKEYS_BETTER1)
-                       {
-                               /* Costs are about the same, new path has 
better pathkeys. */
-                               remove_old = true;
-                       }
-                       else if (keyscmp == PATHKEYS_BETTER2)
+                       else if (costcmp == COSTS_BETTER2)
                        {
-                               /* Costs are about the same, old path has 
better pathkeys. */
-                               accept_new = false;
-                       }
-                       else if (old_path->total_cost > new_path->total_cost * 
1.0000000001)
-                       {
-                               /* Pathkeys are the same, and the old path 
costs more. */
-                               remove_old = true;
+                               if (keyscmp == PATHKEYS_BETTER2)
+                                       accept_new = false;
                        }
-                       else
+                       else if (costcmp == COSTS_EQUAL)
                        {
-                               /*
-                                * Pathkeys are the same, and new path isn't 
materially
-                                * cheaper.
-                                */
-                               accept_new = false;
+                               if (keyscmp == PATHKEYS_BETTER1)
+                                       remove_old = true;
+                               else if (keyscmp == PATHKEYS_BETTER2)
+                                       accept_new = false;
                        }
                }
 
-- 
2.21.0

>From 3f631de90190efa582085dcd84f1f1f395d10beb Mon Sep 17 00:00:00 2001
From: Tomas Vondra <to...@2ndquadrant.com>
Date: Fri, 29 Nov 2019 19:47:28 +0100
Subject: [PATCH 02/13] Implement incremental sort

Incremental sort is an optimized variant of multikey sort for cases
when the input is already sorted by a prefix of the sort keys. For
example when a sort by (key1, key2 ... keyN) is requested, and the
input is already sorted by (key1, key2 ... keyM), M < N, we can
divide the input into groups where keys (key1, ... keyM) are equal,
and only sort on the remaining columns.

The implemented algorithm operates in two different modes:
  - Fetching a minimum number of tuples without checking prefix key
    group membership and sorting on all columns when safe.
  - Fetching all tuples for a single prefix key group and sorting on
    solely the unsorted columns.
We always begin in the first mode, and employ a heuristic to switch
into the second mode if we believe it's beneficial.

Sorting incrementally can potentially use less memory (and possibly
avoid spilling to disk), avoid fetching and sorting all tuples in the
dataset (particularly useful when a LIMIT clause has been specified),
and begin returning tuples before the entire result set is available.
Small datasets which fit entirely in memory and must be fully realized
and sorted may be slightly slower, which we reflect in the costing
implementation.

The hybrid mode approach allows us to optimize for both very small
groups (where the overhead of a new tuplesort is high) and very large
groups (where we can lower cost by not having to sort on already sorted
columns), albeit at some extra cost while switching between modes.

Co-authored-by: Alexander Korotkov <a.korot...@postgrespro.ru>
---
 doc/src/sgml/config.sgml                      |   14 +
 src/backend/commands/explain.c                |  211 ++-
 src/backend/executor/Makefile                 |    1 +
 src/backend/executor/execAmi.c                |   13 +
 src/backend/executor/execParallel.c           |   18 +
 src/backend/executor/execProcnode.c           |   33 +
 src/backend/executor/nodeIncrementalSort.c    | 1107 ++++++++++++++++
 src/backend/executor/nodeSort.c               |    3 +-
 src/backend/nodes/copyfuncs.c                 |   49 +-
 src/backend/nodes/outfuncs.c                  |   25 +-
 src/backend/nodes/readfuncs.c                 |   37 +-
 src/backend/optimizer/path/allpaths.c         |    4 +
 src/backend/optimizer/path/costsize.c         |  194 ++-
 src/backend/optimizer/path/pathkeys.c         |   61 +-
 src/backend/optimizer/plan/createplan.c       |  129 +-
 src/backend/optimizer/plan/planner.c          |   71 +-
 src/backend/optimizer/plan/setrefs.c          |    1 +
 src/backend/optimizer/plan/subselect.c        |    1 +
 src/backend/optimizer/util/pathnode.c         |   51 +
 src/backend/utils/misc/guc.c                  |    9 +
 src/backend/utils/sort/tuplesort.c            |  194 ++-
 src/include/executor/execdebug.h              |    2 +
 src/include/executor/nodeIncrementalSort.h    |   30 +
 src/include/nodes/execnodes.h                 |   68 +
 src/include/nodes/nodes.h                     |    3 +
 src/include/nodes/pathnodes.h                 |    9 +
 src/include/nodes/plannodes.h                 |   11 +
 src/include/optimizer/cost.h                  |   10 +
 src/include/optimizer/pathnode.h              |    6 +
 src/include/optimizer/paths.h                 |    2 +
 src/include/utils/tuplesort.h                 |    3 +
 .../expected/drop-index-concurrently-1.out    |    2 +-
 .../regress/expected/incremental_sort.out     | 1160 +++++++++++++++++
 .../regress/expected/partition_aggregate.out  |    2 +
 src/test/regress/expected/sysviews.out        |    3 +-
 src/test/regress/parallel_schedule            |    2 +-
 src/test/regress/serial_schedule              |    1 +
 src/test/regress/sql/incremental_sort.sql     |   78 ++
 src/test/regress/sql/partition_aggregate.sql  |    2 +
 39 files changed, 3505 insertions(+), 115 deletions(-)
 create mode 100644 src/backend/executor/nodeIncrementalSort.c
 create mode 100644 src/include/executor/nodeIncrementalSort.h
 create mode 100644 src/test/regress/expected/incremental_sort.out
 create mode 100644 src/test/regress/sql/incremental_sort.sql

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 4ec13f3311..ba764671bb 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4467,6 +4467,20 @@ ANY <replaceable 
class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-incrementalsort" 
xreflabel="enable_incrementalsort">
+      <term><varname>enable_incrementalsort</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_incrementalsort</varname> configuration 
parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of incremental sort
+        steps. The default is <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-indexscan" xreflabel="enable_indexscan">
       <term><varname>enable_indexscan</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 62fb3434a3..8a3bf8a4e5 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -80,6 +80,8 @@ static void show_upper_qual(List *qual, const char *qlabel,
                                                        ExplainState *es);
 static void show_sort_keys(SortState *sortstate, List *ancestors,
                                                   ExplainState *es);
+static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+                                          List *ancestors, ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
                                                                   ExplainState 
*es);
 static void show_agg_keys(AggState *astate, List *ancestors,
@@ -93,7 +95,7 @@ static void show_grouping_set_keys(PlanState *planstate,
 static void show_group_keys(GroupState *gstate, List *ancestors,
                                                        ExplainState *es);
 static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
-                                                                int nkeys, 
AttrNumber *keycols,
+                                                                int nkeys, int 
nPresortedKeys, AttrNumber *keycols,
                                                                 Oid 
*sortOperators, Oid *collations, bool *nullsFirst,
                                                                 List 
*ancestors, ExplainState *es);
 static void show_sortorder_options(StringInfo buf, Node *sortexpr,
@@ -101,6 +103,8 @@ static void show_sortorder_options(StringInfo buf, Node 
*sortexpr,
 static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
                                                         List *ancestors, 
ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
+static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
+                                                                          
ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
                                                                ExplainState 
*es);
@@ -1215,6 +1219,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
                case T_Sort:
                        pname = sname = "Sort";
                        break;
+               case T_IncrementalSort:
+                       pname = sname = "Incremental Sort";
+                       break;
                case T_Group:
                        pname = sname = "Group";
                        break;
@@ -1841,6 +1848,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
                        show_sort_keys(castNode(SortState, planstate), 
ancestors, es);
                        show_sort_info(castNode(SortState, planstate), es);
                        break;
+               case T_IncrementalSort:
+                       
show_incremental_sort_keys(castNode(IncrementalSortState, planstate),
+                                                                          
ancestors, es);
+                       
show_incremental_sort_info(castNode(IncrementalSortState, planstate),
+                                                                          es);
+                       break;
                case T_MergeAppend:
                        show_merge_append_keys(castNode(MergeAppendState, 
planstate),
                                                                   ancestors, 
es);
@@ -2175,12 +2188,29 @@ show_sort_keys(SortState *sortstate, List *ancestors, 
ExplainState *es)
        Sort       *plan = (Sort *) sortstate->ss.ps.plan;
 
        show_sort_group_keys((PlanState *) sortstate, "Sort Key",
-                                                plan->numCols, 
plan->sortColIdx,
+                                                plan->numCols, 0, 
plan->sortColIdx,
                                                 plan->sortOperators, 
plan->collations,
                                                 plan->nullsFirst,
                                                 ancestors, es);
 }
 
+/*
+ * Show the sort keys for a IncrementalSort node.
+ */
+static void
+show_incremental_sort_keys(IncrementalSortState *incrsortstate,
+                                                  List *ancestors, 
ExplainState *es)
+{
+       IncrementalSort *plan = (IncrementalSort *) incrsortstate->ss.ps.plan;
+
+       show_sort_group_keys((PlanState *) incrsortstate, "Sort Key",
+                                                plan->sort.numCols, 
plan->presortedCols,
+                                                plan->sort.sortColIdx,
+                                                plan->sort.sortOperators, 
plan->sort.collations,
+                                                plan->sort.nullsFirst,
+                                                ancestors, es);
+}
+
 /*
  * Likewise, for a MergeAppend node.
  */
@@ -2191,7 +2221,7 @@ show_merge_append_keys(MergeAppendState *mstate, List 
*ancestors,
        MergeAppend *plan = (MergeAppend *) mstate->ps.plan;
 
        show_sort_group_keys((PlanState *) mstate, "Sort Key",
-                                                plan->numCols, 
plan->sortColIdx,
+                                                plan->numCols, 0, 
plan->sortColIdx,
                                                 plan->sortOperators, 
plan->collations,
                                                 plan->nullsFirst,
                                                 ancestors, es);
@@ -2215,7 +2245,7 @@ show_agg_keys(AggState *astate, List *ancestors,
                        show_grouping_sets(outerPlanState(astate), plan, 
ancestors, es);
                else
                        show_sort_group_keys(outerPlanState(astate), "Group 
Key",
-                                                                plan->numCols, 
plan->grpColIdx,
+                                                                plan->numCols, 
0, plan->grpColIdx,
                                                                 NULL, NULL, 
NULL,
                                                                 ancestors, es);
 
@@ -2284,7 +2314,7 @@ show_grouping_set_keys(PlanState *planstate,
        if (sortnode)
        {
                show_sort_group_keys(planstate, "Sort Key",
-                                                        sortnode->numCols, 
sortnode->sortColIdx,
+                                                        sortnode->numCols, 0, 
sortnode->sortColIdx,
                                                         
sortnode->sortOperators, sortnode->collations,
                                                         sortnode->nullsFirst,
                                                         ancestors, es);
@@ -2341,7 +2371,7 @@ show_group_keys(GroupState *gstate, List *ancestors,
        /* The key columns refer to the tlist of the child plan */
        ancestors = lcons(gstate, ancestors);
        show_sort_group_keys(outerPlanState(gstate), "Group Key",
-                                                plan->numCols, plan->grpColIdx,
+                                                plan->numCols, 0, 
plan->grpColIdx,
                                                 NULL, NULL, NULL,
                                                 ancestors, es);
        ancestors = list_delete_first(ancestors);
@@ -2354,13 +2384,14 @@ show_group_keys(GroupState *gstate, List *ancestors,
  */
 static void
 show_sort_group_keys(PlanState *planstate, const char *qlabel,
-                                        int nkeys, AttrNumber *keycols,
+                                        int nkeys, int nPresortedKeys, 
AttrNumber *keycols,
                                         Oid *sortOperators, Oid *collations, 
bool *nullsFirst,
                                         List *ancestors, ExplainState *es)
 {
        Plan       *plan = planstate->plan;
        List       *context;
        List       *result = NIL;
+       List       *resultPresorted = NIL;
        StringInfoData sortkeybuf;
        bool            useprefix;
        int                     keyno;
@@ -2400,9 +2431,13 @@ show_sort_group_keys(PlanState *planstate, const char 
*qlabel,
                                                                   
nullsFirst[keyno]);
                /* Emit one property-list item per sort key */
                result = lappend(result, pstrdup(sortkeybuf.data));
+               if (keyno < nPresortedKeys)
+                       resultPresorted = lappend(resultPresorted, exprstr);
        }
 
        ExplainPropertyList(qlabel, result, es);
+       if (nPresortedKeys > 0)
+               ExplainPropertyList("Presorted Key", resultPresorted, es);
 }
 
 /*
@@ -2612,6 +2647,168 @@ show_sort_info(SortState *sortstate, ExplainState *es)
        }
 }
 
+/*
+ * If it's EXPLAIN ANALYZE, show tuplesort stats for a incremental sort node
+ */
+static void
+show_incremental_sort_info(IncrementalSortState *incrsortstate,
+                                                  ExplainState *es)
+{
+       if (es->analyze && incrsortstate->sort_Done &&
+               incrsortstate->fullsort_state != NULL)
+       {
+               /* TODO: is it valid to get space used etc. only once given we 
re-use the sort? */
+               /* TODO: maybe show average, min, max sort group size? */
+
+               Tuplesortstate *fullsort_state = incrsortstate->fullsort_state;
+               TuplesortInstrumentation fullsort_stats;
+               const char *fullsort_sortMethod;
+               const char *fullsort_spaceType;
+               Tuplesortstate *prefixsort_state = 
incrsortstate->prefixsort_state;
+               TuplesortInstrumentation prefixsort_stats;
+               const char *prefixsort_sortMethod;
+               const char *prefixsort_spaceType;
+
+               tuplesort_get_stats(fullsort_state, &fullsort_stats);
+               fullsort_sortMethod = 
tuplesort_method_name(fullsort_stats.sortMethod);
+               fullsort_spaceType = 
tuplesort_space_type_name(fullsort_stats.spaceType);
+               if (prefixsort_state != NULL)
+               {
+                       tuplesort_get_stats(prefixsort_state, 
&prefixsort_stats);
+                       prefixsort_sortMethod = 
tuplesort_method_name(prefixsort_stats.sortMethod);
+                       prefixsort_spaceType = 
tuplesort_space_type_name(prefixsort_stats.spaceType);
+               }
+
+               if (es->format == EXPLAIN_FORMAT_TEXT)
+               {
+                       appendStringInfoSpaces(es->str, es->indent * 2);
+                       appendStringInfo(es->str, "Sort Method: Full: %s  %s: 
%ldkB",
+                                                        fullsort_sortMethod, 
fullsort_spaceType,
+                                                        
fullsort_stats.spaceUsed);
+                       if (prefixsort_state != NULL)
+                               appendStringInfo(es->str, ", Prefix-only: %s 
%s: %ldkB\n",
+                                                                
prefixsort_sortMethod, prefixsort_spaceType,
+                                                                
prefixsort_stats.spaceUsed);
+                       else
+                               appendStringInfo(es->str, "\n");
+                       appendStringInfoSpaces(es->str, es->indent * 2);
+                       appendStringInfo(es->str, "Sort Groups: Full:  %ld",
+                                                        
incrsortstate->fullsort_group_count);
+                       if (prefixsort_state != NULL)
+                               appendStringInfo(es->str, ", Prefix-only: 
%ld\n",
+                                                        
incrsortstate->prefixsort_group_count);
+                       else
+                               appendStringInfo(es->str, "\n");
+               }
+               else
+               {
+                       /* TODO */
+                       ExplainPropertyText("Full Sort Method", 
fullsort_sortMethod, es);
+                       ExplainPropertyInteger("Full Sort Space Used", "kB",
+                                       fullsort_stats.spaceUsed, es);
+                       ExplainPropertyText("Full Sort Space Type", 
fullsort_spaceType, es);
+                       ExplainPropertyInteger("Full Sort Groups", NULL,
+                                                                  
incrsortstate->fullsort_group_count, es);
+
+                       if (prefixsort_state != NULL)
+                       {
+                               ExplainPropertyText("Prefix Sort Method", 
prefixsort_sortMethod, es);
+                               ExplainPropertyInteger("Prefix Sort Space 
Used", "kB",
+                                               prefixsort_stats.spaceUsed, es);
+                               ExplainPropertyText("Prefix Sort Space Type", 
prefixsort_spaceType, es);
+                               ExplainPropertyInteger("Prefix Sort Groups", 
NULL,
+                                                                          
incrsortstate->prefixsort_group_count, es);
+                       }
+               }
+       }
+
+       if (incrsortstate->shared_info != NULL)
+       {
+               int                     n;
+               bool            opened_group = false;
+
+               for (n = 0; n < incrsortstate->shared_info->num_workers; n++)
+               {
+                       IncrementalSortInfo *incsort_info =
+                               &incrsortstate->shared_info->sinfo[n];
+                       TuplesortInstrumentation *fullsort_instrument;
+                       const char *fullsort_sortMethod;
+                       const char *fullsort_spaceType;
+                       long            fullsort_spaceUsed;
+                       int64           fullsort_group_count;
+                       TuplesortInstrumentation *prefixsort_instrument;
+                       const char *prefixsort_sortMethod;
+                       const char *prefixsort_spaceType;
+                       long            prefixsort_spaceUsed;
+                       int64           prefixsort_group_count;
+
+                       fullsort_instrument = 
&incsort_info->fullsort_instrument;
+                       fullsort_group_count = 
incsort_info->fullsort_group_count;
+
+                       prefixsort_instrument = 
&incsort_info->prefixsort_instrument;
+                       prefixsort_group_count = 
incsort_info->prefixsort_group_count;
+
+                       if (fullsort_instrument->sortMethod == 
SORT_TYPE_STILL_IN_PROGRESS)
+                               continue;               /* ignore any unfilled 
slots */
+
+                       fullsort_sortMethod = tuplesort_method_name(
+                                       fullsort_instrument->sortMethod);
+                       fullsort_spaceType = tuplesort_space_type_name(
+                                       fullsort_instrument->spaceType);
+                       fullsort_spaceUsed = fullsort_instrument->spaceUsed;
+
+                       if (prefixsort_instrument)
+                       {
+                               prefixsort_sortMethod = tuplesort_method_name(
+                                               
prefixsort_instrument->sortMethod);
+                               prefixsort_spaceType = 
tuplesort_space_type_name(
+                                               
prefixsort_instrument->spaceType);
+                               prefixsort_spaceUsed = 
prefixsort_instrument->spaceUsed;
+                       }
+
+                       if (es->format == EXPLAIN_FORMAT_TEXT)
+                       {
+                               appendStringInfoSpaces(es->str, es->indent * 2);
+                               appendStringInfo(es->str,
+                                                                "Worker %d: 
Full Sort Method: %s  %s: %ldkB  Groups: %ld",
+                                                                n, 
fullsort_sortMethod, fullsort_spaceType,
+                                                                
fullsort_spaceUsed, fullsort_group_count);
+                               if (prefixsort_instrument)
+                                       appendStringInfo(es->str,
+                                                                        ", 
Prefix Sort Method: %s  %s: %ldkB  Groups: %ld\n",
+                                                                        
prefixsort_sortMethod, prefixsort_spaceType,
+                                                                        
prefixsort_spaceUsed, prefixsort_group_count);
+                               else
+                                       appendStringInfo(es->str, "\n");
+                       }
+                       else
+                       {
+                               if (!opened_group)
+                               {
+                                       ExplainOpenGroup("Workers", "Workers", 
false, es);
+                                       opened_group = true;
+                               }
+                               ExplainOpenGroup("Worker", NULL, true, es);
+                               ExplainPropertyInteger("Worker Number", NULL, 
n, es);
+                               ExplainPropertyText("Full Sort Method", 
fullsort_sortMethod, es);
+                               ExplainPropertyInteger("Full Sort Space Used", 
"kB", fullsort_spaceUsed, es);
+                               ExplainPropertyText("Full Sort Space Type", 
fullsort_spaceType, es);
+                               ExplainPropertyInteger("Full Sort Groups", 
NULL, fullsort_group_count, es);
+                               if (prefixsort_instrument)
+                               {
+                                       ExplainPropertyText("Prefix Sort 
Method", prefixsort_sortMethod, es);
+                                       ExplainPropertyInteger("Prefix Sort 
Space Used", "kB", prefixsort_spaceUsed, es);
+                                       ExplainPropertyText("Prefix Sort Space 
Type", prefixsort_spaceType, es);
+                                       ExplainPropertyInteger("Prefix Sort 
Groups", NULL, prefixsort_group_count, es);
+                               }
+                               ExplainCloseGroup("Worker", NULL, true, es);
+                       }
+               }
+               if (opened_group)
+                       ExplainCloseGroup("Workers", "Workers", false, es);
+       }
+}
+
 /*
  * Show information on hash buckets/batches.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index a983800e4b..e06258e134 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -63,6 +63,7 @@ OBJS = \
        nodeSeqscan.o \
        nodeSetOp.o \
        nodeSort.o \
+       nodeIncrementalSort.o \
        nodeSubplan.o \
        nodeSubqueryscan.o \
        nodeTableFuncscan.o \
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 779d3dccea..0d0fec82f1 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -30,6 +30,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -252,6 +253,10 @@ ExecReScan(PlanState *node)
                        ExecReScanSort((SortState *) node);
                        break;
 
+               case T_IncrementalSortState:
+                       ExecReScanIncrementalSort((IncrementalSortState *) 
node);
+                       break;
+
                case T_GroupState:
                        ExecReScanGroup((GroupState *) node);
                        break;
@@ -557,8 +562,16 @@ ExecSupportsBackwardScan(Plan *node)
                case T_CteScan:
                case T_Material:
                case T_Sort:
+                       /* these don't evaluate tlist */
                        return true;
 
+               case T_IncrementalSort:
+                       /*
+                        * Unlike full sort, incremental sort keeps only a 
single group
+                        * of tuples in memory, so it can't scan backwards.
+                        */
+                       return false;
+
                case T_LockRows:
                case T_Limit:
                        return ExecSupportsBackwardScan(outerPlan(node));
diff --git a/src/backend/executor/execParallel.c 
b/src/backend/executor/execParallel.c
index b256642665..2aedc2b1af 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -31,6 +31,7 @@
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeSeqscan.h"
@@ -280,6 +281,10 @@ ExecParallelEstimate(PlanState *planstate, 
ExecParallelEstimateContext *e)
                        /* even when not parallel-aware, for EXPLAIN ANALYZE */
                        ExecSortEstimate((SortState *) planstate, e->pcxt);
                        break;
+               case T_IncrementalSortState:
+                       /* even when not parallel-aware, for EXPLAIN ANALYZE */
+                       ExecIncrementalSortEstimate((IncrementalSortState *) 
planstate, e->pcxt);
+                       break;
 
                default:
                        break;
@@ -493,6 +498,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
                        /* even when not parallel-aware, for EXPLAIN ANALYZE */
                        ExecSortInitializeDSM((SortState *) planstate, d->pcxt);
                        break;
+               case T_IncrementalSortState:
+                       /* even when not parallel-aware, for EXPLAIN ANALYZE */
+                       ExecIncrementalSortInitializeDSM((IncrementalSortState 
*) planstate, d->pcxt);
+                       break;
 
                default:
                        break;
@@ -955,6 +964,7 @@ ExecParallelReInitializeDSM(PlanState *planstate,
                        break;
                case T_HashState:
                case T_SortState:
+               case T_IncrementalSortState:
                        /* these nodes have DSM state, but no reinitialization 
is required */
                        break;
 
@@ -1015,6 +1025,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
                case T_SortState:
                        ExecSortRetrieveInstrumentation((SortState *) 
planstate);
                        break;
+               case T_IncrementalSortState:
+                       
ExecIncrementalSortRetrieveInstrumentation((IncrementalSortState *) planstate);
+                       break;
                case T_HashState:
                        ExecHashRetrieveInstrumentation((HashState *) 
planstate);
                        break;
@@ -1301,6 +1314,11 @@ ExecParallelInitializeWorker(PlanState *planstate, 
ParallelWorkerContext *pwcxt)
                        /* even when not parallel-aware, for EXPLAIN ANALYZE */
                        ExecSortInitializeWorker((SortState *) planstate, 
pwcxt);
                        break;
+               case T_IncrementalSortState:
+                       /* even when not parallel-aware, for EXPLAIN ANALYZE */
+                       
ExecIncrementalSortInitializeWorker((IncrementalSortState *) planstate,
+                                                                               
                pwcxt);
+                       break;
 
                default:
                        break;
diff --git a/src/backend/executor/execProcnode.c 
b/src/backend/executor/execProcnode.c
index 7a6e285149..a5e928f8f1 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -88,6 +88,7 @@
 #include "executor/nodeGroup.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
+#include "executor/nodeIncrementalSort.h"
 #include "executor/nodeIndexonlyscan.h"
 #include "executor/nodeIndexscan.h"
 #include "executor/nodeLimit.h"
@@ -313,6 +314,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
                                                                                
                estate, eflags);
                        break;
 
+               case T_IncrementalSort:
+                       result = (PlanState *) 
ExecInitIncrementalSort((IncrementalSort *) node,
+                                                                               
                                   estate, eflags);
+                       break;
+
                case T_Group:
                        result = (PlanState *) ExecInitGroup((Group *) node,
                                                                                
                 estate, eflags);
@@ -693,6 +699,10 @@ ExecEndNode(PlanState *node)
                        ExecEndSort((SortState *) node);
                        break;
 
+               case T_IncrementalSortState:
+                       ExecEndIncrementalSort((IncrementalSortState *) node);
+                       break;
+
                case T_GroupState:
                        ExecEndGroup((GroupState *) node);
                        break;
@@ -839,6 +849,29 @@ ExecSetTupleBound(int64 tuples_needed, PlanState 
*child_node)
                        sortState->bound = tuples_needed;
                }
        }
+       else if (IsA(child_node, IncrementalSortState))
+       {
+               /*
+                * If it is a Sort node, notify it that it can use bounded sort.
+                *
+                * Note: it is the responsibility of nodeSort.c to react 
properly to
+                * changes of these parameters.  If we ever redesign this, it'd 
be a
+                * good idea to integrate this signaling with the 
parameter-change
+                * mechanism.
+                */
+               IncrementalSortState  *sortState = (IncrementalSortState *) 
child_node;
+
+               if (tuples_needed < 0)
+               {
+                       /* make sure flag gets reset if needed upon rescan */
+                       sortState->bounded = false;
+               }
+               else
+               {
+                       sortState->bounded = true;
+                       sortState->bound = tuples_needed;
+               }
+       }
        else if (IsA(child_node, AppendState))
        {
                /*
diff --git a/src/backend/executor/nodeIncrementalSort.c 
b/src/backend/executor/nodeIncrementalSort.c
new file mode 100644
index 0000000000..c3b903e568
--- /dev/null
+++ b/src/backend/executor/nodeIncrementalSort.c
@@ -0,0 +1,1107 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncremenalSort.c
+ *       Routines to handle incremental sorting of relations.
+ *
+ * DESCRIPTION
+ *
+ *             Incremental sort is an optimized variant of multikey sort for 
cases
+ *             when the input is already sorted by a prefix of the sort keys.  
For
+ *             example when a sort by (key1, key2 ... keyN) is requested, and 
the
+ *             input is already sorted by (key1, key2 ... keyM), M < N, we can
+ *             divide the input into groups where keys (key1, ... keyM) are 
equal,
+ *             and only sort on the remaining columns.
+ *
+ *             Consider the following example.  We have input tuples 
consisting of
+ *             two integers (X, Y) already presorted by X, while it's required 
to
+ *             sort them by both X and Y.  Let input tuples be following.
+ *
+ *             (1, 5)
+ *             (1, 2)
+ *             (2, 9)
+ *             (2, 1)
+ *             (2, 5)
+ *             (3, 3)
+ *             (3, 7)
+ *
+ *             Incremental sort algorithm would split the input into the 
following
+ *             groups, which have equal X, and then sort them by Y 
individually:
+ *
+ *                     (1, 5) (1, 2)
+ *                     (2, 9) (2, 1) (2, 5)
+ *                     (3, 3) (3, 7)
+ *
+ *             After sorting these groups and putting them altogether, we 
would get
+ *             the following result which is sorted by X and Y, as requested:
+ *
+ *             (1, 2)
+ *             (1, 5)
+ *             (2, 1)
+ *             (2, 5)
+ *             (2, 9)
+ *             (3, 3)
+ *             (3, 7)
+ *
+ *             Incremental sort may be more efficient than plain sort, 
particularly
+ *             on large datasets, as it reduces the amount of data to sort at 
once,
+ *             making it more likely it fits into work_mem (eliminating the 
need to
+ *             spill to disk).  But the main advantage of incremental sort is 
that
+ *             it can start producing rows early, before sorting the whole 
dataset,
+ *             which is a significant benefit especially for queries with 
LIMIT.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *       src/backend/executor/nodeIncremenalSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/execdebug.h"
+#include "executor/nodeIncrementalSort.h"
+#include "miscadmin.h"
+#include "utils/lsyscache.h"
+#include "utils/tuplesort.h"
+
+/*
+ * Prepare information for presorted_keys comparison.
+ */
+static void
+preparePresortedCols(IncrementalSortState *node)
+{
+       IncrementalSort    *plannode = (IncrementalSort *) node->ss.ps.plan;
+       int                                     presortedCols,
+                                               i;
+
+       Assert(IsA(plannode, IncrementalSort));
+       presortedCols = plannode->presortedCols;
+
+       node->presorted_keys = (PresortedKeyData *) palloc(presortedCols *
+                                                                               
                        sizeof(PresortedKeyData));
+
+       /* Pre-cache comparison functions for each pre-sorted key. */
+       for (i = 0; i < presortedCols; i++)
+       {
+               Oid                                     equalityOp,
+                                                       equalityFunc;
+               PresortedKeyData   *key;
+
+               key = &node->presorted_keys[i];
+               key->attno = plannode->sort.sortColIdx[i];
+
+               equalityOp = get_equality_op_for_ordering_op(
+                                                                               
plannode->sort.sortOperators[i], NULL);
+               if (!OidIsValid(equalityOp))
+                       elog(ERROR, "missing equality operator for ordering 
operator %u",
+                                       plannode->sort.sortOperators[i]);
+
+               equalityFunc = get_opcode(equalityOp);
+               if (!OidIsValid(equalityFunc))
+                       elog(ERROR, "missing function for operator %u", 
equalityOp);
+
+               /* Lookup the comparison function */
+               fmgr_info_cxt(equalityFunc, &key->flinfo, CurrentMemoryContext);
+
+               /* We can initialize the callinfo just once and re-use it */
+               key->fcinfo = palloc0(SizeForFunctionCallInfo(2));
+               InitFunctionCallInfoData(*key->fcinfo, &key->flinfo, 2,
+                                                               
plannode->sort.collations[i], NULL, NULL);
+               key->fcinfo->args[0].isnull = false;
+               key->fcinfo->args[1].isnull = false;
+       }
+}
+
+/*
+ * Check whether a given tuple belongs to the current sort group.
+ *
+ * We do this by comparing its first 'presortedCols' column values to
+ * the pivot tuple of the current group.
+ *
+ */
+static bool
+isCurrentGroup(IncrementalSortState *node, TupleTableSlot *pivot, 
TupleTableSlot *tuple)
+{
+       int presortedCols, i;
+
+       Assert(IsA(node->ss.ps.plan, IncrementalSort));
+
+       presortedCols = ((IncrementalSort *) node->ss.ps.plan)->presortedCols;
+
+       /*
+        * That the input is sorted by keys * (0, ... n) implies that the tail 
keys
+        * are more likely to change. Therefore we do our comparison starting 
from
+        * the last pre-sorted column to optimize for early detection of
+        * inequality and minimizing the number of function calls..
+        */
+       for (i = presortedCols - 1; i >= 0; i--)
+       {
+               Datum                           datumA,
+                                                       datumB,
+                                                       result;
+               bool                            isnullA,
+                                                       isnullB;
+               AttrNumber                      attno = 
node->presorted_keys[i].attno;
+               PresortedKeyData   *key;
+
+               datumA = slot_getattr(pivot, attno, &isnullA);
+               datumB = slot_getattr(tuple, attno, &isnullB);
+
+               /* Special case for NULL-vs-NULL, else use standard comparison 
*/
+               if (isnullA || isnullB)
+               {
+                       if (isnullA == isnullB)
+                               continue;
+                       else
+                               return false;
+               }
+
+               key = &node->presorted_keys[i];
+
+               key->fcinfo->args[0].value = datumA;
+               key->fcinfo->args[1].value = datumB;
+
+               /* just for paranoia's sake, we reset isnull each time */
+               key->fcinfo->isnull = false;
+
+               result = FunctionCallInvoke(key->fcinfo);
+
+               /* Check for null result, since caller is clearly not expecting 
one */
+               if (key->fcinfo->isnull)
+                       elog(ERROR, "function %u returned NULL", 
key->flinfo.fn_oid);
+
+               if (!DatumGetBool(result))
+                       return false;
+       }
+       return true;
+}
+
+/*
+ * Switch to presorted prefix mode.
+ *
+ * When we determine that we've likely encountered a large batch of tuples all
+ * having the same presorted prefix values, we want to optimize tuplesort by
+ * only sorting on unsorted suffix keys.
+ *
+ * The problem is that we've already accumulated several tuples in another
+ * tuplesort configured to sort by all columns (assuming that there may be
+ * more than one prefix key group). So to switch to presorted prefix mode we
+ * have to go back an look at all the tuples we've already accumulated and
+ * verify they're all part of the same prefix key group before sorting them
+ * solely by unsorted suffix keys.
+ *
+ * While it's likely that all already fetch tuples are all part of a single
+ * prefix group, we also have to handle the possibility that there is at least
+ * one different prefix key group before the large prefix key group.
+ */
+static void
+switchToPresortedPrefixMode(IncrementalSortState *node)
+{
+       ScanDirection           dir;
+       int64 nTuples = 0;
+       bool lastTuple = false;
+       bool firstTuple = true;
+       TupleDesc                   tupDesc;
+       PlanState                  *outerNode;
+       IncrementalSort    *plannode = (IncrementalSort *) node->ss.ps.plan;
+
+       dir = node->ss.ps.state->es_direction;
+       outerNode = outerPlanState(node);
+       tupDesc = ExecGetResultType(outerNode);
+
+       if (node->prefixsort_state == NULL)
+       {
+               Tuplesortstate *prefixsort_state;
+               int presortedCols = plannode->presortedCols;
+
+               /*
+                * Optimize the sort by assuming the prefix columns are all 
equal
+                * and thus we only need to sort by any remaining columns.
+                */
+               prefixsort_state = tuplesort_begin_heap(
+                               tupDesc,
+                               plannode->sort.numCols - presortedCols,
+                               &(plannode->sort.sortColIdx[presortedCols]),
+                               &(plannode->sort.sortOperators[presortedCols]),
+                               &(plannode->sort.collations[presortedCols]),
+                               &(plannode->sort.nullsFirst[presortedCols]),
+                               work_mem,
+                               NULL,
+                               false);
+               node->prefixsort_state = prefixsort_state;
+       }
+       else
+       {
+               /* Next group of presorted data */
+               tuplesort_reset(node->prefixsort_state);
+       }
+
+       /*
+        * If the current node has a bound, then it's reasonably likely that a
+        * large prefix key group will benefit from bounded sort, so configure
+        * the tuplesort to allow for that optimization.
+        */
+       if (node->bounded)
+       {
+               SO1_printf("Setting bound on presorted prefix tuplesort to: 
%ld\n",
+                               node->bound - node->bound_Done);
+               tuplesort_set_bound(node->prefixsort_state,
+                               node->bound - node->bound_Done);
+       }
+
+       for (;;)
+       {
+               lastTuple = node->n_fullsort_remaining - nTuples == 1;
+
+               /*
+                * When we encounter multiple prefix key groups inside the full 
sort
+                * tuplesort we have to carry over the last read tuple into the 
next
+                * batch.
+                */
+               if (firstTuple && !TupIsNull(node->transfer_tuple))
+               {
+                       tuplesort_puttupleslot(node->prefixsort_state, 
node->transfer_tuple);
+                       nTuples++;
+
+                       /* The carried over tuple is our new group pivot tuple. 
*/
+                       ExecCopySlot(node->group_pivot, node->transfer_tuple);
+               }
+               else
+               {
+                       tuplesort_gettupleslot(node->fullsort_state,
+                                       ScanDirectionIsForward(dir),
+                                       false, node->transfer_tuple, NULL);
+
+                       /*
+                        * If this is our first time through the loop, then we 
need to save the
+                        * first tuple we get as our new group pivot.
+                        */
+                       if (TupIsNull(node->group_pivot))
+                               ExecCopySlot(node->group_pivot, 
node->transfer_tuple);
+
+                       if (isCurrentGroup(node, node->group_pivot, 
node->transfer_tuple))
+                       {
+                               tuplesort_puttupleslot(node->prefixsort_state, 
node->transfer_tuple);
+                               nTuples++;
+                       }
+                       else
+                       {
+                               /* The tuple isn't part of the current batch so 
we need to carry
+                                * it over into the next set up tuples we 
transfer out of the full
+                                * sort tuplesort into the presorted prefix 
tuplesort. We don't
+                                * actually have to do anything special to save 
the tuple since
+                                * we've already loaded it into the 
node->transfer_tuple slot, and,
+                                * even though that slot points to memory 
inside the full sort
+                                * tuplesort, we can't reset that tuplesort 
anyway until we've
+                                * fully transferred out of its tuples, so this 
reference is safe.
+                                * We do need to reset the group pivot tuple 
though since we've
+                                * finished the current prefix key group.
+                                */
+                               ExecClearTuple(node->group_pivot);
+                               break;
+                       }
+               }
+
+               firstTuple = false;
+
+               if (lastTuple)
+                       /*
+                        * We retain the current group pivot tuple since we 
haven't yet
+                        * found the end of the current prefix key group.
+                        */
+                       break;
+       }
+
+       /*
+        * Track how many tuples remain in the full sort batch so that we know 
if
+        * we need to sort multiple prefix key groups before processing tuples
+        * remaining in the large single prefix key group we think we've
+        * encountered.
+        */
+       SO1_printf("Moving %ld tuples to presorted prefix tuplesort\n", 
nTuples);
+       node->n_fullsort_remaining -= nTuples;
+       SO1_printf("Setting n_fullsort_remaining to %ld\n", 
node->n_fullsort_remaining);
+
+       if (lastTuple)
+       {
+               /*
+                * We've confirmed that all tuples remaining in the full sort 
batch
+                * is in the same prefix key group and moved all of those 
tuples into
+                * the presorted prefix tuplesort. Now we can save our pivot 
comparison
+                * tuple and continue fetching tuples from the outer execution 
node to
+                * load into the presorted prefix tuplesort.
+                */
+               ExecCopySlot(node->group_pivot, node->transfer_tuple);
+               SO_printf("Setting execution_status to INCSORT_LOADPREFIXSORT 
(switchToPresortedPrefixMode)\n");
+               node->execution_status = INCSORT_LOADPREFIXSORT;
+
+               /* Make sure we clear the transfer tuple slot so that next time 
we
+                * encounter a large prefix key group we don't incorrectly 
assume
+                * we have a tuple carried over from the previous group.
+                */
+               ExecClearTuple(node->transfer_tuple);
+       }
+       else
+       {
+               /*
+                * We finished a group but didn't consume all of the tuples 
from the
+                * full sort batch sorter, so we'll sort this batch, let the 
inner node
+                * read out all of those tuples, and then come back around to 
find
+                * another batch.
+                */
+               SO1_printf("Sorting presorted prefix tuplesort with %ld 
tuples\n", nTuples);
+               tuplesort_performsort(node->prefixsort_state);
+               node->prefixsort_group_count++;
+
+               if (node->bounded)
+               {
+                       /*
+                        * If the current node has a bound, and we've already 
sorted n
+                        * tuples, then the functional bound remaining is
+                        * (original bound - n), so store the current number of 
processed
+                        * tuples for use in configuring sorting bound.
+                        */
+                       SO2_printf("Changing bound_Done from %ld to %ld\n",
+                                       Min(node->bound, node->bound_Done + 
nTuples), node->bound_Done);
+                       node->bound_Done = Min(node->bound, node->bound_Done + 
nTuples);
+               }
+
+               SO_printf("Setting execution_status to INCSORT_READPREFIXSORT  
(switchToPresortedPrefixMode)\n");
+               node->execution_status = INCSORT_READPREFIXSORT;
+       }
+}
+
+/*
+ * Sorting many small groups with tuplesort is inefficient. In order to
+ * cope with this problem we don't start a new group until the current one
+ * contains at least DEFAULT_MIN_GROUP_SIZE tuples (unfortunately this also
+ * means we can't assume small groups of tuples all have the same prefix keys.)
+ * When we have a bound that's less than DEFAULT_MIN_GROUP_SIZE we start 
looking
+ * for the new group as soon as we've met our bound to avoid fetching more
+ * tuples than we absolutely have to fetch.
+ */
+#define DEFAULT_MIN_GROUP_SIZE 32
+
+/*
+ * While we've optimized for small prefix key groups by not starting our prefix
+ * key comparisons until we've reached a minimum number of tuples, we don't 
want
+ * that optimization to cause us to lose out on the benefits of being able to
+ * assume a large group of tuples is fully presorted by its prefix keys.
+ * Therefore we use the DEFAULT_MAX_FULL_SORT_GROUP_SIZE cutoff as a heuristic
+ * for determining when we believe we've encountered a large group, and, if we
+ * get to that point without finding a new prefix key group we transition to
+ * presorted prefix key mode.
+ */
+#define DEFAULT_MAX_FULL_SORT_GROUP_SIZE (2 * DEFAULT_MIN_GROUP_SIZE)
+
+/* ----------------------------------------------------------------
+ *             ExecIncrementalSort
+ *
+ *             Assuming that outer subtree returns tuple presorted by some 
prefix
+ *             of target sort columns, performs incremental sort. The 
implemented
+ *             algorithm operates in two different modes:
+ *               - Fetching a minimum number of tuples without checking prefix 
key
+ *                 group membership and sorting on all columns when safe.
+ *               - Fetching all tuples for a single prefix key group and 
sorting on
+ *                 solely the unsorted columns.
+ *             We always begin in the first mode, and employ a heuristic to 
switch
+ *             into the second mode if we believe it's beneficial.
+ *
+ *             Sorting incrementally can potentially use less memory, avoid 
fetching
+ *             and sorting all tuples in the the dataset, and begin returning 
tuples
+ *             before the entire result set is available.
+ *
+ *             The hybrid mode approach allows us to optimize for both very 
small
+ *             groups (where the overhead of a new tuplesort is high) and very 
large
+ *             groups (where we can lower cost by not having to sort on 
already sorted
+ *             columns), albeit at some extra cost while switching between 
modes.
+ *
+ *             Conditions:
+ *               -- none.
+ *
+ *             Initial States:
+ *               -- the outer child is prepared to return the first tuple.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecIncrementalSort(PlanState *pstate)
+{
+       IncrementalSortState *node = castNode(IncrementalSortState, pstate);
+       EState                     *estate;
+       ScanDirection           dir;
+       Tuplesortstate     *read_sortstate;
+       Tuplesortstate     *fullsort_state;
+       TupleTableSlot     *slot;
+       IncrementalSort    *plannode = (IncrementalSort *) node->ss.ps.plan;
+       PlanState                  *outerNode;
+       TupleDesc                       tupDesc;
+       int64                           nTuples = 0;
+       int64                           minGroupSize;
+
+       CHECK_FOR_INTERRUPTS();
+
+       estate = node->ss.ps.state;
+       dir = estate->es_direction;
+       fullsort_state = node->fullsort_state;
+
+       if (node->execution_status == INCSORT_READFULLSORT
+                       || node->execution_status == INCSORT_READPREFIXSORT)
+       {
+               /*
+                * Return next tuple from the current sorted group set if 
available.
+                */
+               read_sortstate = node->execution_status == INCSORT_READFULLSORT 
?
+                       fullsort_state : node->prefixsort_state;
+               slot = node->ss.ps.ps_ResultTupleSlot;
+               if (tuplesort_gettupleslot(read_sortstate, 
ScanDirectionIsForward(dir),
+                                                                  false, slot, 
NULL) || node->finished)
+                       /*
+                        * TODO: there isn't a good test case for the 
node->finished
+                        * case directly, but lots of other stuff fails if it's 
not
+                        * there. If the outer node will fail when trying to 
fetch
+                        * too many tuples, then things break if that test 
isn't here.
+                        */
+                       return slot;
+               else if (node->n_fullsort_remaining > 0)
+               {
+                       /*
+                        * When we transition to presorted prefix mode, we 
might have
+                        * accumulated at least one additional prefix key group 
in the full
+                        * sort tuplesort. The first call to 
switchToPresortedPrefixMode()
+                        * pulled the one of those groups out, and we've 
returned those
+                        * tuples to the inner node, but if we tuples remaining 
in that
+                        * tuplesort (i.e., n_fullsort_remaining > 0) at this 
point we
+                        * need to do that again.
+                        */
+                       SO1_printf("Re-calling switchToPresortedPrefixMode() 
because n_fullsort_remaining is > 0 (%ld)\n",
+                                       node->n_fullsort_remaining);
+                       switchToPresortedPrefixMode(node);
+               }
+               else
+               {
+                       /*
+                        * If we don't have any already sorted tuples to read, 
and we're not
+                        * in the middle of transitioning into presorted prefix 
sort mode,
+                        * then it's time to start the process all over again 
by building
+                        * new full sort group.
+                        */
+                       SO_printf("Setting execution_status to 
INCSORT_LOADFULLSORT (n_fullsort_remaining > 0)\n");
+                       node->execution_status = INCSORT_LOADFULLSORT;
+               }
+       }
+
+       /*
+        * Want to scan subplan in the forward direction while creating the
+        * sorted data.
+        */
+       estate->es_direction = ForwardScanDirection;
+
+       outerNode = outerPlanState(node);
+       tupDesc = ExecGetResultType(outerNode);
+
+       if (node->execution_status == INCSORT_LOADFULLSORT)
+       {
+               /*
+                * Initialize tuplesort module (only needed before the first 
group).
+                */
+               if (fullsort_state == NULL)
+               {
+                       /*
+                        * Initialize presorted column support structures for
+                        * isCurrentGroup().
+                        */
+                       preparePresortedCols(node);
+
+                       /*
+                        * Since we optimize small prefix key groups by 
accumulating a
+                        * minimum number of tuples before sorting, we can't 
assume that a
+                        * group of tuples all have the same prefix key values. 
Hence we
+                        * setup the full sort tuplesort to sort by all 
requested sort
+                        * columns.
+                        */
+                       fullsort_state = tuplesort_begin_heap(
+                                       tupDesc,
+                                       plannode->sort.numCols,
+                                       plannode->sort.sortColIdx,
+                                       plannode->sort.sortOperators,
+                                       plannode->sort.collations,
+                                       plannode->sort.nullsFirst,
+                                       work_mem,
+                                       NULL,
+                                       false);
+                       node->fullsort_state = fullsort_state;
+               }
+               else
+               {
+                       /* Reset sort for a new prefix key group. */
+                       tuplesort_reset(fullsort_state);
+               }
+
+               /*
+                * Calculate the remaining tuples left if the bounded and 
configure
+                * both bounded sort and the minimum group size accordingly.
+                */
+               if (node->bounded)
+               {
+                       int64 currentBound = node->bound - node->bound_Done;
+
+                       /*
+                        * Bounded sort isn't likely to be a useful 
optimization for full
+                        * sort mode since we limit full sort mode to a 
relatively small
+                        * number of tuples and tuplesort doesn't switch over 
to top-n heap
+                        * sort anyway unless it hits (2 * bound) tuples.
+                        */
+                       if (currentBound < DEFAULT_MIN_GROUP_SIZE)
+                               tuplesort_set_bound(fullsort_state, 
currentBound);
+
+                       minGroupSize = Min(DEFAULT_MIN_GROUP_SIZE, 
currentBound);
+               }
+               else
+                       minGroupSize = DEFAULT_MIN_GROUP_SIZE;
+
+               /* Because we have to read the next tuple to find out that we've
+                * encountered a new prefix key group on subsequent groups we 
have to
+                * carry over that extra tuple and add it to the new group's 
sort here.
+                */
+               if (!TupIsNull(node->group_pivot))
+               {
+                       tuplesort_puttupleslot(fullsort_state, 
node->group_pivot);
+                       nTuples++;
+
+                       /*
+                        * We're in full sort mode accumulating a minimum 
number of tuples
+                        * and not checking for prefix key equality yet, so we 
can't assume
+                        * the group pivot tuple will reamin the same -- unless 
we're using
+                        * a minimum group size of 1, in which case the pivot 
is obviously
+                        * still the pviot.
+                        */
+                       if (nTuples != minGroupSize)
+                               ExecClearTuple(node->group_pivot);
+               }
+
+               for (;;)
+               {
+                       /*
+                        * TODO: do we need to check for interrupts inside 
these loops or
+                        * will the outer node handle that?
+                        */
+
+                       slot = ExecProcNode(outerNode);
+
+                       /*
+                        * When the outer node can't provide us anymore tuples, 
then we
+                        * can sort the current group and return those tuples.
+                        */
+                       if (TupIsNull(slot))
+                       {
+                               node->finished = true;
+
+                               SO1_printf("Sorting fullsort with %ld 
tuples\n", nTuples);
+                               tuplesort_performsort(fullsort_state);
+                               node->fullsort_group_count++;
+
+                               SO_printf("Setting execution_status to 
INCSORT_READFULLSORT (final tuple) \n");
+                               node->execution_status = INCSORT_READFULLSORT;
+                               break;
+                       }
+
+                       /* Accumulate the next group of presorted tuples. */
+                       if (nTuples < minGroupSize)
+                       {
+                               /*
+                                * If we have yet hit our target minimum group 
size, then don't
+                                * both with checking for inclusion in the 
current prefix group
+                                * since a large number of very tiny sorts is 
inefficient.
+                                */
+                               tuplesort_puttupleslot(fullsort_state, slot);
+                               nTuples++;
+
+                               /* Keep the last tuple of our minimal group as 
a pivot. */
+                               if (nTuples == minGroupSize)
+                                       ExecCopySlot(node->group_pivot, slot);
+                       }
+                       else
+                       {
+                               /*
+                                * Once we've accumulated a minimum number of 
tuples, we start
+                                * checking for a new prefix key group. Only 
after we find
+                                * changed prefix keys can we guarantee sort 
stability of the
+                                * tuples we've already accumulated.
+                                */
+                               if (isCurrentGroup(node, node->group_pivot, 
slot))
+                               {
+                                       /*
+                                        * As long as the prefix keys match the 
pivot tuple then
+                                        * load the tuple into the tuplesort.
+                                        */
+                                       tuplesort_puttupleslot(fullsort_state, 
slot);
+                                       nTuples++;
+                               }
+                               else
+                               {
+                                       /*
+                                        * Since the tuple we fetched isn't 
part of the current
+                                        * prefix key group we can't sort it as 
part of this
+                                        * sort group. Instead we need to carry 
it over to the
+                                        * next group. We use the group_pivot 
slot as a temp
+                                        * container for that purpose even 
though we won't actually
+                                        * treat it as a group pivot.
+                                        */
+                                       ExecCopySlot(node->group_pivot, slot);
+
+                                       if (node->bounded)
+                                       {
+                                               /*
+                                                * If the current node has a 
bound, and we've already
+                                                * sorted n tuples, then the 
functional bound remaining
+                                                * is (original bound - n), so 
store the current number
+                                                * of processed tuples for use 
in configuring sorting
+                                                * bound.
+                                                */
+                                               SO2_printf("Changing bound_Done 
from %ld to %ld\n",
+                                                               
Min(node->bound, node->bound_Done + nTuples), node->bound_Done);
+                                               node->bound_Done = 
Min(node->bound, node->bound_Done + nTuples);
+                                       }
+
+                                       /*
+                                        * Once we find changed prefix keys we 
can complete the
+                                        * sort and begin reading out the 
sorted tuples.
+                                        */
+                                       SO1_printf("Sorting fullsort tuplesort 
with %ld tuples\n", nTuples);
+                                       tuplesort_performsort(fullsort_state);
+                                       node->fullsort_group_count++;
+                                       SO_printf("Setting execution_status to 
INCSORT_READFULLSORT (found end of group)\n");
+                                       node->execution_status = 
INCSORT_READFULLSORT;
+                                       break;
+                               }
+                       }
+
+                       /*
+                        * Once we've processed 
DEFAULT_MAX_FULL_SORT_GROUP_SIZE tuples
+                        * then we make the assumption that it's likely that 
we've found
+                        * a large group of tuples having a single prefix key 
(as long
+                        * as the last tuple didn't shift us into reading from 
the full
+                        * sort mode tuplesort).
+                        */
+                       if (nTuples > DEFAULT_MAX_FULL_SORT_GROUP_SIZE &&
+                                       node->execution_status != 
INCSORT_READFULLSORT)
+                       {
+                               /*
+                                * The group pivot we have stored has already 
been put into the
+                                * tuplesort; we don't want to carry it over.
+                                */
+                               ExecClearTuple(node->group_pivot);
+
+                               /*
+                                * Unfortunately the tuplesort API doesn't 
include a way to
+                                * retrieve tuples unless a sort has been 
performed, so we
+                                * perform the sort even though we could just 
as easily rely
+                                * on FIFO retrieval semantics when 
transferring them to the
+                                * presorted prefix tuplesort.
+                                */
+                               SO1_printf("Sorting fullsort tuplesort with %ld 
tuples\n", nTuples);
+                               tuplesort_performsort(fullsort_state);
+                               node->fullsort_group_count++;
+
+                               /*
+                                * If the full sort tuplesort happened to 
switch into top-n heapsort mode
+                                * then we will only be able to retrieve 
currentBound tuples (since the
+                                * tuplesort will have only retained the top-n 
tuples). This is safe even
+                                * though we haven't yet completed fetching the 
current prefix key group
+                                * because the tuples we've "lost" already 
sorted "below" the retained ones,
+                                * and we're already contractually guaranteed 
to not need any more than the
+                                * currentBount tuples.
+                                */
+                               if (tuplesort_used_bound(node->fullsort_state))
+                               {
+                                       int64 currentBound = node->bound - 
node->bound_Done;
+                                       SO2_printf("Read %ld tuples, but 
setting to %ld because we used bounded sort\n",
+                                                       nTuples, 
Min(currentBound, nTuples));
+                                       nTuples = Min(currentBound, nTuples);
+                               }
+
+                               SO1_printf("Setting n_fullsort_remaining to %ld 
and calling switchToPresortedPrefixMode()\n",
+                                               nTuples);
+
+                               /*
+                                * Track the number of tuples we need to move 
from the fullsort
+                                * to presorted prefix sort (we might have 
multiple prefix key
+                                * groups, so we need a way to see if we've 
actually finished).
+                                */
+                               node->n_fullsort_remaining = nTuples;
+
+                               /* Transition the tuples to the presorted 
prefix tuplesort. */
+                               switchToPresortedPrefixMode(node);
+
+                               /*
+                                * Since we know we had tuples to move to the 
presorted prefix
+                                * tuplesort, we know that unless that 
transition has verified
+                                * that all tuples belonged to the same prefix 
key group (in
+                                * which case we can go straight to continuing 
to load tuples
+                                * into that tuplesort), we should have a tuple 
to return here.
+                                *
+                                * Either way, the appropriate execution status 
should have
+                                * been set by switchToPresortedPrefixMode(), 
so we can drop out
+                                * of the loop here and let the appropriate 
path kick in.
+                                */
+                               break;
+                       }
+               }
+       }
+
+       if (node->execution_status == INCSORT_LOADPREFIXSORT)
+       {
+               /*
+                * Since we only enter this state after determining that all 
remaining
+                * tuples in the full sort tuplesort have the same prefix, 
we've already
+                * established a current group pivot tuple (but wasn't carried 
over;
+                * it's already been put into the prefix sort tuplesort).
+                */
+               Assert(!TupIsNull(node->group_pivot));
+
+               for (;;)
+               {
+                       slot = ExecProcNode(outerNode);
+
+                       /* Check to see if there are no more tuples to fetch. */
+                       if (TupIsNull(slot))
+                       {
+                               node->finished = true;
+                               break;
+                       }
+
+                       if (isCurrentGroup(node, node->group_pivot, slot))
+                       {
+                               /*
+                                * Fetch tuples and put them into the presorted 
prefix tuplesort
+                                * until we find changed prefix keys. Only then 
can we guarantee
+                                * sort stability of the tuples we've already 
accumulated.
+                                */
+                               tuplesort_puttupleslot(node->prefixsort_state, 
slot);
+                               nTuples++;
+                       }
+                       else
+                       {
+                               /*
+                                * Since the tuple we fetched isn't part of the 
current prefix
+                                * key group we can't sort it as part of this 
sort group.
+                                * Instead we need to carry it over to the next 
group. We use
+                                * the group_pivot slot as a temp container for 
that purpose
+                                * even though we won't actually treat it as a 
group pivot.
+                                */
+                               ExecCopySlot(node->group_pivot, slot);
+                               break;
+                       }
+               }
+
+               /* Perform the sort and return the tuples to the inner plan 
nodes. */
+               SO1_printf("Sorting presorted prefix tuplesort with >= %ld 
tuples\n", nTuples);
+               tuplesort_performsort(node->prefixsort_state);
+               node->prefixsort_group_count++;
+               SO_printf("Setting execution_status to INCSORT_READPREFIXSORT 
(found end of group)\n");
+               node->execution_status = INCSORT_READPREFIXSORT;
+
+               if (node->bounded)
+               {
+                       /*
+                        * If the current node has a bound, and we've already 
sorted n
+                        * tuples, then the functional bound remaining is
+                        * (original bound - n), so store the current number of 
processed
+                        * tuples for use in configuring sorting bound.
+                        */
+                       SO2_printf("Changing bound_Done from %ld to %ld\n",
+                                       Min(node->bound, node->bound_Done + 
nTuples), node->bound_Done);
+                       node->bound_Done = Min(node->bound, node->bound_Done + 
nTuples);
+               }
+       }
+
+       /* Restore to user specified direction. */
+       estate->es_direction = dir;
+
+       /*
+        * Remember that we've begun our scan and sort so we know how to handle
+        * rescan.
+        */
+       node->sort_Done = true;
+
+       /* Record shared stats if we're a parallel worker. */
+       if (node->shared_info && node->am_worker)
+       {
+               IncrementalSortInfo *incsort_info =
+                       &node->shared_info->sinfo[ParallelWorkerNumber];
+
+               Assert(IsParallelWorker());
+               Assert(ParallelWorkerNumber <= node->shared_info->num_workers);
+
+               tuplesort_get_stats(fullsort_state, 
&incsort_info->fullsort_instrument);
+               incsort_info->fullsort_group_count = node->fullsort_group_count;
+
+               if (node->prefixsort_state)
+               {
+                       tuplesort_get_stats(node->prefixsort_state,
+                                       &incsort_info->prefixsort_instrument);
+                       incsort_info->prefixsort_group_count = 
node->prefixsort_group_count;
+               }
+       }
+
+       /*
+        * Get the first or next tuple from tuplesort. Returns NULL if no more
+        * tuples.
+        */
+       read_sortstate = node->execution_status == INCSORT_READFULLSORT ?
+               fullsort_state : node->prefixsort_state;
+       slot = node->ss.ps.ps_ResultTupleSlot;
+       (void) tuplesort_gettupleslot(read_sortstate, 
ScanDirectionIsForward(dir),
+                                                                 false, slot, 
NULL);
+       return slot;
+}
+
+/* ----------------------------------------------------------------
+ *             ExecInitIncrementalSort
+ *
+ *             Creates the run-time state information for the sort node
+ *             produced by the planner and initializes its outer subtree.
+ * ----------------------------------------------------------------
+ */
+IncrementalSortState *
+ExecInitIncrementalSort(IncrementalSort *node, EState *estate, int eflags)
+{
+       IncrementalSortState   *incrsortstate;
+
+       SO_printf("ExecInitIncrementalSort: initializing sort node\n");
+
+       /*
+        * Incremental sort can't be used with either EXEC_FLAG_REWIND,
+        * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current
+        * bucket in tuplesortstate.
+        */
+       Assert((eflags & (EXEC_FLAG_REWIND |
+                                         EXEC_FLAG_BACKWARD |
+                                         EXEC_FLAG_MARK)) == 0);
+
+       /*
+        * create state structure
+        */
+       incrsortstate = makeNode(IncrementalSortState);
+       incrsortstate->ss.ps.plan = (Plan *) node;
+       incrsortstate->ss.ps.state = estate;
+       incrsortstate->ss.ps.ExecProcNode = ExecIncrementalSort;
+
+       incrsortstate->bounded = false;
+       incrsortstate->sort_Done = false;
+       incrsortstate->finished = false;
+       incrsortstate->fullsort_state = NULL;
+       incrsortstate->prefixsort_state = NULL;
+       incrsortstate->group_pivot = NULL;
+       incrsortstate->transfer_tuple = NULL;
+       incrsortstate->n_fullsort_remaining = 0;
+       incrsortstate->bound_Done = 0;
+       incrsortstate->fullsort_group_count = 0;
+       incrsortstate->prefixsort_group_count = 0;
+       incrsortstate->presorted_keys = NULL;
+
+       /*
+        * Miscellaneous initialization
+        *
+        * Sort nodes don't initialize their ExprContexts because they never 
call
+        * ExecQual or ExecProject.
+        */
+
+       /*
+        * initialize child nodes
+        *
+        * We shield the child node from the need to support REWIND, BACKWARD, 
or
+        * MARK/RESTORE.
+        */
+       eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
+
+       outerPlanState(incrsortstate) = ExecInitNode(outerPlan(node), estate, 
eflags);
+
+       /*
+        * Initialize scan slot and type.
+        */
+       ExecCreateScanSlotFromOuterPlan(estate, &incrsortstate->ss, 
&TTSOpsMinimalTuple);
+
+       /*
+        * Initialize return slot and type. No need to initialize projection 
info because
+        * this node doesn't do projections.
+        */
+       ExecInitResultTupleSlotTL(&incrsortstate->ss.ps, &TTSOpsMinimalTuple);
+       incrsortstate->ss.ps.ps_ProjInfo = NULL;
+
+       /* make standalone slot to store previous tuple from outer node */
+       incrsortstate->group_pivot = MakeSingleTupleTableSlot(
+                                                       
ExecGetResultType(outerPlanState(incrsortstate)), &TTSOpsMinimalTuple);
+       incrsortstate->transfer_tuple = MakeSingleTupleTableSlot(
+                                                       
ExecGetResultType(outerPlanState(incrsortstate)), &TTSOpsMinimalTuple);
+
+       SO_printf("ExecInitIncrementalSort: sort node initialized\n");
+
+       return incrsortstate;
+}
+
+/* ----------------------------------------------------------------
+ *             ExecEndIncrementalSort(node)
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndIncrementalSort(IncrementalSortState *node)
+{
+       SO_printf("ExecEndIncrementalSort: shutting down sort node\n");
+
+       /*
+        * clean out the tuple table
+        */
+       ExecClearTuple(node->ss.ss_ScanTupleSlot);
+       /* must drop pointer to sort result tuple */
+       ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+       /* must drop stanalone tuple slot from outer node */
+       ExecDropSingleTupleTableSlot(node->group_pivot);
+       ExecDropSingleTupleTableSlot(node->transfer_tuple);
+
+       /*
+        * Release tuplesort resources.
+        */
+       if (node->fullsort_state != NULL)
+               tuplesort_end(node->fullsort_state);
+       node->fullsort_state = NULL;
+       if (node->prefixsort_state != NULL)
+               tuplesort_end(node->prefixsort_state);
+       node->prefixsort_state = NULL;
+
+       /*
+        * Shut down the subplan.
+        */
+       ExecEndNode(outerPlanState(node));
+
+       SO_printf("ExecEndIncrementalSort: sort node shutdown\n");
+}
+
+void
+ExecReScanIncrementalSort(IncrementalSortState *node)
+{
+       PlanState  *outerPlan = outerPlanState(node);
+
+       /*
+        * If we haven't sorted yet, just return. If outerplan's chgParam is not
+        * NULL then it will be re-scanned by ExecProcNode, else no reason to
+        * re-scan it at all.
+        */
+       if (!node->sort_Done)
+               return;
+
+       /* must drop pointer to sort result tuple */
+       ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+
+       /*
+        * If subnode is to be rescanned then we forget previous sort results; 
we
+        * have to re-read the subplan and re-sort.  Also must re-sort if the
+        * bounded-sort parameters changed or we didn't select randomAccess.
+        *
+        * Otherwise we can just rewind and rescan the sorted output.
+        */
+       node->sort_Done = false;
+       tuplesort_end(node->fullsort_state);
+       node->prefixsort_state = NULL;
+       tuplesort_end(node->fullsort_state);
+       node->prefixsort_state = NULL;
+       node->bound_Done = 0;
+
+       /*
+        * if chgParam of subnode is not null then plan will be re-scanned by
+        * first ExecProcNode.
+        */
+       if (outerPlan->chgParam == NULL)
+               ExecReScan(outerPlan);
+}
+
+/* ----------------------------------------------------------------
+ *                                             Parallel Query Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *             ExecSortEstimate
+ *
+ *             Estimate space required to propagate sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortEstimate(IncrementalSortState *node, ParallelContext *pcxt)
+{
+       Size            size;
+
+       /* don't need this if not instrumenting or no workers */
+       if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+               return;
+
+       size = mul_size(pcxt->nworkers, sizeof(IncrementalSortInfo));
+       size = add_size(size, offsetof(SharedIncrementalSortInfo, sinfo));
+       shm_toc_estimate_chunk(&pcxt->estimator, size);
+       shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *             ExecSortInitializeDSM
+ *
+ *             Initialize DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeDSM(IncrementalSortState *node, ParallelContext 
*pcxt)
+{
+       Size            size;
+
+       /* don't need this if not instrumenting or no workers */
+       if (!node->ss.ps.instrument || pcxt->nworkers == 0)
+               return;
+
+       size = offsetof(SharedIncrementalSortInfo, sinfo)
+               + pcxt->nworkers * sizeof(IncrementalSortInfo);
+       node->shared_info = shm_toc_allocate(pcxt->toc, size);
+       /* ensure any unfilled slots will contain zeroes */
+       memset(node->shared_info, 0, size);
+       node->shared_info->num_workers = pcxt->nworkers;
+       shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id,
+                                  node->shared_info);
+}
+
+/* ----------------------------------------------------------------
+ *             ExecSortInitializeWorker
+ *
+ *             Attach worker to DSM space for sort statistics.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortInitializeWorker(IncrementalSortState *node, 
ParallelWorkerContext *pwcxt)
+{
+       node->shared_info =
+               shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, 
true);
+       node->am_worker = true;
+}
+
+/* ----------------------------------------------------------------
+ *             ExecSortRetrieveInstrumentation
+ *
+ *             Transfer sort statistics from DSM to private memory.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState *node)
+{
+       Size            size;
+       SharedIncrementalSortInfo *si;
+
+       if (node->shared_info == NULL)
+               return;
+
+       size = offsetof(SharedIncrementalSortInfo, sinfo)
+               + node->shared_info->num_workers * sizeof(IncrementalSortInfo);
+       si = palloc(size);
+       memcpy(si, node->shared_info, size);
+       node->shared_info = si;
+}
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 92855278ad..3ea1b1bca1 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -93,7 +93,8 @@ ExecSort(PlanState *pstate)
                                                                                
          plannode->collations,
                                                                                
          plannode->nullsFirst,
                                                                                
          work_mem,
-                                                                               
          NULL, node->randomAccess);
+                                                                               
          NULL,
+                                                                               
          node->randomAccess);
                if (node->bounded)
                        tuplesort_set_bound(tuplesortstate, node->bound);
                node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 2f267e4bb6..5c3aef32c4 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -924,6 +924,24 @@ _copyMaterial(const Material *from)
 }
 
 
+/*
+ * CopySortFields
+ *
+ *             This function copies the fields of the Sort node.  It is used by
+ *             all the copy functions for classes which inherit from Sort.
+ */
+static void
+CopySortFields(const Sort *from, Sort *newnode)
+{
+       CopyPlanFields((const Plan *) from, (Plan *) newnode);
+
+       COPY_SCALAR_FIELD(numCols);
+       COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
+       COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
+       COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
+       COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+}
+
 /*
  * _copySort
  */
@@ -935,13 +953,29 @@ _copySort(const Sort *from)
        /*
         * copy node superclass fields
         */
-       CopyPlanFields((const Plan *) from, (Plan *) newnode);
+       CopySortFields(from, newnode);
 
-       COPY_SCALAR_FIELD(numCols);
-       COPY_POINTER_FIELD(sortColIdx, from->numCols * sizeof(AttrNumber));
-       COPY_POINTER_FIELD(sortOperators, from->numCols * sizeof(Oid));
-       COPY_POINTER_FIELD(collations, from->numCols * sizeof(Oid));
-       COPY_POINTER_FIELD(nullsFirst, from->numCols * sizeof(bool));
+       return newnode;
+}
+
+
+/*
+ * _copyIncrementalSort
+ */
+static IncrementalSort *
+_copyIncrementalSort(const IncrementalSort *from)
+{
+       IncrementalSort    *newnode = makeNode(IncrementalSort);
+
+       /*
+        * copy node superclass fields
+        */
+       CopySortFields((const Sort *) from, (Sort *) newnode);
+
+       /*
+        * copy remainder of node
+        */
+       COPY_SCALAR_FIELD(presortedCols);
 
        return newnode;
 }
@@ -4875,6 +4909,9 @@ copyObjectImpl(const void *from)
                case T_Sort:
                        retval = _copySort(from);
                        break;
+               case T_IncrementalSort:
+                       retval = _copyIncrementalSort(from);
+                       break;
                case T_Group:
                        retval = _copyGroup(from);
                        break;
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index b0dcd02ff6..9801f8fa2d 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -833,10 +833,8 @@ _outMaterial(StringInfo str, const Material *node)
 }
 
 static void
-_outSort(StringInfo str, const Sort *node)
+_outSortInfo(StringInfo str, const Sort *node)
 {
-       WRITE_NODE_TYPE("SORT");
-
        _outPlanInfo(str, (const Plan *) node);
 
        WRITE_INT_FIELD(numCols);
@@ -846,6 +844,24 @@ _outSort(StringInfo str, const Sort *node)
        WRITE_BOOL_ARRAY(nullsFirst, node->numCols);
 }
 
+static void
+_outSort(StringInfo str, const Sort *node)
+{
+       WRITE_NODE_TYPE("SORT");
+
+       _outSortInfo(str, node);
+}
+
+static void
+_outIncrementalSort(StringInfo str, const IncrementalSort *node)
+{
+       WRITE_NODE_TYPE("INCREMENTALSORT");
+
+       _outSortInfo(str, (const Sort *) node);
+
+       WRITE_INT_FIELD(presortedCols);
+}
+
 static void
 _outUnique(StringInfo str, const Unique *node)
 {
@@ -3771,6 +3787,9 @@ outNode(StringInfo str, const void *obj)
                        case T_Sort:
                                _outSort(str, obj);
                                break;
+                       case T_IncrementalSort:
+                               _outIncrementalSort(str, obj);
+                               break;
                        case T_Unique:
                                _outUnique(str, obj);
                                break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 764e3bb90c..c2847e8d3f 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2117,12 +2117,13 @@ _readMaterial(void)
 }
 
 /*
- * _readSort
+ * ReadCommonSort
+ *     Assign the basic stuff of all nodes that inherit from Sort
  */
-static Sort *
-_readSort(void)
+static void
+ReadCommonSort(Sort *local_node)
 {
-       READ_LOCALS(Sort);
+       READ_TEMP_LOCALS();
 
        ReadCommonPlan(&local_node->plan);
 
@@ -2131,6 +2132,32 @@ _readSort(void)
        READ_OID_ARRAY(sortOperators, local_node->numCols);
        READ_OID_ARRAY(collations, local_node->numCols);
        READ_BOOL_ARRAY(nullsFirst, local_node->numCols);
+}
+
+/*
+ * _readSort
+ */
+static Sort *
+_readSort(void)
+{
+       READ_LOCALS_NO_FIELDS(Sort);
+
+       ReadCommonSort(local_node);
+
+       READ_DONE();
+}
+
+/*
+ * _readIncrementalSort
+ */
+static IncrementalSort *
+_readIncrementalSort(void)
+{
+       READ_LOCALS(IncrementalSort);
+
+       ReadCommonSort(&local_node->sort);
+
+       READ_INT_FIELD(presortedCols);
 
        READ_DONE();
 }
@@ -2765,6 +2792,8 @@ parseNodeString(void)
                return_value = _readMaterial();
        else if (MATCH("SORT", 4))
                return_value = _readSort();
+       else if (MATCH("INCREMENTALSORT", 15))
+               return_value = _readIncrementalSort();
        else if (MATCH("GROUP", 5))
                return_value = _readGroup();
        else if (MATCH("AGG", 3))
diff --git a/src/backend/optimizer/path/allpaths.c 
b/src/backend/optimizer/path/allpaths.c
index db3a68a51d..adca8322aa 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3881,6 +3881,10 @@ print_path(PlannerInfo *root, Path *path, int indent)
                        ptype = "Sort";
                        subpath = ((SortPath *) path)->subpath;
                        break;
+               case T_IncrementalSortPath:
+                       ptype = "IncrementalSort";
+                       subpath = ((SortPath *) path)->subpath;
+                       break;
                case T_GroupPath:
                        ptype = "Group";
                        subpath = ((GroupPath *) path)->subpath;
diff --git a/src/backend/optimizer/path/costsize.c 
b/src/backend/optimizer/path/costsize.c
index c5f6593485..f6ec405d58 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -127,6 +127,7 @@ bool                enable_indexonlyscan = true;
 bool           enable_bitmapscan = true;
 bool           enable_tidscan = true;
 bool           enable_sort = true;
+bool           enable_incrementalsort = true;
 bool           enable_hashagg = true;
 bool           enable_nestloop = true;
 bool           enable_material = true;
@@ -1645,9 +1646,9 @@ cost_recursive_union(Path *runion, Path *nrterm, Path 
*rterm)
 }
 
 /*
- * cost_sort
- *       Determines and returns the cost of sorting a relation, including
- *       the cost of reading the input data.
+ * cost_tuplesort
+ *       Determines and returns the cost of sorting a relation using tuplesort,
+ *    not including the cost of reading the input data.
  *
  * If the total volume of data to sort is less than sort_mem, we will do
  * an in-memory sort, which requires no I/O and about t*log2(t) tuple
@@ -1674,39 +1675,23 @@ cost_recursive_union(Path *runion, Path *nrterm, Path 
*rterm)
  * specifying nonzero comparison_cost; typically that's used for any extra
  * work that has to be done to prepare the inputs to the comparison operators.
  *
- * 'pathkeys' is a list of sort keys
- * 'input_cost' is the total cost for reading the input data
  * 'tuples' is the number of tuples in the relation
  * 'width' is the average tuple width in bytes
  * 'comparison_cost' is the extra cost per comparison, if any
  * 'sort_mem' is the number of kilobytes of work memory allowed for the sort
  * 'limit_tuples' is the bound on the number of output tuples; -1 if no bound
- *
- * NOTE: some callers currently pass NIL for pathkeys because they
- * can't conveniently supply the sort keys.  Since this routine doesn't
- * currently do anything with pathkeys anyway, that doesn't matter...
- * but if it ever does, it should react gracefully to lack of key data.
- * (Actually, the thing we'd most likely be interested in is just the number
- * of sort keys, which all callers *could* supply.)
  */
-void
-cost_sort(Path *path, PlannerInfo *root,
-                 List *pathkeys, Cost input_cost, double tuples, int width,
+static void
+cost_tuplesort(Cost *startup_cost, Cost *run_cost,
+                 double tuples, int width,
                  Cost comparison_cost, int sort_mem,
                  double limit_tuples)
 {
-       Cost            startup_cost = input_cost;
-       Cost            run_cost = 0;
        double          input_bytes = relation_byte_size(tuples, width);
        double          output_bytes;
        double          output_tuples;
        long            sort_mem_bytes = sort_mem * 1024L;
 
-       if (!enable_sort)
-               startup_cost += disable_cost;
-
-       path->rows = tuples;
-
        /*
         * We want to be sure the cost of a sort is never estimated as zero, 
even
         * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
@@ -1745,7 +1730,7 @@ cost_sort(Path *path, PlannerInfo *root,
                 *
                 * Assume about N log2 N comparisons
                 */
-               startup_cost += comparison_cost * tuples * LOG2(tuples);
+               *startup_cost = comparison_cost * tuples * LOG2(tuples);
 
                /* Disk costs */
 
@@ -1756,7 +1741,7 @@ cost_sort(Path *path, PlannerInfo *root,
                        log_runs = 1.0;
                npageaccesses = 2.0 * npages * log_runs;
                /* Assume 3/4ths of accesses are sequential, 1/4th are not */
-               startup_cost += npageaccesses *
+               *startup_cost += npageaccesses *
                        (seq_page_cost * 0.75 + random_page_cost * 0.25);
        }
        else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
@@ -1767,12 +1752,12 @@ cost_sort(Path *path, PlannerInfo *root,
                 * factor is a bit higher than for quicksort.  Tweak it so that 
the
                 * cost curve is continuous at the crossover point.
                 */
-               startup_cost += comparison_cost * tuples * LOG2(2.0 * 
output_tuples);
+               *startup_cost = comparison_cost * tuples * LOG2(2.0 * 
output_tuples);
        }
        else
        {
                /* We'll use plain quicksort on all the input tuples */
-               startup_cost += comparison_cost * tuples * LOG2(tuples);
+               *startup_cost = comparison_cost * tuples * LOG2(tuples);
        }
 
        /*
@@ -1783,8 +1768,163 @@ cost_sort(Path *path, PlannerInfo *root,
         * here --- the upper LIMIT will pro-rate the run cost so we'd be double
         * counting the LIMIT otherwise.
         */
-       run_cost += cpu_operator_cost * tuples;
+       *run_cost = cpu_operator_cost * tuples;
+}
+
+/*
+ * cost_full_sort
+ *     Determines and returns the cost of sorting a relation, including the
+ *     cost of reading the input data.
+ *
+ * For the precise description of how the cost is calculated, see the comment
+ * for cost_tuplesort().
+ */
+void
+cost_full_sort(Cost *startup_cost, Cost *run_cost,
+                          Cost input_total_cost, double tuples, int width,
+                          Cost comparison_cost, int sort_mem,
+                          double limit_tuples)
+{
+       cost_tuplesort(startup_cost, run_cost,
+                                  tuples, width,
+                                  comparison_cost, sort_mem,
+                                  limit_tuples);
+
+       if (!enable_sort)
+               *startup_cost += disable_cost;
+
+       *startup_cost += input_total_cost;
+}
+
+/*
+ * cost_incremental_sort
+ *     Determines and returns the cost of sorting a relation incrementally, 
when
+ *  the input path is already sorted by some of the pathkeys.
+ *
+ * 'presorted_keys' is the number of leading pathkeys by which the input path
+ * is sorted.
+ *
+ * We estimate the number of groups into which the relation is divided by the
+ * leading pathkeys, and then calculate the cost of sorting a single group
+ * with tuplesort using cost_tuplesort().
+ */
+void
+cost_incremental_sort(Path *path,
+                 PlannerInfo *root, List *pathkeys, int presorted_keys,
+                 Cost input_startup_cost, Cost input_total_cost,
+                 double input_tuples, int width, Cost comparison_cost, int 
sort_mem,
+                 double limit_tuples)
+{
+       Cost            startup_cost = 0,
+                               run_cost = 0,
+                               input_run_cost = input_total_cost - 
input_startup_cost;
+       double          group_tuples,
+                               input_groups;
+       Cost            group_startup_cost,
+                               group_run_cost,
+                               group_input_run_cost;
+       List       *presortedExprs = NIL;
+       ListCell   *l;
+       int                     i = 0;
+
+       Assert(presorted_keys != 0);
+
+       /*
+        * We want to be sure the cost of a sort is never estimated as zero, 
even
+        * if passed-in tuple count is zero.  Besides, mustn't do log(0)...
+        */
+       if (input_tuples < 2.0)
+               input_tuples = 2.0;
 
+       /* Extract presorted keys as list of expressions */
+       foreach(l, pathkeys)
+       {
+               PathKey *key = (PathKey *)lfirst(l);
+               EquivalenceMember *member = (EquivalenceMember *)
+                                               
linitial(key->pk_eclass->ec_members);
+
+               presortedExprs = lappend(presortedExprs, member->em_expr);
+
+               i++;
+               if (i >= presorted_keys)
+                       break;
+       }
+
+       /* Estimate number of groups with equal presorted keys */
+       input_groups = estimate_num_groups(root, presortedExprs, input_tuples, 
NULL);
+       group_tuples = input_tuples / input_groups;
+       group_input_run_cost = input_run_cost / input_groups;
+
+       /*
+        * Estimate average cost of sorting of one group where presorted keys
+        * are equal.  Incremental sort is sensitive to distribution of tuples
+        * to the groups, where we're relying on quite rough assumptions.  Thus,
+        * we're pessimistic about incremental sort performance and increase
+        * its average group size by half.
+        */
+       cost_tuplesort(&group_startup_cost, &group_run_cost,
+                                  1.5 * group_tuples, width, comparison_cost, 
sort_mem,
+                                  limit_tuples);
+
+       /*
+        * Startup cost of incremental sort is the startup cost of its first 
group
+        * plus the cost of its input.
+        */
+       startup_cost += group_startup_cost
+               + input_startup_cost + group_input_run_cost;
+
+       /*
+        * After we started producing tuples from the first group, the cost of
+        * producing all the tuples is given by the cost to finish processing
+        * this group, plus the total cost to process the remaining groups,
+        * plus the remaining cost of input.
+        */
+       run_cost += group_run_cost
+               + (group_run_cost + group_startup_cost) * (input_groups - 1)
+               + group_input_run_cost * (input_groups - 1);
+
+       /*
+        * Incremental sort adds some overhead by itself. Firstly, it has to
+        * detect the sort groups. This is roughly equal to one extra copy and
+        * comparison per tuple. Secondly, it has to reset the tuplesort context
+        * for every group.
+        */
+       run_cost += (cpu_tuple_cost + comparison_cost) * input_tuples;
+       run_cost += 2.0 * cpu_tuple_cost * input_groups;
+
+       path->rows = input_tuples;
+       path->startup_cost = startup_cost;
+       path->total_cost = startup_cost + run_cost;
+}
+
+/*
+ * cost_sort
+ *       Determines and returns the cost of sorting a relation, including
+ *       the cost of reading the input data.
+ *
+ * NOTE: some callers currently pass NIL for pathkeys because they
+ * can't conveniently supply the sort keys.  Since this routine doesn't
+ * currently do anything with pathkeys anyway, that doesn't matter...
+ * but if it ever does, it should react gracefully to lack of key data.
+ * (Actually, the thing we'd most likely be interested in is just the number
+ * of sort keys, which all callers *could* supply.)
+ */
+void
+cost_sort(Path *path, PlannerInfo *root,
+                 List *pathkeys, Cost input_cost, double tuples, int width,
+                 Cost comparison_cost, int sort_mem,
+                 double limit_tuples)
+
+{
+       Cost startup_cost;
+       Cost run_cost;
+
+       cost_full_sort(&startup_cost, &run_cost,
+                                  input_cost,
+                                  tuples, width, comparison_cost, sort_mem,
+                                  limit_tuples);
+
+       path->rows = tuples;
        path->startup_cost = startup_cost;
        path->total_cost = startup_cost + run_cost;
 }
diff --git a/src/backend/optimizer/path/pathkeys.c 
b/src/backend/optimizer/path/pathkeys.c
index 2f4fea241a..e88a87e189 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -334,6 +334,51 @@ pathkeys_contained_in(List *keys1, List *keys2)
        return false;
 }
 
+
+/*
+ * pathkeys_common_contained_in
+ *    Same as pathkeys_contained_in, but also sets length of longest
+ *    common prefix of keys1 and keys2.
+ */
+bool
+pathkeys_common_contained_in(List *keys1, List *keys2, int *n_common)
+{
+       int                     n = 0;
+       ListCell   *key1,
+                          *key2;
+
+       forboth(key1, keys1, key2, keys2)
+       {
+               PathKey    *pathkey1 = (PathKey *) lfirst(key1);
+               PathKey    *pathkey2 = (PathKey *) lfirst(key2);
+
+               if (pathkey1 != pathkey2)
+               {
+                       *n_common = n;
+                       return false;
+               }
+               n++;
+       }
+
+       *n_common = n;
+       return (key1 == NULL);
+}
+
+
+/*
+ * pathkeys_common
+ *    Returns length of longest common prefix of keys1 and keys2.
+ */
+int
+pathkeys_common(List *keys1, List *keys2)
+{
+       int             n;
+
+       (void) pathkeys_common_contained_in(keys1, keys2, &n);
+       return n;
+}
+
+
 /*
  * get_cheapest_path_for_pathkeys
  *       Find the cheapest path (according to the specified criterion) that
@@ -1793,19 +1838,23 @@ right_merge_direction(PlannerInfo *root, PathKey 
*pathkey)
 static int
 pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
 {
+       int     n_common_pathkeys;
+
        if (root->query_pathkeys == NIL)
                return 0;                               /* no special ordering 
requested */
 
        if (pathkeys == NIL)
                return 0;                               /* unordered path */
 
-       if (pathkeys_contained_in(root->query_pathkeys, pathkeys))
-       {
-               /* It's useful ... or at least the first N keys are */
-               return list_length(root->query_pathkeys);
-       }
+       (void) pathkeys_common_contained_in(root->query_pathkeys, pathkeys,
+                                                                               
&n_common_pathkeys);
 
-       return 0;                                       /* path ordering not 
useful */
+       /*
+        * Return the number of path keys in common, or 0 if there are none.
+        * Any leading common pathkeys could be useful for ordering because
+        * we can use the incremental sort.
+        */
+       return n_common_pathkeys;
 }
 
 /*
diff --git a/src/backend/optimizer/plan/createplan.c 
b/src/backend/optimizer/plan/createplan.c
index aee81bd755..39ae858894 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -98,6 +98,8 @@ static Plan *create_projection_plan(PlannerInfo *root,
                                                                        int 
flags);
 static Plan *inject_projection_plan(Plan *subplan, List *tlist, bool 
parallel_safe);
 static Sort *create_sort_plan(PlannerInfo *root, SortPath *best_path, int 
flags);
+static IncrementalSort *create_incrementalsort_plan(PlannerInfo *root,
+                                                                       
IncrementalSortPath *best_path, int flags);
 static Group *create_group_plan(PlannerInfo *root, GroupPath *best_path);
 static Unique *create_upper_unique_plan(PlannerInfo *root, UpperUniquePath 
*best_path,
                                                                                
int flags);
@@ -244,6 +246,10 @@ static MergeJoin *make_mergejoin(List *tlist,
 static Sort *make_sort(Plan *lefttree, int numCols,
                                           AttrNumber *sortColIdx, Oid 
*sortOperators,
                                           Oid *collations, bool *nullsFirst);
+static IncrementalSort *make_incrementalsort(Plan *lefttree,
+                 int numCols, int presortedCols,
+                 AttrNumber *sortColIdx, Oid *sortOperators,
+                 Oid *collations, bool *nullsFirst);
 static Plan *prepare_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
                                                                                
Relids relids,
                                                                                
const AttrNumber *reqColIdx,
@@ -258,6 +264,8 @@ static EquivalenceMember 
*find_ec_member_for_tle(EquivalenceClass *ec,
                                                                                
                 Relids relids);
 static Sort *make_sort_from_pathkeys(Plan *lefttree, List *pathkeys,
                                                                         Relids 
relids);
+static IncrementalSort *make_incrementalsort_from_pathkeys(Plan *lefttree,
+                                               List *pathkeys, Relids relids, 
int presortedCols);
 static Sort *make_sort_from_groupcols(List *groupcls,
                                                                          
AttrNumber *grpColIdx,
                                                                          Plan 
*lefttree);
@@ -460,6 +468,11 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, 
int flags)
                                                                                
         (SortPath *) best_path,
                                                                                
         flags);
                        break;
+               case T_IncrementalSort:
+                       plan = (Plan *) create_incrementalsort_plan(root,
+                                                                               
                                (IncrementalSortPath *) best_path,
+                                                                               
                                flags);
+                       break;
                case T_Group:
                        plan = (Plan *) create_group_plan(root,
                                                                                
          (GroupPath *) best_path);
@@ -1991,6 +2004,32 @@ create_sort_plan(PlannerInfo *root, SortPath *best_path, 
int flags)
        return plan;
 }
 
+/*
+ * create_incrementalsort_plan
+ *
+ *       Do the same as create_sort_plan, but create IncrementalSort plan.
+ */
+static IncrementalSort *
+create_incrementalsort_plan(PlannerInfo *root, IncrementalSortPath *best_path,
+                                                       int flags)
+{
+       IncrementalSort    *plan;
+       Plan                       *subplan;
+
+       /* See comments in create_sort_plan() above */
+       subplan = create_plan_recurse(root, best_path->spath.subpath,
+                                                                 flags | 
CP_SMALL_TLIST);
+       plan = make_incrementalsort_from_pathkeys(subplan,
+                                                               
best_path->spath.path.pathkeys,
+                                                               
IS_OTHER_REL(best_path->spath.subpath->parent) ?
+                                                               
best_path->spath.path.parent->relids : NULL,
+                                                               
best_path->presortedCols);
+
+       copy_generic_path_info(&plan->sort.plan, (Path *) best_path);
+
+       return plan;
+}
+
 /*
  * create_group_plan
  *
@@ -5082,17 +5121,24 @@ static void
 label_sort_with_costsize(PlannerInfo *root, Sort *plan, double limit_tuples)
 {
        Plan       *lefttree = plan->plan.lefttree;
-       Path            sort_path;              /* dummy for result of 
cost_sort */
+       Cost            startup_cost,
+                               run_cost;
 
-       cost_sort(&sort_path, root, NIL,
+       /*
+        * This function shouldn't have to deal with IncrementalSort plans
+        * because they are only created from corresponding Path nodes.
+        */
+       Assert(IsA(plan, Sort));
+
+       cost_full_sort(&startup_cost, &run_cost,
                          lefttree->total_cost,
                          lefttree->plan_rows,
                          lefttree->plan_width,
                          0.0,
                          work_mem,
                          limit_tuples);
-       plan->plan.startup_cost = sort_path.startup_cost;
-       plan->plan.total_cost = sort_path.total_cost;
+       plan->plan.startup_cost = startup_cost;
+       plan->plan.total_cost = startup_cost + run_cost;
        plan->plan.plan_rows = lefttree->plan_rows;
        plan->plan.plan_width = lefttree->plan_width;
        plan->plan.parallel_aware = false;
@@ -5673,9 +5719,12 @@ make_sort(Plan *lefttree, int numCols,
                  AttrNumber *sortColIdx, Oid *sortOperators,
                  Oid *collations, bool *nullsFirst)
 {
-       Sort       *node = makeNode(Sort);
-       Plan       *plan = &node->plan;
+       Sort       *node;
+       Plan       *plan;
+
+       node = makeNode(Sort);
 
+       plan = &node->plan;
        plan->targetlist = lefttree->targetlist;
        plan->qual = NIL;
        plan->lefttree = lefttree;
@@ -5689,6 +5738,37 @@ make_sort(Plan *lefttree, int numCols,
        return node;
 }
 
+/*
+ * make_incrementalsort --- basic routine to build a IncrementalSort plan node
+ *
+ * Caller must have built the sortColIdx, sortOperators, collations, and
+ * nullsFirst arrays already.
+ */
+static IncrementalSort *
+make_incrementalsort(Plan *lefttree, int numCols, int presortedCols,
+                 AttrNumber *sortColIdx, Oid *sortOperators,
+                 Oid *collations, bool *nullsFirst)
+{
+       IncrementalSort    *node;
+       Plan                       *plan;
+
+       node = makeNode(IncrementalSort);
+
+       plan = &node->sort.plan;
+       plan->targetlist = lefttree->targetlist;
+       plan->qual = NIL;
+       plan->lefttree = lefttree;
+       plan->righttree = NULL;
+       node->presortedCols = presortedCols;
+       node->sort.numCols = numCols;
+       node->sort.sortColIdx = sortColIdx;
+       node->sort.sortOperators = sortOperators;
+       node->sort.collations = collations;
+       node->sort.nullsFirst = nullsFirst;
+
+       return node;
+}
+
 /*
  * prepare_sort_from_pathkeys
  *       Prepare to sort according to given pathkeys
@@ -6035,6 +6115,42 @@ make_sort_from_pathkeys(Plan *lefttree, List *pathkeys, 
Relids relids)
                                         collations, nullsFirst);
 }
 
+/*
+ * make_incrementalsort_from_pathkeys
+ *       Create sort plan to sort according to given pathkeys
+ *
+ *       'lefttree' is the node which yields input tuples
+ *       'pathkeys' is the list of pathkeys by which the result is to be sorted
+ *       'relids' is the set of relations required by 
prepare_sort_from_pathkeys()
+ *       'presortedCols' is the number of presorted columns in input tuples
+ */
+static IncrementalSort *
+make_incrementalsort_from_pathkeys(Plan *lefttree, List *pathkeys,
+                                               Relids relids, int 
presortedCols)
+{
+       int                     numsortkeys;
+       AttrNumber *sortColIdx;
+       Oid                *sortOperators;
+       Oid                *collations;
+       bool       *nullsFirst;
+
+       /* Compute sort column info, and adjust lefttree as needed */
+       lefttree = prepare_sort_from_pathkeys(lefttree, pathkeys,
+                                                                               
  relids,
+                                                                               
  NULL,
+                                                                               
  false,
+                                                                               
  &numsortkeys,
+                                                                               
  &sortColIdx,
+                                                                               
  &sortOperators,
+                                                                               
  &collations,
+                                                                               
  &nullsFirst);
+
+       /* Now build the Sort node */
+       return make_incrementalsort(lefttree, numsortkeys, presortedCols,
+                                                               sortColIdx, 
sortOperators,
+                                                               collations, 
nullsFirst);
+}
+
 /*
  * make_sort_from_sortclauses
  *       Create sort plan to sort according to given sortclauses
@@ -6769,6 +6885,7 @@ is_projection_capable_path(Path *path)
                case T_Hash:
                case T_Material:
                case T_Sort:
+               case T_IncrementalSort:
                case T_Unique:
                case T_SetOp:
                case T_LockRows:
diff --git a/src/backend/optimizer/plan/planner.c 
b/src/backend/optimizer/plan/planner.c
index 7fe11b59a0..a153e77247 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4923,8 +4923,8 @@ create_distinct_paths(PlannerInfo *root,
  * Build a new upperrel containing Paths for ORDER BY evaluation.
  *
  * All paths in the result must satisfy the ORDER BY ordering.
- * The only new path we need consider is an explicit sort on the
- * cheapest-total existing path.
+ * The only new paths we need consider is an explicit full or
+ * incremental sort on the cheapest-total existing path.
  *
  * input_rel: contains the source-data Paths
  * target: the output tlist the result Paths must emit
@@ -4963,29 +4963,60 @@ create_ordered_paths(PlannerInfo *root,
 
        foreach(lc, input_rel->pathlist)
        {
-               Path       *path = (Path *) lfirst(lc);
+               Path       *input_path = (Path *) lfirst(lc);
+               Path       *sorted_path = input_path;
                bool            is_sorted;
+               int                     presorted_keys;
+
+               is_sorted = pathkeys_common_contained_in(root->sort_pathkeys,
+                                                                               
                 input_path->pathkeys, &presorted_keys);
 
-               is_sorted = pathkeys_contained_in(root->sort_pathkeys,
-                                                                               
  path->pathkeys);
-               if (path == cheapest_input_path || is_sorted)
+               if (is_sorted)
                {
-                       if (!is_sorted)
+                       /* Use the input path as is, but add a projection step 
if needed */
+                       if (sorted_path->pathtarget != target)
+                               sorted_path = apply_projection_to_path(root, 
ordered_rel,
+                                                                               
                           sorted_path, target);
+
+                       add_path(ordered_rel, sorted_path);
+               }
+               else
+               {
+                       if (input_path == cheapest_input_path)
                        {
-                               /* An explicit sort here can take advantage of 
LIMIT */
-                               path = (Path *) create_sort_path(root,
-                                                                               
                 ordered_rel,
-                                                                               
                 path,
-                                                                               
                 root->sort_pathkeys,
-                                                                               
                 limit_tuples);
+                               /*
+                                * Sort the cheapest input path. An explicit 
sort here can take
+                                * advantage of LIMIT.
+                                */
+                               sorted_path = (Path *) create_sort_path(root,
+                                                                               
                                ordered_rel,
+                                                                               
                                input_path,
+                                                                               
                                root->sort_pathkeys,
+                                                                               
                                limit_tuples);
+                               /* Add projection step if needed */
+                               if (sorted_path->pathtarget != target)
+                                       sorted_path = 
apply_projection_to_path(root, ordered_rel,
+                                                                               
                                   sorted_path, target);
+
+                               add_path(ordered_rel, sorted_path);
+                       }
+                       if (enable_incrementalsort && presorted_keys > 0)
+                       {
+                               /* Also consider incremental sort. */
+                               sorted_path = (Path *) 
create_incremental_sort_path(root,
+                                                                               
                                                        ordered_rel,
+                                                                               
                                                        input_path,
+                                                                               
                                                        root->sort_pathkeys,
+                                                                               
                                                        presorted_keys,
+                                                                               
                                                        limit_tuples);
+
+                               /* Add projection step if needed */
+                               if (sorted_path->pathtarget != target)
+                                       sorted_path = 
apply_projection_to_path(root, ordered_rel,
+                                                                               
                                   sorted_path, target);
+
+                               add_path(ordered_rel, sorted_path);
                        }
-
-                       /* Add projection step if needed */
-                       if (path->pathtarget != target)
-                               path = apply_projection_to_path(root, 
ordered_rel,
-                                                                               
                path, target);
-
-                       add_path(ordered_rel, path);
                }
        }
 
diff --git a/src/backend/optimizer/plan/setrefs.c 
b/src/backend/optimizer/plan/setrefs.c
index 566ee96da8..c0370b2c70 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -652,6 +652,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
                case T_Material:
                case T_Sort:
+               case T_IncrementalSort:
                case T_Unique:
                case T_SetOp:
 
diff --git a/src/backend/optimizer/plan/subselect.c 
b/src/backend/optimizer/plan/subselect.c
index 48b62a55de..42a2370071 100644
--- a/src/backend/optimizer/plan/subselect.c
+++ b/src/backend/optimizer/plan/subselect.c
@@ -2686,6 +2686,7 @@ finalize_plan(PlannerInfo *root, Plan *plan,
                case T_Hash:
                case T_Material:
                case T_Sort:
+               case T_IncrementalSort:
                case T_Unique:
                case T_SetOp:
                case T_Group:
diff --git a/src/backend/optimizer/util/pathnode.c 
b/src/backend/optimizer/util/pathnode.c
index f24ba587e5..278189a4a9 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2741,6 +2741,57 @@ create_set_projection_path(PlannerInfo *root,
        return pathnode;
 }
 
+/*
+ * create_incremental_sort_path
+ *       Creates a pathnode that represents performing an incremental sort.
+ *
+ * 'rel' is the parent relation associated with the result
+ * 'subpath' is the path representing the source of data
+ * 'pathkeys' represents the desired sort order
+ * 'presorted_keys' is the number of keys by which the input path is
+ *             already sorted
+ * 'limit_tuples' is the estimated bound on the number of output tuples,
+ *             or -1 if no LIMIT or couldn't estimate
+ */
+SortPath *
+create_incremental_sort_path(PlannerInfo *root,
+                                RelOptInfo *rel,
+                                Path *subpath,
+                                List *pathkeys,
+                                int presorted_keys,
+                                double limit_tuples)
+{
+       IncrementalSortPath *sort = makeNode(IncrementalSortPath);
+       SortPath *pathnode = &sort->spath;
+
+       pathnode->path.pathtype = T_IncrementalSort;
+       pathnode->path.parent = rel;
+       /* Sort doesn't project, so use source path's pathtarget */
+       pathnode->path.pathtarget = subpath->pathtarget;
+       /* For now, assume we are above any joins, so no parameterization */
+       pathnode->path.param_info = NULL;
+       pathnode->path.parallel_aware = false;
+       pathnode->path.parallel_safe = rel->consider_parallel &&
+               subpath->parallel_safe;
+       pathnode->path.parallel_workers = subpath->parallel_workers;
+       pathnode->path.pathkeys = pathkeys;
+
+       pathnode->subpath = subpath;
+
+       cost_incremental_sort(&pathnode->path,
+                         root, pathkeys, presorted_keys,
+                         subpath->startup_cost,
+                         subpath->total_cost,
+                         subpath->rows,
+                         subpath->pathtarget->width,
+                         0.0,                          /* XXX comparison_cost 
shouldn't be 0? */
+                         work_mem, limit_tuples);
+
+       sort->presortedCols = presorted_keys;
+
+       return pathnode;
+}
+
 /*
  * create_sort_path
  *       Creates a pathnode that represents performing an explicit sort.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 5fccc9683e..4e7959edb7 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -947,6 +947,15 @@ static struct config_bool ConfigureNamesBool[] =
                true,
                NULL, NULL, NULL
        },
+       {
+               {"enable_incrementalsort", PGC_USERSET, QUERY_TUNING_METHOD,
+                       gettext_noop("Enables the planner's use of incremental 
sort steps."),
+                       NULL
+               },
+               &enable_incrementalsort,
+               true,
+               NULL, NULL, NULL
+       },
        {
                {"enable_hashagg", PGC_USERSET, QUERY_TUNING_METHOD,
                        gettext_noop("Enables the planner's use of hashed 
aggregation plans."),
diff --git a/src/backend/utils/sort/tuplesort.c 
b/src/backend/utils/sort/tuplesort.c
index 7947d2bca0..cb1aacb207 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -125,6 +125,15 @@
 #define PARALLEL_SORT(state)   ((state)->shared == NULL ? 0 : \
                                                                 
(state)->worker >= 0 ? 1 : 2)
 
+/*
+ * Initial size of memtuples array.  We're trying to select this size so that
+ * array don't exceed ALLOCSET_SEPARATE_THRESHOLD and overhead of allocation
+ * be possible less.  However, we don't cosider array sizes less than 1024
+ *
+ */
+#define INITIAL_MEMTUPSIZE Max(1024, \
+       ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)
+
 /* GUC variables */
 #ifdef TRACE_SORT
 bool           trace_sort = false;
@@ -241,6 +250,14 @@ struct Tuplesortstate
        int64           allowedMem;             /* total memory allowed, in 
bytes */
        int                     maxTapes;               /* number of tapes 
(Knuth's T) */
        int                     tapeRange;              /* maxTapes-1 (Knuth's 
P) */
+       int64           maxSpace;               /* maximum amount of space 
occupied among sort
+                                                                  of groups, 
either in-memory or on-disk */
+       bool            maxSpaceOnDisk; /* true when maxSpace is value for 
on-disk
+                                                                  space, false 
when it's value for in-memory
+                                                                  space */
+       TupSortStatus maxSpaceStatus; /* sort status when maxSpace was reached 
*/
+       MemoryContext maincontext;      /* memory context for tuple sort 
metadata
+                                                                  that persist 
across multiple batches */
        MemoryContext sortcontext;      /* memory context holding most sort 
data */
        MemoryContext tuplecontext; /* sub-context of sortcontext for tuple 
data */
        LogicalTapeSet *tapeset;        /* logtape.c object for tapes in a temp 
file */
@@ -647,6 +664,8 @@ static void worker_freeze_result_tape(Tuplesortstate 
*state);
 static void worker_nomergeruns(Tuplesortstate *state);
 static void leader_takeover_tapes(Tuplesortstate *state);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
+static void tuplesort_free(Tuplesortstate *state);
+static void tuplesort_updatemax(Tuplesortstate *state);
 
 /*
  * Special versions of qsort just for SortTuple objects.  qsort_tuple() sorts
@@ -682,6 +701,7 @@ tuplesort_begin_common(int workMem, SortCoordinate 
coordinate,
                                           bool randomAccess)
 {
        Tuplesortstate *state;
+       MemoryContext maincontext;
        MemoryContext sortcontext;
        MemoryContext tuplecontext;
        MemoryContext oldcontext;
@@ -691,13 +711,21 @@ tuplesort_begin_common(int workMem, SortCoordinate 
coordinate,
                elog(ERROR, "random access disallowed under parallel sort");
 
        /*
-        * Create a working memory context for this sort operation. All data
-        * needed by the sort will live inside this context.
+        * Memory context surviving tuplesort_reset.  This memory context holds
+        * data which is useful to keep while sorting multiple similar batches.
         */
-       sortcontext = AllocSetContextCreate(CurrentMemoryContext,
+       maincontext = AllocSetContextCreate(CurrentMemoryContext,
                                                                                
"TupleSort main",
                                                                                
ALLOCSET_DEFAULT_SIZES);
 
+       /*
+        * Create a working memory context for one sort operation.  The content 
of
+        * this context is deleted by tuplesort_reset.
+        */
+       sortcontext = AllocSetContextCreate(maincontext,
+                                                                               
"TupleSort sort",
+                                                                               
ALLOCSET_DEFAULT_SIZES);
+
        /*
         * Caller tuple (e.g. IndexTuple) memory context.
         *
@@ -715,7 +743,7 @@ tuplesort_begin_common(int workMem, SortCoordinate 
coordinate,
         * Make the Tuplesortstate within the per-sort context.  This way, we
         * don't need a separate pfree() operation for it at shutdown.
         */
-       oldcontext = MemoryContextSwitchTo(sortcontext);
+       oldcontext = MemoryContextSwitchTo(maincontext);
 
        state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
 
@@ -740,6 +768,7 @@ tuplesort_begin_common(int workMem, SortCoordinate 
coordinate,
        state->availMem = state->allowedMem;
        state->sortcontext = sortcontext;
        state->tuplecontext = tuplecontext;
+       state->maincontext = maincontext;
        state->tapeset = NULL;
 
        state->memtupcount = 0;
@@ -748,9 +777,7 @@ tuplesort_begin_common(int workMem, SortCoordinate 
coordinate,
         * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
         * see comments in grow_memtuples().
         */
-       state->memtupsize = Max(1024,
-                                                       
ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1);
-
+       state->memtupsize = INITIAL_MEMTUPSIZE;
        state->growmemtuples = true;
        state->slabAllocatorUsed = false;
        state->memtuples = (SortTuple *) palloc(state->memtupsize * 
sizeof(SortTuple));
@@ -814,7 +841,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
        MemoryContext oldcontext;
        int                     i;
 
-       oldcontext = MemoryContextSwitchTo(state->sortcontext);
+       oldcontext = MemoryContextSwitchTo(state->maincontext);
 
        AssertArg(nkeys > 0);
 
@@ -890,7 +917,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
        Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
 
-       oldcontext = MemoryContextSwitchTo(state->sortcontext);
+       oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
        if (trace_sort)
@@ -985,7 +1012,7 @@ tuplesort_begin_index_btree(Relation heapRel,
        MemoryContext oldcontext;
        int                     i;
 
-       oldcontext = MemoryContextSwitchTo(state->sortcontext);
+       oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
        if (trace_sort)
@@ -1063,7 +1090,7 @@ tuplesort_begin_index_hash(Relation heapRel,
                                                                                
                   randomAccess);
        MemoryContext oldcontext;
 
-       oldcontext = MemoryContextSwitchTo(state->sortcontext);
+       oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
        if (trace_sort)
@@ -1106,7 +1133,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, 
Oid sortCollation,
        int16           typlen;
        bool            typbyval;
 
-       oldcontext = MemoryContextSwitchTo(state->sortcontext);
+       oldcontext = MemoryContextSwitchTo(state->maincontext);
 
 #ifdef TRACE_SORT
        if (trace_sort)
@@ -1223,17 +1250,19 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
        state->sortKeys->abbrev_full_comparator = NULL;
 }
 
+bool
+tuplesort_used_bound(Tuplesortstate *state)
+{
+       return state->boundUsed;
+}
+
 /*
- * tuplesort_end
- *
- *     Release resources and clean up.
+ * tuplesort_free
  *
- * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
- * pointing to garbage.  Be careful not to attempt to use or free such
- * pointers afterwards!
+ *     Internal routine for freeing resources of tuplesort.
  */
-void
-tuplesort_end(Tuplesortstate *state)
+static void
+tuplesort_free(Tuplesortstate *state)
 {
        /* context swap probably not needed, but let's be safe */
        MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -1294,7 +1323,111 @@ tuplesort_end(Tuplesortstate *state)
         * Free the per-sort memory context, thereby releasing all working 
memory,
         * including the Tuplesortstate struct itself.
         */
-       MemoryContextDelete(state->sortcontext);
+       MemoryContextReset(state->sortcontext);
+}
+
+/*
+ * tuplesort_end
+ *
+ *     Release resources and clean up.
+ *
+ * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
+ * pointing to garbage.  Be careful not to attempt to use or free such
+ * pointers afterwards!
+ */
+void
+tuplesort_end(Tuplesortstate *state)
+{
+       tuplesort_free(state);
+       MemoryContextDelete(state->maincontext);
+}
+
+/*
+ * tuplesort_updatemax
+ *
+ *     Update maximum resource usage statistics.
+ */
+static void
+tuplesort_updatemax(Tuplesortstate *state)
+{
+       int64   spaceUsed;
+       bool    spaceUsedOnDisk;
+
+       /*
+        * Note: it might seem we should provide both memory and disk usage for 
a
+        * disk-based sort.  However, the current code doesn't track memory 
space
+        * accurately once we have begun to return tuples to the caller (since 
we
+        * don't account for pfree's the caller is expected to do), so we cannot
+        * rely on availMem in a disk sort.  This does not seem worth the 
overhead
+        * to fix.  Is it worth creating an API for the memory context code to
+        * tell us how much is actually used in sortcontext?
+        */
+       if (state->tapeset)
+       {
+               spaceUsedOnDisk = true;
+               spaceUsed = LogicalTapeSetBlocks(state->tapeset) * BLCKSZ;
+       }
+       else
+       {
+               spaceUsedOnDisk = false;
+               spaceUsed = state->allowedMem - state->availMem;
+       }
+
+       /*
+        * Sort evicts data to the disk when it didn't manage to fit those data
+        * to the main memory.  This is why we assume space used on the disk to
+        * be more important for tracking resource usage than space used in 
memory.
+        * Note that amount of space occupied by some tuple set on the disk 
might
+        * be less than amount of space occupied by the same tuple set in the
+        * memory due to more compact representation.
+        */
+       if ((spaceUsedOnDisk && !state->maxSpaceOnDisk) ||
+               (spaceUsedOnDisk == state->maxSpaceOnDisk && spaceUsed > 
state->maxSpace))
+       {
+               state->maxSpace = spaceUsed;
+               state->maxSpaceOnDisk = spaceUsedOnDisk;
+               state->maxSpaceStatus = state->status;
+       }
+}
+
+/*
+ * tuplesort_reset
+ *
+ *     Reset the tuplesort.  Reset all the data in the tuplesort, but leave the
+ *     meta-information in.  After tuplesort_reset, tuplesort is ready to start
+ *     a new sort.  It allows evade recreation of tuple sort (and save 
resources)
+ *     when sorting multiple small batches.
+ */
+void
+tuplesort_reset(Tuplesortstate *state)
+{
+       tuplesort_updatemax(state);
+       tuplesort_free(state);
+
+       state->status = TSS_INITIAL;
+       state->memtupcount = 0;
+       state->boundUsed = false;
+       state->tapeset = NULL;
+       state->currentRun = 0;
+       state->result_tape = -1;
+       state->bounded = false;
+       state->availMem = state->allowedMem;
+       state->lastReturnedTuple = NULL;
+       state->slabAllocatorUsed = false;
+       state->slabMemoryBegin = NULL;
+       state->slabMemoryEnd = NULL;
+       state->slabFreeHead = NULL;
+       state->growmemtuples = true;
+
+       if (state->memtupsize < INITIAL_MEMTUPSIZE)
+       {
+               if (state->memtuples)
+                       pfree(state->memtuples);
+               state->memtuples = (SortTuple *) palloc(INITIAL_MEMTUPSIZE * 
sizeof(SortTuple));
+               state->memtupsize = INITIAL_MEMTUPSIZE;
+       }
+
+       USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 }
 
 /*
@@ -2591,8 +2724,7 @@ mergeruns(Tuplesortstate *state)
         * Reset tuple memory.  We've freed all the tuples that we previously
         * allocated.  We will use the slab allocator from now on.
         */
-       MemoryContextDelete(state->tuplecontext);
-       state->tuplecontext = NULL;
+       MemoryContextResetOnly(state->tuplecontext);
 
        /*
         * We no longer need a large memtuples array.  (We will allocate a 
smaller
@@ -2642,7 +2774,8 @@ mergeruns(Tuplesortstate *state)
         * from each input tape.
         */
        state->memtupsize = numInputTapes;
-       state->memtuples = (SortTuple *) palloc(numInputTapes * 
sizeof(SortTuple));
+       state->memtuples = (SortTuple *) MemoryContextAlloc(state->maincontext,
+                                                                               
numInputTapes * sizeof(SortTuple));
        USEMEM(state, GetMemoryChunkSpace(state->memtuples));
 
        /*
@@ -3138,18 +3271,15 @@ tuplesort_get_stats(Tuplesortstate *state,
         * to fix.  Is it worth creating an API for the memory context code to
         * tell us how much is actually used in sortcontext?
         */
-       if (state->tapeset)
-       {
+       tuplesort_updatemax(state);
+
+       if (state->maxSpaceOnDisk)
                stats->spaceType = SORT_SPACE_TYPE_DISK;
-               stats->spaceUsed = LogicalTapeSetBlocks(state->tapeset) * 
(BLCKSZ / 1024);
-       }
        else
-       {
                stats->spaceType = SORT_SPACE_TYPE_MEMORY;
-               stats->spaceUsed = (state->allowedMem - state->availMem + 1023) 
/ 1024;
-       }
+       stats->spaceUsed = (state->maxSpace + 1023) / 1024;
 
-       switch (state->status)
+       switch (state->maxSpaceStatus)
        {
                case TSS_SORTEDINMEM:
                        if (state->boundUsed)
diff --git a/src/include/executor/execdebug.h b/src/include/executor/execdebug.h
index c119fdf4fa..3e48593543 100644
--- a/src/include/executor/execdebug.h
+++ b/src/include/executor/execdebug.h
@@ -86,10 +86,12 @@
 #define SO_nodeDisplay(l)                              nodeDisplay(l)
 #define SO_printf(s)                                   printf(s)
 #define SO1_printf(s, p)                               printf(s, p)
+#define SO2_printf(s, p1, p2)                  printf(s, p1, p2)
 #else
 #define SO_nodeDisplay(l)
 #define SO_printf(s)
 #define SO1_printf(s, p)
+#define SO2_printf(s, p1, p2)
 #endif                                                 /* EXEC_SORTDEBUG */
 
 /* ----------------
diff --git a/src/include/executor/nodeIncrementalSort.h 
b/src/include/executor/nodeIncrementalSort.h
new file mode 100644
index 0000000000..90d7a81711
--- /dev/null
+++ b/src/include/executor/nodeIncrementalSort.h
@@ -0,0 +1,30 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeIncrementalSort.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeIncrementalSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEINCREMENTALSORT_H
+#define NODEINCREMENTALSORT_H
+
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern IncrementalSortState *ExecInitIncrementalSort(IncrementalSort *node, 
EState *estate, int eflags);
+extern void ExecEndIncrementalSort(IncrementalSortState *node);
+extern void ExecReScanIncrementalSort(IncrementalSortState *node);
+
+/* parallel instrumentation support */
+extern void ExecIncrementalSortEstimate(IncrementalSortState *node, 
ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeDSM(IncrementalSortState *node, 
ParallelContext *pcxt);
+extern void ExecIncrementalSortInitializeWorker(IncrementalSortState *node, 
ParallelWorkerContext *pcxt);
+extern void ExecIncrementalSortRetrieveInstrumentation(IncrementalSortState 
*node);
+
+#endif   /* NODEINCREMENTALSORT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 6eb647290b..f4d1104d4d 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1972,6 +1972,20 @@ typedef struct MaterialState
        Tuplestorestate *tuplestorestate;
 } MaterialState;
 
+
+/* ----------------
+ *      When performing sorting by multiple keys input dataset could be already
+ *      presorted by some prefix of these keys.  We call them "presorted keys".
+ *      PresortedKeyData represents information about one such key.
+ * ----------------
+ */
+typedef struct PresortedKeyData
+{
+       FmgrInfo                                flinfo; /* comparison function 
info */
+       FunctionCallInfo        fcinfo; /* comparison function call info */
+       OffsetNumber                    attno;  /* attribute number in tuple */
+} PresortedKeyData;
+
 /* ----------------
  *      Shared memory container for per-worker sort information
  * ----------------
@@ -2000,6 +2014,60 @@ typedef struct SortState
        SharedSortInfo *shared_info;    /* one entry per worker */
 } SortState;
 
+/* ----------------
+ *      Shared memory container for per-worker incremental sort information
+ * ----------------
+ */
+typedef struct IncrementalSortInfo
+{
+       TuplesortInstrumentation        fullsort_instrument;
+       int64                                           fullsort_group_count;
+       TuplesortInstrumentation        prefixsort_instrument;
+       int64                                           prefixsort_group_count;
+} IncrementalSortInfo;
+
+typedef struct SharedIncrementalSortInfo
+{
+       int                                                     num_workers;
+       IncrementalSortInfo                     sinfo[FLEXIBLE_ARRAY_MEMBER];
+} SharedIncrementalSortInfo;
+
+/* ----------------
+ *      IncrementalSortState information
+ * ----------------
+ */
+typedef enum
+{
+       INCSORT_LOADFULLSORT,
+       INCSORT_LOADPREFIXSORT,
+       INCSORT_READFULLSORT,
+       INCSORT_READPREFIXSORT,
+} IncrementalSortExecutionStatus;
+
+typedef struct IncrementalSortState
+{
+       ScanState       ss;                             /* its first field is 
NodeTag */
+       bool            bounded;                /* is the result set bounded? */
+       int64           bound;                  /* if bounded, how many tuples 
are needed */
+       bool            sort_Done;              /* sort completed yet? */
+       bool            finished;               /* fetching tuples from outer 
node
+                                                                  is finished 
? */
+       int64           bound_Done;             /* value of bound we did the 
sort with */
+       IncrementalSortExecutionStatus execution_status;
+       int64                   n_fullsort_remaining;
+       Tuplesortstate     *fullsort_state; /* private state of tuplesort.c */
+       Tuplesortstate     *prefixsort_state; /* private state of tuplesort.c */
+       /* the keys by which the input path is already sorted */
+       PresortedKeyData *presorted_keys;
+       int64           fullsort_group_count;   /* number of groups with equal 
presorted keys */
+       int64           prefixsort_group_count; /* number of groups with equal 
presorted keys */
+       /* slot for pivot tuple defining values of presorted keys within group 
*/
+       TupleTableSlot *group_pivot;
+       TupleTableSlot *transfer_tuple;
+       bool            am_worker;              /* are we a worker? */
+       SharedIncrementalSortInfo *shared_info; /* one entry per worker */
+} IncrementalSortState;
+
 /* ---------------------
  *     GroupState information
  * ---------------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index bce2d59b0d..f72336e84a 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -74,6 +74,7 @@ typedef enum NodeTag
        T_HashJoin,
        T_Material,
        T_Sort,
+       T_IncrementalSort,
        T_Group,
        T_Agg,
        T_WindowAgg,
@@ -130,6 +131,7 @@ typedef enum NodeTag
        T_HashJoinState,
        T_MaterialState,
        T_SortState,
+       T_IncrementalSortState,
        T_GroupState,
        T_AggState,
        T_WindowAggState,
@@ -245,6 +247,7 @@ typedef enum NodeTag
        T_ProjectionPath,
        T_ProjectSetPath,
        T_SortPath,
+       T_IncrementalSortPath,
        T_GroupPath,
        T_UpperUniquePath,
        T_AggPath,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 23a06d718e..aab2fda7dc 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1618,6 +1618,15 @@ typedef struct SortPath
        Path       *subpath;            /* path representing input source */
 } SortPath;
 
+/*
+ * IncrementalSortPath
+ */
+typedef struct IncrementalSortPath
+{
+       SortPath        spath;
+       int                     presortedCols;  /* number of presorted columns 
*/
+} IncrementalSortPath;
+
 /*
  * GroupPath represents grouping (of presorted input)
  *
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 8e6594e355..bbf0739411 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -770,6 +770,17 @@ typedef struct Sort
        bool       *nullsFirst;         /* NULLS FIRST/LAST directions */
 } Sort;
 
+
+/* ----------------
+ *             incremental sort node
+ * ----------------
+ */
+typedef struct IncrementalSort
+{
+       Sort            sort;
+       int                     presortedCols;  /* number of presorted columns 
*/
+} IncrementalSort;
+
 /* ---------------
  *      group node -
  *             Used for queries with GROUP BY (but no aggregates) specified.
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index b3d0b4f6fb..b9d7a77e65 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -53,6 +53,7 @@ extern PGDLLIMPORT bool enable_indexonlyscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
+extern PGDLLIMPORT bool enable_incrementalsort;
 extern PGDLLIMPORT bool enable_hashagg;
 extern PGDLLIMPORT bool enable_nestloop;
 extern PGDLLIMPORT bool enable_material;
@@ -101,6 +102,15 @@ extern void cost_sort(Path *path, PlannerInfo *root,
                                          List *pathkeys, Cost input_cost, 
double tuples, int width,
                                          Cost comparison_cost, int sort_mem,
                                          double limit_tuples);
+extern void cost_full_sort(Cost *startup_cost, Cost *run_cost,
+                          Cost input_total_cost, double tuples, int width,
+                          Cost comparison_cost, int sort_mem,
+                          double limit_tuples);
+extern void cost_incremental_sort(Path *path,
+                 PlannerInfo *root, List *pathkeys, int presorted_keys,
+                 Cost input_startup_cost, Cost input_total_cost,
+                 double input_tuples, int width, Cost comparison_cost, int 
sort_mem,
+                 double limit_tuples);
 extern void cost_append(AppendPath *path);
 extern void cost_merge_append(Path *path, PlannerInfo *root,
                                                          List *pathkeys, int 
n_streams,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index a12af54971..1470d15c78 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -184,6 +184,12 @@ extern ProjectSetPath 
*create_set_projection_path(PlannerInfo *root,
                                                                                
                  RelOptInfo *rel,
                                                                                
                  Path *subpath,
                                                                                
                  PathTarget *target);
+extern SortPath *create_incremental_sort_path(PlannerInfo *root,
+                                RelOptInfo *rel,
+                                Path *subpath,
+                                List *pathkeys,
+                                int presorted_keys,
+                                double limit_tuples);
 extern SortPath *create_sort_path(PlannerInfo *root,
                                                                  RelOptInfo 
*rel,
                                                                  Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index c6c34630c2..a3cd817cb5 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -188,6 +188,8 @@ typedef enum
 
 extern PathKeysComparison compare_pathkeys(List *keys1, List *keys2);
 extern bool pathkeys_contained_in(List *keys1, List *keys2);
+extern bool pathkeys_common_contained_in(List *keys1, List *keys2, int 
*n_common);
+extern int pathkeys_common(List *keys1, List *keys2);
 extern Path *get_cheapest_path_for_pathkeys(List *paths, List *pathkeys,
                                                                                
        Relids required_outer,
                                                                                
        CostSelector cost_criterion,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index d774bc1152..cebeef2c60 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -215,6 +215,7 @@ extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
                                                                                
         bool randomAccess);
 
 extern void tuplesort_set_bound(Tuplesortstate *state, int64 bound);
+extern bool tuplesort_used_bound(Tuplesortstate *state);
 
 extern void tuplesort_puttupleslot(Tuplesortstate *state,
                                                                   
TupleTableSlot *slot);
@@ -239,6 +240,8 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, 
int64 ntuples,
 
 extern void tuplesort_end(Tuplesortstate *state);
 
+extern void tuplesort_reset(Tuplesortstate *state);
+
 extern void tuplesort_get_stats(Tuplesortstate *state,
                                                                
TuplesortInstrumentation *stats);
 extern const char *tuplesort_method_name(TuplesortMethod m);
diff --git a/src/test/isolation/expected/drop-index-concurrently-1.out 
b/src/test/isolation/expected/drop-index-concurrently-1.out
index 75dff56bc4..8e6adb66bb 100644
--- a/src/test/isolation/expected/drop-index-concurrently-1.out
+++ b/src/test/isolation/expected/drop-index-concurrently-1.out
@@ -21,7 +21,7 @@ QUERY PLAN
 
 Sort           
   Sort Key: id, data
-  ->  Seq Scan on test_dc
+  ->  Index Scan using test_dc_pkey on test_dc
         Filter: ((data)::text = '34'::text)
 step select2: SELECT * FROM test_dc WHERE data=34 ORDER BY id,data;
 id             data           
diff --git a/src/test/regress/expected/incremental_sort.out 
b/src/test/regress/expected/incremental_sort.out
new file mode 100644
index 0000000000..3a58efdf91
--- /dev/null
+++ b/src/test/regress/expected/incremental_sort.out
@@ -0,0 +1,1160 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Sort
+   Sort Key: tenk1.four, tenk1.ten
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(5 rows)
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+               QUERY PLAN                
+-----------------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: tenk1.four, tenk1.ten
+         Presorted Key: tenk1.four
+         ->  Sort
+               Sort Key: tenk1.four
+               ->  Seq Scan on tenk1
+(7 rows)
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+            QUERY PLAN             
+-----------------------------------
+ Incremental Sort
+   Sort Key: tenk1.four, tenk1.ten
+   Presorted Key: tenk1.four
+   ->  Sort
+         Sort Key: tenk1.four
+         ->  Seq Scan on tenk1
+(6 rows)
+
+reset work_mem;
+-- TODO if an analyze happens here the plans might change; should we
+-- solve by inserting extra rows or by adding a GUC that would somehow
+-- forcing the time of plan we expect.
+create table t(a integer, b integer);
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b 
limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b 
limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b 
limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b 
limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b 
limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 1 | 50
+ 1 | 51
+ 1 | 52
+ 1 | 53
+ 1 | 54
+ 1 | 55
+ 1 | 56
+ 1 | 57
+ 1 | 58
+ 1 | 59
+ 1 | 60
+ 1 | 61
+ 1 | 62
+ 1 | 63
+ 1 | 64
+ 1 | 65
+ 1 | 66
+(66 rows)
+
+delete from t;
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from 
generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b 
limit 55;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 55;
+ a | b  
+---+----
+ 1 |  1
+ 1 |  2
+ 1 |  3
+ 1 |  4
+ 1 |  5
+ 1 |  6
+ 1 |  7
+ 1 |  8
+ 1 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 1 | 20
+ 1 | 21
+ 1 | 22
+ 1 | 23
+ 1 | 24
+ 1 | 25
+ 1 | 26
+ 1 | 27
+ 1 | 28
+ 1 | 29
+ 1 | 30
+ 1 | 31
+ 1 | 32
+ 1 | 33
+ 1 | 34
+ 1 | 35
+ 1 | 36
+ 1 | 37
+ 1 | 38
+ 1 | 39
+ 1 | 40
+ 1 | 41
+ 1 | 42
+ 1 | 43
+ 1 | 44
+ 1 | 45
+ 1 | 46
+ 1 | 47
+ 1 | 48
+ 1 | 49
+ 2 | 50
+ 2 | 51
+ 2 | 52
+ 2 | 53
+ 2 | 54
+ 2 | 55
+(55 rows)
+
+delete from t;
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from 
generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b 
limit 70;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 70;
+ a | b  
+---+----
+ 1 |  1
+ 2 |  2
+ 3 |  3
+ 4 |  4
+ 9 |  5
+ 9 |  6
+ 9 |  7
+ 9 |  8
+ 9 |  9
+ 9 | 10
+ 9 | 11
+ 9 | 12
+ 9 | 13
+ 9 | 14
+ 9 | 15
+ 9 | 16
+ 9 | 17
+ 9 | 18
+ 9 | 19
+ 9 | 20
+ 9 | 21
+ 9 | 22
+ 9 | 23
+ 9 | 24
+ 9 | 25
+ 9 | 26
+ 9 | 27
+ 9 | 28
+ 9 | 29
+ 9 | 30
+ 9 | 31
+ 9 | 32
+ 9 | 33
+ 9 | 34
+ 9 | 35
+ 9 | 36
+ 9 | 37
+ 9 | 38
+ 9 | 39
+ 9 | 40
+ 9 | 41
+ 9 | 42
+ 9 | 43
+ 9 | 44
+ 9 | 45
+ 9 | 46
+ 9 | 47
+ 9 | 48
+ 9 | 49
+ 9 | 50
+ 9 | 51
+ 9 | 52
+ 9 | 53
+ 9 | 54
+ 9 | 55
+ 9 | 56
+ 9 | 57
+ 9 | 58
+ 9 | 59
+ 9 | 60
+ 9 | 61
+ 9 | 62
+ 9 | 63
+ 9 | 64
+ 9 | 65
+ 9 | 66
+ 9 | 67
+ 9 | 68
+ 9 | 69
+ 9 | 70
+(70 rows)
+
+delete from t;
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b 
limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b 
limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b 
limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b 
limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b 
limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a | b  
+---+----
+ 0 |  1
+ 0 |  2
+ 0 |  3
+ 0 |  4
+ 0 |  5
+ 0 |  6
+ 0 |  7
+ 0 |  8
+ 0 |  9
+ 1 | 10
+ 1 | 11
+ 1 | 12
+ 1 | 13
+ 1 | 14
+ 1 | 15
+ 1 | 16
+ 1 | 17
+ 1 | 18
+ 1 | 19
+ 2 | 20
+ 2 | 21
+ 2 | 22
+ 2 | 23
+ 2 | 24
+ 2 | 25
+ 2 | 26
+ 2 | 27
+ 2 | 28
+ 2 | 29
+ 3 | 30
+ 3 | 31
+ 3 | 32
+ 3 | 33
+ 3 | 34
+ 3 | 35
+ 3 | 36
+ 3 | 37
+ 3 | 38
+ 3 | 39
+ 4 | 40
+ 4 | 41
+ 4 | 42
+ 4 | 43
+ 4 | 44
+ 4 | 45
+ 4 | 46
+ 4 | 47
+ 4 | 48
+ 4 | 49
+ 5 | 50
+ 5 | 51
+ 5 | 52
+ 5 | 53
+ 5 | 54
+ 5 | 55
+ 5 | 56
+ 5 | 57
+ 5 | 58
+ 5 | 59
+ 6 | 60
+ 6 | 61
+ 6 | 62
+ 6 | 63
+ 6 | 64
+ 6 | 65
+ 6 | 66
+(66 rows)
+
+delete from t;
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b 
limit 31;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 31;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+(31 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b 
limit 32;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 32;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+(32 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b 
limit 33;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 33;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+(33 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b 
limit 65;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 65;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+(65 rows)
+
+explain (costs off) select * from (select * from t order by a) s order by a, b 
limit 66;
+           QUERY PLAN            
+---------------------------------
+ Limit
+   ->  Incremental Sort
+         Sort Key: t.a, t.b
+         Presorted Key: t.a
+         ->  Sort
+               Sort Key: t.a
+               ->  Seq Scan on t
+(7 rows)
+
+select * from (select * from t order by a) s order by a, b limit 66;
+ a  | b  
+----+----
+  1 |  1
+  2 |  2
+  3 |  3
+  4 |  4
+  5 |  5
+  6 |  6
+  7 |  7
+  8 |  8
+  9 |  9
+ 10 | 10
+ 11 | 11
+ 12 | 12
+ 13 | 13
+ 14 | 14
+ 15 | 15
+ 16 | 16
+ 17 | 17
+ 18 | 18
+ 19 | 19
+ 20 | 20
+ 21 | 21
+ 22 | 22
+ 23 | 23
+ 24 | 24
+ 25 | 25
+ 26 | 26
+ 27 | 27
+ 28 | 28
+ 29 | 29
+ 30 | 30
+ 31 | 31
+ 32 | 32
+ 33 | 33
+ 34 | 34
+ 35 | 35
+ 36 | 36
+ 37 | 37
+ 38 | 38
+ 39 | 39
+ 40 | 40
+ 41 | 41
+ 42 | 42
+ 43 | 43
+ 44 | 44
+ 45 | 45
+ 46 | 46
+ 47 | 47
+ 48 | 48
+ 49 | 49
+ 50 | 50
+ 51 | 51
+ 52 | 52
+ 53 | 53
+ 54 | 54
+ 55 | 55
+ 56 | 56
+ 57 | 57
+ 58 | 58
+ 59 | 59
+ 60 | 60
+ 61 | 61
+ 62 | 62
+ 63 | 63
+ 64 | 64
+ 65 | 65
+ 66 | 66
+(66 rows)
+
+delete from t;
+drop table t;
diff --git a/src/test/regress/expected/partition_aggregate.out 
b/src/test/regress/expected/partition_aggregate.out
index 10349ec29c..5f17afe0eb 100644
--- a/src/test/regress/expected/partition_aggregate.out
+++ b/src/test/regress/expected/partition_aggregate.out
@@ -8,6 +8,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz 
factor.
+SET enable_incrementalsort TO off;
 --
 -- Tests for list partitioned tables.
 --
diff --git a/src/test/regress/expected/sysviews.out 
b/src/test/regress/expected/sysviews.out
index a1c90eb905..01b7786f01 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -76,6 +76,7 @@ select name, setting from pg_settings where name like 
'enable%';
  enable_gathermerge             | on
  enable_hashagg                 | on
  enable_hashjoin                | on
+ enable_incrementalsort         | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
  enable_material                | on
@@ -89,7 +90,7 @@ select name, setting from pg_settings where name like 
'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(17 rows)
+(18 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/parallel_schedule 
b/src/test/regress/parallel_schedule
index d33a4e143d..195c570efd 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -78,7 +78,7 @@ test: brin gin gist spgist privileges init_privs 
security_label collate matview
 # ----------
 # Another group of parallel tests
 # ----------
-test: create_table_like alter_generic alter_operator misc async dbsize 
misc_functions sysviews tsrf tidscan collate.icu.utf8
+test: create_table_like alter_generic alter_operator misc async dbsize 
misc_functions sysviews tsrf tidscan collate.icu.utf8 incremental_sort
 
 # rules cannot run concurrently with any test that creates
 # a view or rule in the public schema
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index f86f5c5682..7d72b8979e 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -88,6 +88,7 @@ test: select_distinct_on
 test: select_implicit
 test: select_having
 test: subselect
+test: incremental_sort
 test: union
 test: case
 test: join
diff --git a/src/test/regress/sql/incremental_sort.sql 
b/src/test/regress/sql/incremental_sort.sql
new file mode 100644
index 0000000000..b9df37412f
--- /dev/null
+++ b/src/test/regress/sql/incremental_sort.sql
@@ -0,0 +1,78 @@
+-- When we have to sort the entire table, incremental sort will
+-- be slower than plain sort, so it should not be used.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+
+-- When there is a LIMIT clause, incremental sort is beneficial because
+-- it only has to sort some of the groups, and not the entire table.
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten
+limit 1;
+
+-- When work_mem is not enough to sort the entire table, incremental sort
+-- may be faster if individual groups still fit into work_mem.
+set work_mem to '2MB';
+explain (costs off)
+select * from (select * from tenk1 order by four) t order by four, ten;
+reset work_mem;
+
+-- TODO if an analyze happens here the plans might change; should we
+-- solve by inserting extra rows or by adding a GUC that would somehow
+-- forcing the time of plan we expect.
+create table t(a integer, b integer);
+
+-- A single large group tested around each mode transition point.
+insert into t(a, b) select 1, i from generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b 
limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b 
limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b 
limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b 
limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b 
limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- An initial large group followed by a small group.
+insert into t(a, b) select (case when i < 50 then 1 else 2 end), i from 
generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b 
limit 55;
+select * from (select * from t order by a) s order by a, b limit 55;
+delete from t;
+
+-- An initial small group followed by a large group.
+insert into t(a, b) select (case when i < 5 then i else 9 end), i from 
generate_series(1, 100) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b 
limit 70;
+select * from (select * from t order by a) s order by a, b limit 70;
+delete from t;
+
+-- Small groups of 10 tuples each tested around each mode transition point.
+insert into t(a, b) select i / 10, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b 
limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b 
limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b 
limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b 
limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b 
limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+-- Small groups of only 1 tuple each tested around each mode transition point.
+insert into t(a, b) select i, i from generate_series(1, 70) n(i);
+explain (costs off) select * from (select * from t order by a) s order by a, b 
limit 31;
+select * from (select * from t order by a) s order by a, b limit 31;
+explain (costs off) select * from (select * from t order by a) s order by a, b 
limit 32;
+select * from (select * from t order by a) s order by a, b limit 32;
+explain (costs off) select * from (select * from t order by a) s order by a, b 
limit 33;
+select * from (select * from t order by a) s order by a, b limit 33;
+explain (costs off) select * from (select * from t order by a) s order by a, b 
limit 65;
+select * from (select * from t order by a) s order by a, b limit 65;
+explain (costs off) select * from (select * from t order by a) s order by a, b 
limit 66;
+select * from (select * from t order by a) s order by a, b limit 66;
+delete from t;
+
+drop table t;
diff --git a/src/test/regress/sql/partition_aggregate.sql 
b/src/test/regress/sql/partition_aggregate.sql
index dcd6edbad2..6a8db29a07 100644
--- a/src/test/regress/sql/partition_aggregate.sql
+++ b/src/test/regress/sql/partition_aggregate.sql
@@ -9,6 +9,8 @@ SET enable_partitionwise_aggregate TO true;
 SET enable_partitionwise_join TO true;
 -- Disable parallel plans.
 SET max_parallel_workers_per_gather TO 0;
+-- Disable incremental sort, which can influence selected plans due to fuzz 
factor.
+SET enable_incrementalsort TO off;
 
 --
 -- Tests for list partitioned tables.
-- 
2.21.0

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

Reply via email to