[HACKERS] PoC: Duplicate Tuple Elidation during External Sort for DISTINCT

Jon Nelson Tue, 21 Jan 2014 19:17:54 -0800

Greetings -hackers:

I have worked up a patch to PostgreSQL which elides tuples during an
external sort. The primary use case is when sorted input is being used
to feed a DISTINCT operation. The idea is to throw out tuples that
compare as identical whenever it's convenient, predicated on the
assumption that even a single I/O is more expensive than some number
of (potentially extra) comparisons.  Obviously, this is where a cost
model comes in, which has not been implemented. This patch is a
work-in-progress.


Being very new to hacking PostgreSQL, I've probably done a bunch of
things wrong, but I've tried to test it thoroughly.

A rough summary of the patch follows:

- a GUC variable enables or disables this capability
- in nodeAgg.c, eliding duplicate tuples is enabled if the number of
  distinct columns is equal to the number of sort columns (and both are
  greater than zero).
- in createplan.c, eliding duplicate tuples is enabled if we are
  creating a unique plan which involves sorting first
- ditto planner.c
- all of the remaining changes are in tuplesort.c, which consist of:
  + a new macro, DISCARDTUP and a new structure member, discardtup, are
    both defined and operate similar to COMPARETUP, COPYTUP, etc...
  + in puttuple_common, when state is TSS_BUILDRUNS, we *may* simply
    throw out the new tuple if it compares as identical to the tuple at
    the top of the heap. Since we're already performing this comparison,
    this is essentially free.
  + in mergeonerun, we may discard a tuple if it compares as identical
    to the *last written tuple*. This is a comparison that did not take
    place before, so it's not free, but it saves a write I/O.
  + We perform the same logic in dumptuples

I have tested this patch with a bunch of different data, and the
results are summarized below.

Test 1:
I used a corpus of approximately 1.5TB of textual data, consisting of
7.4 billion input rows, 12.6% of which is unique. This is a 'real-world'
test.

Before the patch: 44.25 hours using 274GB of temporary files
After the patch: 36.44 hours using 125GB of temporary files
Difference: -7.6 hours, 18% faster

Test 2:
Acquire the unique set of strings from a totally random set that
contains no duplicates. This is a 'worst case'.

Input: 700 million rows, 32GB.
Before: 18,796 seconds.
After: 19,358 seconds
Difference: +562 seconds, *slower* by 3%.
In this particular case, the performance varies from faster to slower, and I'm
not sure why. Statistical noise?

Test 3:
Acquire the unique set of integers from a set of 100 million, all
happen to have the value '1'. This is a 'best case'.

Input: 100 million rows, 3.4GB on-disk
Before: 26.1 seconds using 1.3GB of temporary files
After: 8.9 seconds using 8KB of disk
Difference: -17.2 seconds, 65% faster

Test 4:
Similar to test 3, but using 1 billion rows.
Before: 1450 seconds, 13 GB of temporary files
After: 468 seconds, 8 KB of temporary files
Difference: -982 seconds, 68% faster

See below for details on test 4.

Lastly, 'make check' passes without issues (modulo rangefuncs grumping about
changes in pg_settings).

Tests 1 through 3 were performed against PostgreSQL 9.4 HEAD as of
early September 2013.  I have rebased the patch against
PostgreSQL 9.4 HEAD as of 19 January 2014, and re-run tests
2, 3, and 4 with similar results.

Regarding the patch: I've tried to conform to the indenting and style
rules, including using pg_bsd_indent and pg_indent, but I've not had
100% success.

The details on test 4:

A simple example, using 1 billion rows all with the value '1' (integer):
This is a fabricated *best case*.

postgres=# set enable_hashagg = false;
SET
Time: 30.402 ms
postgres=# explain analyze verbose select distinct * from i ;
                                                                 QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=245942793.27..250942793.27 rows=1 width=4) (actual
time=468188.900..468188.901 rows=1 loops=1)
   Output: i
   ->  Sort  (cost=245942793.27..248442793.27 rows=1000000000 width=4)
(actual time=468188.897..468188.897 rows=1 loops=1)
         Output: i
         Sort Key: i.i
         Sort Method: external sort  Disk: 8kB
         ->  Seq Scan on public.i  (cost=0.00..14424779.00
rows=1000000000 width=4) (actual time=34.765..254595.990
rows=1000000000 loops=1)
               Output: i
 Total runtime: 468189.125 ms
(9 rows)

Time: 468189.755 ms
postgres=# set enable_opportunistic_tuple_elide = false;
SET
Time: 30.402 ms
postgres=# explain analyze verbose select distinct * from i ;
                                                                 QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=245942793.27..250942793.27 rows=1 width=4) (actual
time=1047226.451..1449371.632 rows=1 loops=1)
   Output: i
   ->  Sort  (cost=245942793.27..248442793.27 rows=1000000000 width=4)
(actual time=1047226.447..1284922.616 rows=1000000000 loops=1)
         Output: i
         Sort Key: i.i
         Sort Method: external sort  Disk: 13685248kB
         ->  Seq Scan on public.i  (cost=0.00..14424779.00
rows=1000000000 width=4) (actual time=29.832..450978.694
rows=1000000000 loops=1)
               Output: i
 Total runtime: 1449644.717 ms
(9 rows)

There are some issues that remain.
For example, the following two queries behave very differently (they
*are* different queries, of course).

Compare this:
postgres=# explain ( analyze true, verbose true, costs true, buffers
true, timing true ) select count(distinct i.i) from i ;

                                                              QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=16924779.00..16924779.01 rows=1 width=4) (actual
time=894138.114..894138.115 rows=1 loops=1)
   Output: count(DISTINCT i)
   Buffers: shared hit=102 read=4424683, temp read=1466277 written=1466277
   ->  Seq Scan on public.i  (cost=0.00..14424779.00 rows=1000000000
width=4) (actual time=0.012..349142.072 rows=1000000000 loops=1)
         Output: i
         Buffers: shared hit=96 read=4424683
 Total runtime: 894138.233 ms
(7 rows)

Time: 894190.392 ms
postgres=#

to this (with this code path enabled):

postgres=# explain ( analyze true, verbose true, costs true, buffers
true, timing true ) select count(1) from (select distinct i.i from i)
as sub ;

 Aggregate  (cost=182583418.28..182583418.29 rows=1 width=0) (actual
time=451973.381..451973.417 rows=1 loops=1)
   Output: count(1)
   Buffers: shared hit=192 read=4424587, temp read=1 written=1
   ->  Unique  (cost=177583418.27..182583418.27 rows=1 width=4)
(actual time=451973.370..451973.372 rows=1 loops=1)
         Output: i.i
         Buffers: shared hit=192 read=4424587, temp read=1 written=1
         ->  Sort  (cost=177583418.27..180083418.27 rows=1000000000
width=4) (actual time=451973.366..451973.367 rows=1 loops=1)
               Output: i.i
               Sort Key: i.i
               Sort Method: external sort  Disk: 8kB
               Buffers: shared hit=192 read=4424587, temp read=1 written=1
               ->  Seq Scan on public.i  (cost=0.00..14424779.00
rows=1000000000 width=4) (actual time=0.016..223587.173
rows=1000000000 loops=1)
                     Output: i.i
                     Buffers: shared hit=192 read=4424587
 Total runtime: 451994.583 ms

 (which also logged the following information)
LOG:  tuplesort_end: opportunistically elided 999999999 tuples
(1864133 extra comparisons): 0 during initial accrual, 998135866
during build, 0 during merge, 1864133 during dump

For comparison purposes, here is the same query but with hash
aggregation enabled:

 Aggregate  (cost=16924779.02..16924779.03 rows=1 width=0) (actual
time=559379.525..559379.525 rows=1 loops=1)
   Output: count(1)
   Buffers: shared hit=224 read=4424555
   ->  HashAggregate  (cost=16924779.00..16924779.01 rows=1 width=4)
(actual time=559379.513..559379.513 rows=1 loops=1)
         Output: i.i
         Group Key: i.i
         Buffers: shared hit=224 read=4424555
         ->  Seq Scan on public.i  (cost=0.00..14424779.00
rows=1000000000 width=4) (actual time=0.016..223704.468
rows=1000000000 loops=1)
               Output: i.i
               Buffers: shared hit=224 read=4424555
 Total runtime: 559379.608 ms
(11 rows)


A note: this work was supported by my employer, Renesys Corporation (
http://www.renesys.com/ ).

The patch was formatted using 'unified' diff, per
http://wiki.postgresql.org/wiki/Submitting_a_Patch

-- 
Jon

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 8eae43d..cfdab17 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2701,7 +2701,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
        state.tuplesort = tuplesort_begin_datum(TIDOID, TIDLessOperator,
                                                                                
        InvalidOid, false,
                                                                                
        maintenance_work_mem,
-                                                                               
        false);
+                                                                               
        false, false);
        state.htups = state.itups = state.tups_inserted = 0;
 
        (void) index_bulk_delete(&ivinfo, NULL,
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 7e4bca5..2c09dec 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -115,6 +115,7 @@
 #include "utils/tuplesort.h"
 #include "utils/datum.h"
 
+extern bool trace_sort;
 
 /*
  * AggStatePerAggData - per-aggregate working state for the Agg scan
@@ -343,6 +344,15 @@ initialize_aggregates(AggState *aggstate,
                 */
                if (peraggstate->numSortCols > 0)
                {
+                       bool            elideDuplicates = 
(peraggstate->numDistinctCols == peraggstate->numSortCols) && 
peraggstate->numDistinctCols > 0;
+
+#ifdef TRACE_SORT
+                       if (trace_sort)
+
+                               elog(LOG, "initialize_aggregates: 
elideDuplicates = %s, numDistinctCols = %d, numSortCols = %d",
+                                        elideDuplicates ? "true" : "false", 
peraggstate->numDistinctCols, peraggstate->numSortCols);
+#endif
+
                        /*
                         * In case of rescan, maybe there could be an 
uncompleted sort
                         * operation?  Clean it up if so.
@@ -361,14 +371,14 @@ initialize_aggregates(AggState *aggstate,
                                                                          
peraggstate->sortOperators[0],
                                                                          
peraggstate->sortCollations[0],
                                                                          
peraggstate->sortNullsFirst[0],
-                                                                         
work_mem, false) :
+                                                                         
work_mem, false, elideDuplicates) :
                                tuplesort_begin_heap(peraggstate->evaldesc,
                                                                         
peraggstate->numSortCols,
                                                                         
peraggstate->sortColIdx,
                                                                         
peraggstate->sortOperators,
                                                                         
peraggstate->sortCollations,
                                                                         
peraggstate->sortNullsFirst,
-                                                                        
work_mem, false);
+                                                                        
work_mem, false, elideDuplicates);
                }
 
                /*
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index b88571b..3afe8b3 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -89,7 +89,8 @@ ExecSort(SortState *node)
                                                                                
          plannode->collations,
                                                                                
          plannode->nullsFirst,
                                                                                
          work_mem,
-                                                                               
          node->randomAccess);
+                                                                               
          node->randomAccess,
+                                                                               
          plannode->elideDuplicates);
                if (node->bounded)
                        tuplesort_set_bound(tuplesortstate, node->bound);
                node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/optimizer/plan/createplan.c 
b/src/backend/optimizer/plan/createplan.c
index 184d37a..1b5aaee 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1063,6 +1063,11 @@ create_unique_plan(PlannerInfo *root, UniquePath 
*best_path)
                        groupColPos++;
                }
                plan = (Plan *) make_sort_from_sortclauses(root, sortList, 
subplan);
+               {
+                       Sort       *s = (Sort *) plan;
+
+                       s->elideDuplicates = true;
+               }
                plan = (Plan *) make_unique(plan, sortList);
        }
 
@@ -3763,6 +3768,7 @@ make_sort(PlannerInfo *root, Plan *lefttree, int numCols,
        node->sortOperators = sortOperators;
        node->collations = collations;
        node->nullsFirst = nullsFirst;
+       node->elideDuplicates = false;
 
        return node;
 }
diff --git a/src/backend/optimizer/plan/planner.c 
b/src/backend/optimizer/plan/planner.c
index 35bda67..fe01511 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1889,6 +1889,11 @@ grouping_planner(PlannerInfo *root, double 
tuple_fraction)
                                                                                
                                           result_plan,
                                                                                
                                        current_pathkeys,
                                                                                
                                           -1.0);
+                               {
+                                       Sort       *s = (Sort *) result_plan;
+
+                                       s->elideDuplicates = true;
+                               }
                        }
 
                        result_plan = (Plan *) make_unique(result_plan,
diff --git a/src/backend/utils/adt/orderedsetaggs.c 
b/src/backend/utils/adt/orderedsetaggs.c
index e5324e2..724f894 100644
--- a/src/backend/utils/adt/orderedsetaggs.c
+++ b/src/backend/utils/adt/orderedsetaggs.c
@@ -280,13 +280,13 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool 
use_tuples)
                                                                                
                   qstate->sortOperators,
                                                                                
                   qstate->sortCollations,
                                                                                
                   qstate->sortNullsFirsts,
-                                                                               
                   work_mem, false);
+                                                                               
                   work_mem, false, false);
        else
                osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
                                                                                
                        qstate->sortOperator,
                                                                                
                        qstate->sortCollation,
                                                                                
                        qstate->sortNullsFirst,
-                                                                               
                        work_mem, false);
+                                                                               
                        work_mem, false, false);
 
        osastate->number_of_rows = 0;
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 1217098..f5d616d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -130,6 +130,7 @@ extern char *SSLCipherSuites;
 extern char *SSLECDHCurve;
 extern bool SSLPreferServerCiphers;
 
+extern bool enable_opportunistic_tuple_elide;
 #ifdef TRACE_SORT
 extern bool trace_sort;
 #endif
@@ -733,6 +734,15 @@ static struct config_bool ConfigureNamesBool[] =
                NULL, NULL, NULL
        },
        {
+               {"enable_opportunistic_tuple_elide", PGC_USERSET, 
QUERY_TUNING_METHOD,
+                       gettext_noop("Enables the sort code to 
opportunistically elide tuples."),
+                       NULL
+               },
+               &enable_opportunistic_tuple_elide,
+               false,
+               NULL, NULL, NULL
+       },
+       {
                {"enable_material", PGC_USERSET, QUERY_TUNING_METHOD,
                        gettext_noop("Enables the planner's use of 
materialization."),
                        NULL
diff --git a/src/backend/utils/sort/tuplesort.c 
b/src/backend/utils/sort/tuplesort.c
index 8b520c1..047ac16 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -131,6 +131,7 @@ bool                trace_sort = false;
 #ifdef DEBUG_BOUNDED_SORT
 bool           optimize_bounded_sort = true;
 #endif
+bool           enable_opportunistic_tuple_elide = false;
 
 
 /*
@@ -209,6 +210,17 @@ struct Tuplesortstate
        bool            randomAccess;   /* did caller request random access? */
        bool            bounded;                /* did caller specify a maximum 
number of
                                                                 * tuples to 
return? */
+
+       int                     elideDuplicates;
+       int64           tuplesElided;
+       int64           tuplesElidedDuringInitial;
+       int64           tuplesElidedDuringBuild;
+       int64           tuplesElidedDuringMerge;
+       int64           tuplesElidedDuringDump;
+       int64           extraComparisons;
+       bool            have_tup;
+       SortTuple       last_dumped_tup;
+
        bool            boundUsed;              /* true if we made use of a 
bounded heap */
        int                     bound;                  /* if bounded, the 
maximum number of tuples */
        int64           availMem;               /* remaining memory available, 
in bytes */
@@ -248,6 +260,15 @@ struct Tuplesortstate
                                                                                
 SortTuple *stup);
 
        /*
+        * Function to discard a previously-stored tuple. pfree() the 
out-of-line
+        * data (not the SortTuple struct!), and increase state->availMem by the
+        * amount of memory space thereby released.
+        */
+       void            (*discardtup) (Tuplesortstate *state, int tapenum,
+                                                                               
   SortTuple *stup);
+
+
+       /*
         * Function to read a stored tuple from tape back into memory. 'len' is
         * the already-read length of the stored tuple.  Create a palloc'd copy,
         * initialize tuple/datum1/isnull1 in the target SortTuple struct, and
@@ -394,6 +415,7 @@ struct Tuplesortstate
 #define COMPARETUP(state,a,b)  ((*(state)->comparetup) (a, b, state))
 #define COPYTUP(state,stup,tup) ((*(state)->copytup) (state, stup, tup))
 #define WRITETUP(state,tape,stup)      ((*(state)->writetup) (state, tape, 
stup))
+#define DISCARDTUP(state,tape,stup) ((*(state)->discardtup) (state, tape, 
stup))
 #define READTUP(state,stup,tape,len) ((*(state)->readtup) (state, stup, tape, 
len))
 #define REVERSEDIRECTION(state) ((*(state)->reversedirection) (state))
 #define LACKMEM(state)         ((state)->availMem < 0)
@@ -449,7 +471,7 @@ struct Tuplesortstate
        } while(0)
 
 
-static Tuplesortstate *tuplesort_begin_common(int workMem, bool randomAccess);
+static Tuplesortstate *tuplesort_begin_common(int workMem, bool randomAccess, 
bool elideDuplicates);
 static void puttuple_common(Tuplesortstate *state, SortTuple *tuple);
 static void inittapes(Tuplesortstate *state);
 static void selectnewtape(Tuplesortstate *state);
@@ -471,6 +493,8 @@ static int comparetup_heap(const SortTuple *a, const 
SortTuple *b,
 static void copytup_heap(Tuplesortstate *state, SortTuple *stup, void *tup);
 static void writetup_heap(Tuplesortstate *state, int tapenum,
                          SortTuple *stup);
+static void discardtup_heap(Tuplesortstate *state, int tapenum,
+                               SortTuple *stup);
 static void readtup_heap(Tuplesortstate *state, SortTuple *stup,
                         int tapenum, unsigned int len);
 static void reversedirection_heap(Tuplesortstate *state);
@@ -479,6 +503,8 @@ static int comparetup_cluster(const SortTuple *a, const 
SortTuple *b,
 static void copytup_cluster(Tuplesortstate *state, SortTuple *stup, void *tup);
 static void writetup_cluster(Tuplesortstate *state, int tapenum,
                                 SortTuple *stup);
+static void discardtup_cluster(Tuplesortstate *state, int tapenum,
+                                  SortTuple *stup);
 static void readtup_cluster(Tuplesortstate *state, SortTuple *stup,
                                int tapenum, unsigned int len);
 static int comparetup_index_btree(const SortTuple *a, const SortTuple *b,
@@ -488,6 +514,8 @@ static int comparetup_index_hash(const SortTuple *a, const 
SortTuple *b,
 static void copytup_index(Tuplesortstate *state, SortTuple *stup, void *tup);
 static void writetup_index(Tuplesortstate *state, int tapenum,
                           SortTuple *stup);
+static void discardtup_index(Tuplesortstate *state, int tapenum,
+                                SortTuple *stup);
 static void readtup_index(Tuplesortstate *state, SortTuple *stup,
                          int tapenum, unsigned int len);
 static void reversedirection_index_btree(Tuplesortstate *state);
@@ -497,6 +525,8 @@ static int comparetup_datum(const SortTuple *a, const 
SortTuple *b,
 static void copytup_datum(Tuplesortstate *state, SortTuple *stup, void *tup);
 static void writetup_datum(Tuplesortstate *state, int tapenum,
                           SortTuple *stup);
+static void discardtup_datum(Tuplesortstate *state, int tapenum,
+                                SortTuple *stup);
 static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
                          int tapenum, unsigned int len);
 static void reversedirection_datum(Tuplesortstate *state);
@@ -532,7 +562,7 @@ static void free_sort_tuple(Tuplesortstate *state, 
SortTuple *stup);
  */
 
 static Tuplesortstate *
-tuplesort_begin_common(int workMem, bool randomAccess)
+tuplesort_begin_common(int workMem, bool randomAccess, bool elideDuplicates)
 {
        Tuplesortstate *state;
        MemoryContext sortcontext;
@@ -564,6 +594,28 @@ tuplesort_begin_common(int workMem, bool randomAccess)
        state->status = TSS_INITIAL;
        state->randomAccess = randomAccess;
        state->bounded = false;
+
+       if (!enable_opportunistic_tuple_elide && elideDuplicates)
+       {
+               elog(LOG, "tuplesort_begin_common: 
enable_opportunistic_tuple_elide is false but elideDuplicates is true. Changing 
to false.");
+               state->elideDuplicates = false;
+       }
+       else
+       {
+               state->elideDuplicates = elideDuplicates;
+               if (trace_sort)
+               {
+                       elog(LOG, "tuplesort_begin_common: elideDuplicates: 
%s", elideDuplicates ? "true" : "false");
+               }
+       }
+       state->tuplesElided = 0;
+       state->tuplesElidedDuringInitial = 0;
+       state->tuplesElidedDuringBuild = 0;
+       state->tuplesElidedDuringMerge = 0;
+       state->tuplesElidedDuringDump = 0;
+       state->extraComparisons = 0;
+       state->have_tup = false;
+
        state->boundUsed = false;
        state->allowedMem = workMem * (int64) 1024;
        state->availMem = state->allowedMem;
@@ -600,9 +652,9 @@ tuplesort_begin_heap(TupleDesc tupDesc,
                                         int nkeys, AttrNumber *attNums,
                                         Oid *sortOperators, Oid 
*sortCollations,
                                         bool *nullsFirstFlags,
-                                        int workMem, bool randomAccess)
+                                        int workMem, bool randomAccess, bool 
elideDuplicates)
 {
-       Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+       Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess, 
elideDuplicates);
        MemoryContext oldcontext;
        int                     i;
 
@@ -613,8 +665,8 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 #ifdef TRACE_SORT
        if (trace_sort)
                elog(LOG,
-                        "begin tuple sort: nkeys = %d, workMem = %d, 
randomAccess = %c",
-                        nkeys, workMem, randomAccess ? 't' : 'f');
+                        "tuplesort_begin_heap: nkeys = %d, workMem = %d, 
randomAccess = %c, elideDuplicates = %s",
+                        nkeys, workMem, randomAccess ? 't' : 'f', 
state->elideDuplicates ? "true" : "false");
 #endif
 
        state->nKeys = nkeys;
@@ -628,6 +680,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
        state->comparetup = comparetup_heap;
        state->copytup = copytup_heap;
        state->writetup = writetup_heap;
+       state->discardtup = discardtup_heap;
        state->readtup = readtup_heap;
        state->reversedirection = reversedirection_heap;
 
@@ -664,7 +717,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
                                                Relation indexRel,
                                                int workMem, bool randomAccess)
 {
-       Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+       Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess, 
false);
        MemoryContext oldcontext;
 
        Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
@@ -674,7 +727,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 #ifdef TRACE_SORT
        if (trace_sort)
                elog(LOG,
-                        "begin tuple sort: nkeys = %d, workMem = %d, 
randomAccess = %c",
+                        "tuplesort_begin_cluster: nkeys = %d, workMem = %d, 
randomAccess = %c",
                         RelationGetNumberOfAttributes(indexRel),
                         workMem, randomAccess ? 't' : 'f');
 #endif
@@ -690,6 +743,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
        state->comparetup = comparetup_cluster;
        state->copytup = copytup_cluster;
        state->writetup = writetup_cluster;
+       state->discardtup = discardtup_cluster;
        state->readtup = readtup_cluster;
        state->reversedirection = reversedirection_index_btree;
 
@@ -726,7 +780,7 @@ tuplesort_begin_index_btree(Relation heapRel,
                                                        bool enforceUnique,
                                                        int workMem, bool 
randomAccess)
 {
-       Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+       Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess, 
false);
        MemoryContext oldcontext;
 
        oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -750,6 +804,7 @@ tuplesort_begin_index_btree(Relation heapRel,
        state->comparetup = comparetup_index_btree;
        state->copytup = copytup_index;
        state->writetup = writetup_index;
+       state->discardtup = discardtup_index;
        state->readtup = readtup_index;
        state->reversedirection = reversedirection_index_btree;
 
@@ -769,7 +824,7 @@ tuplesort_begin_index_hash(Relation heapRel,
                                                   uint32 hash_mask,
                                                   int workMem, bool 
randomAccess)
 {
-       Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+       Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess, 
false);
        MemoryContext oldcontext;
 
        oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -787,6 +842,7 @@ tuplesort_begin_index_hash(Relation heapRel,
        state->comparetup = comparetup_index_hash;
        state->copytup = copytup_index;
        state->writetup = writetup_index;
+       state->discardtup = discardtup_index;
        state->readtup = readtup_index;
        state->reversedirection = reversedirection_index_hash;
 
@@ -803,9 +859,9 @@ tuplesort_begin_index_hash(Relation heapRel,
 Tuplesortstate *
 tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
                                          bool nullsFirstFlag,
-                                         int workMem, bool randomAccess)
+                                         int workMem, bool randomAccess, bool 
elideDuplicates)
 {
-       Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+       Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess, 
elideDuplicates);
        MemoryContext oldcontext;
        int16           typlen;
        bool            typbyval;
@@ -815,8 +871,8 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid 
sortCollation,
 #ifdef TRACE_SORT
        if (trace_sort)
                elog(LOG,
-                        "begin datum sort: workMem = %d, randomAccess = %c",
-                        workMem, randomAccess ? 't' : 'f');
+                        "tuplesort_begin_datum: workMem = %d, randomAccess = 
%c, elideDuplicates = %s",
+                        workMem, randomAccess ? 't' : 'f', 
state->elideDuplicates ? "true" : "false");
 #endif
 
        state->nKeys = 1;                       /* always a one-column sort */
@@ -830,6 +886,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid 
sortCollation,
        state->comparetup = comparetup_datum;
        state->copytup = copytup_datum;
        state->writetup = writetup_datum;
+       state->discardtup = discardtup_datum;
        state->readtup = readtup_datum;
        state->reversedirection = reversedirection_datum;
 
@@ -886,6 +943,12 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 
        state->bounded = true;
        state->bound = (int) bound;
+
+       if (state->elideDuplicates)
+       {
+               elog(LOG, "tuplesort_set_bound: elideDuplicates is true but 
this is a bounded sort. Changing to false.");
+               state->elideDuplicates = false;
+       }
 }
 
 /*
@@ -930,6 +993,16 @@ tuplesort_end(Tuplesortstate *state)
                else
                        elog(LOG, "internal sort ended, %ld KB used: %s",
                                 spaceUsed, pg_rusage_show(&state->ru_start));
+               if (state->elideDuplicates)
+               {
+                       elog(LOG, "tuplesort_end: opportunistically elided %ld 
tuples " \
+                                "(%ld extra comparisons): %ld during initial 
accrual, %ld during build, %ld during merge, %ld during dump",
+                                state->tuplesElided, state->extraComparisons,
+                                state->tuplesElidedDuringInitial,
+                                state->tuplesElidedDuringBuild,
+                                state->tuplesElidedDuringMerge,
+                                state->tuplesElidedDuringDump);
+               }
        }
 
        TRACE_POSTGRESQL_SORT_DONE(state->tapeset != NULL, spaceUsed);
@@ -1297,10 +1370,25 @@ puttuple_common(Tuplesortstate *state, SortTuple *tuple)
                         * this point; see dumptuples.
                         */
                        Assert(state->memtupcount > 0);
-                       if (COMPARETUP(state, tuple, &state->memtuples[0]) >= 0)
-                               tuplesort_heap_insert(state, tuple, 
state->currentRun, true);
-                       else
-                               tuplesort_heap_insert(state, tuple, 
state->currentRun + 1, true);
+
+                       {
+                               int                     ret = COMPARETUP(state, 
tuple, &state->memtuples[0]);
+
+                               if (ret == 0 && state->elideDuplicates)
+                               {
+                                       free_sort_tuple(state, tuple);
+                                       state->tuplesElided++;
+                                       state->tuplesElidedDuringBuild++;
+                               }
+                               else if (ret >= 0)
+                               {
+                                       tuplesort_heap_insert(state, tuple, 
state->currentRun, true);
+                               }
+                               else
+                               {
+                                       tuplesort_heap_insert(state, tuple, 
state->currentRun + 1, true);
+                               }
+                       }
 
                        /*
                         * If we are over the memory limit, dump tuples till 
we're under.
@@ -1403,6 +1491,10 @@ tuplesort_performsort(Tuplesortstate *state)
                                 pg_rusage_show(&state->ru_start));
        }
 #endif
+       if (trace_sort && state->elideDuplicates)
+       {
+               elog(LOG, "performsort: opportunistically elided %ld tuples.", 
state->tuplesElided);
+       }
 
        MemoryContextSwitchTo(oldcontext);
 }
@@ -1847,6 +1939,8 @@ inittapes(Tuplesortstate *state)
        if (trace_sort)
                elog(LOG, "switching to external sort with %d tapes: %s",
                         maxTapes, pg_rusage_show(&state->ru_start));
+       if (state->elideDuplicates)
+               elog(LOG, "inittapes: opportunistically elided %ld tuples so 
far.", state->tuplesElided);
 #endif
 
        /*
@@ -2124,7 +2218,39 @@ mergeonerun(Tuplesortstate *state)
                /* write the tuple to destTape */
                priorAvail = state->availMem;
                srcTape = state->memtuples[0].tupindex;
-               WRITETUP(state, destTape, &state->memtuples[0]);
+
+               if (state->elideDuplicates && !state->have_tup)
+               {
+                       WRITETUP(state, destTape, &state->memtuples[0]);
+                       state->last_dumped_tup = state->memtuples[0];
+                       /* don't discard it */
+                       state->have_tup = true;
+               }
+               else
+               {                                               /* have_tup || 
!elide duplicates */
+                       if (state->elideDuplicates)
+                       {
+                               if (COMPARETUP(state, &state->last_dumped_tup, 
&state->memtuples[0]) == 0)
+                               {
+                                       DISCARDTUP(state, destTape, 
&state->memtuples[0]);
+                                       state->tuplesElided++;
+                                       state->tuplesElidedDuringMerge++;
+                               }
+                               else
+                               {
+                                       WRITETUP(state, destTape, 
&state->memtuples[0]);
+                                       DISCARDTUP(state, destTape, 
&state->last_dumped_tup);
+                                       state->last_dumped_tup = 
state->memtuples[0];
+                               }
+                               state->extraComparisons++;
+                       }
+                       else
+                       {
+                               WRITETUP(state, destTape, &state->memtuples[0]);
+                               DISCARDTUP(state, destTape, 
&state->memtuples[0]);
+                       }
+               }
+
                /* writetup adjusted total free space, now fix per-tape space */
                spaceFreed = state->availMem - priorAvail;
                state->mergeavailmem[srcTape] += spaceFreed;
@@ -2150,6 +2276,12 @@ mergeonerun(Tuplesortstate *state)
                state->mergeavailslots[srcTape]++;
        }
 
+       if (state->elideDuplicates && state->have_tup)
+       {
+               DISCARDTUP(state, state->tp_tapenum[state->destTape], 
&state->last_dumped_tup);
+               state->have_tup = false;
+       }
+
        /*
         * When the heap empties, we're done.  Write an end-of-run marker on the
         * output tape, and increment its count of real runs.
@@ -2159,8 +2291,14 @@ mergeonerun(Tuplesortstate *state)
 
 #ifdef TRACE_SORT
        if (trace_sort)
+       {
                elog(LOG, "finished %d-way merge step: %s", state->activeTapes,
                         pg_rusage_show(&state->ru_start));
+               if (state->elideDuplicates)
+               {
+                       elog(LOG, "mergeonerun: opportunistically elided %ld 
tuples.", state->tuplesElided);
+               }
+       }
 #endif
 }
 
@@ -2370,9 +2508,42 @@ dumptuples(Tuplesortstate *state, bool alltuples)
                 * Dump the heap's frontmost entry, and sift up to remove it 
from the
                 * heap.
                 */
+               int                     destTape = 
state->tp_tapenum[state->destTape];
+
                Assert(state->memtupcount > 0);
-               WRITETUP(state, state->tp_tapenum[state->destTape],
-                                &state->memtuples[0]);
+
+               if (state->elideDuplicates && !state->have_tup)
+               {
+                       WRITETUP(state, destTape, &state->memtuples[0]);
+                       state->last_dumped_tup = state->memtuples[0];
+                       /* don't discard it */
+                       state->have_tup = true;
+               }
+               else
+               {                                               /* have_tup || 
!elide duplicates */
+                       if (state->elideDuplicates)
+                       {
+                               if (COMPARETUP(state, &state->last_dumped_tup, 
&state->memtuples[0]) == 0)
+                               {
+                                       DISCARDTUP(state, destTape, 
&state->memtuples[0]);
+                                       state->tuplesElided++;
+                                       state->tuplesElidedDuringDump++;
+                               }
+                               else
+                               {
+                                       WRITETUP(state, destTape, 
&state->memtuples[0]);
+                                       DISCARDTUP(state, destTape, 
&state->last_dumped_tup);
+                                       state->last_dumped_tup = 
state->memtuples[0];
+                               }
+                               state->extraComparisons++;
+                       }
+                       else
+                       {
+                               WRITETUP(state, destTape, &state->memtuples[0]);
+                               DISCARDTUP(state, destTape, 
&state->memtuples[0]);
+                       }
+               }
+
                tuplesort_heap_siftup(state, true);
 
                /*
@@ -2389,10 +2560,16 @@ dumptuples(Tuplesortstate *state, bool alltuples)
 
 #ifdef TRACE_SORT
                        if (trace_sort)
+                       {
                                elog(LOG, "finished writing%s run %d to tape 
%d: %s",
                                         (state->memtupcount == 0) ? " final" : 
"",
                                         state->currentRun, state->destTape,
                                         pg_rusage_show(&state->ru_start));
+                               if (state->elideDuplicates)
+                               {
+                                       elog(LOG, "dumptuples: 
opportunistically elided %ld tuples.", state->tuplesElided);
+                               }
+                       }
 #endif
 
                        /*
@@ -2404,6 +2581,16 @@ dumptuples(Tuplesortstate *state, bool alltuples)
                        selectnewtape(state);
                }
        }
+
+       if (alltuples)
+       {
+               /* last run */
+               if (state->elideDuplicates && state->have_tup)
+               {
+                       DISCARDTUP(state, state->tp_tapenum[state->destTape], 
&state->last_dumped_tup);
+                       state->have_tup = false;
+               }
+       }
 }
 
 /*
@@ -2935,6 +3122,12 @@ writetup_heap(Tuplesortstate *state, int tapenum, 
SortTuple *stup)
        if (state->randomAccess)        /* need trailing length word? */
                LogicalTapeWrite(state->tapeset, tapenum,
                                                 (void *) &tuplen, 
sizeof(tuplen));
+}
+
+static void
+discardtup_heap(Tuplesortstate *state, int tapenum, SortTuple *stup)
+{
+       MinimalTuple tuple = (MinimalTuple) stup->tuple;
 
        FREEMEM(state, GetMemoryChunkSpace(tuple));
        heap_free_minimal_tuple(tuple);
@@ -3123,6 +3316,13 @@ writetup_cluster(Tuplesortstate *state, int tapenum, 
SortTuple *stup)
        if (state->randomAccess)        /* need trailing length word? */
                LogicalTapeWrite(state->tapeset, tapenum,
                                                 &tuplen, sizeof(tuplen));
+}
+
+
+static void
+discardtup_cluster(Tuplesortstate *state, int tapenum, SortTuple *stup)
+{
+       HeapTuple       tuple = (HeapTuple) stup->tuple;
 
        FREEMEM(state, GetMemoryChunkSpace(tuple));
        heap_freetuple(tuple);
@@ -3365,6 +3565,12 @@ writetup_index(Tuplesortstate *state, int tapenum, 
SortTuple *stup)
        if (state->randomAccess)        /* need trailing length word? */
                LogicalTapeWrite(state->tapeset, tapenum,
                                                 (void *) &tuplen, 
sizeof(tuplen));
+}
+
+static void
+discardtup_index(Tuplesortstate *state, int tapenum, SortTuple *stup)
+{
+       IndexTuple      tuple = (IndexTuple) stup->tuple;
 
        FREEMEM(state, GetMemoryChunkSpace(tuple));
        pfree(tuple);
@@ -3463,11 +3669,17 @@ writetup_datum(Tuplesortstate *state, int tapenum, 
SortTuple *stup)
        if (state->randomAccess)        /* need trailing length word? */
                LogicalTapeWrite(state->tapeset, tapenum,
                                                 (void *) &writtenlen, 
sizeof(writtenlen));
+}
+
 
+static void
+discardtup_datum(Tuplesortstate *state, int tapenum, SortTuple *stup)
+{
        if (stup->tuple)
        {
                FREEMEM(state, GetMemoryChunkSpace(stup->tuple));
                pfree(stup->tuple);
+               stup->tuple = NULL;
        }
 }
 
@@ -3524,4 +3736,5 @@ free_sort_tuple(Tuplesortstate *state, SortTuple *stup)
 {
        FREEMEM(state, GetMemoryChunkSpace(stup->tuple));
        pfree(stup->tuple);
+       stup->tuple = NULL;
 }
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 846c31a..34de503 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -913,6 +913,7 @@ typedef struct SortGroupClause
        Oid                     sortop;                 /* the ordering 
operator ('<' op), or 0 */
        bool            nulls_first;    /* do NULLs come before normal values? 
*/
        bool            hashable;               /* can eqop be implemented by 
hashing? */
+       bool            elideDuplicates;
 } SortGroupClause;
 
 /*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 38c039c..78ccc72 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -586,6 +586,7 @@ typedef struct Sort
        Oid                *sortOperators;      /* OIDs of operators to sort 
them by */
        Oid                *collations;         /* OIDs of collations */
        bool       *nullsFirst;         /* NULLS FIRST/LAST directions */
+       bool            elideDuplicates;
 } Sort;
 
 /* ---------------
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 05445f0..bdb73c6 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -62,7 +62,7 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
                                         int nkeys, AttrNumber *attNums,
                                         Oid *sortOperators, Oid 
*sortCollations,
                                         bool *nullsFirstFlags,
-                                        int workMem, bool randomAccess);
+                                        int workMem, bool randomAccess, bool 
elideDuplicates);
 extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
                                                Relation indexRel,
                                                int workMem, bool randomAccess);
@@ -77,7 +77,7 @@ extern Tuplesortstate *tuplesort_begin_index_hash(Relation 
heapRel,
 extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
                                          Oid sortOperator, Oid sortCollation,
                                          bool nullsFirstFlag,
-                                         int workMem, bool randomAccess);
+                                         int workMem, bool randomAccess, bool 
elideDuplicates);
 
 extern void tuplesort_set_bound(Tuplesortstate *state, int64 bound);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] PoC: Duplicate Tuple Elidation during External Sort for DISTINCT

Reply via email to