Greetings -hackers:
I have worked up a patch to PostgreSQL which elides tuples during an
external sort. The primary use case is when sorted input is being used
to feed a DISTINCT operation. The idea is to throw out tuples that
compare as identical whenever it's convenient, predicated on the
assumption that even a single I/O is more expensive than some number
of (potentially extra) comparisons. Obviously, this is where a cost
model comes in, which has not been implemented. This patch is a
work-in-progress.
Being very new to hacking PostgreSQL, I've probably done a bunch of
things wrong, but I've tried to test it thoroughly.
A rough summary of the patch follows:
- a GUC variable enables or disables this capability
- in nodeAgg.c, eliding duplicate tuples is enabled if the number of
distinct columns is equal to the number of sort columns (and both are
greater than zero).
- in createplan.c, eliding duplicate tuples is enabled if we are
creating a unique plan which involves sorting first
- ditto planner.c
- all of the remaining changes are in tuplesort.c, which consist of:
+ a new macro, DISCARDTUP and a new structure member, discardtup, are
both defined and operate similar to COMPARETUP, COPYTUP, etc...
+ in puttuple_common, when state is TSS_BUILDRUNS, we *may* simply
throw out the new tuple if it compares as identical to the tuple at
the top of the heap. Since we're already performing this comparison,
this is essentially free.
+ in mergeonerun, we may discard a tuple if it compares as identical
to the *last written tuple*. This is a comparison that did not take
place before, so it's not free, but it saves a write I/O.
+ We perform the same logic in dumptuples
I have tested this patch with a bunch of different data, and the
results are summarized below.
Test 1:
I used a corpus of approximately 1.5TB of textual data, consisting of
7.4 billion input rows, 12.6% of which is unique. This is a 'real-world'
test.
Before the patch: 44.25 hours using 274GB of temporary files
After the patch: 36.44 hours using 125GB of temporary files
Difference: -7.6 hours, 18% faster
Test 2:
Acquire the unique set of strings from a totally random set that
contains no duplicates. This is a 'worst case'.
Input: 700 million rows, 32GB.
Before: 18,796 seconds.
After: 19,358 seconds
Difference: +562 seconds, *slower* by 3%.
In this particular case, the performance varies from faster to slower, and I'm
not sure why. Statistical noise?
Test 3:
Acquire the unique set of integers from a set of 100 million, all
happen to have the value '1'. This is a 'best case'.
Input: 100 million rows, 3.4GB on-disk
Before: 26.1 seconds using 1.3GB of temporary files
After: 8.9 seconds using 8KB of disk
Difference: -17.2 seconds, 65% faster
Test 4:
Similar to test 3, but using 1 billion rows.
Before: 1450 seconds, 13 GB of temporary files
After: 468 seconds, 8 KB of temporary files
Difference: -982 seconds, 68% faster
See below for details on test 4.
Lastly, 'make check' passes without issues (modulo rangefuncs grumping about
changes in pg_settings).
Tests 1 through 3 were performed against PostgreSQL 9.4 HEAD as of
early September 2013. I have rebased the patch against
PostgreSQL 9.4 HEAD as of 19 January 2014, and re-run tests
2, 3, and 4 with similar results.
Regarding the patch: I've tried to conform to the indenting and style
rules, including using pg_bsd_indent and pg_indent, but I've not had
100% success.
The details on test 4:
A simple example, using 1 billion rows all with the value '1' (integer):
This is a fabricated *best case*.
postgres=# set enable_hashagg = false;
SET
Time: 30.402 ms
postgres=# explain analyze verbose select distinct * from i ;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=245942793.27..250942793.27 rows=1 width=4) (actual
time=468188.900..468188.901 rows=1 loops=1)
Output: i
-> Sort (cost=245942793.27..248442793.27 rows=1000000000 width=4)
(actual time=468188.897..468188.897 rows=1 loops=1)
Output: i
Sort Key: i.i
Sort Method: external sort Disk: 8kB
-> Seq Scan on public.i (cost=0.00..14424779.00
rows=1000000000 width=4) (actual time=34.765..254595.990
rows=1000000000 loops=1)
Output: i
Total runtime: 468189.125 ms
(9 rows)
Time: 468189.755 ms
postgres=# set enable_opportunistic_tuple_elide = false;
SET
Time: 30.402 ms
postgres=# explain analyze verbose select distinct * from i ;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=245942793.27..250942793.27 rows=1 width=4) (actual
time=1047226.451..1449371.632 rows=1 loops=1)
Output: i
-> Sort (cost=245942793.27..248442793.27 rows=1000000000 width=4)
(actual time=1047226.447..1284922.616 rows=1000000000 loops=1)
Output: i
Sort Key: i.i
Sort Method: external sort Disk: 13685248kB
-> Seq Scan on public.i (cost=0.00..14424779.00
rows=1000000000 width=4) (actual time=29.832..450978.694
rows=1000000000 loops=1)
Output: i
Total runtime: 1449644.717 ms
(9 rows)
There are some issues that remain.
For example, the following two queries behave very differently (they
*are* different queries, of course).
Compare this:
postgres=# explain ( analyze true, verbose true, costs true, buffers
true, timing true ) select count(distinct i.i) from i ;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=16924779.00..16924779.01 rows=1 width=4) (actual
time=894138.114..894138.115 rows=1 loops=1)
Output: count(DISTINCT i)
Buffers: shared hit=102 read=4424683, temp read=1466277 written=1466277
-> Seq Scan on public.i (cost=0.00..14424779.00 rows=1000000000
width=4) (actual time=0.012..349142.072 rows=1000000000 loops=1)
Output: i
Buffers: shared hit=96 read=4424683
Total runtime: 894138.233 ms
(7 rows)
Time: 894190.392 ms
postgres=#
to this (with this code path enabled):
postgres=# explain ( analyze true, verbose true, costs true, buffers
true, timing true ) select count(1) from (select distinct i.i from i)
as sub ;
Aggregate (cost=182583418.28..182583418.29 rows=1 width=0) (actual
time=451973.381..451973.417 rows=1 loops=1)
Output: count(1)
Buffers: shared hit=192 read=4424587, temp read=1 written=1
-> Unique (cost=177583418.27..182583418.27 rows=1 width=4)
(actual time=451973.370..451973.372 rows=1 loops=1)
Output: i.i
Buffers: shared hit=192 read=4424587, temp read=1 written=1
-> Sort (cost=177583418.27..180083418.27 rows=1000000000
width=4) (actual time=451973.366..451973.367 rows=1 loops=1)
Output: i.i
Sort Key: i.i
Sort Method: external sort Disk: 8kB
Buffers: shared hit=192 read=4424587, temp read=1 written=1
-> Seq Scan on public.i (cost=0.00..14424779.00
rows=1000000000 width=4) (actual time=0.016..223587.173
rows=1000000000 loops=1)
Output: i.i
Buffers: shared hit=192 read=4424587
Total runtime: 451994.583 ms
(which also logged the following information)
LOG: tuplesort_end: opportunistically elided 999999999 tuples
(1864133 extra comparisons): 0 during initial accrual, 998135866
during build, 0 during merge, 1864133 during dump
For comparison purposes, here is the same query but with hash
aggregation enabled:
Aggregate (cost=16924779.02..16924779.03 rows=1 width=0) (actual
time=559379.525..559379.525 rows=1 loops=1)
Output: count(1)
Buffers: shared hit=224 read=4424555
-> HashAggregate (cost=16924779.00..16924779.01 rows=1 width=4)
(actual time=559379.513..559379.513 rows=1 loops=1)
Output: i.i
Group Key: i.i
Buffers: shared hit=224 read=4424555
-> Seq Scan on public.i (cost=0.00..14424779.00
rows=1000000000 width=4) (actual time=0.016..223704.468
rows=1000000000 loops=1)
Output: i.i
Buffers: shared hit=224 read=4424555
Total runtime: 559379.608 ms
(11 rows)
A note: this work was supported by my employer, Renesys Corporation (
http://www.renesys.com/ ).
The patch was formatted using 'unified' diff, per
http://wiki.postgresql.org/wiki/Submitting_a_Patch
--
Jon
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 8eae43d..cfdab17 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2701,7 +2701,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
state.tuplesort = tuplesort_begin_datum(TIDOID, TIDLessOperator,
InvalidOid, false,
maintenance_work_mem,
-
false);
+
false, false);
state.htups = state.itups = state.tups_inserted = 0;
(void) index_bulk_delete(&ivinfo, NULL,
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 7e4bca5..2c09dec 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -115,6 +115,7 @@
#include "utils/tuplesort.h"
#include "utils/datum.h"
+extern bool trace_sort;
/*
* AggStatePerAggData - per-aggregate working state for the Agg scan
@@ -343,6 +344,15 @@ initialize_aggregates(AggState *aggstate,
*/
if (peraggstate->numSortCols > 0)
{
+ bool elideDuplicates =
(peraggstate->numDistinctCols == peraggstate->numSortCols) &&
peraggstate->numDistinctCols > 0;
+
+#ifdef TRACE_SORT
+ if (trace_sort)
+
+ elog(LOG, "initialize_aggregates:
elideDuplicates = %s, numDistinctCols = %d, numSortCols = %d",
+ elideDuplicates ? "true" : "false",
peraggstate->numDistinctCols, peraggstate->numSortCols);
+#endif
+
/*
* In case of rescan, maybe there could be an
uncompleted sort
* operation? Clean it up if so.
@@ -361,14 +371,14 @@ initialize_aggregates(AggState *aggstate,
peraggstate->sortOperators[0],
peraggstate->sortCollations[0],
peraggstate->sortNullsFirst[0],
-
work_mem, false) :
+
work_mem, false, elideDuplicates) :
tuplesort_begin_heap(peraggstate->evaldesc,
peraggstate->numSortCols,
peraggstate->sortColIdx,
peraggstate->sortOperators,
peraggstate->sortCollations,
peraggstate->sortNullsFirst,
-
work_mem, false);
+
work_mem, false, elideDuplicates);
}
/*
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index b88571b..3afe8b3 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -89,7 +89,8 @@ ExecSort(SortState *node)
plannode->collations,
plannode->nullsFirst,
work_mem,
-
node->randomAccess);
+
node->randomAccess,
+
plannode->elideDuplicates);
if (node->bounded)
tuplesort_set_bound(tuplesortstate, node->bound);
node->tuplesortstate = (void *) tuplesortstate;
diff --git a/src/backend/optimizer/plan/createplan.c
b/src/backend/optimizer/plan/createplan.c
index 184d37a..1b5aaee 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1063,6 +1063,11 @@ create_unique_plan(PlannerInfo *root, UniquePath
*best_path)
groupColPos++;
}
plan = (Plan *) make_sort_from_sortclauses(root, sortList,
subplan);
+ {
+ Sort *s = (Sort *) plan;
+
+ s->elideDuplicates = true;
+ }
plan = (Plan *) make_unique(plan, sortList);
}
@@ -3763,6 +3768,7 @@ make_sort(PlannerInfo *root, Plan *lefttree, int numCols,
node->sortOperators = sortOperators;
node->collations = collations;
node->nullsFirst = nullsFirst;
+ node->elideDuplicates = false;
return node;
}
diff --git a/src/backend/optimizer/plan/planner.c
b/src/backend/optimizer/plan/planner.c
index 35bda67..fe01511 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1889,6 +1889,11 @@ grouping_planner(PlannerInfo *root, double
tuple_fraction)
result_plan,
current_pathkeys,
-1.0);
+ {
+ Sort *s = (Sort *) result_plan;
+
+ s->elideDuplicates = true;
+ }
}
result_plan = (Plan *) make_unique(result_plan,
diff --git a/src/backend/utils/adt/orderedsetaggs.c
b/src/backend/utils/adt/orderedsetaggs.c
index e5324e2..724f894 100644
--- a/src/backend/utils/adt/orderedsetaggs.c
+++ b/src/backend/utils/adt/orderedsetaggs.c
@@ -280,13 +280,13 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool
use_tuples)
qstate->sortOperators,
qstate->sortCollations,
qstate->sortNullsFirsts,
-
work_mem, false);
+
work_mem, false, false);
else
osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
qstate->sortOperator,
qstate->sortCollation,
qstate->sortNullsFirst,
-
work_mem, false);
+
work_mem, false, false);
osastate->number_of_rows = 0;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 1217098..f5d616d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -130,6 +130,7 @@ extern char *SSLCipherSuites;
extern char *SSLECDHCurve;
extern bool SSLPreferServerCiphers;
+extern bool enable_opportunistic_tuple_elide;
#ifdef TRACE_SORT
extern bool trace_sort;
#endif
@@ -733,6 +734,15 @@ static struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
{
+ {"enable_opportunistic_tuple_elide", PGC_USERSET,
QUERY_TUNING_METHOD,
+ gettext_noop("Enables the sort code to
opportunistically elide tuples."),
+ NULL
+ },
+ &enable_opportunistic_tuple_elide,
+ false,
+ NULL, NULL, NULL
+ },
+ {
{"enable_material", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of
materialization."),
NULL
diff --git a/src/backend/utils/sort/tuplesort.c
b/src/backend/utils/sort/tuplesort.c
index 8b520c1..047ac16 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -131,6 +131,7 @@ bool trace_sort = false;
#ifdef DEBUG_BOUNDED_SORT
bool optimize_bounded_sort = true;
#endif
+bool enable_opportunistic_tuple_elide = false;
/*
@@ -209,6 +210,17 @@ struct Tuplesortstate
bool randomAccess; /* did caller request random access? */
bool bounded; /* did caller specify a maximum
number of
* tuples to
return? */
+
+ int elideDuplicates;
+ int64 tuplesElided;
+ int64 tuplesElidedDuringInitial;
+ int64 tuplesElidedDuringBuild;
+ int64 tuplesElidedDuringMerge;
+ int64 tuplesElidedDuringDump;
+ int64 extraComparisons;
+ bool have_tup;
+ SortTuple last_dumped_tup;
+
bool boundUsed; /* true if we made use of a
bounded heap */
int bound; /* if bounded, the
maximum number of tuples */
int64 availMem; /* remaining memory available,
in bytes */
@@ -248,6 +260,15 @@ struct Tuplesortstate
SortTuple *stup);
/*
+ * Function to discard a previously-stored tuple. pfree() the
out-of-line
+ * data (not the SortTuple struct!), and increase state->availMem by the
+ * amount of memory space thereby released.
+ */
+ void (*discardtup) (Tuplesortstate *state, int tapenum,
+
SortTuple *stup);
+
+
+ /*
* Function to read a stored tuple from tape back into memory. 'len' is
* the already-read length of the stored tuple. Create a palloc'd copy,
* initialize tuple/datum1/isnull1 in the target SortTuple struct, and
@@ -394,6 +415,7 @@ struct Tuplesortstate
#define COMPARETUP(state,a,b) ((*(state)->comparetup) (a, b, state))
#define COPYTUP(state,stup,tup) ((*(state)->copytup) (state, stup, tup))
#define WRITETUP(state,tape,stup) ((*(state)->writetup) (state, tape,
stup))
+#define DISCARDTUP(state,tape,stup) ((*(state)->discardtup) (state, tape,
stup))
#define READTUP(state,stup,tape,len) ((*(state)->readtup) (state, stup, tape,
len))
#define REVERSEDIRECTION(state) ((*(state)->reversedirection) (state))
#define LACKMEM(state) ((state)->availMem < 0)
@@ -449,7 +471,7 @@ struct Tuplesortstate
} while(0)
-static Tuplesortstate *tuplesort_begin_common(int workMem, bool randomAccess);
+static Tuplesortstate *tuplesort_begin_common(int workMem, bool randomAccess,
bool elideDuplicates);
static void puttuple_common(Tuplesortstate *state, SortTuple *tuple);
static void inittapes(Tuplesortstate *state);
static void selectnewtape(Tuplesortstate *state);
@@ -471,6 +493,8 @@ static int comparetup_heap(const SortTuple *a, const
SortTuple *b,
static void copytup_heap(Tuplesortstate *state, SortTuple *stup, void *tup);
static void writetup_heap(Tuplesortstate *state, int tapenum,
SortTuple *stup);
+static void discardtup_heap(Tuplesortstate *state, int tapenum,
+ SortTuple *stup);
static void readtup_heap(Tuplesortstate *state, SortTuple *stup,
int tapenum, unsigned int len);
static void reversedirection_heap(Tuplesortstate *state);
@@ -479,6 +503,8 @@ static int comparetup_cluster(const SortTuple *a, const
SortTuple *b,
static void copytup_cluster(Tuplesortstate *state, SortTuple *stup, void *tup);
static void writetup_cluster(Tuplesortstate *state, int tapenum,
SortTuple *stup);
+static void discardtup_cluster(Tuplesortstate *state, int tapenum,
+ SortTuple *stup);
static void readtup_cluster(Tuplesortstate *state, SortTuple *stup,
int tapenum, unsigned int len);
static int comparetup_index_btree(const SortTuple *a, const SortTuple *b,
@@ -488,6 +514,8 @@ static int comparetup_index_hash(const SortTuple *a, const
SortTuple *b,
static void copytup_index(Tuplesortstate *state, SortTuple *stup, void *tup);
static void writetup_index(Tuplesortstate *state, int tapenum,
SortTuple *stup);
+static void discardtup_index(Tuplesortstate *state, int tapenum,
+ SortTuple *stup);
static void readtup_index(Tuplesortstate *state, SortTuple *stup,
int tapenum, unsigned int len);
static void reversedirection_index_btree(Tuplesortstate *state);
@@ -497,6 +525,8 @@ static int comparetup_datum(const SortTuple *a, const
SortTuple *b,
static void copytup_datum(Tuplesortstate *state, SortTuple *stup, void *tup);
static void writetup_datum(Tuplesortstate *state, int tapenum,
SortTuple *stup);
+static void discardtup_datum(Tuplesortstate *state, int tapenum,
+ SortTuple *stup);
static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
int tapenum, unsigned int len);
static void reversedirection_datum(Tuplesortstate *state);
@@ -532,7 +562,7 @@ static void free_sort_tuple(Tuplesortstate *state,
SortTuple *stup);
*/
static Tuplesortstate *
-tuplesort_begin_common(int workMem, bool randomAccess)
+tuplesort_begin_common(int workMem, bool randomAccess, bool elideDuplicates)
{
Tuplesortstate *state;
MemoryContext sortcontext;
@@ -564,6 +594,28 @@ tuplesort_begin_common(int workMem, bool randomAccess)
state->status = TSS_INITIAL;
state->randomAccess = randomAccess;
state->bounded = false;
+
+ if (!enable_opportunistic_tuple_elide && elideDuplicates)
+ {
+ elog(LOG, "tuplesort_begin_common:
enable_opportunistic_tuple_elide is false but elideDuplicates is true. Changing
to false.");
+ state->elideDuplicates = false;
+ }
+ else
+ {
+ state->elideDuplicates = elideDuplicates;
+ if (trace_sort)
+ {
+ elog(LOG, "tuplesort_begin_common: elideDuplicates:
%s", elideDuplicates ? "true" : "false");
+ }
+ }
+ state->tuplesElided = 0;
+ state->tuplesElidedDuringInitial = 0;
+ state->tuplesElidedDuringBuild = 0;
+ state->tuplesElidedDuringMerge = 0;
+ state->tuplesElidedDuringDump = 0;
+ state->extraComparisons = 0;
+ state->have_tup = false;
+
state->boundUsed = false;
state->allowedMem = workMem * (int64) 1024;
state->availMem = state->allowedMem;
@@ -600,9 +652,9 @@ tuplesort_begin_heap(TupleDesc tupDesc,
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid
*sortCollations,
bool *nullsFirstFlags,
- int workMem, bool randomAccess)
+ int workMem, bool randomAccess, bool
elideDuplicates)
{
- Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+ Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess,
elideDuplicates);
MemoryContext oldcontext;
int i;
@@ -613,8 +665,8 @@ tuplesort_begin_heap(TupleDesc tupDesc,
#ifdef TRACE_SORT
if (trace_sort)
elog(LOG,
- "begin tuple sort: nkeys = %d, workMem = %d,
randomAccess = %c",
- nkeys, workMem, randomAccess ? 't' : 'f');
+ "tuplesort_begin_heap: nkeys = %d, workMem = %d,
randomAccess = %c, elideDuplicates = %s",
+ nkeys, workMem, randomAccess ? 't' : 'f',
state->elideDuplicates ? "true" : "false");
#endif
state->nKeys = nkeys;
@@ -628,6 +680,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
state->comparetup = comparetup_heap;
state->copytup = copytup_heap;
state->writetup = writetup_heap;
+ state->discardtup = discardtup_heap;
state->readtup = readtup_heap;
state->reversedirection = reversedirection_heap;
@@ -664,7 +717,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
Relation indexRel,
int workMem, bool randomAccess)
{
- Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+ Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess,
false);
MemoryContext oldcontext;
Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
@@ -674,7 +727,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
#ifdef TRACE_SORT
if (trace_sort)
elog(LOG,
- "begin tuple sort: nkeys = %d, workMem = %d,
randomAccess = %c",
+ "tuplesort_begin_cluster: nkeys = %d, workMem = %d,
randomAccess = %c",
RelationGetNumberOfAttributes(indexRel),
workMem, randomAccess ? 't' : 'f');
#endif
@@ -690,6 +743,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
state->comparetup = comparetup_cluster;
state->copytup = copytup_cluster;
state->writetup = writetup_cluster;
+ state->discardtup = discardtup_cluster;
state->readtup = readtup_cluster;
state->reversedirection = reversedirection_index_btree;
@@ -726,7 +780,7 @@ tuplesort_begin_index_btree(Relation heapRel,
bool enforceUnique,
int workMem, bool
randomAccess)
{
- Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+ Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess,
false);
MemoryContext oldcontext;
oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -750,6 +804,7 @@ tuplesort_begin_index_btree(Relation heapRel,
state->comparetup = comparetup_index_btree;
state->copytup = copytup_index;
state->writetup = writetup_index;
+ state->discardtup = discardtup_index;
state->readtup = readtup_index;
state->reversedirection = reversedirection_index_btree;
@@ -769,7 +824,7 @@ tuplesort_begin_index_hash(Relation heapRel,
uint32 hash_mask,
int workMem, bool
randomAccess)
{
- Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+ Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess,
false);
MemoryContext oldcontext;
oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -787,6 +842,7 @@ tuplesort_begin_index_hash(Relation heapRel,
state->comparetup = comparetup_index_hash;
state->copytup = copytup_index;
state->writetup = writetup_index;
+ state->discardtup = discardtup_index;
state->readtup = readtup_index;
state->reversedirection = reversedirection_index_hash;
@@ -803,9 +859,9 @@ tuplesort_begin_index_hash(Relation heapRel,
Tuplesortstate *
tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
bool nullsFirstFlag,
- int workMem, bool randomAccess)
+ int workMem, bool randomAccess, bool
elideDuplicates)
{
- Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+ Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess,
elideDuplicates);
MemoryContext oldcontext;
int16 typlen;
bool typbyval;
@@ -815,8 +871,8 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid
sortCollation,
#ifdef TRACE_SORT
if (trace_sort)
elog(LOG,
- "begin datum sort: workMem = %d, randomAccess = %c",
- workMem, randomAccess ? 't' : 'f');
+ "tuplesort_begin_datum: workMem = %d, randomAccess =
%c, elideDuplicates = %s",
+ workMem, randomAccess ? 't' : 'f',
state->elideDuplicates ? "true" : "false");
#endif
state->nKeys = 1; /* always a one-column sort */
@@ -830,6 +886,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid
sortCollation,
state->comparetup = comparetup_datum;
state->copytup = copytup_datum;
state->writetup = writetup_datum;
+ state->discardtup = discardtup_datum;
state->readtup = readtup_datum;
state->reversedirection = reversedirection_datum;
@@ -886,6 +943,12 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
state->bounded = true;
state->bound = (int) bound;
+
+ if (state->elideDuplicates)
+ {
+ elog(LOG, "tuplesort_set_bound: elideDuplicates is true but
this is a bounded sort. Changing to false.");
+ state->elideDuplicates = false;
+ }
}
/*
@@ -930,6 +993,16 @@ tuplesort_end(Tuplesortstate *state)
else
elog(LOG, "internal sort ended, %ld KB used: %s",
spaceUsed, pg_rusage_show(&state->ru_start));
+ if (state->elideDuplicates)
+ {
+ elog(LOG, "tuplesort_end: opportunistically elided %ld
tuples " \
+ "(%ld extra comparisons): %ld during initial
accrual, %ld during build, %ld during merge, %ld during dump",
+ state->tuplesElided, state->extraComparisons,
+ state->tuplesElidedDuringInitial,
+ state->tuplesElidedDuringBuild,
+ state->tuplesElidedDuringMerge,
+ state->tuplesElidedDuringDump);
+ }
}
TRACE_POSTGRESQL_SORT_DONE(state->tapeset != NULL, spaceUsed);
@@ -1297,10 +1370,25 @@ puttuple_common(Tuplesortstate *state, SortTuple *tuple)
* this point; see dumptuples.
*/
Assert(state->memtupcount > 0);
- if (COMPARETUP(state, tuple, &state->memtuples[0]) >= 0)
- tuplesort_heap_insert(state, tuple,
state->currentRun, true);
- else
- tuplesort_heap_insert(state, tuple,
state->currentRun + 1, true);
+
+ {
+ int ret = COMPARETUP(state,
tuple, &state->memtuples[0]);
+
+ if (ret == 0 && state->elideDuplicates)
+ {
+ free_sort_tuple(state, tuple);
+ state->tuplesElided++;
+ state->tuplesElidedDuringBuild++;
+ }
+ else if (ret >= 0)
+ {
+ tuplesort_heap_insert(state, tuple,
state->currentRun, true);
+ }
+ else
+ {
+ tuplesort_heap_insert(state, tuple,
state->currentRun + 1, true);
+ }
+ }
/*
* If we are over the memory limit, dump tuples till
we're under.
@@ -1403,6 +1491,10 @@ tuplesort_performsort(Tuplesortstate *state)
pg_rusage_show(&state->ru_start));
}
#endif
+ if (trace_sort && state->elideDuplicates)
+ {
+ elog(LOG, "performsort: opportunistically elided %ld tuples.",
state->tuplesElided);
+ }
MemoryContextSwitchTo(oldcontext);
}
@@ -1847,6 +1939,8 @@ inittapes(Tuplesortstate *state)
if (trace_sort)
elog(LOG, "switching to external sort with %d tapes: %s",
maxTapes, pg_rusage_show(&state->ru_start));
+ if (state->elideDuplicates)
+ elog(LOG, "inittapes: opportunistically elided %ld tuples so
far.", state->tuplesElided);
#endif
/*
@@ -2124,7 +2218,39 @@ mergeonerun(Tuplesortstate *state)
/* write the tuple to destTape */
priorAvail = state->availMem;
srcTape = state->memtuples[0].tupindex;
- WRITETUP(state, destTape, &state->memtuples[0]);
+
+ if (state->elideDuplicates && !state->have_tup)
+ {
+ WRITETUP(state, destTape, &state->memtuples[0]);
+ state->last_dumped_tup = state->memtuples[0];
+ /* don't discard it */
+ state->have_tup = true;
+ }
+ else
+ { /* have_tup ||
!elide duplicates */
+ if (state->elideDuplicates)
+ {
+ if (COMPARETUP(state, &state->last_dumped_tup,
&state->memtuples[0]) == 0)
+ {
+ DISCARDTUP(state, destTape,
&state->memtuples[0]);
+ state->tuplesElided++;
+ state->tuplesElidedDuringMerge++;
+ }
+ else
+ {
+ WRITETUP(state, destTape,
&state->memtuples[0]);
+ DISCARDTUP(state, destTape,
&state->last_dumped_tup);
+ state->last_dumped_tup =
state->memtuples[0];
+ }
+ state->extraComparisons++;
+ }
+ else
+ {
+ WRITETUP(state, destTape, &state->memtuples[0]);
+ DISCARDTUP(state, destTape,
&state->memtuples[0]);
+ }
+ }
+
/* writetup adjusted total free space, now fix per-tape space */
spaceFreed = state->availMem - priorAvail;
state->mergeavailmem[srcTape] += spaceFreed;
@@ -2150,6 +2276,12 @@ mergeonerun(Tuplesortstate *state)
state->mergeavailslots[srcTape]++;
}
+ if (state->elideDuplicates && state->have_tup)
+ {
+ DISCARDTUP(state, state->tp_tapenum[state->destTape],
&state->last_dumped_tup);
+ state->have_tup = false;
+ }
+
/*
* When the heap empties, we're done. Write an end-of-run marker on the
* output tape, and increment its count of real runs.
@@ -2159,8 +2291,14 @@ mergeonerun(Tuplesortstate *state)
#ifdef TRACE_SORT
if (trace_sort)
+ {
elog(LOG, "finished %d-way merge step: %s", state->activeTapes,
pg_rusage_show(&state->ru_start));
+ if (state->elideDuplicates)
+ {
+ elog(LOG, "mergeonerun: opportunistically elided %ld
tuples.", state->tuplesElided);
+ }
+ }
#endif
}
@@ -2370,9 +2508,42 @@ dumptuples(Tuplesortstate *state, bool alltuples)
* Dump the heap's frontmost entry, and sift up to remove it
from the
* heap.
*/
+ int destTape =
state->tp_tapenum[state->destTape];
+
Assert(state->memtupcount > 0);
- WRITETUP(state, state->tp_tapenum[state->destTape],
- &state->memtuples[0]);
+
+ if (state->elideDuplicates && !state->have_tup)
+ {
+ WRITETUP(state, destTape, &state->memtuples[0]);
+ state->last_dumped_tup = state->memtuples[0];
+ /* don't discard it */
+ state->have_tup = true;
+ }
+ else
+ { /* have_tup ||
!elide duplicates */
+ if (state->elideDuplicates)
+ {
+ if (COMPARETUP(state, &state->last_dumped_tup,
&state->memtuples[0]) == 0)
+ {
+ DISCARDTUP(state, destTape,
&state->memtuples[0]);
+ state->tuplesElided++;
+ state->tuplesElidedDuringDump++;
+ }
+ else
+ {
+ WRITETUP(state, destTape,
&state->memtuples[0]);
+ DISCARDTUP(state, destTape,
&state->last_dumped_tup);
+ state->last_dumped_tup =
state->memtuples[0];
+ }
+ state->extraComparisons++;
+ }
+ else
+ {
+ WRITETUP(state, destTape, &state->memtuples[0]);
+ DISCARDTUP(state, destTape,
&state->memtuples[0]);
+ }
+ }
+
tuplesort_heap_siftup(state, true);
/*
@@ -2389,10 +2560,16 @@ dumptuples(Tuplesortstate *state, bool alltuples)
#ifdef TRACE_SORT
if (trace_sort)
+ {
elog(LOG, "finished writing%s run %d to tape
%d: %s",
(state->memtupcount == 0) ? " final" :
"",
state->currentRun, state->destTape,
pg_rusage_show(&state->ru_start));
+ if (state->elideDuplicates)
+ {
+ elog(LOG, "dumptuples:
opportunistically elided %ld tuples.", state->tuplesElided);
+ }
+ }
#endif
/*
@@ -2404,6 +2581,16 @@ dumptuples(Tuplesortstate *state, bool alltuples)
selectnewtape(state);
}
}
+
+ if (alltuples)
+ {
+ /* last run */
+ if (state->elideDuplicates && state->have_tup)
+ {
+ DISCARDTUP(state, state->tp_tapenum[state->destTape],
&state->last_dumped_tup);
+ state->have_tup = false;
+ }
+ }
}
/*
@@ -2935,6 +3122,12 @@ writetup_heap(Tuplesortstate *state, int tapenum,
SortTuple *stup)
if (state->randomAccess) /* need trailing length word? */
LogicalTapeWrite(state->tapeset, tapenum,
(void *) &tuplen,
sizeof(tuplen));
+}
+
+static void
+discardtup_heap(Tuplesortstate *state, int tapenum, SortTuple *stup)
+{
+ MinimalTuple tuple = (MinimalTuple) stup->tuple;
FREEMEM(state, GetMemoryChunkSpace(tuple));
heap_free_minimal_tuple(tuple);
@@ -3123,6 +3316,13 @@ writetup_cluster(Tuplesortstate *state, int tapenum,
SortTuple *stup)
if (state->randomAccess) /* need trailing length word? */
LogicalTapeWrite(state->tapeset, tapenum,
&tuplen, sizeof(tuplen));
+}
+
+
+static void
+discardtup_cluster(Tuplesortstate *state, int tapenum, SortTuple *stup)
+{
+ HeapTuple tuple = (HeapTuple) stup->tuple;
FREEMEM(state, GetMemoryChunkSpace(tuple));
heap_freetuple(tuple);
@@ -3365,6 +3565,12 @@ writetup_index(Tuplesortstate *state, int tapenum,
SortTuple *stup)
if (state->randomAccess) /* need trailing length word? */
LogicalTapeWrite(state->tapeset, tapenum,
(void *) &tuplen,
sizeof(tuplen));
+}
+
+static void
+discardtup_index(Tuplesortstate *state, int tapenum, SortTuple *stup)
+{
+ IndexTuple tuple = (IndexTuple) stup->tuple;
FREEMEM(state, GetMemoryChunkSpace(tuple));
pfree(tuple);
@@ -3463,11 +3669,17 @@ writetup_datum(Tuplesortstate *state, int tapenum,
SortTuple *stup)
if (state->randomAccess) /* need trailing length word? */
LogicalTapeWrite(state->tapeset, tapenum,
(void *) &writtenlen,
sizeof(writtenlen));
+}
+
+static void
+discardtup_datum(Tuplesortstate *state, int tapenum, SortTuple *stup)
+{
if (stup->tuple)
{
FREEMEM(state, GetMemoryChunkSpace(stup->tuple));
pfree(stup->tuple);
+ stup->tuple = NULL;
}
}
@@ -3524,4 +3736,5 @@ free_sort_tuple(Tuplesortstate *state, SortTuple *stup)
{
FREEMEM(state, GetMemoryChunkSpace(stup->tuple));
pfree(stup->tuple);
+ stup->tuple = NULL;
}
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 846c31a..34de503 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -913,6 +913,7 @@ typedef struct SortGroupClause
Oid sortop; /* the ordering
operator ('<' op), or 0 */
bool nulls_first; /* do NULLs come before normal values?
*/
bool hashable; /* can eqop be implemented by
hashing? */
+ bool elideDuplicates;
} SortGroupClause;
/*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 38c039c..78ccc72 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -586,6 +586,7 @@ typedef struct Sort
Oid *sortOperators; /* OIDs of operators to sort
them by */
Oid *collations; /* OIDs of collations */
bool *nullsFirst; /* NULLS FIRST/LAST directions */
+ bool elideDuplicates;
} Sort;
/* ---------------
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 05445f0..bdb73c6 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -62,7 +62,7 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
int nkeys, AttrNumber *attNums,
Oid *sortOperators, Oid
*sortCollations,
bool *nullsFirstFlags,
- int workMem, bool randomAccess);
+ int workMem, bool randomAccess, bool
elideDuplicates);
extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
Relation indexRel,
int workMem, bool randomAccess);
@@ -77,7 +77,7 @@ extern Tuplesortstate *tuplesort_begin_index_hash(Relation
heapRel,
extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
Oid sortOperator, Oid sortCollation,
bool nullsFirstFlag,
- int workMem, bool randomAccess);
+ int workMem, bool randomAccess, bool
elideDuplicates);
extern void tuplesort_set_bound(Tuplesortstate *state, int64 bound);
--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers