Greetings -hackers: I have worked up a patch to PostgreSQL which elides tuples during an external sort. The primary use case is when sorted input is being used to feed a DISTINCT operation. The idea is to throw out tuples that compare as identical whenever it's convenient, predicated on the assumption that even a single I/O is more expensive than some number of (potentially extra) comparisons. Obviously, this is where a cost model comes in, which has not been implemented. This patch is a work-in-progress.
Being very new to hacking PostgreSQL, I've probably done a bunch of things wrong, but I've tried to test it thoroughly. A rough summary of the patch follows: - a GUC variable enables or disables this capability - in nodeAgg.c, eliding duplicate tuples is enabled if the number of distinct columns is equal to the number of sort columns (and both are greater than zero). - in createplan.c, eliding duplicate tuples is enabled if we are creating a unique plan which involves sorting first - ditto planner.c - all of the remaining changes are in tuplesort.c, which consist of: + a new macro, DISCARDTUP and a new structure member, discardtup, are both defined and operate similar to COMPARETUP, COPYTUP, etc... + in puttuple_common, when state is TSS_BUILDRUNS, we *may* simply throw out the new tuple if it compares as identical to the tuple at the top of the heap. Since we're already performing this comparison, this is essentially free. + in mergeonerun, we may discard a tuple if it compares as identical to the *last written tuple*. This is a comparison that did not take place before, so it's not free, but it saves a write I/O. + We perform the same logic in dumptuples I have tested this patch with a bunch of different data, and the results are summarized below. Test 1: I used a corpus of approximately 1.5TB of textual data, consisting of 7.4 billion input rows, 12.6% of which is unique. This is a 'real-world' test. Before the patch: 44.25 hours using 274GB of temporary files After the patch: 36.44 hours using 125GB of temporary files Difference: -7.6 hours, 18% faster Test 2: Acquire the unique set of strings from a totally random set that contains no duplicates. This is a 'worst case'. Input: 700 million rows, 32GB. Before: 18,796 seconds. After: 19,358 seconds Difference: +562 seconds, *slower* by 3%. In this particular case, the performance varies from faster to slower, and I'm not sure why. Statistical noise? Test 3: Acquire the unique set of integers from a set of 100 million, all happen to have the value '1'. This is a 'best case'. Input: 100 million rows, 3.4GB on-disk Before: 26.1 seconds using 1.3GB of temporary files After: 8.9 seconds using 8KB of disk Difference: -17.2 seconds, 65% faster Test 4: Similar to test 3, but using 1 billion rows. Before: 1450 seconds, 13 GB of temporary files After: 468 seconds, 8 KB of temporary files Difference: -982 seconds, 68% faster See below for details on test 4. Lastly, 'make check' passes without issues (modulo rangefuncs grumping about changes in pg_settings). Tests 1 through 3 were performed against PostgreSQL 9.4 HEAD as of early September 2013. I have rebased the patch against PostgreSQL 9.4 HEAD as of 19 January 2014, and re-run tests 2, 3, and 4 with similar results. Regarding the patch: I've tried to conform to the indenting and style rules, including using pg_bsd_indent and pg_indent, but I've not had 100% success. The details on test 4: A simple example, using 1 billion rows all with the value '1' (integer): This is a fabricated *best case*. postgres=# set enable_hashagg = false; SET Time: 30.402 ms postgres=# explain analyze verbose select distinct * from i ; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------- Unique (cost=245942793.27..250942793.27 rows=1 width=4) (actual time=468188.900..468188.901 rows=1 loops=1) Output: i -> Sort (cost=245942793.27..248442793.27 rows=1000000000 width=4) (actual time=468188.897..468188.897 rows=1 loops=1) Output: i Sort Key: i.i Sort Method: external sort Disk: 8kB -> Seq Scan on public.i (cost=0.00..14424779.00 rows=1000000000 width=4) (actual time=34.765..254595.990 rows=1000000000 loops=1) Output: i Total runtime: 468189.125 ms (9 rows) Time: 468189.755 ms postgres=# set enable_opportunistic_tuple_elide = false; SET Time: 30.402 ms postgres=# explain analyze verbose select distinct * from i ; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------- Unique (cost=245942793.27..250942793.27 rows=1 width=4) (actual time=1047226.451..1449371.632 rows=1 loops=1) Output: i -> Sort (cost=245942793.27..248442793.27 rows=1000000000 width=4) (actual time=1047226.447..1284922.616 rows=1000000000 loops=1) Output: i Sort Key: i.i Sort Method: external sort Disk: 13685248kB -> Seq Scan on public.i (cost=0.00..14424779.00 rows=1000000000 width=4) (actual time=29.832..450978.694 rows=1000000000 loops=1) Output: i Total runtime: 1449644.717 ms (9 rows) There are some issues that remain. For example, the following two queries behave very differently (they *are* different queries, of course). Compare this: postgres=# explain ( analyze true, verbose true, costs true, buffers true, timing true ) select count(distinct i.i) from i ; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=16924779.00..16924779.01 rows=1 width=4) (actual time=894138.114..894138.115 rows=1 loops=1) Output: count(DISTINCT i) Buffers: shared hit=102 read=4424683, temp read=1466277 written=1466277 -> Seq Scan on public.i (cost=0.00..14424779.00 rows=1000000000 width=4) (actual time=0.012..349142.072 rows=1000000000 loops=1) Output: i Buffers: shared hit=96 read=4424683 Total runtime: 894138.233 ms (7 rows) Time: 894190.392 ms postgres=# to this (with this code path enabled): postgres=# explain ( analyze true, verbose true, costs true, buffers true, timing true ) select count(1) from (select distinct i.i from i) as sub ; Aggregate (cost=182583418.28..182583418.29 rows=1 width=0) (actual time=451973.381..451973.417 rows=1 loops=1) Output: count(1) Buffers: shared hit=192 read=4424587, temp read=1 written=1 -> Unique (cost=177583418.27..182583418.27 rows=1 width=4) (actual time=451973.370..451973.372 rows=1 loops=1) Output: i.i Buffers: shared hit=192 read=4424587, temp read=1 written=1 -> Sort (cost=177583418.27..180083418.27 rows=1000000000 width=4) (actual time=451973.366..451973.367 rows=1 loops=1) Output: i.i Sort Key: i.i Sort Method: external sort Disk: 8kB Buffers: shared hit=192 read=4424587, temp read=1 written=1 -> Seq Scan on public.i (cost=0.00..14424779.00 rows=1000000000 width=4) (actual time=0.016..223587.173 rows=1000000000 loops=1) Output: i.i Buffers: shared hit=192 read=4424587 Total runtime: 451994.583 ms (which also logged the following information) LOG: tuplesort_end: opportunistically elided 999999999 tuples (1864133 extra comparisons): 0 during initial accrual, 998135866 during build, 0 during merge, 1864133 during dump For comparison purposes, here is the same query but with hash aggregation enabled: Aggregate (cost=16924779.02..16924779.03 rows=1 width=0) (actual time=559379.525..559379.525 rows=1 loops=1) Output: count(1) Buffers: shared hit=224 read=4424555 -> HashAggregate (cost=16924779.00..16924779.01 rows=1 width=4) (actual time=559379.513..559379.513 rows=1 loops=1) Output: i.i Group Key: i.i Buffers: shared hit=224 read=4424555 -> Seq Scan on public.i (cost=0.00..14424779.00 rows=1000000000 width=4) (actual time=0.016..223704.468 rows=1000000000 loops=1) Output: i.i Buffers: shared hit=224 read=4424555 Total runtime: 559379.608 ms (11 rows) A note: this work was supported by my employer, Renesys Corporation ( http://www.renesys.com/ ). The patch was formatted using 'unified' diff, per http://wiki.postgresql.org/wiki/Submitting_a_Patch -- Jon
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c index 8eae43d..cfdab17 100644 --- a/src/backend/catalog/index.c +++ b/src/backend/catalog/index.c @@ -2701,7 +2701,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot) state.tuplesort = tuplesort_begin_datum(TIDOID, TIDLessOperator, InvalidOid, false, maintenance_work_mem, - false); + false, false); state.htups = state.itups = state.tups_inserted = 0; (void) index_bulk_delete(&ivinfo, NULL, diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c index 7e4bca5..2c09dec 100644 --- a/src/backend/executor/nodeAgg.c +++ b/src/backend/executor/nodeAgg.c @@ -115,6 +115,7 @@ #include "utils/tuplesort.h" #include "utils/datum.h" +extern bool trace_sort; /* * AggStatePerAggData - per-aggregate working state for the Agg scan @@ -343,6 +344,15 @@ initialize_aggregates(AggState *aggstate, */ if (peraggstate->numSortCols > 0) { + bool elideDuplicates = (peraggstate->numDistinctCols == peraggstate->numSortCols) && peraggstate->numDistinctCols > 0; + +#ifdef TRACE_SORT + if (trace_sort) + + elog(LOG, "initialize_aggregates: elideDuplicates = %s, numDistinctCols = %d, numSortCols = %d", + elideDuplicates ? "true" : "false", peraggstate->numDistinctCols, peraggstate->numSortCols); +#endif + /* * In case of rescan, maybe there could be an uncompleted sort * operation? Clean it up if so. @@ -361,14 +371,14 @@ initialize_aggregates(AggState *aggstate, peraggstate->sortOperators[0], peraggstate->sortCollations[0], peraggstate->sortNullsFirst[0], - work_mem, false) : + work_mem, false, elideDuplicates) : tuplesort_begin_heap(peraggstate->evaldesc, peraggstate->numSortCols, peraggstate->sortColIdx, peraggstate->sortOperators, peraggstate->sortCollations, peraggstate->sortNullsFirst, - work_mem, false); + work_mem, false, elideDuplicates); } /* diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c index b88571b..3afe8b3 100644 --- a/src/backend/executor/nodeSort.c +++ b/src/backend/executor/nodeSort.c @@ -89,7 +89,8 @@ ExecSort(SortState *node) plannode->collations, plannode->nullsFirst, work_mem, - node->randomAccess); + node->randomAccess, + plannode->elideDuplicates); if (node->bounded) tuplesort_set_bound(tuplesortstate, node->bound); node->tuplesortstate = (void *) tuplesortstate; diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c index 184d37a..1b5aaee 100644 --- a/src/backend/optimizer/plan/createplan.c +++ b/src/backend/optimizer/plan/createplan.c @@ -1063,6 +1063,11 @@ create_unique_plan(PlannerInfo *root, UniquePath *best_path) groupColPos++; } plan = (Plan *) make_sort_from_sortclauses(root, sortList, subplan); + { + Sort *s = (Sort *) plan; + + s->elideDuplicates = true; + } plan = (Plan *) make_unique(plan, sortList); } @@ -3763,6 +3768,7 @@ make_sort(PlannerInfo *root, Plan *lefttree, int numCols, node->sortOperators = sortOperators; node->collations = collations; node->nullsFirst = nullsFirst; + node->elideDuplicates = false; return node; } diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c index 35bda67..fe01511 100644 --- a/src/backend/optimizer/plan/planner.c +++ b/src/backend/optimizer/plan/planner.c @@ -1889,6 +1889,11 @@ grouping_planner(PlannerInfo *root, double tuple_fraction) result_plan, current_pathkeys, -1.0); + { + Sort *s = (Sort *) result_plan; + + s->elideDuplicates = true; + } } result_plan = (Plan *) make_unique(result_plan, diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c index e5324e2..724f894 100644 --- a/src/backend/utils/adt/orderedsetaggs.c +++ b/src/backend/utils/adt/orderedsetaggs.c @@ -280,13 +280,13 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples) qstate->sortOperators, qstate->sortCollations, qstate->sortNullsFirsts, - work_mem, false); + work_mem, false, false); else osastate->sortstate = tuplesort_begin_datum(qstate->sortColType, qstate->sortOperator, qstate->sortCollation, qstate->sortNullsFirst, - work_mem, false); + work_mem, false, false); osastate->number_of_rows = 0; diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 1217098..f5d616d 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -130,6 +130,7 @@ extern char *SSLCipherSuites; extern char *SSLECDHCurve; extern bool SSLPreferServerCiphers; +extern bool enable_opportunistic_tuple_elide; #ifdef TRACE_SORT extern bool trace_sort; #endif @@ -733,6 +734,15 @@ static struct config_bool ConfigureNamesBool[] = NULL, NULL, NULL }, { + {"enable_opportunistic_tuple_elide", PGC_USERSET, QUERY_TUNING_METHOD, + gettext_noop("Enables the sort code to opportunistically elide tuples."), + NULL + }, + &enable_opportunistic_tuple_elide, + false, + NULL, NULL, NULL + }, + { {"enable_material", PGC_USERSET, QUERY_TUNING_METHOD, gettext_noop("Enables the planner's use of materialization."), NULL diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c index 8b520c1..047ac16 100644 --- a/src/backend/utils/sort/tuplesort.c +++ b/src/backend/utils/sort/tuplesort.c @@ -131,6 +131,7 @@ bool trace_sort = false; #ifdef DEBUG_BOUNDED_SORT bool optimize_bounded_sort = true; #endif +bool enable_opportunistic_tuple_elide = false; /* @@ -209,6 +210,17 @@ struct Tuplesortstate bool randomAccess; /* did caller request random access? */ bool bounded; /* did caller specify a maximum number of * tuples to return? */ + + int elideDuplicates; + int64 tuplesElided; + int64 tuplesElidedDuringInitial; + int64 tuplesElidedDuringBuild; + int64 tuplesElidedDuringMerge; + int64 tuplesElidedDuringDump; + int64 extraComparisons; + bool have_tup; + SortTuple last_dumped_tup; + bool boundUsed; /* true if we made use of a bounded heap */ int bound; /* if bounded, the maximum number of tuples */ int64 availMem; /* remaining memory available, in bytes */ @@ -248,6 +260,15 @@ struct Tuplesortstate SortTuple *stup); /* + * Function to discard a previously-stored tuple. pfree() the out-of-line + * data (not the SortTuple struct!), and increase state->availMem by the + * amount of memory space thereby released. + */ + void (*discardtup) (Tuplesortstate *state, int tapenum, + SortTuple *stup); + + + /* * Function to read a stored tuple from tape back into memory. 'len' is * the already-read length of the stored tuple. Create a palloc'd copy, * initialize tuple/datum1/isnull1 in the target SortTuple struct, and @@ -394,6 +415,7 @@ struct Tuplesortstate #define COMPARETUP(state,a,b) ((*(state)->comparetup) (a, b, state)) #define COPYTUP(state,stup,tup) ((*(state)->copytup) (state, stup, tup)) #define WRITETUP(state,tape,stup) ((*(state)->writetup) (state, tape, stup)) +#define DISCARDTUP(state,tape,stup) ((*(state)->discardtup) (state, tape, stup)) #define READTUP(state,stup,tape,len) ((*(state)->readtup) (state, stup, tape, len)) #define REVERSEDIRECTION(state) ((*(state)->reversedirection) (state)) #define LACKMEM(state) ((state)->availMem < 0) @@ -449,7 +471,7 @@ struct Tuplesortstate } while(0) -static Tuplesortstate *tuplesort_begin_common(int workMem, bool randomAccess); +static Tuplesortstate *tuplesort_begin_common(int workMem, bool randomAccess, bool elideDuplicates); static void puttuple_common(Tuplesortstate *state, SortTuple *tuple); static void inittapes(Tuplesortstate *state); static void selectnewtape(Tuplesortstate *state); @@ -471,6 +493,8 @@ static int comparetup_heap(const SortTuple *a, const SortTuple *b, static void copytup_heap(Tuplesortstate *state, SortTuple *stup, void *tup); static void writetup_heap(Tuplesortstate *state, int tapenum, SortTuple *stup); +static void discardtup_heap(Tuplesortstate *state, int tapenum, + SortTuple *stup); static void readtup_heap(Tuplesortstate *state, SortTuple *stup, int tapenum, unsigned int len); static void reversedirection_heap(Tuplesortstate *state); @@ -479,6 +503,8 @@ static int comparetup_cluster(const SortTuple *a, const SortTuple *b, static void copytup_cluster(Tuplesortstate *state, SortTuple *stup, void *tup); static void writetup_cluster(Tuplesortstate *state, int tapenum, SortTuple *stup); +static void discardtup_cluster(Tuplesortstate *state, int tapenum, + SortTuple *stup); static void readtup_cluster(Tuplesortstate *state, SortTuple *stup, int tapenum, unsigned int len); static int comparetup_index_btree(const SortTuple *a, const SortTuple *b, @@ -488,6 +514,8 @@ static int comparetup_index_hash(const SortTuple *a, const SortTuple *b, static void copytup_index(Tuplesortstate *state, SortTuple *stup, void *tup); static void writetup_index(Tuplesortstate *state, int tapenum, SortTuple *stup); +static void discardtup_index(Tuplesortstate *state, int tapenum, + SortTuple *stup); static void readtup_index(Tuplesortstate *state, SortTuple *stup, int tapenum, unsigned int len); static void reversedirection_index_btree(Tuplesortstate *state); @@ -497,6 +525,8 @@ static int comparetup_datum(const SortTuple *a, const SortTuple *b, static void copytup_datum(Tuplesortstate *state, SortTuple *stup, void *tup); static void writetup_datum(Tuplesortstate *state, int tapenum, SortTuple *stup); +static void discardtup_datum(Tuplesortstate *state, int tapenum, + SortTuple *stup); static void readtup_datum(Tuplesortstate *state, SortTuple *stup, int tapenum, unsigned int len); static void reversedirection_datum(Tuplesortstate *state); @@ -532,7 +562,7 @@ static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup); */ static Tuplesortstate * -tuplesort_begin_common(int workMem, bool randomAccess) +tuplesort_begin_common(int workMem, bool randomAccess, bool elideDuplicates) { Tuplesortstate *state; MemoryContext sortcontext; @@ -564,6 +594,28 @@ tuplesort_begin_common(int workMem, bool randomAccess) state->status = TSS_INITIAL; state->randomAccess = randomAccess; state->bounded = false; + + if (!enable_opportunistic_tuple_elide && elideDuplicates) + { + elog(LOG, "tuplesort_begin_common: enable_opportunistic_tuple_elide is false but elideDuplicates is true. Changing to false."); + state->elideDuplicates = false; + } + else + { + state->elideDuplicates = elideDuplicates; + if (trace_sort) + { + elog(LOG, "tuplesort_begin_common: elideDuplicates: %s", elideDuplicates ? "true" : "false"); + } + } + state->tuplesElided = 0; + state->tuplesElidedDuringInitial = 0; + state->tuplesElidedDuringBuild = 0; + state->tuplesElidedDuringMerge = 0; + state->tuplesElidedDuringDump = 0; + state->extraComparisons = 0; + state->have_tup = false; + state->boundUsed = false; state->allowedMem = workMem * (int64) 1024; state->availMem = state->allowedMem; @@ -600,9 +652,9 @@ tuplesort_begin_heap(TupleDesc tupDesc, int nkeys, AttrNumber *attNums, Oid *sortOperators, Oid *sortCollations, bool *nullsFirstFlags, - int workMem, bool randomAccess) + int workMem, bool randomAccess, bool elideDuplicates) { - Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess); + Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess, elideDuplicates); MemoryContext oldcontext; int i; @@ -613,8 +665,8 @@ tuplesort_begin_heap(TupleDesc tupDesc, #ifdef TRACE_SORT if (trace_sort) elog(LOG, - "begin tuple sort: nkeys = %d, workMem = %d, randomAccess = %c", - nkeys, workMem, randomAccess ? 't' : 'f'); + "tuplesort_begin_heap: nkeys = %d, workMem = %d, randomAccess = %c, elideDuplicates = %s", + nkeys, workMem, randomAccess ? 't' : 'f', state->elideDuplicates ? "true" : "false"); #endif state->nKeys = nkeys; @@ -628,6 +680,7 @@ tuplesort_begin_heap(TupleDesc tupDesc, state->comparetup = comparetup_heap; state->copytup = copytup_heap; state->writetup = writetup_heap; + state->discardtup = discardtup_heap; state->readtup = readtup_heap; state->reversedirection = reversedirection_heap; @@ -664,7 +717,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc, Relation indexRel, int workMem, bool randomAccess) { - Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess); + Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess, false); MemoryContext oldcontext; Assert(indexRel->rd_rel->relam == BTREE_AM_OID); @@ -674,7 +727,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc, #ifdef TRACE_SORT if (trace_sort) elog(LOG, - "begin tuple sort: nkeys = %d, workMem = %d, randomAccess = %c", + "tuplesort_begin_cluster: nkeys = %d, workMem = %d, randomAccess = %c", RelationGetNumberOfAttributes(indexRel), workMem, randomAccess ? 't' : 'f'); #endif @@ -690,6 +743,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc, state->comparetup = comparetup_cluster; state->copytup = copytup_cluster; state->writetup = writetup_cluster; + state->discardtup = discardtup_cluster; state->readtup = readtup_cluster; state->reversedirection = reversedirection_index_btree; @@ -726,7 +780,7 @@ tuplesort_begin_index_btree(Relation heapRel, bool enforceUnique, int workMem, bool randomAccess) { - Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess); + Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess, false); MemoryContext oldcontext; oldcontext = MemoryContextSwitchTo(state->sortcontext); @@ -750,6 +804,7 @@ tuplesort_begin_index_btree(Relation heapRel, state->comparetup = comparetup_index_btree; state->copytup = copytup_index; state->writetup = writetup_index; + state->discardtup = discardtup_index; state->readtup = readtup_index; state->reversedirection = reversedirection_index_btree; @@ -769,7 +824,7 @@ tuplesort_begin_index_hash(Relation heapRel, uint32 hash_mask, int workMem, bool randomAccess) { - Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess); + Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess, false); MemoryContext oldcontext; oldcontext = MemoryContextSwitchTo(state->sortcontext); @@ -787,6 +842,7 @@ tuplesort_begin_index_hash(Relation heapRel, state->comparetup = comparetup_index_hash; state->copytup = copytup_index; state->writetup = writetup_index; + state->discardtup = discardtup_index; state->readtup = readtup_index; state->reversedirection = reversedirection_index_hash; @@ -803,9 +859,9 @@ tuplesort_begin_index_hash(Relation heapRel, Tuplesortstate * tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation, bool nullsFirstFlag, - int workMem, bool randomAccess) + int workMem, bool randomAccess, bool elideDuplicates) { - Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess); + Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess, elideDuplicates); MemoryContext oldcontext; int16 typlen; bool typbyval; @@ -815,8 +871,8 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation, #ifdef TRACE_SORT if (trace_sort) elog(LOG, - "begin datum sort: workMem = %d, randomAccess = %c", - workMem, randomAccess ? 't' : 'f'); + "tuplesort_begin_datum: workMem = %d, randomAccess = %c, elideDuplicates = %s", + workMem, randomAccess ? 't' : 'f', state->elideDuplicates ? "true" : "false"); #endif state->nKeys = 1; /* always a one-column sort */ @@ -830,6 +886,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation, state->comparetup = comparetup_datum; state->copytup = copytup_datum; state->writetup = writetup_datum; + state->discardtup = discardtup_datum; state->readtup = readtup_datum; state->reversedirection = reversedirection_datum; @@ -886,6 +943,12 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound) state->bounded = true; state->bound = (int) bound; + + if (state->elideDuplicates) + { + elog(LOG, "tuplesort_set_bound: elideDuplicates is true but this is a bounded sort. Changing to false."); + state->elideDuplicates = false; + } } /* @@ -930,6 +993,16 @@ tuplesort_end(Tuplesortstate *state) else elog(LOG, "internal sort ended, %ld KB used: %s", spaceUsed, pg_rusage_show(&state->ru_start)); + if (state->elideDuplicates) + { + elog(LOG, "tuplesort_end: opportunistically elided %ld tuples " \ + "(%ld extra comparisons): %ld during initial accrual, %ld during build, %ld during merge, %ld during dump", + state->tuplesElided, state->extraComparisons, + state->tuplesElidedDuringInitial, + state->tuplesElidedDuringBuild, + state->tuplesElidedDuringMerge, + state->tuplesElidedDuringDump); + } } TRACE_POSTGRESQL_SORT_DONE(state->tapeset != NULL, spaceUsed); @@ -1297,10 +1370,25 @@ puttuple_common(Tuplesortstate *state, SortTuple *tuple) * this point; see dumptuples. */ Assert(state->memtupcount > 0); - if (COMPARETUP(state, tuple, &state->memtuples[0]) >= 0) - tuplesort_heap_insert(state, tuple, state->currentRun, true); - else - tuplesort_heap_insert(state, tuple, state->currentRun + 1, true); + + { + int ret = COMPARETUP(state, tuple, &state->memtuples[0]); + + if (ret == 0 && state->elideDuplicates) + { + free_sort_tuple(state, tuple); + state->tuplesElided++; + state->tuplesElidedDuringBuild++; + } + else if (ret >= 0) + { + tuplesort_heap_insert(state, tuple, state->currentRun, true); + } + else + { + tuplesort_heap_insert(state, tuple, state->currentRun + 1, true); + } + } /* * If we are over the memory limit, dump tuples till we're under. @@ -1403,6 +1491,10 @@ tuplesort_performsort(Tuplesortstate *state) pg_rusage_show(&state->ru_start)); } #endif + if (trace_sort && state->elideDuplicates) + { + elog(LOG, "performsort: opportunistically elided %ld tuples.", state->tuplesElided); + } MemoryContextSwitchTo(oldcontext); } @@ -1847,6 +1939,8 @@ inittapes(Tuplesortstate *state) if (trace_sort) elog(LOG, "switching to external sort with %d tapes: %s", maxTapes, pg_rusage_show(&state->ru_start)); + if (state->elideDuplicates) + elog(LOG, "inittapes: opportunistically elided %ld tuples so far.", state->tuplesElided); #endif /* @@ -2124,7 +2218,39 @@ mergeonerun(Tuplesortstate *state) /* write the tuple to destTape */ priorAvail = state->availMem; srcTape = state->memtuples[0].tupindex; - WRITETUP(state, destTape, &state->memtuples[0]); + + if (state->elideDuplicates && !state->have_tup) + { + WRITETUP(state, destTape, &state->memtuples[0]); + state->last_dumped_tup = state->memtuples[0]; + /* don't discard it */ + state->have_tup = true; + } + else + { /* have_tup || !elide duplicates */ + if (state->elideDuplicates) + { + if (COMPARETUP(state, &state->last_dumped_tup, &state->memtuples[0]) == 0) + { + DISCARDTUP(state, destTape, &state->memtuples[0]); + state->tuplesElided++; + state->tuplesElidedDuringMerge++; + } + else + { + WRITETUP(state, destTape, &state->memtuples[0]); + DISCARDTUP(state, destTape, &state->last_dumped_tup); + state->last_dumped_tup = state->memtuples[0]; + } + state->extraComparisons++; + } + else + { + WRITETUP(state, destTape, &state->memtuples[0]); + DISCARDTUP(state, destTape, &state->memtuples[0]); + } + } + /* writetup adjusted total free space, now fix per-tape space */ spaceFreed = state->availMem - priorAvail; state->mergeavailmem[srcTape] += spaceFreed; @@ -2150,6 +2276,12 @@ mergeonerun(Tuplesortstate *state) state->mergeavailslots[srcTape]++; } + if (state->elideDuplicates && state->have_tup) + { + DISCARDTUP(state, state->tp_tapenum[state->destTape], &state->last_dumped_tup); + state->have_tup = false; + } + /* * When the heap empties, we're done. Write an end-of-run marker on the * output tape, and increment its count of real runs. @@ -2159,8 +2291,14 @@ mergeonerun(Tuplesortstate *state) #ifdef TRACE_SORT if (trace_sort) + { elog(LOG, "finished %d-way merge step: %s", state->activeTapes, pg_rusage_show(&state->ru_start)); + if (state->elideDuplicates) + { + elog(LOG, "mergeonerun: opportunistically elided %ld tuples.", state->tuplesElided); + } + } #endif } @@ -2370,9 +2508,42 @@ dumptuples(Tuplesortstate *state, bool alltuples) * Dump the heap's frontmost entry, and sift up to remove it from the * heap. */ + int destTape = state->tp_tapenum[state->destTape]; + Assert(state->memtupcount > 0); - WRITETUP(state, state->tp_tapenum[state->destTape], - &state->memtuples[0]); + + if (state->elideDuplicates && !state->have_tup) + { + WRITETUP(state, destTape, &state->memtuples[0]); + state->last_dumped_tup = state->memtuples[0]; + /* don't discard it */ + state->have_tup = true; + } + else + { /* have_tup || !elide duplicates */ + if (state->elideDuplicates) + { + if (COMPARETUP(state, &state->last_dumped_tup, &state->memtuples[0]) == 0) + { + DISCARDTUP(state, destTape, &state->memtuples[0]); + state->tuplesElided++; + state->tuplesElidedDuringDump++; + } + else + { + WRITETUP(state, destTape, &state->memtuples[0]); + DISCARDTUP(state, destTape, &state->last_dumped_tup); + state->last_dumped_tup = state->memtuples[0]; + } + state->extraComparisons++; + } + else + { + WRITETUP(state, destTape, &state->memtuples[0]); + DISCARDTUP(state, destTape, &state->memtuples[0]); + } + } + tuplesort_heap_siftup(state, true); /* @@ -2389,10 +2560,16 @@ dumptuples(Tuplesortstate *state, bool alltuples) #ifdef TRACE_SORT if (trace_sort) + { elog(LOG, "finished writing%s run %d to tape %d: %s", (state->memtupcount == 0) ? " final" : "", state->currentRun, state->destTape, pg_rusage_show(&state->ru_start)); + if (state->elideDuplicates) + { + elog(LOG, "dumptuples: opportunistically elided %ld tuples.", state->tuplesElided); + } + } #endif /* @@ -2404,6 +2581,16 @@ dumptuples(Tuplesortstate *state, bool alltuples) selectnewtape(state); } } + + if (alltuples) + { + /* last run */ + if (state->elideDuplicates && state->have_tup) + { + DISCARDTUP(state, state->tp_tapenum[state->destTape], &state->last_dumped_tup); + state->have_tup = false; + } + } } /* @@ -2935,6 +3122,12 @@ writetup_heap(Tuplesortstate *state, int tapenum, SortTuple *stup) if (state->randomAccess) /* need trailing length word? */ LogicalTapeWrite(state->tapeset, tapenum, (void *) &tuplen, sizeof(tuplen)); +} + +static void +discardtup_heap(Tuplesortstate *state, int tapenum, SortTuple *stup) +{ + MinimalTuple tuple = (MinimalTuple) stup->tuple; FREEMEM(state, GetMemoryChunkSpace(tuple)); heap_free_minimal_tuple(tuple); @@ -3123,6 +3316,13 @@ writetup_cluster(Tuplesortstate *state, int tapenum, SortTuple *stup) if (state->randomAccess) /* need trailing length word? */ LogicalTapeWrite(state->tapeset, tapenum, &tuplen, sizeof(tuplen)); +} + + +static void +discardtup_cluster(Tuplesortstate *state, int tapenum, SortTuple *stup) +{ + HeapTuple tuple = (HeapTuple) stup->tuple; FREEMEM(state, GetMemoryChunkSpace(tuple)); heap_freetuple(tuple); @@ -3365,6 +3565,12 @@ writetup_index(Tuplesortstate *state, int tapenum, SortTuple *stup) if (state->randomAccess) /* need trailing length word? */ LogicalTapeWrite(state->tapeset, tapenum, (void *) &tuplen, sizeof(tuplen)); +} + +static void +discardtup_index(Tuplesortstate *state, int tapenum, SortTuple *stup) +{ + IndexTuple tuple = (IndexTuple) stup->tuple; FREEMEM(state, GetMemoryChunkSpace(tuple)); pfree(tuple); @@ -3463,11 +3669,17 @@ writetup_datum(Tuplesortstate *state, int tapenum, SortTuple *stup) if (state->randomAccess) /* need trailing length word? */ LogicalTapeWrite(state->tapeset, tapenum, (void *) &writtenlen, sizeof(writtenlen)); +} + +static void +discardtup_datum(Tuplesortstate *state, int tapenum, SortTuple *stup) +{ if (stup->tuple) { FREEMEM(state, GetMemoryChunkSpace(stup->tuple)); pfree(stup->tuple); + stup->tuple = NULL; } } @@ -3524,4 +3736,5 @@ free_sort_tuple(Tuplesortstate *state, SortTuple *stup) { FREEMEM(state, GetMemoryChunkSpace(stup->tuple)); pfree(stup->tuple); + stup->tuple = NULL; } diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h index 846c31a..34de503 100644 --- a/src/include/nodes/parsenodes.h +++ b/src/include/nodes/parsenodes.h @@ -913,6 +913,7 @@ typedef struct SortGroupClause Oid sortop; /* the ordering operator ('<' op), or 0 */ bool nulls_first; /* do NULLs come before normal values? */ bool hashable; /* can eqop be implemented by hashing? */ + bool elideDuplicates; } SortGroupClause; /* diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h index 38c039c..78ccc72 100644 --- a/src/include/nodes/plannodes.h +++ b/src/include/nodes/plannodes.h @@ -586,6 +586,7 @@ typedef struct Sort Oid *sortOperators; /* OIDs of operators to sort them by */ Oid *collations; /* OIDs of collations */ bool *nullsFirst; /* NULLS FIRST/LAST directions */ + bool elideDuplicates; } Sort; /* --------------- diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h index 05445f0..bdb73c6 100644 --- a/src/include/utils/tuplesort.h +++ b/src/include/utils/tuplesort.h @@ -62,7 +62,7 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc, int nkeys, AttrNumber *attNums, Oid *sortOperators, Oid *sortCollations, bool *nullsFirstFlags, - int workMem, bool randomAccess); + int workMem, bool randomAccess, bool elideDuplicates); extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc, Relation indexRel, int workMem, bool randomAccess); @@ -77,7 +77,7 @@ extern Tuplesortstate *tuplesort_begin_index_hash(Relation heapRel, extern Tuplesortstate *tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation, bool nullsFirstFlag, - int workMem, bool randomAccess); + int workMem, bool randomAccess, bool elideDuplicates); extern void tuplesort_set_bound(Tuplesortstate *state, int64 bound);
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers