Hi, I am starting a new thread that continues with the following point that was discussed in [1] ....
On Fri, 17 Jul 2020 at 09:05, Thomas Munro <thomas.mu...@gmail.com> wrote: > > On Sun, Jul 12, 2020 at 7:25 AM Soumyadeep Chakraborty > <soumyadeep2...@gmail.com> wrote: > > Do you mean that we should have an implementation for > > get_minimal_tuple() for the heap AM and have it return a pointer to the > > minimal tuple from the MINIMAL_TUPLE_OFFSET? And then a caller such as > > tqueueReceiveSlot() will ensure that the heap tuple from which it wants > > to extract the minimal tuple was allocated in the tuple queue in the > > first place? If we consider that the node directly below a gather is a > > SeqScan, then we could possibly, in ExecInitSeqScan() set-up the > > ss_ScanTupleSlot to point to memory in the shared tuple queue? > > Similarly, other ExecInit*() methods can do the same for other executor > > nodes that involve parallelism? Of course, things would be slightly > > different for > > the other use cases you mentioned (such as hash table population) > > What I mean is that where ExecHashTableInsert() and > tqueueReceiveSlot() do ExecFetchSlotMinimalTuple(), you usually get a > freshly allocated copy, and then you copy that again, and free it. > There may be something similar going on in tuplestore and sort code. > Perhaps we could have something like > ExecFetchSlotMinimalTupleInPlace(slot, output_buffer, > output_buffer_size) that returns a value that indicates either success > or hey-that-buffer's-too-small-I-need-N-bytes, or something like that. > That silly extra copy is something Andres pointed out to me in some > perf results involving TPCH hash joins, a couple of years ago. I went ahead and tried doing this. I chose an approach where we can return the pointer to the in-place minimal tuple data if it's a heap/buffer/minimal tuple slot. A new function ExecFetchSlotMinimalTupleData() returns in-place minimal tuple data. If it's neither heap, buffer or minimal tuple, it returns a copy as usual. The receiver should not assume the data is directly taken from MinimalTupleData, so it should set it's t_len to the number of bytes returned. Patch attached (0001-Avoid-redundant-tuple-copy-while-sending-tuples-to-G.patch) Thomas, I guess you had a different approach in mind when you said about "returning either success or hey-that-buffer's-too-small-I-need-N-bytes". But what looks clear to me is that avoiding the copy shows consistent improvement of 4 to 10% for simple parallel table scans. I tried my patch on both x86_64 and arm64, and observed this speedup on both. create table tab as select generate_series(1, 20000000) id, 'abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz' v; select pg_prewarm('tab'::regclass); explain analyze select * from tab where id %2 = 0; Times in milli-secs : HEAD : 1833.119 1816.522 1875.648 1872.153 1834.383 Patch'ed : 1763.786 1721.160 1714.665 1719.738 1704.478 This was with the default 2 parallel workers. With 3 or 4 workers, for the above testcase I didn't see a noticeable difference. I think, if I double the number of rows, the difference will be noticeable. In any case, the gain would go on reducing with the number of workers, because the tuple copy also gets parallelized. In some scenarios, parallel_leader_participation=off causes the difference to amplify. Haven't had a chance to see if this helps any of the TPC-H queries. Also attached is a patch guc_for_testing.patch that I used for testing the gain. This patch is only for testing. Without this, in order to compare the performance figures it requires server restart, and the figures anyway shift back and forth by 5-15 percent after each restart, which creates lot of noise when comparing figures with and without fix. Use this GUC enable_fix to enable/disable the fix. [1] https://www.postgresql.org/message-id/CA%2BhUKGLrN2M18-hACEJbNoj2sn_WoUj9rkkBeoPK7SY427pAnA%40mail.gmail.com -- Thanks, -Amit Khandekar Huawei Technologies
From 5d19626e35e50a5630e5f1a042f7ecee6acb7c70 Mon Sep 17 00:00:00 2001 From: Amit Khandekar <amitdkhan...@gmail.com> Date: Wed, 9 Sep 2020 12:03:01 +0800 Subject: [PATCH 2/2] Add guc for ease of testing speedup. This is only for testing the performance gain. Otherwise, to compare the performance figures, it requires server restart, and the figures anyway shift back and forth by 5-15 percent after each restart, which creates lot of noise when comparing figures with and without fix. With this, we can easily see at least 4-10% difference in execution times by setting/unsetting the GUC enable_fix. --- src/backend/executor/tqueue.c | 12 ++++++++++++ src/backend/utils/misc/guc.c | 12 ++++++++++++ 2 files changed, 24 insertions(+) diff --git a/src/backend/executor/tqueue.c b/src/backend/executor/tqueue.c index c70ee0f39a..7dd60b5c7e 100644 --- a/src/backend/executor/tqueue.c +++ b/src/backend/executor/tqueue.c @@ -45,6 +45,7 @@ struct TupleQueueReader shm_mq_handle *queue; /* shm_mq to receive from */ }; +extern bool enable_fix; /* * Receive a tuple from a query, and send it to the designated shm_mq. * @@ -60,12 +61,23 @@ tqueueReceiveSlot(TupleTableSlot *slot, DestReceiver *self) bool should_free; /* Send the minimal tuple data. */ + if (enable_fix) + { minimal_tuple_data = ExecFetchSlotMinimalTupleData(slot, &len, &should_free); result = shm_mq_send(tqueue->queue, len, minimal_tuple_data, false); if (should_free) pfree(minimal_tuple_data); + } + else + { + MinimalTuple tuple; + tuple = ExecFetchSlotMinimalTuple(slot, &should_free); + result = shm_mq_send(tqueue->queue, tuple->t_len, tuple, false); + if (should_free) + pfree(tuple); + } /* Check for failure. */ if (result == SHM_MQ_DETACHED) return false; diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index de87ad6ef7..8b3f1339cb 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -138,6 +138,8 @@ extern bool trace_syncscan; extern bool optimize_bounded_sort; #endif +bool enable_fix = true; + static int GUC_check_errcode_value; /* global variables for check hook support */ @@ -2036,6 +2038,16 @@ static struct config_bool ConfigureNamesBool[] = NULL, NULL, NULL }, + { + {"enable_fix", PGC_USERSET, RESOURCES_ASYNCHRONOUS, + gettext_noop("dummy"), + gettext_noop("dummy"), + GUC_EXPLAIN + }, + &enable_fix, + true, + NULL, NULL, NULL + }, /* End-of-list marker */ { {NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL -- 2.17.1
From 286310e6a52af7883d20899163aaa349b479d614 Mon Sep 17 00:00:00 2001 From: Amit Khandekar <amitdkhan...@gmail.com> Date: Wed, 9 Sep 2020 11:52:59 +0800 Subject: [PATCH 1/2] Avoid redundant tuple copy while sending tuples to Gather If the tuple to be sent is a heap tuple, get the pointer to the minimal tuple data from the heap tuple, and use this as the source data for shm_mq_send(). This allows us to prevent a tuple copy when the slot is a heap tuple slot. Device a new function ExecFetchSlotMinimalTupleData() that tries to send an in-place minimal tuple data or returns a copy only if the slot is something other than a heap tuple or minimal tuple slot. This shows between 4 to 10% speed up for simple non-aggregate parallel queries. --- src/backend/executor/execTuples.c | 59 +++++++++++++++++++++++++++++++ src/backend/executor/tqueue.c | 17 +++++---- src/include/executor/tuptable.h | 2 ++ 3 files changed, 71 insertions(+), 7 deletions(-) diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c index 4c90ac5236..55a6652b4e 100644 --- a/src/backend/executor/execTuples.c +++ b/src/backend/executor/execTuples.c @@ -1682,6 +1682,65 @@ ExecFetchSlotMinimalTuple(TupleTableSlot *slot, } } +/* -------------------------------- + * ExecFetchSlotMinimalTupleData + * Return minimal tuple data, or the data belonging to minimal tuple + * section if it's a heap tuple slot. + * + * This function is similar to ExecFetchSlotMinimalTuple(), but it goes a + * step further in trying to avoid redundant tuple copy. It does this by + * returning the in-place minimal tuple data region if it's a heap tuple. + * If the slot is neither a minimal tuple nor heap tuple slot, a minimal + * tuple copy is returned, and should_free is set to true. Callers can use + * this if they only want the underlying minimal tuple data. + * One use is where minimal tuple data is sent through the gather queues. + * The receiver end can then treat the data as a MinimalTupleData, but + * they should update it's t_len field, because the data might have + * originally belonged to a heap tuple rather than minimal tuple. + */ +char * +ExecFetchSlotMinimalTupleData(TupleTableSlot *slot, uint32 *len, + bool *shouldFree) +{ + /* + * sanity checks + */ + Assert(slot != NULL); + Assert(!TTS_EMPTY(slot)); + + if (slot->tts_ops->get_heap_tuple) + { + HeapTuple htuple; + + if (shouldFree) + *shouldFree = false; + htuple = slot->tts_ops->get_heap_tuple(slot); + *len = htuple->t_len - MINIMAL_TUPLE_OFFSET; + + return (char*) htuple->t_data + MINIMAL_TUPLE_OFFSET; + } + else + { + MinimalTuple mtuple; + + if (slot->tts_ops->get_minimal_tuple) + { + if (shouldFree) + *shouldFree = false; + mtuple = slot->tts_ops->get_minimal_tuple(slot); + } + else + { + if (shouldFree) + *shouldFree = true; + mtuple = slot->tts_ops->copy_minimal_tuple(slot); + } + *len = mtuple->t_len; + + return (char *) mtuple; + } +} + /* -------------------------------- * ExecFetchSlotHeapTupleDatum * Fetch the slot's tuple as a composite-type Datum. diff --git a/src/backend/executor/tqueue.c b/src/backend/executor/tqueue.c index 30a264ebea..c70ee0f39a 100644 --- a/src/backend/executor/tqueue.c +++ b/src/backend/executor/tqueue.c @@ -54,16 +54,17 @@ static bool tqueueReceiveSlot(TupleTableSlot *slot, DestReceiver *self) { TQueueDestReceiver *tqueue = (TQueueDestReceiver *) self; - MinimalTuple tuple; + char *minimal_tuple_data; + uint32 len; shm_mq_result result; bool should_free; - /* Send the tuple itself. */ - tuple = ExecFetchSlotMinimalTuple(slot, &should_free); - result = shm_mq_send(tqueue->queue, tuple->t_len, tuple, false); + /* Send the minimal tuple data. */ + minimal_tuple_data = ExecFetchSlotMinimalTupleData(slot, &len, &should_free); + result = shm_mq_send(tqueue->queue, len, minimal_tuple_data, false); if (should_free) - pfree(tuple); + pfree(minimal_tuple_data); /* Check for failure. */ if (result == SHM_MQ_DETACHED) @@ -201,10 +202,12 @@ TupleQueueReaderNext(TupleQueueReader *reader, bool nowait, bool *done) /* * Return a pointer to the queue memory directly (which had better be - * sufficiently aligned). + * sufficiently aligned). Also, the sender might not have updated the t_len + * if the data belonged to a heap tuple rather than a minimal tuple. So + * update it now. */ tuple = (MinimalTuple) data; - Assert(tuple->t_len == nbytes); + tuple->t_len = nbytes; return tuple; } diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h index f7df70b5ab..da78929035 100644 --- a/src/include/executor/tuptable.h +++ b/src/include/executor/tuptable.h @@ -325,6 +325,8 @@ extern void ExecStoreHeapTupleDatum(Datum data, TupleTableSlot *slot); extern HeapTuple ExecFetchSlotHeapTuple(TupleTableSlot *slot, bool materialize, bool *shouldFree); extern MinimalTuple ExecFetchSlotMinimalTuple(TupleTableSlot *slot, bool *shouldFree); +extern char *ExecFetchSlotMinimalTupleData(TupleTableSlot *slot, uint32 *len, + bool *shouldFree); extern Datum ExecFetchSlotHeapTupleDatum(TupleTableSlot *slot); extern void slot_getmissingattrs(TupleTableSlot *slot, int startAttNum, int lastAttNum); -- 2.17.1