Redundant tuple copy in tqueueReceiveSlot()

Amit Khandekar Tue, 08 Sep 2020 22:24:32 -0700

Hi,

I am starting a new thread that continues with the following point
that was discussed in [1] ....

On Fri, 17 Jul 2020 at 09:05, Thomas Munro <thomas.mu...@gmail.com> wrote:
>
> On Sun, Jul 12, 2020 at 7:25 AM Soumyadeep Chakraborty
> <soumyadeep2...@gmail.com> wrote:
> > Do you mean that we should have an implementation for
> > get_minimal_tuple() for the heap AM and have it return a pointer to the
> > minimal tuple from the MINIMAL_TUPLE_OFFSET? And then a caller such as
> > tqueueReceiveSlot() will ensure that the heap tuple from which it wants
> > to extract the minimal tuple was allocated in the tuple queue in the
> > first place? If we consider that the node directly below a gather is a
> > SeqScan, then we could possibly, in ExecInitSeqScan() set-up the
> > ss_ScanTupleSlot to point to memory in the shared tuple queue?
> > Similarly, other ExecInit*() methods can do the same for other executor
> > nodes that involve parallelism? Of course, things would be slightly
> > different for
> > the other use cases you mentioned (such as hash table population)
>
> What I mean is that where ExecHashTableInsert() and
> tqueueReceiveSlot() do ExecFetchSlotMinimalTuple(), you usually get a
> freshly allocated copy, and then you copy that again, and free it.
> There may be something similar going on in tuplestore and sort code.
> Perhaps we could have something like
> ExecFetchSlotMinimalTupleInPlace(slot, output_buffer,
> output_buffer_size) that returns a value that indicates either success
> or hey-that-buffer's-too-small-I-need-N-bytes, or something like that.
> That silly extra copy is something Andres pointed out to me in some
> perf results involving TPCH hash joins, a couple of years ago.

I went ahead and tried doing this. I chose an approach where we can
return the pointer to the in-place minimal tuple data if it's a
heap/buffer/minimal tuple slot. A new function
ExecFetchSlotMinimalTupleData() returns in-place minimal tuple data.
If it's neither heap, buffer or minimal tuple, it returns a copy as
usual. The receiver should not assume the data is directly taken from
MinimalTupleData, so it should set it's t_len to the number of bytes
returned. Patch attached
(0001-Avoid-redundant-tuple-copy-while-sending-tuples-to-G.patch)

Thomas, I guess you had a different approach in mind when you said
about "returning either success or
hey-that-buffer's-too-small-I-need-N-bytes".  But what looks clear to
me is that avoiding the copy shows consistent improvement of 4 to 10%
for simple parallel table scans. I tried my patch on both x86_64 and
arm64, and observed this speedup on both.

create table tab as select generate_series(1, 20000000) id,
'abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz' v;
select pg_prewarm('tab'::regclass);
explain analyze select * from tab where id %2 = 0;
Times in milli-secs :
HEAD : 1833.119  1816.522  1875.648  1872.153  1834.383
Patch'ed : 1763.786  1721.160  1714.665  1719.738  1704.478
This was with the default 2 parallel workers. With 3 or 4 workers, for
the above testcase I didn't see a noticeable difference. I think, if I
double the number of rows, the difference will be noticeable. In any
case, the gain would go on reducing with the number of workers,
because the tuple copy also gets parallelized. In some scenarios,
parallel_leader_participation=off causes the difference to amplify.

Haven't had a chance to see if this helps any of the TPC-H queries.

Also attached is a patch guc_for_testing.patch that I used for testing
the gain. This patch is only for testing. Without this, in order to
compare the performance figures it requires server restart, and the
figures anyway shift back and forth by 5-15 percent after each
restart, which creates lot of noise when comparing figures with and
without fix. Use this GUC enable_fix to enable/disable the fix.

[1] 
https://www.postgresql.org/message-id/CA%2BhUKGLrN2M18-hACEJbNoj2sn_WoUj9rkkBeoPK7SY427pAnA%40mail.gmail.com

-- 
Thanks,
-Amit Khandekar
Huawei Technologies

From 5d19626e35e50a5630e5f1a042f7ecee6acb7c70 Mon Sep 17 00:00:00 2001
From: Amit Khandekar <amitdkhan...@gmail.com>
Date: Wed, 9 Sep 2020 12:03:01 +0800
Subject: [PATCH 2/2] Add guc for ease of testing speedup.

This is only for testing the performance gain. Otherwise, to compare
the performance figures, it requires server restart, and the figures
anyway shift back and forth by 5-15 percent after each restart, which
creates lot of noise when comparing figures with and without fix.

With this, we can easily see at least 4-10% difference in execution
times by setting/unsetting the GUC enable_fix.
---
 src/backend/executor/tqueue.c | 12 ++++++++++++
 src/backend/utils/misc/guc.c  | 12 ++++++++++++
 2 files changed, 24 insertions(+)

diff --git a/src/backend/executor/tqueue.c b/src/backend/executor/tqueue.c
index c70ee0f39a..7dd60b5c7e 100644
--- a/src/backend/executor/tqueue.c
+++ b/src/backend/executor/tqueue.c
@@ -45,6 +45,7 @@ struct TupleQueueReader
 	shm_mq_handle *queue;		/* shm_mq to receive from */
 };
 
+extern bool		enable_fix;
 /*
  * Receive a tuple from a query, and send it to the designated shm_mq.
  *
@@ -60,12 +61,23 @@ tqueueReceiveSlot(TupleTableSlot *slot, DestReceiver *self)
 	bool		should_free;
 
 	/* Send the minimal tuple data. */
+	if (enable_fix)
+	{
 	minimal_tuple_data = ExecFetchSlotMinimalTupleData(slot, &len, &should_free);
 	result = shm_mq_send(tqueue->queue, len, minimal_tuple_data, false);
 
 	if (should_free)
 		pfree(minimal_tuple_data);
+	}
+	else
+	{
+	MinimalTuple tuple;
+	tuple = ExecFetchSlotMinimalTuple(slot, &should_free);
+	result = shm_mq_send(tqueue->queue, tuple->t_len, tuple, false);
 
+	if (should_free)
+		pfree(tuple);
+	}
 	/* Check for failure. */
 	if (result == SHM_MQ_DETACHED)
 		return false;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index de87ad6ef7..8b3f1339cb 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -138,6 +138,8 @@ extern bool trace_syncscan;
 extern bool optimize_bounded_sort;
 #endif
 
+bool		enable_fix = true;
+
 static int	GUC_check_errcode_value;
 
 /* global variables for check hook support */
@@ -2036,6 +2038,16 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"enable_fix", PGC_USERSET, RESOURCES_ASYNCHRONOUS,
+			gettext_noop("dummy"),
+			gettext_noop("dummy"),
+			GUC_EXPLAIN
+		},
+		&enable_fix,
+		true,
+		NULL, NULL, NULL
+	},
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
-- 
2.17.1

From 286310e6a52af7883d20899163aaa349b479d614 Mon Sep 17 00:00:00 2001
From: Amit Khandekar <amitdkhan...@gmail.com>
Date: Wed, 9 Sep 2020 11:52:59 +0800
Subject: [PATCH 1/2] Avoid redundant tuple copy while sending tuples to Gather

If the tuple to be sent is a heap tuple, get the pointer to the minimal
tuple data from the heap tuple, and use this as the source data for
shm_mq_send(). This allows us to prevent a tuple copy when the slot
is a heap tuple slot.

Device a new function ExecFetchSlotMinimalTupleData() that tries to
send an in-place minimal tuple data or returns a copy only if the
slot is something other than a heap tuple or minimal tuple slot.

This shows between 4 to 10% speed up for simple non-aggregate parallel
queries.
---
 src/backend/executor/execTuples.c | 59 +++++++++++++++++++++++++++++++
 src/backend/executor/tqueue.c     | 17 +++++----
 src/include/executor/tuptable.h   |  2 ++
 3 files changed, 71 insertions(+), 7 deletions(-)

diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c
index 4c90ac5236..55a6652b4e 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -1682,6 +1682,65 @@ ExecFetchSlotMinimalTuple(TupleTableSlot *slot,
 	}
 }
 
+/* --------------------------------
+ *		ExecFetchSlotMinimalTupleData
+ *			Return minimal tuple data, or the data belonging to minimal tuple
+ *			section if it's a heap tuple slot.
+ *
+ *		This function is similar to ExecFetchSlotMinimalTuple(), but it goes a
+ *		step further in trying to avoid redundant tuple copy. It does this by
+ *		returning the in-place minimal tuple data region if it's a heap tuple.
+ *		If the slot is neither a minimal tuple nor heap tuple slot, a minimal
+ *		tuple copy is returned, and should_free is set to true. Callers can use
+ *		this if they only want the underlying minimal tuple data.
+ *		One use is where minimal tuple data is sent through the gather queues.
+ *		The receiver end can then treat the data as a MinimalTupleData, but
+ *		they should update it's t_len field, because the data might have
+ *		originally belonged to a heap tuple rather than minimal tuple.
+ */
+char *
+ExecFetchSlotMinimalTupleData(TupleTableSlot *slot, uint32 *len,
+							  bool *shouldFree)
+{
+	/*
+	 * sanity checks
+	 */
+	Assert(slot != NULL);
+	Assert(!TTS_EMPTY(slot));
+
+	if (slot->tts_ops->get_heap_tuple)
+	{
+		HeapTuple	htuple;
+
+		if (shouldFree)
+			*shouldFree = false;
+		htuple = slot->tts_ops->get_heap_tuple(slot);
+		*len = htuple->t_len - MINIMAL_TUPLE_OFFSET;
+
+		return (char*) htuple->t_data + MINIMAL_TUPLE_OFFSET;
+	}
+	else
+	{
+		MinimalTuple	mtuple;
+
+		if (slot->tts_ops->get_minimal_tuple)
+		{
+			if (shouldFree)
+				*shouldFree = false;
+			mtuple = slot->tts_ops->get_minimal_tuple(slot);
+		}
+		else
+		{
+			if (shouldFree)
+				*shouldFree = true;
+			mtuple = slot->tts_ops->copy_minimal_tuple(slot);
+		}
+		*len = mtuple->t_len;
+
+		return (char *) mtuple;
+	}
+}
+
 /* --------------------------------
  *		ExecFetchSlotHeapTupleDatum
  *			Fetch the slot's tuple as a composite-type Datum.
diff --git a/src/backend/executor/tqueue.c b/src/backend/executor/tqueue.c
index 30a264ebea..c70ee0f39a 100644
--- a/src/backend/executor/tqueue.c
+++ b/src/backend/executor/tqueue.c
@@ -54,16 +54,17 @@ static bool
 tqueueReceiveSlot(TupleTableSlot *slot, DestReceiver *self)
 {
 	TQueueDestReceiver *tqueue = (TQueueDestReceiver *) self;
-	MinimalTuple tuple;
+	char	   *minimal_tuple_data;
+	uint32		len;
 	shm_mq_result result;
 	bool		should_free;
 
-	/* Send the tuple itself. */
-	tuple = ExecFetchSlotMinimalTuple(slot, &should_free);
-	result = shm_mq_send(tqueue->queue, tuple->t_len, tuple, false);
+	/* Send the minimal tuple data. */
+	minimal_tuple_data = ExecFetchSlotMinimalTupleData(slot, &len, &should_free);
+	result = shm_mq_send(tqueue->queue, len, minimal_tuple_data, false);
 
 	if (should_free)
-		pfree(tuple);
+		pfree(minimal_tuple_data);
 
 	/* Check for failure. */
 	if (result == SHM_MQ_DETACHED)
@@ -201,10 +202,12 @@ TupleQueueReaderNext(TupleQueueReader *reader, bool nowait, bool *done)
 
 	/*
 	 * Return a pointer to the queue memory directly (which had better be
-	 * sufficiently aligned).
+	 * sufficiently aligned). Also, the sender might not have updated the t_len
+	 * if the data belonged to a heap tuple rather than a minimal tuple. So
+	 * update it now.
 	 */
 	tuple = (MinimalTuple) data;
-	Assert(tuple->t_len == nbytes);
+	tuple->t_len = nbytes;
 
 	return tuple;
 }
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index f7df70b5ab..da78929035 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -325,6 +325,8 @@ extern void ExecStoreHeapTupleDatum(Datum data, TupleTableSlot *slot);
 extern HeapTuple ExecFetchSlotHeapTuple(TupleTableSlot *slot, bool materialize, bool *shouldFree);
 extern MinimalTuple ExecFetchSlotMinimalTuple(TupleTableSlot *slot,
 											  bool *shouldFree);
+extern char *ExecFetchSlotMinimalTupleData(TupleTableSlot *slot, uint32 *len,
+										   bool *shouldFree);
 extern Datum ExecFetchSlotHeapTupleDatum(TupleTableSlot *slot);
 extern void slot_getmissingattrs(TupleTableSlot *slot, int startAttNum,
 								 int lastAttNum);
-- 
2.17.1

Redundant tuple copy in tqueueReceiveSlot()

Reply via email to