Hi, On 2026-04-07 13:30:11 -0700, Lukas Fittl wrote: > 0001 is the change to make queryDesc->totaltime be allocated by > ExecutorStart instead of plugins themselves, and adds a > queryDesc->totaltime_options to have plugins request which level of > summary instrumentation they need. This change is pretty simple, and > could still make sense to get into 19. Because of the earlier > Instrumentation refactoring that was pushed (thanks!) we're already > asking extensions allocating queryDesc->totaltime to modify their use > of InstrAlloc, so I think we might as well clean this up now.
Hm. That's a fair argument. They indeed would have to again change next
release
It's not a complicated change and removes more lines than it adds.
I guess one thing I'm not sure is whether the fields shouldn't be renamed at
the same time:
a) To prevent extensions from continuing to set it, most of them do not test
against assert enabled builds. With a different name they would get a
compiler error.
b) "totaltime" and "totaltime_options" are pretty poor descriptors of tracking
query level statistics. If everyone has to change anyway, this is a good
occasion.
'query_instr[_options]'?
Any opinions?
> 0002 is just ExecProcNodeInstr moved to instrument.c, as Andres had
> suggested previously. We still get some quick performance wins from
> doing that (see end of email), and again, its a simple change, so
> could be considered if someone has bandwidth remaining. I've added a
> later patch that then does the more complex inlining and gets us the
> full speed up.
Here it needs a few more inlines to get the full performance, otherwise it
doesn't inline all the helpers. I think on balance I didn't like the
prototype in instrument.h, that's too widely included, and it might even cause
some circularity issues. It seems better to do the somewhat ugly thing and
have the prototype be in executor.h.
> 0002 measurements (with current master and TSC clock source used for
> timing, best of three):
>
> CREATE TABLE lotsarows(key int not null);
> INSERT INTO lotsarows SELECT generate_series(1, 50000000);
> VACUUM FREEZE lotsarows;
With the somewhat more extreme benchmark I used in the rdtsc thread and the
added inline mentioned above I see a bit bigger wins. See the attached
explainbench.sql - it doesn't quite cover all the combinations, but I think it
gives a good enough overview.
c=1 pgbench -f ~/tmp/explainbench.sql -P5 -r -t 10
master:
statement latencies in milliseconds and failures:
200.800 0 SELECT pg_prewarm('pgbench_accounts');
0.098 0 PREPARE query AS SELECT * FROM pgbench_accounts
OFFSET 5000000 LIMIT 1;
212.010 0 EXPLAIN (ANALYZE, BUFFERS OFF, WAL OFF, TIMING OFF)
268.648 0 EXPLAIN (ANALYZE, BUFFERS OFF, WAL OFF, TIMING ON)
232.421 0 EXPLAIN (ANALYZE, BUFFERS ON, WAL ON, TIMING OFF)
283.531 0 EXPLAIN (ANALYZE, BUFFERS ON, WAL ON, TIMING ON)
0.030 0 DEALLOCATE query;
0002:
statement latencies in milliseconds and failures:
201.558 0 SELECT pg_prewarm('pgbench_accounts');
0.103 0 PREPARE query AS SELECT * FROM pgbench_accounts
OFFSET 5000000 LIMIT 1;
188.696 0 EXPLAIN (ANALYZE, BUFFERS OFF, WAL OFF, TIMING OFF)
244.479 0 EXPLAIN (ANALYZE, BUFFERS OFF, WAL OFF, TIMING ON)
223.773 0 EXPLAIN (ANALYZE, BUFFERS ON, WAL ON, TIMING OFF)
266.947 0 EXPLAIN (ANALYZE, BUFFERS ON, WAL ON, TIMING ON)
0.034 0 DEALLOCATE query;
That's something like 4-12%.
Pretty nice for a patch that just adds a few lines around and adds a few
inlines.
> At this point I'd say its safe to say that we should push out later
> changes to PG20, because it needs another good look over, and I don't
> think Andres or Heikki have the capacity for that today (but I really
> appreciate all the effort put in by both of you!).
Indeed.
> @@ -334,6 +334,9 @@ explain_ExecutorStart(QueryDesc *queryDesc, int eflags)
>
> if (auto_explain_enabled())
> {
> + /* We're always interested in runtime */
> + queryDesc->totaltime_options |= INSTRUMENT_TIMER;
> - queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL);
Not that it's going to make a significant difference, but it is nice that this
now would need to track less.
Kinda wonder about having
EXPLAIN (ANALYZE BUFFERS totals_only, WAL totals_only) ...;
in plenty cases that'd be all one needs, at substantially lower cost.
Greetings,
Andres Freund
explainbench.sql
Description: application/sql
>From f2f6ef266bdaa2670f08e061bf6a94f121be76d5 Mon Sep 17 00:00:00 2001 From: Lukas Fittl <[email protected]> Date: Tue, 7 Apr 2026 12:32:36 -0700 Subject: [PATCH v17a] instrumentation: Move ExecProcNodeInstr to allow inlining This moves the implementation of ExecProcNodeInstr, the ExecProcNode variant that gets used when instrumentation is on, to be defined in instrument.c instead of execProcNode.c, and marks functions it uses as inline. This allows compilers to generate an optimized implementation, and shows a 2 to 5% reduction in instrumentation overhead for queries that move lots of rows. Author: Lukas Fittl <[email protected]> Suggested-by: Andres Freund <[email protected]> Reviewed-by: Discussion: https://www.postgresql.org/message-id/flat/cap53pkzdbk8vj1fs4az481lgmn8f9mjic39zrhqkfusyq6k...@mail.gmail.com --- src/include/executor/executor.h | 7 ++++++ src/backend/executor/execProcnode.c | 20 ---------------- src/backend/executor/instrument.c | 37 ++++++++++++++++++++++++----- 3 files changed, 38 insertions(+), 26 deletions(-) diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h index 491c4886506..6980c6dceda 100644 --- a/src/include/executor/executor.h +++ b/src/include/executor/executor.h @@ -303,6 +303,13 @@ extern void ExecEndNode(PlanState *node); extern void ExecShutdownNode(PlanState *node); extern void ExecSetTupleBound(int64 tuples_needed, PlanState *child_node); +/* + * ExecProcNodeInstr() is implemented in instrument.c, as that allows for + * inlining of the instrumentation functions, but thematically it ought to be + * in execProcnode.c. + */ +extern TupleTableSlot *ExecProcNodeInstr(PlanState *node); + /* ---------------------------------------------------------------- * ExecProcNode diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c index 132fe37ef60..7c4c66e323f 100644 --- a/src/backend/executor/execProcnode.c +++ b/src/backend/executor/execProcnode.c @@ -121,7 +121,6 @@ #include "nodes/nodeFuncs.h" static TupleTableSlot *ExecProcNodeFirst(PlanState *node); -static TupleTableSlot *ExecProcNodeInstr(PlanState *node); static bool ExecShutdownNode_walker(PlanState *node, void *context); @@ -471,25 +470,6 @@ ExecProcNodeFirst(PlanState *node) } -/* - * ExecProcNode wrapper that performs instrumentation calls. By keeping - * this a separate function, we avoid overhead in the normal case where - * no instrumentation is wanted. - */ -static TupleTableSlot * -ExecProcNodeInstr(PlanState *node) -{ - TupleTableSlot *result; - - InstrStartNode(node->instrument); - - result = node->ExecProcNodeReal(node); - - InstrStopNode(node->instrument, TupIsNull(result) ? 0.0 : 1.0); - - return result; -} - /* ---------------------------------------------------------------- * MultiExecProcNode diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c index 4c3aec7fdee..ffbcd572133 100644 --- a/src/backend/executor/instrument.c +++ b/src/backend/executor/instrument.c @@ -15,7 +15,10 @@ #include <unistd.h> +#include "executor/executor.h" #include "executor/instrument.h" +#include "executor/tuptable.h" +#include "nodes/execnodes.h" #include "portability/instr_time.h" #include "utils/guc_hooks.h" @@ -46,7 +49,7 @@ InstrInitOptions(Instrumentation *instr, int instrument_options) instr->need_timer = (instrument_options & INSTRUMENT_TIMER) != 0; } -void +inline void InstrStart(Instrumentation *instr) { if (instr->need_timer) @@ -125,14 +128,14 @@ InstrInitNode(NodeInstrumentation *instr, int instrument_options, bool async_mod } /* Entry to a plan node */ -void +inline void InstrStartNode(NodeInstrumentation *instr) { InstrStart(&instr->instr); } /* Exit from a plan node */ -void +inline void InstrStopNode(NodeInstrumentation *instr, double nTuples) { double save_tuplecount = instr->tuplecount; @@ -166,6 +169,28 @@ InstrStopNode(NodeInstrumentation *instr, double nTuples) } } +/* + * ExecProcNode wrapper that performs instrumentation calls. By keeping + * this a separate function, we avoid overhead in the normal case where + * no instrumentation is wanted. + * + * This is implemented in instrument.c as all the functions it calls directly + * are here, allowing them to be inlined even when not using LTO. + */ +TupleTableSlot * +ExecProcNodeInstr(PlanState *node) +{ + TupleTableSlot *result; + + InstrStartNode(node->instrument); + + result = node->ExecProcNodeReal(node); + + InstrStopNode(node->instrument, TupIsNull(result) ? 0.0 : 1.0); + + return result; +} + /* Update tuple count */ void InstrUpdateTupleCount(NodeInstrumentation *instr, double nTuples) @@ -298,7 +323,7 @@ BufferUsageAdd(BufferUsage *dst, const BufferUsage *add) } /* dst += add - sub */ -void +inline void BufferUsageAccumDiff(BufferUsage *dst, const BufferUsage *add, const BufferUsage *sub) @@ -328,7 +353,7 @@ BufferUsageAccumDiff(BufferUsage *dst, } /* helper functions for WAL usage accumulation */ -static void +static inline void WalUsageAdd(WalUsage *dst, WalUsage *add) { dst->wal_bytes += add->wal_bytes; @@ -338,7 +363,7 @@ WalUsageAdd(WalUsage *dst, WalUsage *add) dst->wal_buffers_full += add->wal_buffers_full; } -void +inline void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add, const WalUsage *sub) { dst->wal_bytes += add->wal_bytes - sub->wal_bytes; -- 2.53.0.1.gb2826b52eb
