Re: Stack-based tracking of per-node WAL/buffer usage

Lukas Fittl Mon, 09 Mar 2026 16:46:49 -0700

Hi Zsolt,

Thanks for reviewing!

On Mon, Mar 9, 2026 at 2:55 PM Zsolt Parragi <[email protected]> wrote:
> + if (queryDesc->totaltime && estate->es_instrument && !IsParallelWorker())
> + {
> + ExecFinalizeNodeInstrumentation(queryDesc->planstate);
> +
> + ExecFinalizeTriggerInstrumentation(estate);
> + }
> +
>   if (queryDesc->totaltime)
> - InstrStop(queryDesc->totaltime);
> + queryDesc->totaltime = InstrQueryStopFinalize(queryDesc->totaltime);
>
> In ExecFinalizeNodeInstrumentation InstrFinalizeNode pfrees the
> original instrumentation, but doesn't remove it from the
> unfinalized_children list. In normal execution in
> InstrQueryStopFinalize ResourceOwnerForgetInstrumentation handles
> this, but what about the error path, if something happens between the
> two?
> Won't we end up in ResOwnerReleaseInstrumentation and do use after
> free reads and then a double free on the now invalid pointer?

ResourceOwnerForgetInstrumentation directly follows the call to
ExecFinalizeNodeInstrumentation in standard_ExecutorFinish, so I'm not
sure which error case you're thinking of?

We could explicitly zap the unfinalized_children list at the end of
ExecFinalizeNodeInstrumentation (and have it take a
QueryInstrumentation argument) to protect against this, but I don't
think that makes much of a difference with the code as currently
written.

Maybe I'm misunderstanding which situation you're thinking of?

> + if (myState->es->timing || myState->es->buffers)
> + instr = InstrQueryStopFinalize(instr);
> +
>
> Is it okay to leak 1 instrumentation copy per tuple in the query
> context? This freshly palloced object will be out of scope a few lines
> after this.

I don't think that's a permanent leak, since it would be in the memory
context of the caller, i.e. the per-query memory context, but yeah,
this doesn't seem ideal, since we're keeping this memory around for
each tuple processed.

First of all, we could just do a pfree in serializeAnalyzeReceive
after we added the stats, but that said, this extra instrumentation
for EXPLAIN (SERIALIZE) is a bit of a curious edge case I suppose,
since serializeAnalyzeReceive gets called once per tuple - and so
doing the ResOwner dance for each tuple is probably not ideal.

I wonder if maybe we shouldn't treat this as a case of
NodeInstrumentation instead, and attach to the query's totaltime
instrumentation for abort safety. The only thing that's a bit
inconvenient about that is that calling
InstrQueryRememberNode/InstrFinalizeNode requires passing in that
parent QueryInstrumentation, and the SerializeDestReceiver doesn't
currently have easy access to that. If we brute-force solve that we
could just add an extra method to set it after CreateQueryDesc is
called, but:

Stepping back, from an overall API design perspective, we could also
track the current QueryInstrumentation in a global variable - that'd
make it easier to attach/finalize extra instrumentation on it even if
we don't have direct access to querydesc->totaltime, or we're in a
non-executor context that uses QueryInstrumentation (like some utility
commands) for future use cases of extra sub-query-level
instrumentation. For example, that could also come in handy if we
wanted VACUUM VERBOSE break out the buffer access it does on indexes
vs tables.

> @@ -128,8 +130,22 @@ IndexNext(IndexScanState *node)
>
> IndexNextWithReorder doesn't need the same handling?

Yes - good point, that's an unintentional omission.

For context, I hadn't spent substantial amounts of time on this part
of the series (rather on getting the design of the stack mechanism
right), but will fix this in the next revision to also handle
IndexNextWithReorder.

I've also been thinking whether we should teach Index-only Scans to
break out the per-table buffer stats for the (rare) heap fetches. I
guess we should? (it'd be fairly easy to instrument that
index_fetch_heap there now)

> + if (scandesc->xs_heap_continue)
> + elog(ERROR, "non-MVCC snapshots are not supported in index-only scans");
>
> Shouldn't this say index scans? (there's another preexisting
> indexonylscan mention in this file, but that also seems wrong)

Hmm. Looking at this again, I think the handling of xs_heap_continue
isn't right altogether. Basically it should be this instead, I think,
so we correctly call the table AM's table_index_fetch_tuple again if
call_again gets set:

    while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
    {
        if (node->iss_InstrumentTable)
            InstrPushStack(&node->iss_InstrumentTable->instr);

        for (;;)
        {
            found = index_fetch_heap(scandesc, slot);
            if (found || !scandesc->xs_heap_continue)
                break;
        }

        if (node->iss_InstrumentTable)
            InstrPopStack(&node->iss_InstrumentTable->instr);

        if (unlikely(!found))
            continue;

Which matches the loop in index_getnext_slot (i.e keep calling
index_fetch_heap if xs_heap_continue=true). Based on that, we wouldn't
need the elog at all.

> +#if HAVE_INSTR_STACK
> + usage = &instr_top.bufusage;
> +#else
> + usage = &pgBufferUsage;
> +#endif
>
> I don't see pgBufferUsage anywhere else in the current code, it was
> probably removed between rebases?

That's intentional to allow testing pg_session_buffer_usage both on
master (HAVE_INSTR_STACK=0) and with the patched version
(HAVE_INSTR_STACK=1) - mainly to ensure we're matching behavior for
anyone interested in the top-line buffer or WAL numbers. I don't think
we should commit pg_session_buffer_usage (at least not as-is), its
just there for testing the earlier changes.

> + char    *prefix = title ? psprintf("%s ", title) : pstrdup("");
> +
> + ExplainPropertyInteger(psprintf("%sShared Hit Blocks", prefix), NULL,
>      usage->shared_blks_hit, es);
>
> (And many similar ExplainPropery after this)
>
> title is NULL most of the time, and this results in 16 allocations for
> that common case - isn't there a better solution like using
> ExplainOpenGroup or something?

Yeah, I guess that's fair - I don't know if the extra allocations
really matter, but I can see your point.

If we used ExplainOpenGroup that would make it a nested structure in
the JSON/etc representation, so it'd look like this:

"Shared Read Blocks:" 123.0,
...
"Temp I/O Write Time:" 42.0,
"Table": {
  "Shared Read Blocks:" 123.0,
  ...
  "Temp I/O Write Time:" 42.0,
}

compared to text representation (simplified):

Buffers: shared read=123.0
I/O Timings: temp write=42.0
Table Buffers: shared read=123.0
Table I/O Timings: temp write=42.0

I think that can work, and we have precedence for using groups in a
node like this, e.g. with "Workers".

If we go with this, I do wonder if we should clarify the JSON/etc
group key as "Table Access" (one group) or "Table Buffers" / "Table
I/O" (two groups), instead of just "Table"?

> + * Callers must ensure that no intermediate stack entries are skipped, to
> + * handle aborts correctly. If you're thinking of calling this in a 
> PG_FINALLY
> + * block, instead call InstrPopAndFinalizeStack which can skip intermediate
> + * stack entries, or instead use InstrStart/InstrStop.
>
> InstrPopAndFinalizeStack doesn't exists in the latest patch version

Ah, yeah, that should say InstrStopFinalize (basically direct use of
InstrPushStack/InstrPopStack is discouraged over use of the
Instrumentation functions), I'll fix it in the next revision.

Thanks,
Lukas

--
Lukas Fittl

Re: Stack-based tracking of per-node WAL/buffer usage

Reply via email to