Hi Zsolt, Thanks for reviewing!
On Mon, Mar 9, 2026 at 2:55 PM Zsolt Parragi <[email protected]> wrote: > + if (queryDesc->totaltime && estate->es_instrument && !IsParallelWorker()) > + { > + ExecFinalizeNodeInstrumentation(queryDesc->planstate); > + > + ExecFinalizeTriggerInstrumentation(estate); > + } > + > if (queryDesc->totaltime) > - InstrStop(queryDesc->totaltime); > + queryDesc->totaltime = InstrQueryStopFinalize(queryDesc->totaltime); > > In ExecFinalizeNodeInstrumentation InstrFinalizeNode pfrees the > original instrumentation, but doesn't remove it from the > unfinalized_children list. In normal execution in > InstrQueryStopFinalize ResourceOwnerForgetInstrumentation handles > this, but what about the error path, if something happens between the > two? > Won't we end up in ResOwnerReleaseInstrumentation and do use after > free reads and then a double free on the now invalid pointer? ResourceOwnerForgetInstrumentation directly follows the call to ExecFinalizeNodeInstrumentation in standard_ExecutorFinish, so I'm not sure which error case you're thinking of? We could explicitly zap the unfinalized_children list at the end of ExecFinalizeNodeInstrumentation (and have it take a QueryInstrumentation argument) to protect against this, but I don't think that makes much of a difference with the code as currently written. Maybe I'm misunderstanding which situation you're thinking of? > + if (myState->es->timing || myState->es->buffers) > + instr = InstrQueryStopFinalize(instr); > + > > Is it okay to leak 1 instrumentation copy per tuple in the query > context? This freshly palloced object will be out of scope a few lines > after this. I don't think that's a permanent leak, since it would be in the memory context of the caller, i.e. the per-query memory context, but yeah, this doesn't seem ideal, since we're keeping this memory around for each tuple processed. First of all, we could just do a pfree in serializeAnalyzeReceive after we added the stats, but that said, this extra instrumentation for EXPLAIN (SERIALIZE) is a bit of a curious edge case I suppose, since serializeAnalyzeReceive gets called once per tuple - and so doing the ResOwner dance for each tuple is probably not ideal. I wonder if maybe we shouldn't treat this as a case of NodeInstrumentation instead, and attach to the query's totaltime instrumentation for abort safety. The only thing that's a bit inconvenient about that is that calling InstrQueryRememberNode/InstrFinalizeNode requires passing in that parent QueryInstrumentation, and the SerializeDestReceiver doesn't currently have easy access to that. If we brute-force solve that we could just add an extra method to set it after CreateQueryDesc is called, but: Stepping back, from an overall API design perspective, we could also track the current QueryInstrumentation in a global variable - that'd make it easier to attach/finalize extra instrumentation on it even if we don't have direct access to querydesc->totaltime, or we're in a non-executor context that uses QueryInstrumentation (like some utility commands) for future use cases of extra sub-query-level instrumentation. For example, that could also come in handy if we wanted VACUUM VERBOSE break out the buffer access it does on indexes vs tables. > @@ -128,8 +130,22 @@ IndexNext(IndexScanState *node) > > IndexNextWithReorder doesn't need the same handling? Yes - good point, that's an unintentional omission. For context, I hadn't spent substantial amounts of time on this part of the series (rather on getting the design of the stack mechanism right), but will fix this in the next revision to also handle IndexNextWithReorder. I've also been thinking whether we should teach Index-only Scans to break out the per-table buffer stats for the (rare) heap fetches. I guess we should? (it'd be fairly easy to instrument that index_fetch_heap there now) > + if (scandesc->xs_heap_continue) > + elog(ERROR, "non-MVCC snapshots are not supported in index-only scans"); > > Shouldn't this say index scans? (there's another preexisting > indexonylscan mention in this file, but that also seems wrong) Hmm. Looking at this again, I think the handling of xs_heap_continue isn't right altogether. Basically it should be this instead, I think, so we correctly call the table AM's table_index_fetch_tuple again if call_again gets set: while ((tid = index_getnext_tid(scandesc, direction)) != NULL) { if (node->iss_InstrumentTable) InstrPushStack(&node->iss_InstrumentTable->instr); for (;;) { found = index_fetch_heap(scandesc, slot); if (found || !scandesc->xs_heap_continue) break; } if (node->iss_InstrumentTable) InstrPopStack(&node->iss_InstrumentTable->instr); if (unlikely(!found)) continue; Which matches the loop in index_getnext_slot (i.e keep calling index_fetch_heap if xs_heap_continue=true). Based on that, we wouldn't need the elog at all. > +#if HAVE_INSTR_STACK > + usage = &instr_top.bufusage; > +#else > + usage = &pgBufferUsage; > +#endif > > I don't see pgBufferUsage anywhere else in the current code, it was > probably removed between rebases? That's intentional to allow testing pg_session_buffer_usage both on master (HAVE_INSTR_STACK=0) and with the patched version (HAVE_INSTR_STACK=1) - mainly to ensure we're matching behavior for anyone interested in the top-line buffer or WAL numbers. I don't think we should commit pg_session_buffer_usage (at least not as-is), its just there for testing the earlier changes. > + char *prefix = title ? psprintf("%s ", title) : pstrdup(""); > + > + ExplainPropertyInteger(psprintf("%sShared Hit Blocks", prefix), NULL, > usage->shared_blks_hit, es); > > (And many similar ExplainPropery after this) > > title is NULL most of the time, and this results in 16 allocations for > that common case - isn't there a better solution like using > ExplainOpenGroup or something? Yeah, I guess that's fair - I don't know if the extra allocations really matter, but I can see your point. If we used ExplainOpenGroup that would make it a nested structure in the JSON/etc representation, so it'd look like this: "Shared Read Blocks:" 123.0, ... "Temp I/O Write Time:" 42.0, "Table": { "Shared Read Blocks:" 123.0, ... "Temp I/O Write Time:" 42.0, } compared to text representation (simplified): Buffers: shared read=123.0 I/O Timings: temp write=42.0 Table Buffers: shared read=123.0 Table I/O Timings: temp write=42.0 I think that can work, and we have precedence for using groups in a node like this, e.g. with "Workers". If we go with this, I do wonder if we should clarify the JSON/etc group key as "Table Access" (one group) or "Table Buffers" / "Table I/O" (two groups), instead of just "Table"? > + * Callers must ensure that no intermediate stack entries are skipped, to > + * handle aborts correctly. If you're thinking of calling this in a > PG_FINALLY > + * block, instead call InstrPopAndFinalizeStack which can skip intermediate > + * stack entries, or instead use InstrStart/InstrStop. > > InstrPopAndFinalizeStack doesn't exists in the latest patch version Ah, yeah, that should say InstrStopFinalize (basically direct use of InstrPushStack/InstrPopStack is discouraged over use of the Instrumentation functions), I'll fix it in the next revision. Thanks, Lukas -- Lukas Fittl
