Re: Adding wait events statistics

Bertrand Drouvot Mon, 28 Jul 2025 05:35:31 -0700

Hi,

On Thu, Jul 24, 2025 at 10:38:49AM -0400, Robert Haas wrote:
> On Thu, Jul 24, 2025 at 7:52 AM Bertrand Drouvot
> <bertranddrouvot...@gmail.com> wrote:
> > Well, the idea was more: as we speak about "wait" events then it would make
> > sense to add their duration. And then, to have meaningful data to interpret 
> > the
> > durations then it would make sense to add the counters. So that one could 
> > answer
> > questions like:
> >
> > * Is the engine’s wait pattern the same over time?
> > * Is application’s "A" wait pattern the same over time?
> > * I observe a peak in wait event "W": is it because "W" is now waiting 
> > longer or
> > is it because it is hit more frequently?
> 
> The third one -- are the waits longer or are we waiting more often --
> is a much better argument. I acknowledge that's something that a user
> might want to know. But I think it's worth drilling down into that a
> little bit more -- when and why would the user want to know that? I
> might be wrong here, but I feel like the most likely case where you
> would care is something having to do with I/O, which gets back to
> Andres's point about the AIO instrumentation.


Yeah, I/O is definitively an area where exposing the duration makes sense.

> If it's strictly
> processes fighting over in-memory LWLocks, I don't expect the lock
> hold times to vary widely, and I'm not sure what the user is supposed
> to think about it or do about it if they do.

Yeah, I also think not all of them are worth to get the duration for.

> I actually wrote some
> instrumentation kind of like this many years ago and it was useful for
> understanding some of the internal locking mechanics of PostgreSQL, so
> that I could think about whether things could be improved in the code,
> but I never had the sense that what I wrote on that occasion would
> have been useful to end-users. I don't know if you have any more
> detailed thoughts about this to share.

Focusing on LWLocks, I think that for example the OidGenLock one is one
it would be interesting to have the duration for. That way one could start to
investigate if there is a very long run of used OID values with no gaps in
TOAST tables or if there is higher concurrency. I do agree that might be not be
useful for all the LWLocks (those for which the users have no actionable 
workaround/solutions) but for some I think it could be useful (like the one
mentioned above).

Outside of LWLocks, I think it makes sense for heavyweight locks to answer
questions like: Is the locker holding a lock for longer? 

> > On my lab it added 60 cycles, I'm not sure that is too expensive. But even
> > if we think this is, maybe we could provide an option to turn this 
> > "overhead" off/on
> > with a GUC or compilation flag.
> 
> I don't know exactly what you tested,

I counted the cycles with:

static inline uint64_t rdtsc(void) {
    uint32_t lo, hi;
    __asm__ __volatile__("rdtsc" : "=a" (lo), "=d" (hi));
    return ((uint64_t)hi << 32) | lo;
}

calling this function in 2 ways:

W1: before and after incrementing the counters and collecting their durations
W2: in wait_start() and in wait_end()

Each of the aboves have been done without the other one in place and the delta
have been computed while running check-world.

That way I generated percentiles for the values measured in W1 (shared in [1])
and also per wait events (also shared on [1]).

P50 for the counters increments and durations is 146 cycles, if we remove 
the estimated rdtsc() calls overhead (say 80 cycles) that gives 66 cycles
overhead.


[1]: 
https://www.postgresql.org/message-id/aH%2BDDoiLujt%2Bb9cV%40ip-10-97-1-34.eu-west-3.compute.internal

> but there's a huge difference
> between running a query normally and running it under EXPLAIN ANALYZE,
> and a lot of that is timing overhead. There are probably cases where
> queries never actually have to wait very much - e.g. all data in
> shared_buffers, no contention, maybe a bunch of CPU cycles going into
> computation rather than data access - but I think there will also be
> cases where there are a LOT of wait events.

I see, you are saying that a LOT of wait events could add a LOT of little
cycles overhead which could lead to much more overhead than those 66 cycles.

That's right, that's why the idea was to add the counters and timings only on
wait classes for which the counters and durations overhead was relatively small
in regard to the waits themself.

>From [1]:

"
Overhead on the lock class is about 0.03%
Overhead on the timeout class is less than 0.01%

and now we can also see that:

Overhead on the lwlock class is about 1%
Overhead on the client class is about 0.5%
Overhead on the bufferpin class is about 0.2%
"

> For instance, if data fits
> well into the OS cache but poorly into shared_buffers and you
> construct a workload that has to access a lot of buffers in quick
> succession, like say a nested loop over a parameterized inner index
> scan, then I would think we would just be doing a ton of extra timing
> operations compared to right now, and since the waits would be short
> (since the data is coming from the OS rather than the disk) I would
> expect it to hurt quite a bit. If not, then why does EXPLAIN need a
> TIMING OFF option?

That's right, that's why the reasoning shared in [1] was based on filtering on 
the wait classes by focusing on very small overhead (relative overhead and not
absolute).

> > If you add the duration to the mix, that's more useful. And if you add the 
> > buffer
> > relfile information to the mix, that's even more insightful.
> 
> But also costly.

Right, everything that adds cycles adds costs. I think that comparing the
relative costs like done in [1] makes sense though. That helps to categorize
the costs.

> > One could spot hot buffers with this data in hand.
> 
> I actually don't believe we need duration to spot hot buffers. Some
> kind of sampling approach seems like it should work fine. If you
> expose some kind of information about buffer access in shared memory
> such that it can be sampled (deliberate hand-waving here) and somebody
> takes 100 samples, with each sample covering all backends in the
> system, over a period of 50 or 100 or 200 or 500 seconds, the hot
> buffers are going to pop right up to the top of the list.

That's right, the ones you would spot would be hot buffers but I think that's
also possible that you missed some (the smaller the sampling interval is, the
better though).

But still without the durations you can't say if the ones holding the buffers 
are
holding longer than before or if there is an increase of concurrency.

Between 2 samples that produced the same wait on a buffer: has the wait event
changed or was it the same? 

If that's the later we could increment the duration by the sampling interval,
but we just don't know.

> People often say things to me like "well what
> if I just get really unlucky and always miss what is truly the hottest
> buffer," but the law of large numbers says you don't really need to
> worry about that as long as you collect a sufficiently large number of
> samples, and that's not really difficult to do.

Yeah probably, but still without the durations, it's almost impossible to know
if there is an increase of concurrency or if the lock is held for longer. 
Indeed,
if the lock is held for longer one might see an increase of number of waiters
and "wrongly" interpret that as an increase of concurrency.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Adding wait events statistics

Reply via email to