On Thu, Jul 24, 2025 at 7:52 AM Bertrand Drouvot
<bertranddrouvot...@gmail.com> wrote:
> Well, the idea was more: as we speak about "wait" events then it would make
> sense to add their duration. And then, to have meaningful data to interpret 
> the
> durations then it would make sense to add the counters. So that one could 
> answer
> questions like:
>
> * Is the engine’s wait pattern the same over time?
> * Is application’s "A" wait pattern the same over time?
> * I observe a peak in wait event "W": is it because "W" is now waiting longer 
> or
> is it because it is hit more frequently?

Forgive me if I'm beating a dead horse here, but I just don't believe
the first two of these at all. If you want to know if the wait pattern
is the same over time, take samples over various intervals and compare
the samples. It's true that you might notice some differences by
comparing counts that you wouldn't notice by comparing sample
profiles, but that's true of ANY additional data that you choose to
collect. It's not a good argument for collecting
number-of-times-we-waited-on-each-wait-event specifically. Because (in
my experience) counts of the number of times we hit each wait event
point are nearly meaningless, if you detect a change by comparing
those across time periods, all you will really know is that something
changed. You won't really get any understanding of what it was. I
think with a more careful choice of what data to gather, we can do
better.

The third one -- are the waits longer or are we waiting more often --
is a much better argument. I acknowledge that's something that a user
might want to know. But I think it's worth drilling down into that a
little bit more -- when and why would the user want to know that? I
might be wrong here, but I feel like the most likely case where you
would care is something having to do with I/O, which gets back to
Andres's point about the AIO instrumentation. If it's strictly
processes fighting over in-memory LWLocks, I don't expect the lock
hold times to vary widely, and I'm not sure what the user is supposed
to think about it or do about it if they do. I actually wrote some
instrumentation kind of like this many years ago and it was useful for
understanding some of the internal locking mechanics of PostgreSQL, so
that I could think about whether things could be improved in the code,
but I never had the sense that what I wrote on that occasion would
have been useful to end-users. I don't know if you have any more
detailed thoughts about this to share.

> > I'm almost sure that measuring LWLock wait
> > times is going to be too expensive to be practical,
>
> On my lab it added 60 cycles, I'm not sure that is too expensive. But even
> if we think this is, maybe we could provide an option to turn this "overhead" 
> off/on
> with a GUC or compilation flag.

I don't know exactly what you tested, but there's a huge difference
between running a query normally and running it under EXPLAIN ANALYZE,
and a lot of that is timing overhead. There are probably cases where
queries never actually have to wait very much - e.g. all data in
shared_buffers, no contention, maybe a bunch of CPU cycles going into
computation rather than data access - but I think there will also be
cases where there are a LOT of wait events. For instance, if data fits
well into the OS cache but poorly into shared_buffers and you
construct a workload that has to access a lot of buffers in quick
succession, like say a nested loop over a parameterized inner index
scan, then I would think we would just be doing a ton of extra timing
operations compared to right now, and since the waits would be short
(since the data is coming from the OS rather than the disk) I would
expect it to hurt quite a bit. If not, then why does EXPLAIN need a
TIMING OFF option?

> > Measuring counts doesn't seem
> > very useful either: knowing the number of times that somebody tried to
> > acquire a relation lock or a tuple lock arguably tells you something
> > about your workload that you might want to know, whereas I would argue
> > that knowing the number of times that somebody tried to acquire a
> > buffer lock doesn't really tell you anything at all.
>
> If you add the duration to the mix, that's more useful. And if you add the 
> buffer
> relfile information to the mix, that's even more insightful.

But also costly. If we create a system that has significantly more
overhead than wait events, we probably won't be able to have it
enabled all the time, and busy systems may not even be able to afford
turning it on temporarily. The cost of a system of this time has a
huge impact on how usable it actually is.

> One could spot hot buffers with this data in hand.

I actually don't believe we need duration to spot hot buffers. Some
kind of sampling approach seems like it should work fine. If you
expose some kind of information about buffer access in shared memory
such that it can be sampled (deliberate hand-waving here) and somebody
takes 100 samples, with each sample covering all backends in the
system, over a period of 50 or 100 or 200 or 500 seconds, the hot
buffers are going to pop right up to the top of the list. You won't
necessarily be able to see buffers that are just a tiny bit hot, but
with a decent number of samples you should be able to clearly see the
stuff that's really hot. People often say things to me like "well what
if I just get really unlucky and always miss what is truly the hottest
buffer," but the law of large numbers says you don't really need to
worry about that as long as you collect a sufficiently large number of
samples, and that's not really difficult to do.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


Reply via email to