On Thu, Jul 24, 2025 at 7:52 AM Bertrand Drouvot <bertranddrouvot...@gmail.com> wrote: > Well, the idea was more: as we speak about "wait" events then it would make > sense to add their duration. And then, to have meaningful data to interpret > the > durations then it would make sense to add the counters. So that one could > answer > questions like: > > * Is the engine’s wait pattern the same over time? > * Is application’s "A" wait pattern the same over time? > * I observe a peak in wait event "W": is it because "W" is now waiting longer > or > is it because it is hit more frequently?
Forgive me if I'm beating a dead horse here, but I just don't believe the first two of these at all. If you want to know if the wait pattern is the same over time, take samples over various intervals and compare the samples. It's true that you might notice some differences by comparing counts that you wouldn't notice by comparing sample profiles, but that's true of ANY additional data that you choose to collect. It's not a good argument for collecting number-of-times-we-waited-on-each-wait-event specifically. Because (in my experience) counts of the number of times we hit each wait event point are nearly meaningless, if you detect a change by comparing those across time periods, all you will really know is that something changed. You won't really get any understanding of what it was. I think with a more careful choice of what data to gather, we can do better. The third one -- are the waits longer or are we waiting more often -- is a much better argument. I acknowledge that's something that a user might want to know. But I think it's worth drilling down into that a little bit more -- when and why would the user want to know that? I might be wrong here, but I feel like the most likely case where you would care is something having to do with I/O, which gets back to Andres's point about the AIO instrumentation. If it's strictly processes fighting over in-memory LWLocks, I don't expect the lock hold times to vary widely, and I'm not sure what the user is supposed to think about it or do about it if they do. I actually wrote some instrumentation kind of like this many years ago and it was useful for understanding some of the internal locking mechanics of PostgreSQL, so that I could think about whether things could be improved in the code, but I never had the sense that what I wrote on that occasion would have been useful to end-users. I don't know if you have any more detailed thoughts about this to share. > > I'm almost sure that measuring LWLock wait > > times is going to be too expensive to be practical, > > On my lab it added 60 cycles, I'm not sure that is too expensive. But even > if we think this is, maybe we could provide an option to turn this "overhead" > off/on > with a GUC or compilation flag. I don't know exactly what you tested, but there's a huge difference between running a query normally and running it under EXPLAIN ANALYZE, and a lot of that is timing overhead. There are probably cases where queries never actually have to wait very much - e.g. all data in shared_buffers, no contention, maybe a bunch of CPU cycles going into computation rather than data access - but I think there will also be cases where there are a LOT of wait events. For instance, if data fits well into the OS cache but poorly into shared_buffers and you construct a workload that has to access a lot of buffers in quick succession, like say a nested loop over a parameterized inner index scan, then I would think we would just be doing a ton of extra timing operations compared to right now, and since the waits would be short (since the data is coming from the OS rather than the disk) I would expect it to hurt quite a bit. If not, then why does EXPLAIN need a TIMING OFF option? > > Measuring counts doesn't seem > > very useful either: knowing the number of times that somebody tried to > > acquire a relation lock or a tuple lock arguably tells you something > > about your workload that you might want to know, whereas I would argue > > that knowing the number of times that somebody tried to acquire a > > buffer lock doesn't really tell you anything at all. > > If you add the duration to the mix, that's more useful. And if you add the > buffer > relfile information to the mix, that's even more insightful. But also costly. If we create a system that has significantly more overhead than wait events, we probably won't be able to have it enabled all the time, and busy systems may not even be able to afford turning it on temporarily. The cost of a system of this time has a huge impact on how usable it actually is. > One could spot hot buffers with this data in hand. I actually don't believe we need duration to spot hot buffers. Some kind of sampling approach seems like it should work fine. If you expose some kind of information about buffer access in shared memory such that it can be sampled (deliberate hand-waving here) and somebody takes 100 samples, with each sample covering all backends in the system, over a period of 50 or 100 or 200 or 500 seconds, the hot buffers are going to pop right up to the top of the list. You won't necessarily be able to see buffers that are just a tiny bit hot, but with a decent number of samples you should be able to clearly see the stuff that's really hot. People often say things to me like "well what if I just get really unlucky and always miss what is truly the hottest buffer," but the law of large numbers says you don't really need to worry about that as long as you collect a sufficiently large number of samples, and that's not really difficult to do. -- Robert Haas EDB: http://www.enterprisedb.com