I've been thinking some more about scalability and what we need to measure in order to locate and remove the next set of bottlenecks.
EXCLUSIVE LOCKS The lock wait time distribution and the sum of lock held time is of interest in understanding contention. SHARED LOCKS Shared locks present some complexities for analysing contention stats. If we look at the sum of the lock held time then we will get the wrong answer because many backends can hold an LW_SHARED mode lock at the same time. Moreover, LW_SHARED locks have queue jumping characteristics that make LW_EXCLUSIVE locks wait for substantial lengths of time. The worst of those situations was the old CheckpointStartLock which could starve a starting checkpoint for many minutes on a busy server. For locks that can be both shared and exclusive we should measure the lock wait time for shared and exclusive separately and we should measure the lock hold time only for exclusive mode. We've discussed the possibility of a third type of lock, a queued shared lock. I've not found any benefit in prototypes so far, but one day... RARE EVENTS AND TRAFFIC JAMS For queued exclusive locks the queue length is an interesting measurement over time. This is because we may find that certain rare events cause effects out of proportion to their actual duration. If the random arrival rate of new lock requests approaches the lock hold time (service time) then when a traffic jam forms it can take long periods to clear again. e.g. if a lock is randomly requested every 11us and lock service time is 10us then the lock seems like it will mostly be clear. Should the lock ever be held for an extended time, e.g. 1ms (=1000us) then a long queue will form, say about ~99 long. But the every 100us we serve 10 lock requestors while 9 more arrive. So after the traffic jam forms it will take 10,000us to clear, i.e. the traffic jam takes 10 times as long to clear as the original event that caused it. Taken to the extreme, very rare events can still be the major source of contention in a dynamic system. Now introduce non-random effects into the arrival rate distribution and you can see that flash queues can form easily and yet take a long time to clear. The maths for this is fairly hard... WHY ARE WE WAITING? Up to now we've looked at contention on single well-known LWlocks, such as BufMappingLock etc.. There will be times when we need to return to looking at those contention points, but I'm thinking we may need to begin looking at other points of contention in the server. The single well-known locks behave in different ways because each lock has different lock service times and also different access frequencies on different lock modes (shared or exclusive). We should be careful not to consider all of these locks similarly in any analysis. The second source of contention issues I see is where we hold multiple well-known locks. For example holding WALInsertLock is normal, as is holding WALWriteLock, but holding both WALInsertLock while we perform a write with WALWriteLock held is a bad thing and we would want to avoid that condition. So I'd like to look at what combinations of locks we hold and Why they were taken. The third source of contention is data block events. These are much harder to spot because they are spread across the whole buffer space. An example might be index block splits. These will occur at the same logical place in the index, though because of the way we split the new right page is always a new data block and so in a different buffer. So contention on the value "123" in an index could actually move across different buffer locks and not be visible for what it really is. Recursive block splits can cause very long waits. We need ways to be able to track those types of event. So our sources of contention are at least 1. single well-known locks 2. multiple well-known locks 3. data block contention events ??? I've thought about ways of understanding the root cause of a lock wait and there are some. But because of what we said earlier about traffic jams lasting much longer than the original event, its hard to accurately explain why certain tasks wait. Are we waiting because an earlier event caused a traffic jam, or are we waiting because a sudden rush of lock requests occurred before the original traffic jam cleared? -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster