Hello everyone. > However ... I tried to reproduce the original complaint, and > failed entirely. I do see KnownAssignedXidsGetAndSetXmin > eating a bit of time in the standby backends, but it's under 1% > and doesn't seem to be rising over time. Perhaps we've already > applied some optimization that ameliorates the problem? But > I tested v13 as well as HEAD, and got the same results.
> Hmm. I wonder if my inability to detect a problem is because the startup > process does keep ahead of the workload on my machine, while it fails > to do so on the OP's machine. I've only got a 16-CPU machine at hand, > which probably limits the ability of the primary to saturate the standby's > startup process. Yes, optimization by Andres Freund made things much better, but the impact is still noticeable. I was also using 16CPU machine - but two of them (primary and standby). Here are the scripts I was using (1) for benchmark - maybe it could help. > Nowadays we've *got* those primitives. Can we get rid of > known_assigned_xids_lck, and if so would it make a meaningful > difference in this scenario? I was trying it already - but was unable to find real benefits for it. WIP patch in attachment. Hm, I see I have sent it to list, but it absent in archives... Just quote from it: > First potential positive effect I could see is > (TransactionIdIsInProgress -> KnownAssignedXidsSearch) locking but > seems like it is not on standby hotpath. > Second one - locking for KnownAssignedXidsGetAndSetXmin (build > snapshot). But I was unable to measure impact. It wasn’t visible > separately in (3) test. > Maybe someone knows scenario causing known_assigned_xids_lck or > TransactionIdIsInProgress become bottleneck on standby? The latest question is still actual :) > I think it might be a bigger effect than one might immediately think. Because > the spinlock will typically be on the same cacheline as head/tail, and because > every spinlock acquisition requires the cacheline to be modified (and thus > owned mexlusively) by the current core, uses of head/tail will very commonly > be cache misses even in workloads without a lot of KAX activity. I was trying to find some way to practically achieve any noticeable impact here, but unsuccessfully. >> But yeah, it does feel like the proposed >> approach is only going to be optimal over a small range of conditions. > In particular, it doesn't adapt at all to workloads that don't replay all that > much, but do compute a lot of snapshots. The approach (2) was optimized to avoid any additional work for anyone except non-startup processes (approach with offsets to skip gaps while building snapshot). [1]: https://gist.github.com/michail-nikolaev/e1dfc70bdd7cfd1b902523dbb3db2f28 [2]: https://www.postgresql.org/message-id/flat/CANtu0ogzo4MsR7My9%2BNhu3to5%3Dy7G9zSzUbxfWYOn9W5FfHjTA%40mail.gmail.com#341a3c3b033f69b260120b3173a66382 -- Michail Nikolaev
From 9759ea174db5e140415e4ce80a62f63075e91fe3 Mon Sep 17 00:00:00 2001 From: Michail Nikolaev <michail.nikol...@gmail.com> Date: Sun, 21 Nov 2021 21:23:02 +0300 Subject: [PATCH] memory barrier instead of spinlock --- src/backend/storage/ipc/procarray.c | 42 ++++++++----------------------------- 1 file changed, 9 insertions(+), 33 deletions(-) diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c index 4da53a5b3f..bca336a80d 100644 --- a/src/backend/storage/ipc/procarray.c +++ b/src/backend/storage/ipc/procarray.c @@ -60,7 +60,6 @@ #include "pgstat.h" #include "storage/proc.h" #include "storage/procarray.h" -#include "storage/spin.h" #include "utils/acl.h" #include "utils/builtins.h" #include "utils/rel.h" @@ -81,7 +80,6 @@ typedef struct ProcArrayStruct int numKnownAssignedXids; /* current # of valid entries */ int tailKnownAssignedXids; /* index of oldest valid element */ int headKnownAssignedXids; /* index of newest element, + 1 */ - slock_t known_assigned_xids_lck; /* protects head/tail pointers */ /* * Highest subxid that has been removed from KnownAssignedXids array to @@ -424,7 +422,6 @@ CreateSharedProcArray(void) procArray->numKnownAssignedXids = 0; procArray->tailKnownAssignedXids = 0; procArray->headKnownAssignedXids = 0; - SpinLockInit(&procArray->known_assigned_xids_lck); procArray->lastOverflowedXid = InvalidTransactionId; procArray->replication_slot_xmin = InvalidTransactionId; procArray->replication_slot_catalog_xmin = InvalidTransactionId; @@ -4552,10 +4549,9 @@ ExpireOldKnownAssignedTransactionIds(TransactionId xid) * pointer. This wouldn't require any lock at all, except that on machines * with weak memory ordering we need to be careful that other processors * see the array element changes before they see the head pointer change. - * We handle this by using a spinlock to protect reads and writes of the - * head/tail pointers. (We could dispense with the spinlock if we were to - * create suitable memory access barrier primitives and use those instead.) - * The spinlock must be taken to read or write the head/tail pointers unless + * We handle this by using a memory barrier to protect writes of the + * head pointer. + * The memory barrier is taken before write the head pointer unless * the caller holds ProcArrayLock exclusively. * * Algorithmic analysis: @@ -4599,7 +4595,6 @@ KnownAssignedXidsCompress(bool force) int compress_index; int i; - /* no spinlock required since we hold ProcArrayLock exclusively */ head = pArray->headKnownAssignedXids; tail = pArray->tailKnownAssignedXids; @@ -4685,7 +4680,7 @@ KnownAssignedXidsAdd(TransactionId from_xid, TransactionId to_xid, /* * Since only the startup process modifies the head/tail pointers, we - * don't need a lock to read them here. + * are safe read them here. */ head = pArray->headKnownAssignedXids; tail = pArray->tailKnownAssignedXids; @@ -4743,21 +4738,20 @@ KnownAssignedXidsAdd(TransactionId from_xid, TransactionId to_xid, pArray->numKnownAssignedXids += nxids; /* - * Now update the head pointer. We use a spinlock to protect this + * Now update the head pointer. We use a memory barrier to protect this * pointer, not because the update is likely to be non-atomic, but to * ensure that other processors see the above array updates before they * see the head pointer change. * * If we're holding ProcArrayLock exclusively, there's no need to take the - * spinlock. + * barrier. */ if (exclusive_lock) pArray->headKnownAssignedXids = head; else { - SpinLockAcquire(&pArray->known_assigned_xids_lck); + pg_write_barrier(); pArray->headKnownAssignedXids = head; - SpinLockRelease(&pArray->known_assigned_xids_lck); } } @@ -4780,20 +4774,8 @@ KnownAssignedXidsSearch(TransactionId xid, bool remove) int tail; int result_index = -1; - if (remove) - { - /* we hold ProcArrayLock exclusively, so no need for spinlock */ - tail = pArray->tailKnownAssignedXids; - head = pArray->headKnownAssignedXids; - } - else - { - /* take spinlock to ensure we see up-to-date array contents */ - SpinLockAcquire(&pArray->known_assigned_xids_lck); - tail = pArray->tailKnownAssignedXids; - head = pArray->headKnownAssignedXids; - SpinLockRelease(&pArray->known_assigned_xids_lck); - } + tail = pArray->tailKnownAssignedXids; + head = pArray->headKnownAssignedXids; /* * Standard binary search. Note we can ignore the KnownAssignedXidsValid @@ -5031,13 +5013,9 @@ KnownAssignedXidsGetAndSetXmin(TransactionId *xarray, TransactionId *xmin, * cannot enter and then leave the array while we hold ProcArrayLock. We * might miss newly-added xids, but they should be >= xmax so irrelevant * anyway. - * - * Must take spinlock to ensure we see up-to-date array contents. */ - SpinLockAcquire(&procArray->known_assigned_xids_lck); tail = procArray->tailKnownAssignedXids; head = procArray->headKnownAssignedXids; - SpinLockRelease(&procArray->known_assigned_xids_lck); for (i = tail; i < head; i++) { @@ -5084,10 +5062,8 @@ KnownAssignedXidsGetOldestXmin(void) /* * Fetch head just once, since it may change while we loop. */ - SpinLockAcquire(&procArray->known_assigned_xids_lck); tail = procArray->tailKnownAssignedXids; head = procArray->headKnownAssignedXids; - SpinLockRelease(&procArray->known_assigned_xids_lck); for (i = tail; i < head; i++) { -- 2.16.2.windows.1