Re: Slow standby snapshot

Michail Nikolaev Wed, 16 Nov 2022 04:24:16 -0800

Hello everyone.

> However ... I tried to reproduce the original complaint, and
> failed entirely.  I do see KnownAssignedXidsGetAndSetXmin
> eating a bit of time in the standby backends, but it's under 1%
> and doesn't seem to be rising over time.  Perhaps we've already
> applied some optimization that ameliorates the problem?  But
> I tested v13 as well as HEAD, and got the same results.


> Hmm.  I wonder if my inability to detect a problem is because the startup
> process does keep ahead of the workload on my machine, while it fails
> to do so on the OP's machine.  I've only got a 16-CPU machine at hand,
> which probably limits the ability of the primary to saturate the standby's
> startup process.

Yes, optimization by Andres Freund made things much better, but the
impact is still noticeable.

I was also using 16CPU machine - but two of them (primary and standby).

Here are the scripts I was using (1) for benchmark - maybe it could help.


> Nowadays we've *got* those primitives.  Can we get rid of
> known_assigned_xids_lck, and if so would it make a meaningful
> difference in this scenario?

I was trying it already - but was unable to find real benefits for it.
WIP patch in attachment.

Hm, I see I have sent it to list, but it absent in archives... Just
quote from it:

> First potential positive effect I could see is
> (TransactionIdIsInProgress -> KnownAssignedXidsSearch) locking but
> seems like it is not on standby hotpath.

> Second one - locking for KnownAssignedXidsGetAndSetXmin (build
> snapshot). But I was unable to measure impact. It wasn’t visible
> separately in (3) test.

> Maybe someone knows scenario causing known_assigned_xids_lck or
> TransactionIdIsInProgress become bottleneck on standby?

The latest question is still actual :)

> I think it might be a bigger effect than one might immediately think. Because
> the spinlock will typically be on the same cacheline as head/tail, and because
> every spinlock acquisition requires the cacheline to be modified (and thus
> owned mexlusively) by the current core, uses of head/tail will very commonly
> be cache misses even in workloads without a lot of KAX activity.

I was trying to find some way to practically achieve any noticeable
impact here, but unsuccessfully.

>> But yeah, it does feel like the proposed
>> approach is only going to be optimal over a small range of conditions.

> In particular, it doesn't adapt at all to workloads that don't replay all that
> much, but do compute a lot of snapshots.

The approach (2) was optimized to avoid any additional work for anyone
except non-startup
processes (approach with offsets to skip gaps while building snapshot).


[1]: https://gist.github.com/michail-nikolaev/e1dfc70bdd7cfd1b902523dbb3db2f28
[2]: 
https://www.postgresql.org/message-id/flat/CANtu0ogzo4MsR7My9%2BNhu3to5%3Dy7G9zSzUbxfWYOn9W5FfHjTA%40mail.gmail.com#341a3c3b033f69b260120b3173a66382

--
Michail Nikolaev

From 9759ea174db5e140415e4ce80a62f63075e91fe3 Mon Sep 17 00:00:00 2001
From: Michail Nikolaev <michail.nikol...@gmail.com>
Date: Sun, 21 Nov 2021 21:23:02 +0300
Subject: [PATCH] memory barrier instead of spinlock

---
 src/backend/storage/ipc/procarray.c | 42 ++++++++-----------------------------
 1 file changed, 9 insertions(+), 33 deletions(-)

diff --git a/src/backend/storage/ipc/procarray.c 
b/src/backend/storage/ipc/procarray.c
index 4da53a5b3f..bca336a80d 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -60,7 +60,6 @@
 #include "pgstat.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
-#include "storage/spin.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
@@ -81,7 +80,6 @@ typedef struct ProcArrayStruct
        int                     numKnownAssignedXids;   /* current # of valid 
entries */
        int                     tailKnownAssignedXids;  /* index of oldest 
valid element */
        int                     headKnownAssignedXids;  /* index of newest 
element, + 1 */
-       slock_t         known_assigned_xids_lck;        /* protects head/tail 
pointers */
 
        /*
         * Highest subxid that has been removed from KnownAssignedXids array to
@@ -424,7 +422,6 @@ CreateSharedProcArray(void)
                procArray->numKnownAssignedXids = 0;
                procArray->tailKnownAssignedXids = 0;
                procArray->headKnownAssignedXids = 0;
-               SpinLockInit(&procArray->known_assigned_xids_lck);
                procArray->lastOverflowedXid = InvalidTransactionId;
                procArray->replication_slot_xmin = InvalidTransactionId;
                procArray->replication_slot_catalog_xmin = InvalidTransactionId;
@@ -4552,10 +4549,9 @@ ExpireOldKnownAssignedTransactionIds(TransactionId xid)
  * pointer.  This wouldn't require any lock at all, except that on machines
  * with weak memory ordering we need to be careful that other processors
  * see the array element changes before they see the head pointer change.
- * We handle this by using a spinlock to protect reads and writes of the
- * head/tail pointers.  (We could dispense with the spinlock if we were to
- * create suitable memory access barrier primitives and use those instead.)
- * The spinlock must be taken to read or write the head/tail pointers unless
+ * We handle this by using a memory barrier to protect writes of the
+ * head pointer.
+ * The memory barrier is taken before write the head pointer unless
  * the caller holds ProcArrayLock exclusively.
  *
  * Algorithmic analysis:
@@ -4599,7 +4595,6 @@ KnownAssignedXidsCompress(bool force)
        int                     compress_index;
        int                     i;
 
-       /* no spinlock required since we hold ProcArrayLock exclusively */
        head = pArray->headKnownAssignedXids;
        tail = pArray->tailKnownAssignedXids;
 
@@ -4685,7 +4680,7 @@ KnownAssignedXidsAdd(TransactionId from_xid, 
TransactionId to_xid,
 
        /*
         * Since only the startup process modifies the head/tail pointers, we
-        * don't need a lock to read them here.
+        * are safe read them here.
         */
        head = pArray->headKnownAssignedXids;
        tail = pArray->tailKnownAssignedXids;
@@ -4743,21 +4738,20 @@ KnownAssignedXidsAdd(TransactionId from_xid, 
TransactionId to_xid,
        pArray->numKnownAssignedXids += nxids;
 
        /*
-        * Now update the head pointer.  We use a spinlock to protect this
+        * Now update the head pointer.  We use a memory barrier to protect this
         * pointer, not because the update is likely to be non-atomic, but to
         * ensure that other processors see the above array updates before they
         * see the head pointer change.
         *
         * If we're holding ProcArrayLock exclusively, there's no need to take 
the
-        * spinlock.
+        * barrier.
         */
        if (exclusive_lock)
                pArray->headKnownAssignedXids = head;
        else
        {
-               SpinLockAcquire(&pArray->known_assigned_xids_lck);
+               pg_write_barrier();
                pArray->headKnownAssignedXids = head;
-               SpinLockRelease(&pArray->known_assigned_xids_lck);
        }
 }
 
@@ -4780,20 +4774,8 @@ KnownAssignedXidsSearch(TransactionId xid, bool remove)
        int                     tail;
        int                     result_index = -1;
 
-       if (remove)
-       {
-               /* we hold ProcArrayLock exclusively, so no need for spinlock */
-               tail = pArray->tailKnownAssignedXids;
-               head = pArray->headKnownAssignedXids;
-       }
-       else
-       {
-               /* take spinlock to ensure we see up-to-date array contents */
-               SpinLockAcquire(&pArray->known_assigned_xids_lck);
-               tail = pArray->tailKnownAssignedXids;
-               head = pArray->headKnownAssignedXids;
-               SpinLockRelease(&pArray->known_assigned_xids_lck);
-       }
+       tail = pArray->tailKnownAssignedXids;
+       head = pArray->headKnownAssignedXids;
 
        /*
         * Standard binary search.  Note we can ignore the 
KnownAssignedXidsValid
@@ -5031,13 +5013,9 @@ KnownAssignedXidsGetAndSetXmin(TransactionId *xarray, 
TransactionId *xmin,
         * cannot enter and then leave the array while we hold ProcArrayLock.  
We
         * might miss newly-added xids, but they should be >= xmax so irrelevant
         * anyway.
-        *
-        * Must take spinlock to ensure we see up-to-date array contents.
         */
-       SpinLockAcquire(&procArray->known_assigned_xids_lck);
        tail = procArray->tailKnownAssignedXids;
        head = procArray->headKnownAssignedXids;
-       SpinLockRelease(&procArray->known_assigned_xids_lck);
 
        for (i = tail; i < head; i++)
        {
@@ -5084,10 +5062,8 @@ KnownAssignedXidsGetOldestXmin(void)
        /*
         * Fetch head just once, since it may change while we loop.
         */
-       SpinLockAcquire(&procArray->known_assigned_xids_lck);
        tail = procArray->tailKnownAssignedXids;
        head = procArray->headKnownAssignedXids;
-       SpinLockRelease(&procArray->known_assigned_xids_lck);
 
        for (i = tail; i < head; i++)
        {
-- 
2.16.2.windows.1

Re: Slow standby snapshot

Reply via email to