Re: Improve WALRead() to suck data directly from WAL buffers when possible

Andres Freund Fri, 03 Nov 2023 17:44:54 -0700

Hi,

On 2023-11-02 22:38:38 +0530, Bharath Rupireddy wrote:
> From 5b5469d7dcd8e98bfcaf14227e67356bbc1f5fe8 Mon Sep 17 00:00:00 2001
> From: Bharath Rupireddy <bharath.rupireddyforpostg...@gmail.com>
> Date: Thu, 2 Nov 2023 15:10:51 +0000
> Subject: [PATCH v14] Track oldest initialized WAL buffer page
>
> ---
>  src/backend/access/transam/xlog.c | 170 ++++++++++++++++++++++++++++++
>  src/include/access/xlog.h         |   1 +
>  2 files changed, 171 insertions(+)
>
> diff --git a/src/backend/access/transam/xlog.c 
> b/src/backend/access/transam/xlog.c
> index b541be8eec..fdf2ef310b 100644
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -504,6 +504,45 @@ typedef struct XLogCtlData
>       XLogRecPtr *xlblocks;           /* 1st byte ptr-s + XLOG_BLCKSZ */
>       int                     XLogCacheBlck;  /* highest allocated xlog 
> buffer index */
>
> +     /*
> +      * Start address of oldest initialized page in XLog buffers.
> +      *
> +      * We mainly track oldest initialized page explicitly to quickly tell 
> if a
> +      * given WAL record is available in XLog buffers. It also can be used 
> for
> +      * other purposes, see notes below.
> +      *
> +      * OldestInitializedPage gives XLog buffers following properties:
> +      *
> +      * 1) At any given point of time, pages in XLog buffers array are sorted
> +      * in an ascending order from OldestInitializedPage till 
> InitializedUpTo.
> +      * Note that we verify this property for assert-only builds, see
> +      * IsXLogBuffersArraySorted() for more details.


This is true - but also not, if you look at it a bit too literally. The
buffers in xlblocks itself obviously aren't ordered when wrapping around
between XLogRecPtrToBufIdx(OldestInitializedPage) and
XLogRecPtrToBufIdx(InitializedUpTo).


> +      * 2) OldestInitializedPage is monotonically increasing (by virtue of 
> how
> +      * postgres generates WAL records), that is, its value never decreases.
> +      * This property lets someone read its value without a lock. There's no
> +      * problem even if its value is slightly stale i.e. concurrently being
> +      * updated. One can still use it for finding if a given WAL record is
> +      * available in XLog buffers. At worst, one might get false positives
> +      * (i.e. OldestInitializedPage may tell that the WAL record is available
> +      * in XLog buffers, but when one actually looks at it, it isn't really
> +      * available). This is more efficient and performant than acquiring a 
> lock
> +      * for reading. Note that we may not need a lock to read
> +      * OldestInitializedPage but we need to update it holding
> +      * WALBufMappingLock.

I'd
s/may not need/do not need/

But perhaps rephrase it a bit more, to something like:

To update OldestInitializedPage, WALBufMappingLock needs to be held
exclusively, for reading no lock is required.


> +      *
> +      * 3) One can start traversing XLog buffers from OldestInitializedPage
> +      * till InitializedUpTo to list out all valid WAL records and stats, and
> +      * expose them via SQL-callable functions to users.
> +      *
> +      * 4) XLog buffers array is inherently organized as a circular, sorted 
> and
> +      * rotated array with OldestInitializedPage as pivot with the property
> +      * where LSN of previous buffer page (if valid) is greater than
> +      * OldestInitializedPage and LSN of next buffer page (if valid) is 
> greater
> +      * than OldestInitializedPage.
> +      */
> +     XLogRecPtr      OldestInitializedPage;

It seems a bit odd to name a LSN containing variable *Page...


>       /*
>        * InsertTimeLineID is the timeline into which new WAL is being inserted
>        * and flushed. It is zero during recovery, and does not change once 
> set.
> @@ -590,6 +629,10 @@ static ControlFileData *ControlFile = NULL;
>  #define NextBufIdx(idx)              \
>               (((idx) == XLogCtl->XLogCacheBlck) ? 0 : ((idx) + 1))
>
> +/* Macro to retreat to previous buffer index. */
> +#define PreviousBufIdx(idx)          \
> +             (((idx) == 0) ? XLogCtl->XLogCacheBlck : ((idx) - 1))

I think it might be worth making these inlines and adding assertions that idx
is not bigger than XLogCtl->XLogCacheBlck?


>  /*
>   * XLogRecPtrToBufIdx returns the index of the WAL buffer that holds, or
>   * would hold if it was in cache, the page containing 'recptr'.
> @@ -708,6 +751,10 @@ static void WALInsertLockAcquireExclusive(void);
>  static void WALInsertLockRelease(void);
>  static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
>
> +#ifdef USE_ASSERT_CHECKING
> +static bool IsXLogBuffersArraySorted(void);
> +#endif
> +
>  /*
>   * Insert an XLOG record represented by an already-constructed chain of data
>   * chunks.  This is a low-level routine; to construct the WAL record header
> @@ -1992,6 +2039,52 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, 
> bool opportunistic)
>               XLogCtl->InitializedUpTo = NewPageEndPtr;
>
>               npages++;
> +
> +             /*
> +              * Try updating oldest initialized XLog buffer page.
> +              *
> +              * Update it if we are initializing an XLog buffer page for the 
> first
> +              * time or if XLog buffers are full and we are wrapping around.
> +              */
> +             if (XLogRecPtrIsInvalid(XLogCtl->OldestInitializedPage) ||
> +                     XLogRecPtrToBufIdx(XLogCtl->OldestInitializedPage) == 
> nextidx)
> +             {
> +                     Assert(XLogCtl->OldestInitializedPage < 
> NewPageBeginPtr);
> +
> +                     XLogCtl->OldestInitializedPage = NewPageBeginPtr;
> +             }

Wait, isn't this too late?  At this point the buffer can already be used by
GetXLogBuffers().  I think thi sneeds to happen at the latest just before
                *((volatile XLogRecPtr *) &XLogCtl->xlblocks[nextidx]) = 
NewPageEndPtr;


Why is it legal to get here with XLogCtl->OldestInitializedPage being invalid?


> +
> +/*
> + * Returns whether or not a given WAL record is available in XLog buffers.
> + *
> + * Note that we don't read OldestInitializedPage under a lock, see 
> description
> + * near its definition in xlog.c for more details.
> + *
> + * Note that caller needs to pass in an LSN known to the server, not a future
> + * or unwritten or unflushed LSN.
> + */
> +bool
> +IsWALRecordAvailableInXLogBuffers(XLogRecPtr lsn)
> +{
> +     if (!XLogRecPtrIsInvalid(lsn) &&
> +             !XLogRecPtrIsInvalid(XLogCtl->OldestInitializedPage) &&
> +             lsn >= XLogCtl->OldestInitializedPage &&
> +             lsn < XLogCtl->InitializedUpTo)
> +     {
> +             return true;
> +     }
> +
> +     return false;
> +}
> diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
> index a14126d164..35235010e6 100644
> --- a/src/include/access/xlog.h
> +++ b/src/include/access/xlog.h
> @@ -261,6 +261,7 @@ extern void ReachedEndOfBackup(XLogRecPtr EndRecPtr, 
> TimeLineID tli);
>  extern void SetInstallXLogFileSegmentActive(void);
>  extern bool IsInstallXLogFileSegmentActive(void);
>  extern void XLogShutdownWalRcv(void);
> +extern bool IsWALRecordAvailableInXLogBuffers(XLogRecPtr lsn);
>
>  /*
>   * Routines to start, stop, and get status of a base backup.
> --
> 2.34.1
>

> From db027d8f1dcb53ebceef0135287f120acf67cc21 Mon Sep 17 00:00:00 2001
> From: Bharath Rupireddy <bharath.rupireddyforpostg...@gmail.com>
> Date: Thu, 2 Nov 2023 15:36:11 +0000
> Subject: [PATCH v14] Allow WAL reading from WAL buffers
>
> This commit adds WALRead() the capability to read WAL from WAL
> buffers when possible. When requested WAL isn't available in WAL
> buffers, the WAL is read from the WAL file as usual. It relies on
> WALBufMappingLock so that no one replaces the WAL buffer page that
> we're reading from. It skips reading from WAL buffers if
> WALBufMappingLock can't be acquired immediately. In other words,
> it doesn't wait for WALBufMappingLock to be available. This helps
> reduce the contention on WALBufMappingLock.
>
> This commit benefits the callers of WALRead(), that are walsenders
> and pg_walinspect. They can now avoid reading WAL from the WAL
> file (possibly avoiding disk IO). Tests show that the WAL buffers
> hit ratio stood at 95% for 1 primary, 1 sync standby, 1 async
> standby, with pgbench --scale=300 --client=32 --time=900. In other
> words, the walsenders avoided 95% of the time reading from the
> file/avoided pread system calls:
> https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com
>
> This commit also benefits when direct IO is enabled for WAL.
> Reading WAL from WAL buffers puts back the performance close to
> that of without direct IO for WAL:
> https://www.postgresql.org/message-id/CALj2ACV6rS%2B7iZx5%2BoAvyXJaN4AG-djAQeM1mrM%3DYSDkVrUs7g%40mail.gmail.com
>
> This commit also paves the way for the following features in
> future:
> - Improves synchronous replication performance by replicating
> directly from WAL buffers.
> - A opt-in way for the walreceivers to receive unflushed WAL.
> More details here:
> https://www.postgresql.org/message-id/20231011224353.cl7c2s222dw3de4j%40awork3.anarazel.de
>
> Author: Bharath Rupireddy
> Reviewed-by: Dilip Kumar, Andres Freund
> Reviewed-by: Nathan Bossart, Kuntal Ghosh
> Discussion: 
> https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com
> ---
>  src/backend/access/transam/xlog.c       | 205 ++++++++++++++++++++++++
>  src/backend/access/transam/xlogreader.c |  41 ++++-
>  src/include/access/xlog.h               |   6 +
>  3 files changed, 250 insertions(+), 2 deletions(-)
>
> diff --git a/src/backend/access/transam/xlog.c 
> b/src/backend/access/transam/xlog.c
> index fdf2ef310b..ff5dccaaa7 100644
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -1753,6 +1753,211 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
>       return cachedPos + ptr % XLOG_BLCKSZ;
>  }

>

> +/*
> + * Read WAL from WAL buffers.
> + *
> + * Read 'count' bytes of WAL from WAL buffers into 'buf', starting at 
> location
> + * 'startptr', on timeline 'tli' and return total read bytes.
> + *
> + * This function returns quickly in the following cases:
> + * - When passed-in timeline is different than server's current insertion
> + * timeline as WAL is always inserted into WAL buffers on insertion timeline.
> + * - When server is in recovery as WAL buffers aren't currently used in
> + * recovery.
> + *
> + * Note that this function reads as much as it can from WAL buffers, meaning,
> + * it may not read all the requested 'count' bytes. Caller must be aware of
> + * this and deal with it.
> + *
> + * Note that this function is not available for frontend code as WAL buffers 
> is
> + * an internal mechanism to the server.
>
> + */
> +Size
> +XLogReadFromBuffers(XLogReaderState *state,
> +                                     XLogRecPtr startptr,
> +                                     TimeLineID tli,
> +                                     Size count,
> +                                     char *buf)
> +{
> +     XLogRecPtr      ptr;
> +     XLogRecPtr      cur_lsn;
> +     Size            nbytes;
> +     Size            ntotal;
> +     Size            nbatch;
> +     char       *batchstart;
> +
> +     if (RecoveryInProgress())
> +             return 0;
>
> +     if (tli != GetWALInsertionTimeLine())
> +             return 0;
> +
> +     Assert(!XLogRecPtrIsInvalid(startptr));
> +
> +     cur_lsn = GetFlushRecPtr(NULL);
> +     if (unlikely(startptr > cur_lsn))
> +             elog(ERROR, "WAL start LSN %X/%X specified for reading from WAL 
> buffers must be less than current database system WAL LSN %X/%X",
> +                      LSN_FORMAT_ARGS(startptr), LSN_FORMAT_ARGS(cur_lsn));

Hm, why does this check belong here? For some tools it might be legitimate to
read the WAL before it was fully flushed.



> +     /*
> +      * Holding WALBufMappingLock ensures inserters don't overwrite this 
> value
> +      * while we are reading it. We try to acquire it in shared mode so that
> +      * the concurrent WAL readers are also allowed. We try to do as less 
> work
> +      * as possible while holding the lock to not impact concurrent WAL 
> writers
> +      * much. We quickly exit to not cause any contention, if the lock isn't
> +      * immediately available.
> +      */
> +     if (!LWLockConditionalAcquire(WALBufMappingLock, LW_SHARED))
> +             return 0;

That seems problematic - that lock is often heavily contended.  We could
instead check IsWALRecordAvailableInXLogBuffers() once before reading the
page, then read the page contents *without* holding a lock, and then check
IsWALRecordAvailableInXLogBuffers() again - if the page was replaced in the
interim we read bogus data, but that's a bit of a wasted effort.


> +     ptr = startptr;
> +     nbytes = count;                         /* Total bytes requested to be 
> read by caller. */
> +     ntotal = 0;                                     /* Total bytes read. */
> +     nbatch = 0;                                     /* Bytes to be read in 
> single batch. */
> +     batchstart = NULL;                      /* Location to read from for 
> single batch. */

What does "batch" mean?


> +     while (nbytes > 0)
> +     {
> +             XLogRecPtr      expectedEndPtr;
> +             XLogRecPtr      endptr;
> +             int                     idx;
> +             char       *page;
> +             char       *data;
> +             XLogPageHeader phdr;
> +
> +             idx = XLogRecPtrToBufIdx(ptr);
> +             expectedEndPtr = ptr;
> +             expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;
> +             endptr = XLogCtl->xlblocks[idx];
> +
> +             /* Requested WAL isn't available in WAL buffers. */
> +             if (expectedEndPtr != endptr)
> +                     break;
> +
> +             /*
> +              * We found WAL buffer page containing given XLogRecPtr. Get 
> starting
> +              * address of the page and a pointer to the right location of 
> given
> +              * XLogRecPtr in that page.
> +              */
> +             page = XLogCtl->pages + idx * (Size) XLOG_BLCKSZ;
> +             data = page + ptr % XLOG_BLCKSZ;
> +
> +             /*
> +              * The fact that we acquire WALBufMappingLock while reading the 
> WAL
> +              * buffer page itself guarantees that no one else initializes 
> it or
> +              * makes it ready for next use in AdvanceXLInsertBuffer(). 
> However, we
> +              * need to ensure that we are not reading a page that just got
> +              * initialized. For this, we look at the needed page header.
> +              */
> +             phdr = (XLogPageHeader) page;
> +
> +             /* Return, if WAL buffer page doesn't look valid. */
> +             if (!(phdr->xlp_magic == XLOG_PAGE_MAGIC &&
> +                       phdr->xlp_pageaddr == (ptr - (ptr % XLOG_BLCKSZ)) &&
> +                       phdr->xlp_tli == tli))
> +                     break;

I don't think this code should ever encounter a page where this is not the
case?  We particularly shouldn't do so silently, seems that could hide all
kinds of problems.



> +             /*
> +              * Note that we don't perform all page header checks here to 
> avoid
> +              * extra work in production builds; callers will anyway do those
> +              * checks extensively. However, in an assert-enabled build, we 
> perform
> +              * all the checks here and raise an error if failed.
> +              */

Why?


> +             /* Count what is wanted, not the whole page. */
> +             if ((data + nbytes) <= (page + XLOG_BLCKSZ))
> +             {
> +                     /* All the bytes are in one page. */
> +                     nbatch += nbytes;
> +                     ntotal += nbytes;
> +                     nbytes = 0;
> +             }
> +             else
> +             {
> +                     Size            navailable;
> +
> +                     /*
> +                      * All the bytes are not in one page. Deduce available 
> bytes on
> +                      * the current page, count them and continue to look 
> for remaining
> +                      * bytes.
> +                      */
s/deducate/deduct/? Perhaps better subtract?


Greetings,

Andres Freund

Re: Improve WALRead() to suck data directly from WAL buffers when possible

Reply via email to