Re: Improve WALRead() to suck data directly from WAL buffers when possible

Bharath Rupireddy Tue, 07 Feb 2023 20:27:53 -0800

On Tue, Feb 7, 2023 at 4:12 PM Dilip Kumar <[email protected]> wrote:
>
> On Mon, Dec 26, 2022 at 2:20 PM Bharath Rupireddy
> <[email protected]> wrote:
>
> I have gone through this patch, I have some comments (mostly cosmetic
> and comments)


Thanks a lot for reviewing.

> From the above comments, it appears that we are reading from the exact
> pointer we are interested to read, but actually, we are reading
> the complete page.  I think this comment needs to be fixed and we can
> also explain why we read the complete page here.

I modified it. Please see the attached v3 patch.

> Generally, we use the name 'recptr' to represent XLogRecPtr type of
> variable, but in your case, it is actually data at that recptr, so
> better use some other name like 'buf' or 'buffer'.

Changed it to use 'data' as it seemed more appropriate than just a
buffer to not confuse with the WAL buffer page.

> 3.
> + if ((recptr + nbytes) <= (page + XLOG_BLCKSZ))
> + {
> + }
> + else if ((recptr + nbytes) > (page + XLOG_BLCKSZ))
> + {
>
> Why do you have this 'else if ((recptr + nbytes) > (page +
> XLOG_BLCKSZ))' check in the else part? why it is not directly else
> without a condition in 'if'?

Changed.

> I think we do not need 2 separate variables 'count' and '*read_bytes',
> just one variable for input/output is sufficient.  The original value
> can always be stored in some temp variable
> instead of the function argument.

We could do that, but for the sake of readability and not cluttering
the API, I kept it as-is.

Besides addressing the above review comments, I've made some more
changes - 1) I optimized the patch a bit by removing an extra memcpy.
Up until v2 patch, the entire WAL buffer page is returned and the
caller takes what is wanted from it. This adds an extra memcpy, so I
changed it to avoid extra memcpy and just copy what is wanted. 2) I
improved the comments.

I can also do a few other things, but before working on them, I'd like
to hear from others:
1. A separate wait event (WAIT_EVENT_WAL_READ_FROM_BUFFERS) for
reading from WAL buffers - right now, WAIT_EVENT_WAL_READ is being
used both for reading from WAL buffers and WAL files. Given the fact
that we won't wait for a lock or do a time-taking task while reading
from buffers, it seems unnecessary.
2. A separate TAP test for verifying that the WAL is actually read
from WAL buffers - right now, existing tests for recovery,
subscription, pg_walinspect already cover the code, see [1]. However,
if needed, I can add a separate TAP test.
3. Use the oldest initialized WAL buffer page to quickly tell if the
given LSN is present in WAL buffers without taking any lock - right
now, WALBufMappingLock is acquired to do so. While this doesn't seem
to impact much, it's good to optimize it away. But, the oldest
initialized WAL buffer page isn't tracked, so I've put up a patch and
sent in another thread [2]. Irrespective of [2], we are still good
with what we have in this patch.

[1]
recovery tests:
PATCHED: WAL buffers hit - 14759, misses - 3371

subscription tests:
PATCHED: WAL buffers hit - 1972, misses - 32616

pg_walinspect tests:
PATCHED: WAL buffers hit - 8, misses - 8

[2] 
https://www.postgresql.org/message-id/CALj2ACVgi6LirgLDZh%3DFdfdvGvKAD%3D%3DWTOSWcQy%3DAtNgPDVnKw%40mail.gmail.com

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

From e6670c4098930f87e9b4dce2f21e73e0c0ce9361 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <[email protected]>
Date: Wed, 8 Feb 2023 03:48:32 +0000
Subject: [PATCH v3] Improve WALRead() to suck data directly from WAL buffers

---
 src/backend/access/transam/xlog.c       | 178 ++++++++++++++++++++++++
 src/backend/access/transam/xlogreader.c |  47 ++++++-
 src/include/access/xlog.h               |   6 +
 3 files changed, 229 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f9f0f6db8d..e4679d42f5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -689,6 +689,10 @@ static bool ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos,
 							  XLogRecPtr *PrevPtr);
 static XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto);
 static char *GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli);
+static Size XLogReadFromBuffersGuts(XLogRecPtr ptr,
+									TimeLineID tli,
+									Size count,
+									char *page);
 static XLogRecPtr XLogBytePosToRecPtr(uint64 bytepos);
 static XLogRecPtr XLogBytePosToEndRecPtr(uint64 bytepos);
 static uint64 XLogRecPtrToBytePos(XLogRecPtr ptr);
@@ -1639,6 +1643,180 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
 	return cachedPos + ptr % XLOG_BLCKSZ;
 }
 
+/*
+ * Guts of XLogReadFromBuffers().
+ *
+ * Read 'count' bytes into 'buf', starting at location 'ptr', from WAL
+ * fetched WAL buffers on timeline 'tli' and return the read bytes.
+ */
+static Size
+XLogReadFromBuffersGuts(XLogRecPtr ptr,
+						TimeLineID tli,
+						Size count,
+						char *buf)
+{
+	XLogRecPtr	expectedEndPtr;
+	XLogRecPtr	endptr;
+	int 	idx;
+	Size	nread = 0;
+
+	idx = XLogRecPtrToBufIdx(ptr);
+	expectedEndPtr = ptr;
+	expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;
+
+	/*
+	 * Holding WALBufMappingLock ensures inserters don't overwrite this value
+	 * while we are reading it. We try to acquire it in shared mode so that the
+	 * concurrent WAL readers are also allowed. We try to do as less work as
+	 * possible while holding the lock to not impact concurrent WAL writers
+	 * much. We quickly exit to not cause any contention, if the lock isn't
+	 * immediately available.
+	 */
+	if (!LWLockConditionalAcquire(WALBufMappingLock, LW_SHARED))
+		return nread;
+
+	endptr = XLogCtl->xlblocks[idx];
+
+	if (expectedEndPtr == endptr)
+	{
+		char	*page;
+		char    *data;
+		XLogPageHeader	phdr;
+
+		/*
+		 * We found WAL buffer page containing given XLogRecPtr. Get starting
+		 * address of the page and a pointer to the right location of given
+		 * XLogRecPtr in that page.
+		 */
+		page = XLogCtl->pages + idx * (Size) XLOG_BLCKSZ;
+		data = page + ptr % XLOG_BLCKSZ;
+
+		/* Read what is wanted, not the whole page. */
+		if ((data + count) <= (page + XLOG_BLCKSZ))
+		{
+			/* All the bytes are in one page. */
+			memcpy(buf, data, count);
+			nread = count;
+		}
+		else
+		{
+			Size nremaining;
+
+			/*
+			 * All the bytes are not in one page. Compute remaining bytes on
+			 * the current page, copy them over to output buffer.
+			 */
+			nremaining = XLOG_BLCKSZ - (data - page);
+			memcpy(buf, data, nremaining);
+			nread = nremaining;
+		}
+
+		/*
+		 * Release the lock as early as possible to avoid creating any possible
+		 * contention.
+		 */
+		LWLockRelease(WALBufMappingLock);
+
+		/*
+		 * The fact that we acquire WALBufMappingLock while reading the WAL
+		 * buffer page itself guarantees that no one else initializes it or
+		 * makes it ready for next use in AdvanceXLInsertBuffer().
+		 *
+		 * However, we perform basic page header checks for ensuring that we
+		 * are not reading a page that just got initialized. Callers will
+		 * anyway perform extensive page-level and record-level checks.
+		 */
+		phdr = (XLogPageHeader) page;
+
+		if (!(phdr->xlp_magic == XLOG_PAGE_MAGIC &&
+			  phdr->xlp_pageaddr == (ptr - (ptr % XLOG_BLCKSZ)) &&
+			  phdr->xlp_tli == tli))
+		{
+			/*
+			 * WAL buffer page doesn't look valid, so return as if nothing was
+			 * read.
+			 */
+			nread = 0;
+		}
+	}
+	else
+	{
+		/* We have found nothing. */
+		LWLockRelease(WALBufMappingLock);
+	}
+
+	/* We never read more than what the caller has asked for. */
+	Assert(nread <= count);
+
+	return nread;
+}
+
+/*
+ * Read WAL from WAL buffers.
+ *
+ * Read 'count' bytes into 'buf', starting at location 'startptr', from WAL
+ * fetched WAL buffers on timeline 'tli' and set the read bytes to
+ * 'read_bytes'.
+ *
+ * Note that this function reads as much as it can from WAL buffers, meaning,
+ * it may not read all the requested 'count' bytes. The caller must be aware of
+ * this and deal with it.
+ */
+void
+XLogReadFromBuffers(XLogRecPtr startptr,
+					TimeLineID tli,
+					Size count,
+					char *buf,
+					Size *read_bytes)
+{
+	XLogRecPtr	ptr;
+	char    *dst;
+	Size    nbytes;
+
+	Assert(!XLogRecPtrIsInvalid(startptr));
+	Assert(count > 0);
+	Assert(startptr <= GetFlushRecPtr(NULL));
+	Assert(!RecoveryInProgress());
+
+	ptr = startptr;
+	nbytes = count;
+	dst = buf;
+	*read_bytes = 0;
+
+	while (nbytes > 0)
+	{
+		Size	nread;
+
+		nread = XLogReadFromBuffersGuts(ptr, tli, nbytes, dst);
+
+		if (nread == 0)
+		{
+			/* We read nothing. */
+			break;
+		}
+		else if (nread == nbytes)
+		{
+			/* We read all the requested bytes. */
+			*read_bytes += nread;
+			break;
+		}
+		else if (nread < nbytes)
+		{
+			/*
+			 * We read some of the requested bytes. Continue to read remaining
+			 * bytes.
+			 */
+			ptr += nread;
+			nbytes -= nread;
+			dst += nread;
+			*read_bytes += nread;
+		}
+	}
+
+	elog(DEBUG1, "read %zu bytes out of %zu bytes from WAL buffers for given LSN %X/%X, Timeline ID %u",
+		 *read_bytes, count, LSN_FORMAT_ARGS(startptr), tli);
+}
+
 /*
  * Converts a "usable byte position" to XLogRecPtr. A usable byte position
  * is the position starting from the beginning of WAL, excluding all WAL
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index aa6c929477..244e92908c 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1485,8 +1485,7 @@ err:
  * Returns true if succeeded, false if an error occurs, in which case
  * 'errinfo' receives error details.
  *
- * XXX probably this should be improved to suck data directly from the
- * WAL buffers when possible.
+ * When possible, this function reads data directly from WAL buffers.
  */
 bool
 WALRead(XLogReaderState *state,
@@ -1497,6 +1496,50 @@ WALRead(XLogReaderState *state,
 	XLogRecPtr	recptr;
 	Size		nbytes;
 
+#ifndef FRONTEND
+	/* Frontend tools have no idea of WAL buffers. */
+	Size        read_bytes;
+
+	/*
+	 * When possible, read WAL from WAL buffers. We skip this step and continue
+	 * the usual way, that is to read from WAL file, either when server is in
+	 * recovery (standby mode, archive or crash recovery), in which case the
+	 * WAL buffers are not used or when the server is inserting in a different
+	 * timeline from that of the timeline that we're trying to read WAL from.
+	 */
+	if (!RecoveryInProgress() &&
+		tli == GetWALInsertionTimeLine())
+	{
+		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+		XLogReadFromBuffers(startptr, tli, count, buf, &read_bytes);
+		pgstat_report_wait_end();
+
+		/*
+		 * Check if we have read fully (hit), partially (partial hit) or
+		 * nothing (miss) from WAL buffers. If we have read either partially or
+		 * nothing, then continue to read the remaining bytes the usual way,
+		 * that is, read from WAL file.
+		 */
+		if (count == read_bytes)
+		{
+			/* Buffer hit, so return. */
+			return true;
+		}
+		else if (read_bytes > 0 && count > read_bytes)
+		{
+			/*
+			 * Buffer partial hit, so reset the state to count the read bytes
+			 * and continue.
+			 */
+			buf += read_bytes;
+			startptr += read_bytes;
+			count -= read_bytes;
+		}
+
+		/* Buffer miss i.e., read_bytes = 0, so continue */
+	}
+#endif	/* FRONTEND */
+
 	p = buf;
 	recptr = startptr;
 	nbytes = count;
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index cfe5409738..c9941aa001 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -247,6 +247,12 @@ extern XLogRecPtr GetLastImportantRecPtr(void);
 
 extern void SetWalWriterSleeping(bool sleeping);
 
+extern void XLogReadFromBuffers(XLogRecPtr startptr,
+								TimeLineID tli,
+								Size count,
+								char *buf,
+								Size *read_bytes);
+
 /*
  * Routines used by xlogrecovery.c to call back into xlog.c during recovery.
  */
-- 
2.34.1

Re: Improve WALRead() to suck data directly from WAL buffers when possible

Reply via email to