Re: Improve WALRead() to suck data directly from WAL buffers when possible

2024-04-24 Thread Michael Paquier
On Wed, Apr 24, 2024 at 09:46:20PM +0530, Bharath Rupireddy wrote:
> Thanks. I started a new thread
> https://www.postgresql.org/message-id/CALj2ACVfF2Uj9NoFy-5m98HNtjHpuD17EDE9twVeJng-jTAe7A%40mail.gmail.com.

Cool, thanks.
--
Michael


signature.asc
Description: PGP signature


Re: Improve WALRead() to suck data directly from WAL buffers when possible

2024-04-24 Thread Bharath Rupireddy
On Thu, Apr 11, 2024 at 6:31 AM Michael Paquier  wrote:
>
> On Tue, Apr 09, 2024 at 09:33:49AM +0300, Andrey M. Borodin wrote:
> > As far as I understand CF entry [0] is committed? I understand that
> > there are some open followups, but I just want to determine correct
> > CF item status...
>
> So much work has happened on this thread with things that has been
> committed, so switching the entry to committed makes sense to me.  I
> have just done that.
>
> Bharath, could you create a new thread with the new things you are
> proposing?  All that should be v18 work, particularly v27-0002:
> https://www.postgresql.org/message-id/CALj2ACWCibnX2jcnRreBHFesFeQ6vbKiFstML=w-jvtvukd...@mail.gmail.com

Thanks. I started a new thread
https://www.postgresql.org/message-id/CALj2ACVfF2Uj9NoFy-5m98HNtjHpuD17EDE9twVeJng-jTAe7A%40mail.gmail.com.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2024-04-10 Thread Michael Paquier
On Tue, Apr 09, 2024 at 09:33:49AM +0300, Andrey M. Borodin wrote:
> As far as I understand CF entry [0] is committed? I understand that
> there are some open followups, but I just want to determine correct
> CF item status... 

So much work has happened on this thread with things that has been
committed, so switching the entry to committed makes sense to me.  I
have just done that.

Bharath, could you create a new thread with the new things you are
proposing?  All that should be v18 work, particularly v27-0002:
https://www.postgresql.org/message-id/CALj2ACWCibnX2jcnRreBHFesFeQ6vbKiFstML=w-jvtvukd...@mail.gmail.com
--
Michael


signature.asc
Description: PGP signature


Re: Improve WALRead() to suck data directly from WAL buffers when possible

2024-04-09 Thread Andrey M. Borodin



> On 8 Apr 2024, at 08:17, Bharath Rupireddy 
>  wrote:

Hi Bharath!

As far as I understand CF entry [0] is committed? I understand that there are 
some open followups, but I just want to determine correct CF item status...

Thanks!


Best regards, Andrey Borodin.

[0] https://commitfest.postgresql.org/47/4060/



Re: Improve WALRead() to suck data directly from WAL buffers when possible

2024-04-07 Thread Bharath Rupireddy
On Thu, Mar 21, 2024 at 11:33 PM Bharath Rupireddy
 wrote:
>
> Please find the v26 patches after rebasing.

Commit f3ff7bf83b added a check in WALReadFromBuffers to ensure the
requested WAL is not past the WAL that's copied to WAL buffers. So,
I've dropped v26-0001 patch.

I've attached v27 patches for further review.

0001 adds a test module to demonstrate reading from WAL buffers
patterns like the caller ensuring the requested WAL is fully copied to
WAL buffers using WaitXLogInsertionsToFinish and an implementation of
xlogreader page_read
callback to read unflushed/not-yet-flushed WAL directly from WAL buffers.

0002 Use WALReadFromBuffers in more places like for logical
walsenders, logical decoding functions, backends reading WAL.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 77c4cc3320a9c2982b5fdac9bd51a892ca644fae Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Mon, 8 Apr 2024 05:02:07 +
Subject: [PATCH v27 1/2] Add test module to demonstrate reading from WAL 
 buffers patterns

This commit adds a test module to demonstrate a few patterns for
reading from WAL buffers using WALReadFromBuffers added by commit
91f2cae7a4e.

1. This module contains a test function to read the WAL that's
fully copied to WAL buffers. Whether or not the WAL is fully
copied to WAL buffers is ensured by WaitXLogInsertionsToFinish
before WALReadFromBuffers.

2. This module contains an implementation of xlogreader page_read
callback to read unflushed/not-yet-flushed WAL directly from WAL
buffers.
---
 src/backend/access/transam/xlog.c |   3 +-
 src/backend/access/transam/xlogreader.c   |   3 +-
 src/include/access/xlog.h |   1 +
 src/test/modules/Makefile |   1 +
 src/test/modules/meson.build  |   1 +
 .../modules/read_wal_from_buffers/.gitignore  |   4 +
 .../modules/read_wal_from_buffers/Makefile|  23 ++
 .../modules/read_wal_from_buffers/meson.build |  33 ++
 .../read_wal_from_buffers--1.0.sql|  37 ++
 .../read_wal_from_buffers.c   | 318 ++
 .../read_wal_from_buffers.control |   4 +
 .../read_wal_from_buffers/t/001_basic.pl  | 111 ++
 12 files changed, 536 insertions(+), 3 deletions(-)
 create mode 100644 src/test/modules/read_wal_from_buffers/.gitignore
 create mode 100644 src/test/modules/read_wal_from_buffers/Makefile
 create mode 100644 src/test/modules/read_wal_from_buffers/meson.build
 create mode 100644 src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql
 create mode 100644 src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
 create mode 100644 src/test/modules/read_wal_from_buffers/read_wal_from_buffers.control
 create mode 100644 src/test/modules/read_wal_from_buffers/t/001_basic.pl

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e3fb26f5ab..4d4a840c8e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -700,7 +700,6 @@ static void ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos,
 	  XLogRecPtr *EndPos, XLogRecPtr *PrevPtr);
 static bool ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos,
 			  XLogRecPtr *PrevPtr);
-static XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto);
 static char *GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli);
 static XLogRecPtr XLogBytePosToRecPtr(uint64 bytepos);
 static XLogRecPtr XLogBytePosToEndRecPtr(uint64 bytepos);
@@ -1496,7 +1495,7 @@ WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt)
  * uninitialized page), and the inserter might need to evict an old WAL buffer
  * to make room for a new one, which in turn requires WALWriteLock.
  */
-static XLogRecPtr
+XLogRecPtr
 WaitXLogInsertionsToFinish(XLogRecPtr upto)
 {
 	uint64		bytepos;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 37d2a57961..12dddf64cc 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1033,7 +1033,8 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	 * record is.  This is so that we can check the additional identification
 	 * info that is present in the first page's "long" header.
 	 */
-	if (targetSegNo != state->seg.ws_segno && targetPageOff != 0)
+	if (state->seg.ws_segno != 0 &&
+		targetSegNo != state->seg.ws_segno && targetPageOff != 0)
 	{
 		XLogRecPtr	targetSegmentPtr = pageptr - targetPageOff;
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 76787a8267..74606a6846 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -252,6 +252,7 @@ extern XLogRecPtr GetLastImportantRecPtr(void);
 
 extern void SetWalWriterSleeping(bool sleeping);
 
+extern XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto);
 extern Size WALReadFromBuffers(char *dstbuf, XLogRecPtr 

Re: Improve WALRead() to suck data directly from WAL buffers when possible

2024-03-21 Thread Bharath Rupireddy
On Tue, Feb 20, 2024 at 11:40 AM Bharath Rupireddy
 wrote:
>
> Ran pgperltidy on the new TAP test file added. Please see the attached
> v25 patch set.

Please find the v26 patches after rebasing.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v26-0001-Add-check-in-WALReadFromBuffers-against-requeste.patch
Description: Binary data


v26-0002-Add-test-module-for-verifying-read-from-WAL-buff.patch
Description: Binary data


v26-0003-Use-WALReadFromBuffers-in-more-places.patch
Description: Binary data


v26-0004-Demonstrate-reading-unflushed-WAL-directly-from-.patch
Description: Binary data


Re: Improve WALRead() to suck data directly from WAL buffers when possible

2024-02-19 Thread Bharath Rupireddy
On Sat, Feb 17, 2024 at 10:27 AM Bharath Rupireddy
 wrote:
>
> On Fri, Feb 16, 2024 at 11:01 PM Jeff Davis  wrote:
> >
> > > Here, I'm with v23 patch set:
> >
> > Thank you, I'll look at these.
>
> Thanks. Here's the v24 patch set after rebasing.

Ran pgperltidy on the new TAP test file added. Please see the attached
v25 patch set.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v25-0001-Add-check-in-WALReadFromBuffers-against-requeste.patch
Description: Binary data


v25-0002-Add-test-module-for-verifying-read-from-WAL-buff.patch
Description: Binary data


v25-0003-Use-WALReadFromBuffers-in-more-places.patch
Description: Binary data


v25-0004-Demonstrate-reading-unflushed-WAL-directly-from-.patch
Description: Binary data


Re: Improve WALRead() to suck data directly from WAL buffers when possible

2024-02-16 Thread Bharath Rupireddy
On Fri, Feb 16, 2024 at 11:01 PM Jeff Davis  wrote:
>
> > Here, I'm with v23 patch set:
>
> Thank you, I'll look at these.

Thanks. Here's the v24 patch set after rebasing.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From d0317ed91b1483a5556c87388e0186462711e022 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Sat, 17 Feb 2024 04:41:29 +
Subject: [PATCH v24 4/4] Demonstrate reading unflushed WAL directly from WAL
 buffers

---
 src/backend/access/transam/xlogreader.c   |   3 +-
 .../read_wal_from_buffers--1.0.sql|  23 ++
 .../read_wal_from_buffers.c   | 266 +-
 .../read_wal_from_buffers/t/001_basic.pl  |  35 +++
 4 files changed, 325 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index ae9904e7e4..4658a86997 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1035,7 +1035,8 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	 * record is.  This is so that we can check the additional identification
 	 * info that is present in the first page's "long" header.
 	 */
-	if (targetSegNo != state->seg.ws_segno && targetPageOff != 0)
+	if (state->seg.ws_segno != 0 &&
+		targetSegNo != state->seg.ws_segno && targetPageOff != 0)
 	{
 		XLogRecPtr	targetSegmentPtr = pageptr - targetPageOff;
 
diff --git a/src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql
index 82fa097d10..72d05522fc 100644
--- a/src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql
+++ b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql
@@ -12,3 +12,26 @@ CREATE FUNCTION read_wal_from_buffers(IN lsn pg_lsn, IN bytes_to_read int,
 bytes_read OUT int)
 AS 'MODULE_PATHNAME', 'read_wal_from_buffers'
 LANGUAGE C STRICT;
+
+--
+-- get_wal_records_info_from_buffers()
+--
+-- SQL function to get info of WAL records available in WAL buffers.
+--
+CREATE FUNCTION get_wal_records_info_from_buffers(IN start_lsn pg_lsn,
+IN end_lsn pg_lsn,
+OUT start_lsn pg_lsn,
+OUT end_lsn pg_lsn,
+OUT prev_lsn pg_lsn,
+OUT xid xid,
+OUT resource_manager text,
+OUT record_type text,
+OUT record_length int4,
+OUT main_data_length int4,
+OUT fpi_length int4,
+OUT description text,
+OUT block_ref text
+)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'get_wal_records_info_from_buffers'
+LANGUAGE C STRICT PARALLEL SAFE;
diff --git a/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
index 9df5c07b4b..ed33a14127 100644
--- a/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
+++ b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
@@ -14,11 +14,27 @@
 #include "postgres.h"
 
 #include "access/xlog.h"
-#include "fmgr.h"
+#include "access/xlog_internal.h"
+#include "access/xlogreader.h"
+#include "access/xlogrecovery.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "utils/builtins.h"
 #include "utils/pg_lsn.h"
 
 PG_MODULE_MAGIC;
 
+static int	read_from_wal_buffers(XLogReaderState *state, XLogRecPtr targetPagePtr,
+  int reqLen, XLogRecPtr targetRecPtr,
+  char *cur_page);
+
+static XLogRecord *ReadNextXLogRecord(XLogReaderState *xlogreader);
+static void GetWALRecordInfo(XLogReaderState *record, Datum *values,
+			 bool *nulls, uint32 ncols);
+static void GetWALRecordsInfo(FunctionCallInfo fcinfo,
+			  XLogRecPtr start_lsn,
+			  XLogRecPtr end_lsn);
+
 /*
  * SQL function to read WAL from WAL buffers. Returns number of bytes read.
  */
@@ -52,3 +68,251 @@ read_wal_from_buffers(PG_FUNCTION_ARGS)
 
 	PG_RETURN_INT32(read);
 }
+
+/*
+ * XLogReaderRoutine->page_read callback for reading WAL from WAL buffers.
+ */
+static int
+read_from_wal_buffers(XLogReaderState *state, XLogRecPtr targetPagePtr,
+	  int reqLen, XLogRecPtr targetRecPtr,
+	  char *cur_page)
+{
+	XLogRecPtr	read_upto,
+loc;
+	TimeLineID	tli = GetWALInsertionTimeLine();
+	Size		count;
+	Size		read = 0;
+
+	loc = targetPagePtr + reqLen;
+
+	/* Loop waiting for xlog to be available if necessary */
+	while (1)
+	{
+		read_upto = GetXLogInsertRecPtr();
+
+		if (loc <= read_upto)
+			break;
+
+		WaitXLogInsertionsToFinish(loc);
+
+		CHECK_FOR_INTERRUPTS();
+		pg_usleep(1000L);
+	}
+
+	if (targetPagePtr + XLOG_BLCKSZ <= read_upto)
+	{
+		/*
+		 * more than one block available; read only that block, have caller
+		 * come back if they need more.
+		 */
+		count = XLOG_BLCKSZ;
+	}
+	else if (targetPagePtr + reqLen > read_upto)
+	{
+		/* not enough data there */
+		return -1;
+	}
+	else
+	{
+		/* enough bytes available to satisfy the request */
+		count = read_upto - targetPagePtr;
+	}
+
+	/* read WAL from WAL 

Re: Improve WALRead() to suck data directly from WAL buffers when possible

2024-02-16 Thread Jeff Davis
On Fri, 2024-02-16 at 13:08 +0530, Bharath Rupireddy wrote:
> I'd suggest we strike a balance here - error out in assert builds if
> startptr+count is past the current insert position and trust the
> callers for production builds.

It's not reasonable to have divergent behavior between assert-enabled
builds and production. I think for now I will just commit the Assert as
Andres suggested until we work out a few more details.

One idea is to use Álvaro's work to eliminate the spinlock, and then
add a variable to represent the last known point returned by
WaitXLogInsertionsToFinish(). Then we can cheaply Assert that the
caller requested something before that point.

> Here, I'm with v23 patch set:

Thank you, I'll look at these.

Regards,
Jeff Davis





Re: Improve WALRead() to suck data directly from WAL buffers when possible

2024-02-15 Thread Bharath Rupireddy
On Wed, Feb 14, 2024 at 6:59 AM Jeff Davis  wrote:
>
> Attached 2 patches.
>
> Per Andres's suggestion, 0001 adds an:
>   Assert(startptr + count <= LogwrtResult.Write)
>
> Though if we want to allow the caller (e.g. in an extension) to
> determine the valid range, perhaps using WaitXLogInsertionsToFinish(),
> then the check is wrong.

Right.

> Maybe we should just get rid of that code
> entirely and trust the caller to request a reasonable range?

I'd suggest we strike a balance here - error out in assert builds if
startptr+count is past the current insert position and trust the
callers for production builds. It has a couple of advantages over
doing just Assert(startptr + count <= LogwrtResult.Write):
1) It allows the caller to read unflushed WAL directly from WAL
buffers, see the attached 0005 for an example.
2) All the existing callers where WALReadFromBuffers() is thought to
be used are ensuring WAL availability by reading upto the flush
position so no problem with it.

Also, a note before WALRead() stating the caller must request the WAL
at least that's written out (upto LogwrtResult.Write). I'm not so sure
about this, perhaps, we don't need this comment at all.

Here, I'm with v23 patch set:

0001 - Adds assertion in WALReadFromBuffers() to ensure the requested
WAL isn't beyond the current insert position.
0002 - Adds a new test module to demonstrate how one can use
WALReadFromBuffers() ensuring WaitXLogInsertionsToFinish() if need be.
0003 - Uses WALReadFromBuffers in more places like logical walsenders
and backends.
0004 - Removes zero-padding related stuff as discussed in
https://www.postgresql.org/message-id/CALj2ACWBRFac2TingD3PE3w2EBHXUHY3=aeezpjmqhpeobg...@mail.gmail.com.
This is needed in this patch set otherwise the assertion added in 0001
fails after 0003.
0005 - Adds a page_read callback for reading from WAL buffers in the
new test module added in 0002. Also, adds tests.

Thoughts?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 5c0f95acda494904d02593e6ba305717b61c44b5 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Fri, 16 Feb 2024 06:54:53 +
Subject: [PATCH v23 2/5] Add test module for verifying read from WAL buffers

---
 src/test/modules/Makefile |  1 +
 src/test/modules/meson.build  |  1 +
 .../modules/read_wal_from_buffers/.gitignore  |  4 ++
 .../modules/read_wal_from_buffers/Makefile| 23 ++
 .../modules/read_wal_from_buffers/meson.build | 33 +
 .../read_wal_from_buffers--1.0.sql| 14 
 .../read_wal_from_buffers.c   | 54 ++
 .../read_wal_from_buffers.control |  4 ++
 .../read_wal_from_buffers/t/001_basic.pl  | 72 +++
 9 files changed, 206 insertions(+)
 create mode 100644 src/test/modules/read_wal_from_buffers/.gitignore
 create mode 100644 src/test/modules/read_wal_from_buffers/Makefile
 create mode 100644 src/test/modules/read_wal_from_buffers/meson.build
 create mode 100644 src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql
 create mode 100644 src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
 create mode 100644 src/test/modules/read_wal_from_buffers/read_wal_from_buffers.control
 create mode 100644 src/test/modules/read_wal_from_buffers/t/001_basic.pl

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 89aa41b5e3..864a3dd72b 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -12,6 +12,7 @@ SUBDIRS = \
 		  dummy_seclabel \
 		  libpq_pipeline \
 		  plsample \
+		  read_wal_from_buffers \
 		  spgist_name_ops \
 		  test_bloomfilter \
 		  test_copy_callbacks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 8fbe742d38..4f3dd69e58 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -33,6 +33,7 @@ subdir('test_resowner')
 subdir('test_rls_hooks')
 subdir('test_shm_mq')
 subdir('test_slru')
+subdir('read_wal_from_buffers')
 subdir('unsafe_tests')
 subdir('worker_spi')
 subdir('xid_wraparound')
diff --git a/src/test/modules/read_wal_from_buffers/.gitignore b/src/test/modules/read_wal_from_buffers/.gitignore
new file mode 100644
index 00..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/read_wal_from_buffers/Makefile b/src/test/modules/read_wal_from_buffers/Makefile
new file mode 100644
index 00..9e57a837f9
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/read_wal_from_buffers/Makefile
+
+MODULE_big = read_wal_from_buffers
+OBJS = \
+	$(WIN32RES) \
+	read_wal_from_buffers.o
+PGFILEDESC = "read_wal_from_buffers - test module to read WAL from WAL buffers"
+
+EXTENSION = read_wal_from_buffers
+DATA = read_wal_from_buffers--1.0.sql
+

Re: Improve WALRead() to suck data directly from WAL buffers when possible

2024-02-13 Thread Andres Freund
Hi,

On 2024-02-12 17:33:24 -0800, Jeff Davis wrote:
> On Mon, 2024-02-12 at 15:36 -0800, Andres Freund wrote:
> >
> > It doesn't really seem like a necessary, or even particularly useful,
> > part. You couldn't just call WALRead() for that, since the caller
> > would need
> > to know the range up to which WAL is valid but not yet flushed as
> > well. Thus
> > the caller would need to first use WaitXLogInsertionsToFinish() or
> > something
> > like it anyway - and then there's no point in doing the WALRead()
> > anymore.
>
> I follow until the last part. Did you mean "and then there's no point
> in doing the WaitXLogInsertionsToFinish() in WALReadFromBuffers()
> anymore"?

Yes, not sure what happened in my brain there.


> For now, should I assert that the requested WAL data is before the
> Flush pointer or assert that it's before the Write pointer?

Yes, I think that'd be good.


> > Note that for replicating unflushed data, we *still* might need to
> > fall back
> > to reading WAL data from disk. In which case not asserting in
> > WALRead() would
> > just make it hard to find bugs, because not using
> > WaitXLogInsertionsToFinish()
> > would appear to work as long as data is in wal buffers, but as soon
> > as we'd
> > fall back to on-disk (but unflushed) data, we'd send bogus WAL.
>
> That makes me wonder whether my previous idea[1] might matter: when
> some buffers have been evicted, should WALReadFromBuffers() keep going
> through the loop and return the end portion of the requested data
> rather than the beginning?

I still doubt that that will help very often, but it'll take some
experimentation to figure it out, I guess.


> We can sort that out when we get closer to replicating unflushed WAL.

+1

Greetings,

Andres Freund




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2024-02-13 Thread Jeff Davis
On Tue, 2024-02-13 at 22:47 +0530, Bharath Rupireddy wrote:
> c) If planning to replicate unflushed data, then ensure all the
> WALRead() callers wait until startptr+count is past the current
> insert
> position with WaitXLogInsertionsToFinish().

WALRead() can't read past the Write pointer, so there's no point in
calling WaitXLogInsertionsToFinish(), right?


Regards,
Jeff Davis





Re: Improve WALRead() to suck data directly from WAL buffers when possible

2024-02-13 Thread Jeff Davis
Attached 2 patches.

Per Andres's suggestion, 0001 adds an:
  Assert(startptr + count <= LogwrtResult.Write)

Though if we want to allow the caller (e.g. in an extension) to
determine the valid range, perhaps using WaitXLogInsertionsToFinish(),
then the check is wrong. Maybe we should just get rid of that code
entirely and trust the caller to request a reasonable range?

On Mon, 2024-02-12 at 17:33 -0800, Jeff Davis wrote:
> That makes me wonder whether my previous idea[1] might matter: when
> some buffers have been evicted, should WALReadFromBuffers() keep
> going
> through the loop and return the end portion of the requested data
> rather than the beginning?
> [1] 
> https://www.postgresql.org/message-id/2b36bf99e762e65db0dafbf8d338756cf5fa6ece.ca...@j-davis.com

0002 is to illustrate the above idea. It's a strange API so I don't
intend to commit it in this form, but I think we will ultimately need
to do something like it when we want to replicate unflushed data.

The idea is that data past the Write pointer is always (and only)
available in the WAL buffers, so WALReadFromBuffers() should always
return it. That way we can always safely fall through to ordinary
WALRead(), which can only see before the Write pointer. There's also
data before the Write pointer that could be in the WAL buffers, and we
might as well copy that, too, if it's not evicted.

If some buffers are evicted, it will fill in the *end* of the buffer,
leaving a gap at the beginning. The nice thing is that if there is any
gap, it will be before the Write pointer, so we can always fall back to
WALRead() to fill the gap and it should always succeed.

Regards,
Jeff Davis

From f890362e9f5cefd04ee3f9406c7fcafb6a277e45 Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Tue, 13 Feb 2024 11:17:08 -0800
Subject: [PATCH 1/2] Add assert to WALReadFromBuffers().

Per suggestion from Andres.
---
 src/backend/access/transam/xlog.c | 24 ++--
 1 file changed, 6 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4e14c242b1..50c347a679 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1710,12 +1710,13 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
  * of bytes read successfully.
  *
  * Fewer than 'count' bytes may be read if some of the requested WAL data has
- * already been evicted from the WAL buffers, or if the caller requests data
- * that is not yet available.
+ * already been evicted.
  *
  * No locks are taken.
  *
- * The 'tli' argument is only used as a convenient safety check so that
+ * Caller should ensure that it reads no further than LogwrtResult.Write
+ * (which should have been updated by the caller when determining how far to
+ * read). The 'tli' argument is only used as a convenient safety check so that
  * callers do not read from WAL buffers on a historical timeline.
  */
 Size
@@ -1724,26 +1725,13 @@ WALReadFromBuffers(char *dstbuf, XLogRecPtr startptr, Size count,
 {
 	char	   *pdst = dstbuf;
 	XLogRecPtr	recptr = startptr;
-	XLogRecPtr	upto;
-	Size		nbytes;
+	Size		nbytes = count;
 
 	if (RecoveryInProgress() || tli != GetWALInsertionTimeLine())
 		return 0;
 
 	Assert(!XLogRecPtrIsInvalid(startptr));
-
-	/*
-	 * Don't read past the available WAL data.
-	 *
-	 * Check using local copy of LogwrtResult. Ordinarily it's been updated by
-	 * the caller when determining how far to read; but if not, it just means
-	 * we'll read less data.
-	 *
-	 * XXX: the available WAL could be extended to the WAL insert pointer by
-	 * calling WaitXLogInsertionsToFinish().
-	 */
-	upto = Min(startptr + count, LogwrtResult.Write);
-	nbytes = upto - startptr;
+	Assert(startptr + count <= LogwrtResult.Write);
 
 	/*
 	 * Loop through the buffers without a lock. For each buffer, atomically
-- 
2.34.1

From 6fc92fa74a033c881624177365afbbffc37ed873 Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Tue, 13 Feb 2024 16:56:46 -0800
Subject: [PATCH 2/2] WALReadFromBuffers: read end of the requested range

---
 src/backend/access/transam/xlog.c   | 109 
 src/backend/replication/walsender.c |  14 ++--
 src/include/access/xlog.h   |   2 +-
 3 files changed, 70 insertions(+), 55 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 50c347a679..ea55e1b77b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1706,11 +1706,21 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
 }
 
 /*
- * Read WAL data directly from WAL buffers, if available. Returns the number
- * of bytes read successfully.
+ * Read WAL data directly from WAL buffers, if available.
  *
- * Fewer than 'count' bytes may be read if some of the requested WAL data has
- * already been evicted.
+ * Some pages in the requested range may already be evicted from the WAL
+ * buffers, in which case this function continues on and reads the *end* of
+ * 

Re: Improve WALRead() to suck data directly from WAL buffers when possible

2024-02-13 Thread Bharath Rupireddy
On Tue, Feb 13, 2024 at 5:06 AM Andres Freund  wrote:
>
> I doubt there's a sane way to use WALRead() without *first* ensuring that the
> range of data is valid. I think we're better of moving that responsibility
> explicitly to the caller and adding an assertion verifying that.
>
> It doesn't really seem like a necessary, or even particularly useful,
> part. You couldn't just call WALRead() for that, since the caller would need
> to know the range up to which WAL is valid but not yet flushed as well. Thus
> the caller would need to first use WaitXLogInsertionsToFinish() or something
> like it anyway - and then there's no point in doing the WALRead() anymore.
>
> Note that for replicating unflushed data, we *still* might need to fall back
> to reading WAL data from disk. In which case not asserting in WALRead() would
> just make it hard to find bugs, because not using WaitXLogInsertionsToFinish()
> would appear to work as long as data is in wal buffers, but as soon as we'd
> fall back to on-disk (but unflushed) data, we'd send bogus WAL.

Callers of WALRead() do a good amount of work to figure out what's
been flushed out but they read the un-flushed and/or invalid data see
the comment [1] around WALRead() call sites as well as a recent thread
[2] for more details.

IIUC, here's the summary of the discussion that has happened so far:
a) If only replicating flushed data, then ensure all the WALRead()
callers read how much ever is valid out of startptr+count. Fix
provided in [2] can help do that.
b) If only replicating flushed data, then ensure all the
WALReadFromBuffers() callers read how much ever is valid out of
startptr+count. Current and expected WALReadFromBuffers() callers will
anyway determine how much of it is flushed and can validly be read.
c) If planning to replicate unflushed data, then ensure all the
WALRead() callers wait until startptr+count is past the current insert
position with WaitXLogInsertionsToFinish().
d) If planning to replicate unflushed data, then ensure all the
WALReadFromBuffers() callers wait until startptr+count is past the
current insert position with WaitXLogInsertionsToFinish().

Adding an assertion or error in WALReadFromBuffers() for ensuring the
callers do follow the above set of rules is easy. We can just do
Assert(startptr+count <= LogwrtResult.Flush).

However, adding a similar assertion or error in WALRead() gets
trickier as it's being called from many places - walsenders, backends,
external tools etc. even when the server is in recovery. Therefore,
determining the actual valid LSN is a bit of a challenge.

What I think is the best way:
- Try and get the fix provided for (a) at [2].
- Implement both (c) and (d).
- Have the assertion in WALReadFromBuffers() ensuring the callers wait
until startptr+count is past the current insert position with
WaitXLogInsertionsToFinish().
- Have a comment around WALRead() to ensure the callers are requesting
the WAL that's written to the disk because it's hard to determine
what's written to disk as this gets called in many scenarios - when
server is in recovery, for walsummarizer etc.
- In the new test module, demonstrate how one can implement reading
unflushed data with WALReadFromBuffers() and/or WALRead() +
WaitXLogInsertionsToFinish().

Thoughts?

[1]
/*
 * Even though we just determined how much of the page can be validly read
 * as 'count', read the whole page anyway. It's guaranteed to be
 * zero-padded up to the page boundary if it's incomplete.
 */
if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ, tli,
 ))

[2] 
https://www.postgresql.org/message-id/CALj2ACWBRFac2TingD3PE3w2EBHXUHY3%3DAEEZPJmqhpEOBGExg%40mail.gmail.com

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2024-02-13 Thread Bharath Rupireddy
On Tue, Feb 13, 2024 at 1:03 AM Jeff Davis  wrote:
>
> For 0004, we need to resolve why callers are using XLOG_BLCKSZ and we
> can fix that independently, as discussed here:
>
> https://www.postgresql.org/message-id/CALj2ACV=C1GZT9XQRm4iN1NV1T=hla_hsgwnx2y5-g+mswd...@mail.gmail.com

Thanks. I started a new thread for this -
https://www.postgresql.org/message-id/CALj2ACWBRFac2TingD3PE3w2EBHXUHY3%3DAEEZPJmqhpEOBGExg%40mail.gmail.com.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2024-02-12 Thread Jeff Davis
On Mon, 2024-02-12 at 15:36 -0800, Andres Freund wrote:
> 
> It doesn't really seem like a necessary, or even particularly useful,
> part. You couldn't just call WALRead() for that, since the caller
> would need
> to know the range up to which WAL is valid but not yet flushed as
> well. Thus
> the caller would need to first use WaitXLogInsertionsToFinish() or
> something
> like it anyway - and then there's no point in doing the WALRead()
> anymore.

I follow until the last part. Did you mean "and then there's no point
in doing the WaitXLogInsertionsToFinish() in WALReadFromBuffers()
anymore"?

For now, should I assert that the requested WAL data is before the
Flush pointer or assert that it's before the Write pointer?

> Note that for replicating unflushed data, we *still* might need to
> fall back
> to reading WAL data from disk. In which case not asserting in
> WALRead() would
> just make it hard to find bugs, because not using
> WaitXLogInsertionsToFinish()
> would appear to work as long as data is in wal buffers, but as soon
> as we'd
> fall back to on-disk (but unflushed) data, we'd send bogus WAL.

That makes me wonder whether my previous idea[1] might matter: when
some buffers have been evicted, should WALReadFromBuffers() keep going
through the loop and return the end portion of the requested data
rather than the beginning?

We can sort that out when we get closer to replicating unflushed WAL.

Regards,
Jeff Davis

[1] 
https://www.postgresql.org/message-id/2b36bf99e762e65db0dafbf8d338756cf5fa6ece.ca...@j-davis.com




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2024-02-12 Thread Andres Freund
Hi,

On 2024-02-12 15:56:19 -0800, Jeff Davis wrote:
> On Mon, 2024-02-12 at 11:33 -0800, Jeff Davis wrote:
> > For 0002 & 0003, I'd like more clarity on how they will actually be
> > used by an extension.
>
> In patch 0002, I'm concerned about calling
> WaitXLogInsertionsToFinish(). It loops through all the locks, but
> doesn't have any early return path or advance any state.

I doubt it'd be too bad - we call that at much much higher frequency during
write heavy OLTP workloads (c.f. XLogFlush()).  It can be a performance issue
there, but only after increasing NUM_XLOGINSERT_LOCKS - before that the
limited number of writers is the limit.  Compared to that walsender shouldn't
be a significant factor.

However, I think it's a very bad idea to call WALReadFromBuffers() from
WALReadFromBuffers(). This needs to be at the caller, not down in
WALReadFromBuffers().

I don't see why we would want to weaken the error condition in
WaitXLogInsertionsToFinish() - I suspect it'd not work correctly to wait for
insertions that aren't yet in progress and it just seems like an API misuse.


> So if it's repeatedly called with the same or similar values it seems like
> it would be doing a lot of extra work.
>
> I'm not sure of the best fix. We could add something to LogwrtResult to
> track a new LSN that represents the highest known point where all
> inserters are finished (in other words, the latest return value of
> WaitXLogInsertionsToFinish()). That seems invasive, though.

FWIW, I think LogwrtResult is an anti-pattern, perhaps introduced due to
misunderstanding how cache coherency works. It's not fundamentally faster to
access non-shared memory. It'd make far more sense to allow lock-free access
to the shared LogwrtResult and

Greetings,

Andres Freund




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2024-02-12 Thread Jeff Davis
On Mon, 2024-02-12 at 11:33 -0800, Jeff Davis wrote:
> For 0002 & 0003, I'd like more clarity on how they will actually be
> used by an extension.

In patch 0002, I'm concerned about calling
WaitXLogInsertionsToFinish(). It loops through all the locks, but
doesn't have any early return path or advance any state.

So if it's repeatedly called with the same or similar values it seems
like it would be doing a lot of extra work.

I'm not sure of the best fix. We could add something to LogwrtResult to
track a new LSN that represents the highest known point where all
inserters are finished (in other words, the latest return value of
WaitXLogInsertionsToFinish()). That seems invasive, though.

Regards,
Jeff Davis





Re: Improve WALRead() to suck data directly from WAL buffers when possible

2024-02-12 Thread Andres Freund
Hi,

On 2024-02-12 12:46:00 -0800, Jeff Davis wrote:
> On Mon, 2024-02-12 at 12:18 -0800, Andres Freund wrote:
> > +    upto = Min(startptr + count, LogwrtResult.Write);
> > +    nbytes = upto - startptr;
> >
> > Shouldn't it pretty much be a bug to ever encounter this?
>
> In the current code it's impossible, though Bharath hinted at an
> extension which could reach that path.
>
> What I committed was a bit of a compromise -- earlier versions of the
> patch supported reading right up to the Insert pointer (which requires
> a call to WaitXLogInsertionsToFinish()). I wasn't ready to commit that
> code without seeing a more about how that would be used, but I thought
> it was reasonable to have some simple code in there to allow reading up
> to the Write pointer.

I doubt there's a sane way to use WALRead() without *first* ensuring that the
range of data is valid. I think we're better of moving that responsibility
explicitly to the caller and adding an assertion verifying that.


> It seems closer to the structure that we will ultimately need to
> replicate unflushed data, right?

It doesn't really seem like a necessary, or even particularly useful,
part. You couldn't just call WALRead() for that, since the caller would need
to know the range up to which WAL is valid but not yet flushed as well. Thus
the caller would need to first use WaitXLogInsertionsToFinish() or something
like it anyway - and then there's no point in doing the WALRead() anymore.

Note that for replicating unflushed data, we *still* might need to fall back
to reading WAL data from disk. In which case not asserting in WALRead() would
just make it hard to find bugs, because not using WaitXLogInsertionsToFinish()
would appear to work as long as data is in wal buffers, but as soon as we'd
fall back to on-disk (but unflushed) data, we'd send bogus WAL.

Greetings,

Andres Freund




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2024-02-12 Thread Jeff Davis
On Mon, 2024-02-12 at 12:18 -0800, Andres Freund wrote:
> +    upto = Min(startptr + count, LogwrtResult.Write);
> +    nbytes = upto - startptr;
> 
> Shouldn't it pretty much be a bug to ever encounter this?

In the current code it's impossible, though Bharath hinted at an
extension which could reach that path.

What I committed was a bit of a compromise -- earlier versions of the
patch supported reading right up to the Insert pointer (which requires
a call to WaitXLogInsertionsToFinish()). I wasn't ready to commit that
code without seeing a more about how that would be used, but I thought
it was reasonable to have some simple code in there to allow reading up
to the Write pointer.

It seems closer to the structure that we will ultimately need to
replicate unflushed data, right?

Regards,
Jeff Davis

[1]
https://www.postgresql.org/message-id/calj2acw65mqn6ukv57sqdtmzajgd1n_adqtdgy+gmdqu6v6...@mail.gmail.com




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2024-02-12 Thread Andres Freund
Hi,

On 2024-02-12 11:33:24 -0800, Jeff Davis wrote:
> On Wed, 2024-01-31 at 14:30 +0530, Bharath Rupireddy wrote:
> > Please see the attached v22 patch set.
>
> Committed 0001.

Yay, I think this is very cool. There are plenty other improvements than can
be based on this...


One thing I'm a bit confused in the code is the following:

+/*
+ * Don't read past the available WAL data.
+ *
+ * Check using local copy of LogwrtResult. Ordinarily it's been updated by
+ * the caller when determining how far to read; but if not, it just means
+ * we'll read less data.
+ *
+ * XXX: the available WAL could be extended to the WAL insert pointer by
+ * calling WaitXLogInsertionsToFinish().
+ */
+upto = Min(startptr + count, LogwrtResult.Write);
+nbytes = upto - startptr;

Shouldn't it pretty much be a bug to ever encounter this? There aren't
equivalent checks in WALRead(), so any user of WALReadFromBuffers() that then
falls back to WALRead() is just going to send unwritten data.

ISTM that this should be an assertion or error.

Greetings,

Andres Freund




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2024-02-12 Thread Jeff Davis
On Wed, 2024-01-31 at 14:30 +0530, Bharath Rupireddy wrote:
> Please see the attached v22 patch set.

Committed 0001.

For 0002 & 0003, I'd like more clarity on how they will actually be
used by an extension.

For 0004, we need to resolve why callers are using XLOG_BLCKSZ and we
can fix that independently, as discussed here:

https://www.postgresql.org/message-id/CALj2ACV=C1GZT9XQRm4iN1NV1T=hla_hsgwnx2y5-g+mswd...@mail.gmail.com

Regards,
Jeff Davis





Re: Improve WALRead() to suck data directly from WAL buffers when possible

2024-01-31 Thread Bharath Rupireddy
On Wed, Jan 31, 2024 at 3:01 PM Alvaro Herrera  wrote:
>
> Looking at 0003, where an XXX comment is added about taking a spinlock
> to read LogwrtResult, I suspect the answer is probably not, because it
> is likely to slow down the other uses of LogwrtResult.

We avoided keeping LogwrtResult latest as the current callers for
WALReadFromBuffers() all determine the flush LSN using
GetFlushRecPtr(), see comment #4 from
https://www.postgresql.org/message-id/CALj2ACV%3DC1GZT9XQRm4iN1NV1T%3DhLA_hsGWNx2Y5-G%2BmSwdhNg%40mail.gmail.com.

> But I wonder if
> a better path forward would be to base further work on my older
> uncommitted patch to make LogwrtResult use atomics.  With that, you
> wouldn't have to block others in order to read the value.  I last posted
> that patch in [1] in case you're curious.
>
> [1] https://postgr.es/m/20220728065920.oleu2jzsatchakfj@alvherre.pgsql
>
> The reason I abandoned that patch is that the performance problem that I
> was fixing no longer existed -- it was fixed in a different way.

Nice. I'll respond in that thread.  FWIW, there's been a recent
attempt at turning unloggedLSN to 64-bit atomic -
https://commitfest.postgresql.org/46/4330/ and that might need
pg_atomic_monotonic_advance_u64. I guess we would have to bring your
patch and the unloggedLSN into a single thread to have a better
discussion.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2024-01-31 Thread Alvaro Herrera
Looking at 0003, where an XXX comment is added about taking a spinlock
to read LogwrtResult, I suspect the answer is probably not, because it
is likely to slow down the other uses of LogwrtResult.  But I wonder if
a better path forward would be to base further work on my older
uncommitted patch to make LogwrtResult use atomics.  With that, you
wouldn't have to block others in order to read the value.  I last posted
that patch in [1] in case you're curious.

[1] https://postgr.es/m/20220728065920.oleu2jzsatchakfj@alvherre.pgsql

The reason I abandoned that patch is that the performance problem that I
was fixing no longer existed -- it was fixed in a different way.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"In fact, the basic problem with Perl 5's subroutines is that they're not
crufty enough, so the cruft leaks out into user-defined code instead, by
the Conservation of Cruft Principle."  (Larry Wall, Apocalypse 6)




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2024-01-31 Thread Bharath Rupireddy
On Tue, Jan 30, 2024 at 11:01 PM Alvaro Herrera  wrote:
>
> Hmm, this looks quite nice and simple.

Thanks for looking at it.

> My only comment is that a
> sequence like this
>
>/* Read from WAL buffers, if available. */
>rbytes = XLogReadFromBuffers(_message.data[output_message.len],
> startptr, nbytes, xlogreader->seg.ws_tli);
>output_message.len += rbytes;
>startptr += rbytes;
>nbytes -= rbytes;
>
>if (!WALRead(xlogreader,
> _message.data[output_message.len],
> startptr,
>
> leaves you wondering if WALRead() should be called at all or not, in the
> case when all bytes were read by XLogReadFromBuffers.  I think in many
> cases what's going to happen is that nbytes is going to be zero, and
> then WALRead is going to return having done nothing in its inner loop.
> I think this warrants a comment somewhere.  Alternatively, we could
> short-circuit the 'if' expression so that WALRead() is not called in
> that case (but I'm not sure it's worth the loss of code clarity).

It might help avoid a function call in case reading from WAL buffers
satisfies the read fully. And, it's not that clumsy with the change,
see following. I've changed it in the attached v22 patch set.

if (nbytes > 0 &&
!WALRead(xlogreader,

> Also, but this is really quite minor, it seems sad to add more functions
> with the prefix XLog, when we have renamed things to use the prefix WAL,
> and we have kept the old names only to avoid backpatchability issues.
> I mean, if we have WALRead() already, wouldn't it make perfect sense to
> name the new routine WALReadFromBuffers?

WALReadFromBuffers looks better. Used that in v22 patch.

Please see the attached v22 patch set.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v22-0004-Use-WALReadFromBuffers-in-more-places.patch
Description: Binary data


v22-0002-Allow-WALReadFromBuffers-to-wait-for-in-progress.patch
Description: Binary data


v22-0003-Add-test-module-for-verifying-WAL-read-from-WAL-.patch
Description: Binary data


v22-0001-Add-WALReadFromBuffers.patch
Description: Binary data


Re: Improve WALRead() to suck data directly from WAL buffers when possible

2024-01-30 Thread Alvaro Herrera
Hmm, this looks quite nice and simple.  My only comment is that a
sequence like this

   /* Read from WAL buffers, if available. */
   rbytes = XLogReadFromBuffers(_message.data[output_message.len],
startptr, nbytes, xlogreader->seg.ws_tli);
   output_message.len += rbytes;
   startptr += rbytes;
   nbytes -= rbytes;

   if (!WALRead(xlogreader,
_message.data[output_message.len],
startptr,

leaves you wondering if WALRead() should be called at all or not, in the
case when all bytes were read by XLogReadFromBuffers.  I think in many
cases what's going to happen is that nbytes is going to be zero, and
then WALRead is going to return having done nothing in its inner loop.
I think this warrants a comment somewhere.  Alternatively, we could
short-circuit the 'if' expression so that WALRead() is not called in
that case (but I'm not sure it's worth the loss of code clarity).

Also, but this is really quite minor, it seems sad to add more functions
with the prefix XLog, when we have renamed things to use the prefix WAL,
and we have kept the old names only to avoid backpatchability issues.
I mean, if we have WALRead() already, wouldn't it make perfect sense to
name the new routine WALReadFromBuffers?

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Tiene valor aquel que admite que es un cobarde" (Fernandel)




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2024-01-27 Thread Bharath Rupireddy
On Sat, Jan 27, 2024 at 1:04 AM Jeff Davis  wrote:
>
> All of these things are functionally equivalent -- the same thing is
> happening at the end. This is just a discussion about API style and how
> that will interact with hypothetical callers that don't exist today.
> And it can also be easily changed later, so we aren't stuck with
> whatever decision happens here.

I'll leave that up to you. I'm okay either ways - 1) ensure the caller
doesn't use XLogReadFromBuffers, 2) XLogReadFromBuffers returning
as-if nothing was read when in recovery or on a different timeline.

> > Imagine, implementing an extension (may be for fun or learning or
> > educational or production purposes) to read unflushed WAL directly
> > from WAL buffers using XLogReadFromBuffers as page_read callback with
> > xlogreader facility.
>
> That makes sense, I didn't realize you intended to use this fron an
> extension. I'm fine considering that as a separate patch that could
> potentially be committed soon after this one.

Yes, I've turned that into 0002 patch.

> I'd like some more details, but can I please just commit the basic
> functionality now-ish?

+1.

> > Tried to keep wal_writer quiet with wal_writer_delay=1ms and
> > wal_writer_flush_after = 1GB to not to flush WAL in the background.
> > Also, disabled autovacuum, and set checkpoint_timeout to a higher
> > value. All of this is done to generate minimal WAL so that WAL
> > buffers
> > don't get overwritten. Do you see any problems with it?
>
> Maybe check it against pg_current_wal_lsn(), and see if the Write
> pointer moved ahead? Perhaps even have a (limited) loop that tries
> again to catch it at the right time?

Adding a loop seems to be reasonable here and done in v21-0003. Also,
I've added wal_level = minimal per
src/test/recovery/t/039_end_of_wal.pl introduced by commit bae868caf22
which also tries to keep WAL activity to minimum.

> > Can the WAL summarizer ever read the WAL on current TLI? I'm not so
> > sure about it, I haven't explored it in detail.
>
> Let's just not call XLogReadFromBuffers from there.

Removed.

PSA v21 patch set.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v21-0001-Add-XLogReadFromBuffers.patch
Description: Binary data


v21-0002-Allow-XLogReadFromBuffers-to-wait-for-in-progres.patch
Description: Binary data


v21-0003-Add-test-module-for-verifying-WAL-read-from-WAL-.patch
Description: Binary data


v21-0004-Use-XLogReadFromBuffers-in-more-places.patch
Description: Binary data


Re: Improve WALRead() to suck data directly from WAL buffers when possible

2024-01-26 Thread Jeff Davis
On Fri, 2024-01-26 at 19:31 +0530, Bharath Rupireddy wrote:
> Are you suggesting to error out instead of returning 0?

We'd do neither of those things, because no caller should actually call
it while RecoveryInProgress() or on a different timeline.

>  How about
> returning a negative value instead of just 0 or returning true/false
> just like WALRead?

All of these things are functionally equivalent -- the same thing is
happening at the end. This is just a discussion about API style and how
that will interact with hypothetical callers that don't exist today.
And it can also be easily changed later, so we aren't stuck with
whatever decision happens here.

> 
> Imagine, implementing an extension (may be for fun or learning or
> educational or production purposes) to read unflushed WAL directly
> from WAL buffers using XLogReadFromBuffers as page_read callback with
> xlogreader facility.

That makes sense, I didn't realize you intended to use this fron an
extension. I'm fine considering that as a separate patch that could
potentially be committed soon after this one.

I'd like some more details, but can I please just commit the basic
functionality now-ish?

> Tried to keep wal_writer quiet with wal_writer_delay=1ms and
> wal_writer_flush_after = 1GB to not to flush WAL in the background.
> Also, disabled autovacuum, and set checkpoint_timeout to a higher
> value. All of this is done to generate minimal WAL so that WAL
> buffers
> don't get overwritten. Do you see any problems with it?

Maybe check it against pg_current_wal_lsn(), and see if the Write
pointer moved ahead? Perhaps even have a (limited) loop that tries
again to catch it at the right time?

> Can the WAL summarizer ever read the WAL on current TLI? I'm not so
> sure about it, I haven't explored it in detail.

Let's just not call XLogReadFromBuffers from there.

Regards,
Jeff Davis





Re: Improve WALRead() to suck data directly from WAL buffers when possible

2024-01-26 Thread Bharath Rupireddy
On Fri, Jan 26, 2024 at 8:31 AM Jeff Davis  wrote:
>
> > PSA v20 patch set.
>
> 0001 is very close. I have the following suggestions:
>
>   * Don't just return zero. If the caller is doing something we don't
> expect, we want to fix the caller. I understand you'd like this to be
> more like a transparent optimization, and we may do that later, but I
> don't think it's a good idea to do that now.

+if (RecoveryInProgress() ||
+tli != GetWALInsertionTimeLine())
+return ntotal;
+
+Assert(!XLogRecPtrIsInvalid(startptr));

Are you suggesting to error out instead of returning 0? If yes, I
disagree with it. Because, failure to read due to unmet pre-conditions
doesn't necessarily have to be to error out. If we error out, the
immediate failure we see is in the src/bin/psql TAP test for calling
XLogReadFromBuffers when the server is in recovery. How about
returning a negative value instead of just 0 or returning true/false
just like WALRead?

>   * There's currently no use for reading LSNs between Write and Insert,
> so remove the WaitXLogInsertionsToFinish() code path. That also means
> we don't need the extra emitLog parameter, so we can remove that. When
> we have a use case, we can bring it all back.

I disagree with this. I don't see anything wrong with
XLogReadFromBuffers having the capability to wait for in-progress
insertions to finish. In fact, it makes the function near-complete.
Imagine, implementing an extension (may be for fun or learning or
educational or production purposes) to read unflushed WAL directly
from WAL buffers using XLogReadFromBuffers as page_read callback with
xlogreader facility. AFAICT, I don't see a problem with
WaitXLogInsertionsToFinish logic in XLogReadFromBuffers.

FWIW, one important aspect of XLogReadFromBuffers is its ability to
read the unflushed WAL from WAL buffers. Also, see a note from Andres
here 
https://www.postgresql.org/message-id/20231109205836.zjoawdrn4q77yemv%40awork3.anarazel.de.

> If you agree, I can just make those adjustments (and do some final
> checking) and commit 0001. Otherwise let me know what you think.

Thanks. Please see my responses above.

> 0002: How does the test control whether the data requested is before
> the Flush pointer, the Write pointer, or the Insert pointer? What if
> the walwriter comes in and moves one of those pointers before the next
> statement is executed?

Tried to keep wal_writer quiet with wal_writer_delay=1ms and
wal_writer_flush_after = 1GB to not to flush WAL in the background.
Also, disabled autovacuum, and set checkpoint_timeout to a higher
value. All of this is done to generate minimal WAL so that WAL buffers
don't get overwritten. Do you see any problems with it?

> Also, do you think a test module is required for
> the basic functionality in 0001, or only when we start doing more
> complex things like reading past the Flush pointer?

With WaitXLogInsertionsToFinish in XLogReadFromBuffers, we have that
capability already in. Having a separate test module ensures the code
is tested properly.

As far as the test is concerned, it verifies 2 cases:
1. Check if WAL is successfully read from WAL buffers. For this, the
test generates minimal WAL and reads from WAL buffers from the start
LSN = current insert LSN captured before the WAL generation.
2. Check with a WAL that doesn't yet exist. For this, the test reads
from WAL buffers from the start LSN = current flush LSN+16MB (a
randomly chosen higher value).

> 0003: can you explain why this is useful for wal summarizer to read
> from the buffers?

Can the WAL summarizer ever read the WAL on current TLI? I'm not so
sure about it, I haven't explored it in detail.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2024-01-25 Thread Jeff Davis
On Thu, 2024-01-25 at 14:35 +0530, Bharath Rupireddy wrote:
> 
> "expecting zeros at the end" - this can't always be true as the WAL
> 
...

> I think this needs to be discussed separately. If okay, I'll start a
> new thread.

Thank you for investigating. When the above issue is handled, I'll be
more comfortable expanding the call sites for XLogReadFromBuffers().

> > Also, if we've detected that the first requested buffer has been
> > evicted, is there any value in continuing the loop to see if more
> > recent buffers are available? For example, if the requested LSNs
> > range
> > over buffers 4, 5, and 6, and 4 has already been evicted, should we
> > try
> > to return LSN data from 5 and 6 at the proper offset in the dest
> > buffer? If so, we'd need to adjust the API so the caller knows what
> > parts of the dest buffer were filled in.
> 
> I'd second this capability for now to keep the API simple and clear,
> but we can consider expanding it as needed.

Agreed. This case doesn't seem important; I just thought I'd ask about
it.

> If the callers use GetFlushRecPtr() to determine how far to read,
> LogwrtResult will be *reasonably* latest

It will be up-to-date enough that we'd never go through
WaitXLogInsertionsToFinish(), which is all we care about.

> As far as the current WAL readers are concerned, we don't need an
> explicit spinlock to determine LogwrtResult because all of them use
> GetFlushRecPtr() to determine how far to read. If there's any caller
> that's not updating LogwrtResult at all, we can consider reading
> LogwrtResult it ourselves in future.

So we don't actually need that path yet, right?

> 5. I think the two requirements specified at
> https://www.postgresql.org/message-id/20231109205836.zjoawdrn4q77yemv%40awork3.anarazel.de
> still hold with the v19j.

Agreed.

> PSA v20 patch set.

0001 is very close. I have the following suggestions:

  * Don't just return zero. If the caller is doing something we don't
expect, we want to fix the caller. I understand you'd like this to be
more like a transparent optimization, and we may do that later, but I
don't think it's a good idea to do that now.

  * There's currently no use for reading LSNs between Write and Insert,
so remove the WaitXLogInsertionsToFinish() code path. That also means
we don't need the extra emitLog parameter, so we can remove that. When
we have a use case, we can bring it all back.

If you agree, I can just make those adjustments (and do some final
checking) and commit 0001. Otherwise let me know what you think.

0002: How does the test control whether the data requested is before
the Flush pointer, the Write pointer, or the Insert pointer? What if
the walwriter comes in and moves one of those pointers before the next
statement is executed? Also, do you think a test module is required for
the basic functionality in 0001, or only when we start doing more
complex things like reading past the Flush pointer?

0003: can you explain why this is useful for wal summarizer to read
from the buffers?

Regards,
Jeff Davis





Re: Improve WALRead() to suck data directly from WAL buffers when possible

2024-01-25 Thread Bharath Rupireddy
On Tue, Jan 23, 2024 at 9:37 AM Jeff Davis  wrote:
>
> On Mon, 2024-01-22 at 12:12 -0800, Andres Freund wrote:
> > I still think that anything that requires such checks shouldn't be
> > merged. It's completely bogus to check page contents for validity
> > when we
> > should have metadata telling us which range of the buffers is valid
> > and which
> > not.
>
> The check seems entirely unnecessary, to me. A leftover from v18?
>
> I have attached a new patch (version "19j") to illustrate some of my
> previous suggestions. I didn't spend a lot of time on it so it's not
> ready for commit, but I believe my suggestions are easier to understand
> in code form.

> Note that, right now, it only works for XLogSendPhysical(). I believe
> it's best to just make it work for 1-3 callers that we understand well,
> and we can generalize later if it makes sense.

+1 to do it for XLogSendPhysical() first. Enabling it for others can
just be done as something like the attached v20-0003.

> I'm still not clear on why some callers are reading XLOG_BLCKSZ
> (expecting zeros at the end), and if it's OK to just change them to use
> the exact byte count.

"expecting zeros at the end" - this can't always be true as the WAL
can get flushed after determining the flush ptr before reading it from
the WAL file. FWIW, here's what I've tried previoulsy -
https://github.com/BRupireddy2/postgres/tree/ensure_extra_read_WAL_page_is_zero_padded_at_the_end_WIP,
the tests hit the Assert(false); added. Which means, the zero-padding
comment around WALRead() call-sites isn't quite right.

/*
 * Even though we just determined how much of the page can be validly read
 * as 'count', read the whole page anyway. It's guaranteed to be
 * zero-padded up to the page boundary if it's incomplete.
 */
if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ, tli,

I think this needs to be discussed separately. If okay, I'll start a new thread.

> Also, if we've detected that the first requested buffer has been
> evicted, is there any value in continuing the loop to see if more
> recent buffers are available? For example, if the requested LSNs range
> over buffers 4, 5, and 6, and 4 has already been evicted, should we try
> to return LSN data from 5 and 6 at the proper offset in the dest
> buffer? If so, we'd need to adjust the API so the caller knows what
> parts of the dest buffer were filled in.

I'd second this capability for now to keep the API simple and clear,
but we can consider expanding it as needed.

I reviewed the v19j and attached v20 patch set:

1.
 * The caller must ensure that it's reasonable to read from the WAL buffers,
 * i.e. that the requested data is from the current timeline, that we're not
 * in recovery, etc.

I still think the XLogReadFromBuffers can just return in any of the
above cases instead of comments. I feel we must assume the caller is
going to ask the WAL from a different timeline and/or in recovery and
design the API to deal with it. Done that way in v20 patch.

2. Fixed some typos, reworded a few comments (i.e. used "current
insert/write position" instead of "Insert/Write pointer" like
elsewhere), ran pgindent.

3.
- * XXX probably this should be improved to suck data directly from the
- * WAL buffers when possible.

Removed the above comment before WALRead() since we have that facility
now. Perhaps, we can say the callers can suck data directly from the
WAL buffers using XLogReadFromBuffers. But I have no strong opinion on
this.

4.
+ * Most callers will have already updated LogwrtResult when determining
+ * how far to read, but it's OK if it's out of date. (XXX: is it worth
+ * taking a spinlock to update LogwrtResult and check again before calling
+ * WaitXLogInsertionsToFinish()?)

If the callers use GetFlushRecPtr() to determine how far to read,
LogwrtResult will be *reasonably* latest, otherwise not. If
LogwrtResult is a bit old, XLogReadFromBuffers will call
WaitXLogInsertionsToFinish which will just loop over all insertion
locks and return.

As far as the current WAL readers are concerned, we don't need an
explicit spinlock to determine LogwrtResult because all of them use
GetFlushRecPtr() to determine how far to read. If there's any caller
that's not updating LogwrtResult at all, we can consider reading
LogwrtResult it ourselves in future.

5. I think the two requirements specified at
https://www.postgresql.org/message-id/20231109205836.zjoawdrn4q77yemv%40awork3.anarazel.de
still hold with the v19j.

5.1 Never allow WAL being read that's past
XLogBytePosToRecPtr(XLogCtl->Insert->CurrBytePos) as it does not
exist.
5.2 If the to-be-read LSN is between XLogCtl->LogwrtResult->Write and
XLogBytePosToRecPtr(Insert->CurrBytePos) we need to call
WaitXLogInsertionsToFinish() before copying the data.

+if (upto > LogwrtResult.Write)
+{
+XLogRecPtr writtenUpto = WaitXLogInsertionsToFinish(upto, false);
+
+upto = Min(upto, writtenUpto);
+nbytes = upto - 

Re: Improve WALRead() to suck data directly from WAL buffers when possible

2024-01-22 Thread Jeff Davis
On Mon, 2024-01-22 at 12:12 -0800, Andres Freund wrote:
> I still think that anything that requires such checks shouldn't be
> merged. It's completely bogus to check page contents for validity
> when we
> should have metadata telling us which range of the buffers is valid
> and which
> not.

The check seems entirely unnecessary, to me. A leftover from v18?

I have attached a new patch (version "19j") to illustrate some of my
previous suggestions. I didn't spend a lot of time on it so it's not
ready for commit, but I believe my suggestions are easier to understand
in code form.

Note that, right now, it only works for XLogSendPhysical(). I believe
it's best to just make it work for 1-3 callers that we understand well,
and we can generalize later if it makes sense.

I'm still not clear on why some callers are reading XLOG_BLCKSZ
(expecting zeros at the end), and if it's OK to just change them to use
the exact byte count.

Also, if we've detected that the first requested buffer has been
evicted, is there any value in continuing the loop to see if more
recent buffers are available? For example, if the requested LSNs range
over buffers 4, 5, and 6, and 4 has already been evicted, should we try
to return LSN data from 5 and 6 at the proper offset in the dest
buffer? If so, we'd need to adjust the API so the caller knows what
parts of the dest buffer were filled in.

Regards,
Jeff Davis

From 34d89d6b869a454fd15097d8a0f0b6d3997a74da Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Mon, 22 Jan 2024 16:21:53 -0800
Subject: [PATCH v19j 1/2] Add XLogReadFromBuffers().

Allows reading directly from WAL buffers without a lock, avoiding the
need to wait for WAL flushing and read from the filesystem.

For now, the only caller is physical replication, but we can consider
expanding it to other callers as needed.

Author: Bharath Rupireddy
---
 src/backend/access/transam/xlog.c   | 145 ++--
 src/backend/replication/walsender.c |  14 +++
 src/include/access/xlog.h   |   2 +
 3 files changed, 152 insertions(+), 9 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 478377c4a2..c9619139af 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -698,7 +698,7 @@ static void ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos,
 	  XLogRecPtr *EndPos, XLogRecPtr *PrevPtr);
 static bool ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos,
 			  XLogRecPtr *PrevPtr);
-static XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto);
+static XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto, bool emitLog);
 static char *GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli);
 static XLogRecPtr XLogBytePosToRecPtr(uint64 bytepos);
 static XLogRecPtr XLogBytePosToEndRecPtr(uint64 bytepos);
@@ -1494,7 +1494,7 @@ WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt)
  * to make room for a new one, which in turn requires WALWriteLock.
  */
 static XLogRecPtr
-WaitXLogInsertionsToFinish(XLogRecPtr upto)
+WaitXLogInsertionsToFinish(XLogRecPtr upto, bool emitLog)
 {
 	uint64		bytepos;
 	XLogRecPtr	reservedUpto;
@@ -1521,9 +1521,10 @@ WaitXLogInsertionsToFinish(XLogRecPtr upto)
 	 */
 	if (upto > reservedUpto)
 	{
-		ereport(LOG,
-(errmsg("request to flush past end of generated WAL; request %X/%X, current position %X/%X",
-		LSN_FORMAT_ARGS(upto), LSN_FORMAT_ARGS(reservedUpto;
+		if (emitLog)
+			ereport(LOG,
+	(errmsg("request to flush past end of generated WAL; request %X/%X, current position %X/%X",
+			LSN_FORMAT_ARGS(upto), LSN_FORMAT_ARGS(reservedUpto;
 		upto = reservedUpto;
 	}
 
@@ -1705,6 +1706,132 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
 	return cachedPos + ptr % XLOG_BLCKSZ;
 }
 
+/*
+ * Read WAL directly from WAL buffers, if available.
+ *
+ * This function reads 'count' bytes of WAL from WAL buffers into 'buf'
+ * starting at location 'startptr' and returns total bytes read.
+ *
+ * The bytes read may be fewer than requested if any of the WAL buffers in the
+ * requested range have been evicted, or if the last requested byte is beyond
+ * the Insert pointer.
+ *
+ * If reading beyond the Write pointer, this function will wait for concurent
+ * inserters to finish. Otherwise, it does not wait at all.
+ *
+ * The caller must ensure that it's reasonable to read from the WAL buffers,
+ * i.e. that the requested data is from the current timeline, that we're not
+ * in recovery, etc.
+ */
+Size
+XLogReadFromBuffers(char *buf, XLogRecPtr startptr, Size count)
+{
+	XLogRecPtr	 ptr	= startptr;
+	XLogRecPtr	 upto	= startptr + count;
+	Size		 nbytes = count;
+	Size		 ntotal = 0;
+	char		*dst	= buf;
+
+	Assert(!RecoveryInProgress());
+	Assert(!XLogRecPtrIsInvalid(startptr));
+
+	/*
+	 * Caller requested very recent WAL data. Wait for any in-progress WAL
+	 * insertions to WAL buffers to finish.
+	 *
+	 * Most callers will have already updated LogwrtResult when 

Re: Improve WALRead() to suck data directly from WAL buffers when possible

2024-01-22 Thread Andres Freund
Hi,

On 2024-01-10 19:59:29 +0530, Bharath Rupireddy wrote:
> + /*
> +  * Typically, we must not read a WAL buffer page that just got
> +  * initialized. Because we waited enough for the in-progress WAL
> +  * insertions to finish above. However, there can exist a slight
> +  * window after the above wait finishes in which the read 
> buffer page
> +  * can get replaced especially under high WAL generation rates. 
> After
> +  * all, we are reading from WAL buffers without any locks here. 
> So,
> +  * let's not count such a page in.
> +  */
> + phdr = (XLogPageHeader) page;
> + if (!(phdr->xlp_magic == XLOG_PAGE_MAGIC &&
> +   phdr->xlp_pageaddr == (ptr - (ptr % XLOG_BLCKSZ)) &&
> +   phdr->xlp_tli == tli))
> + break;

I still think that anything that requires such checks shouldn't be
merged. It's completely bogus to check page contents for validity when we
should have metadata telling us which range of the buffers is valid and which
not.

Greetings,

Andres Freund




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2024-01-22 Thread Melih Mutlu
Hi Bharath,

Thanks for working on this. It seems like a nice improvement to have.

Here are some comments on 0001 patch.

1-  xlog.c
+ /*
+ * Fast paths for the following reasons: 1) WAL buffers aren't in use when
+ * server is in recovery. 2) WAL is inserted into WAL buffers on current
+ * server's insertion TLI. 3) Invalid starting WAL location.
+ */

Shouldn't the comment be something like "2) WAL is *not* inserted into WAL
buffers on current server's insertion TLI" since the condition to break is tli
!= GetWALInsertionTimeLine()

2-
+ /*
+ * WAL being read doesn't yet exist i.e. past the current insert position.
+ */
+ if ((startptr + count) > reservedUpto)
+ return ntotal;

This question may not even make sense but I wonder whether we can read from
startptr only to reservedUpto in case of startptr+count exceeds
reservedUpto?

3-
+ /* Wait for any in-progress WAL insertions to WAL buffers to finish. */
+ if ((startptr + count) > LogwrtResult.Write &&
+ (startptr + count) <= reservedUpto)
+ WaitXLogInsertionsToFinish(startptr + count);

Do we need to check if (startptr + count) <= reservedUpto as we already
verified this condition a few lines above?

4-
+ Assert(nread > 0);
+ memcpy(dst, data, nread);
+
+ /*
+ * Make sure we don't read xlblocks down below before the page
+ * contents up above.
+ */
+ pg_read_barrier();
+
+ /* Recheck if the read page still exists in WAL buffers. */
+ endptr = pg_atomic_read_u64(>xlblocks[idx]);
+
+ /* Return if the page got initalized while we were reading it. */
+ if (expectedEndPtr != endptr)
+ break;
+
+ /*
+ * Typically, we must not read a WAL buffer page that just got
+ * initialized. Because we waited enough for the in-progress WAL
+ * insertions to finish above. However, there can exist a slight
+ * window after the above wait finishes in which the read buffer page
+ * can get replaced especially under high WAL generation rates. After
+ * all, we are reading from WAL buffers without any locks here. So,
+ * let's not count such a page in.
+ */
+ phdr = (XLogPageHeader) page;
+ if (!(phdr->xlp_magic == XLOG_PAGE_MAGIC &&
+   phdr->xlp_pageaddr == (ptr - (ptr % XLOG_BLCKSZ)) &&
+   phdr->xlp_tli == tli))
+ break;

I see that you recheck if the page still exists and so at the end. What
would you think about memcpy'ing only after being sure that we will need
and use the recently read data? If we break the loop during the recheck, we
simply discard the data read in the latest attempt. I guess that this may
not be a big deal but the data would be unnecessarily copied into the
destination in such a case.

5- xlogreader.c
+ nread = XLogReadFromBuffers(startptr, tli, count, buf);
+
+ if (nread > 0)
+ {
+ /*
+ * Check if its a full read, short read or no read from WAL buffers.
+ * For short read or no read, continue to read the remaining bytes
+ * from WAL file.
+ *
+ * XXX: It might be worth to expose WAL buffer read stats.
+ */
+ if (nread == count) /* full read */
+ return true;
+ else if (nread < count) /* short read */
+ {
+ buf += nread;
+ startptr += nread;
+ count -= nread;
+ }

Typo in the comment. Should be like "Check if *it's* a full read, short
read or no read from WAL buffers."

Also I don't think XLogReadFromBuffers() returns anything less than 0 and
more than count. Is verifying nread > 0 necessary? I think if nread does
not equal to count, we can simply assume that it's a short read. (or no
read at all in case nread is 0 which we don't need to handle specifically)

6-
+ /*
+ * We determined how much of the page can be validly read as 'count', read
+ * that much only, not the entire page. Since WALRead() can read the page
+ * from WAL buffers, in which case, the page is not guaranteed to be
+ * zero-padded up to the page boundary because of the concurrent
+ * insertions.
+ */

I'm not sure about pasting this into the most places we call WalRead().
Wouldn't it be better if we mention this somewhere around WALRead() only
once?


Best,
-- 
Melih Mutlu
Microsoft


Re: Improve WALRead() to suck data directly from WAL buffers when possible

2024-01-19 Thread Jeff Davis
On Wed, 2024-01-10 at 19:59 +0530, Bharath Rupireddy wrote:
> I've addressed the above review comments and attached v19 patch-set.

Regarding:

-   if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ, tli,
-))
+   if (!WALRead(state, cur_page, targetPagePtr, count, tli,
))

I'd like to understand the reason it was using XLOG_BLCKSZ before. Was
it a performance optimization? Or was it to zero the remainder of the
caller's buffer (readBuf)? Or something else?

If it was to zero the remainder of the caller's buffer, then we should
explicitly make that the caller's responsibility.

Regards,
Jeff Davis





Re: Improve WALRead() to suck data directly from WAL buffers when possible

2024-01-10 Thread Bharath Rupireddy
On Fri, Jan 5, 2024 at 7:20 AM Jeff Davis  wrote:
>
> On Wed, 2023-12-20 at 15:36 +0530, Bharath Rupireddy wrote:
> > Thanks. Attaching remaining patches as v18 patch-set after commits
> > c3a8e2a7cb16 and 766571be1659.
>
> Comments:

Thanks for reviewing.

> I still think the right thing for this patch is to call
> XLogReadFromBuffers() directly from the callers who need it, and not
> change WALRead(). I am open to changing this later, but for now that
> makes sense to me so that we can clearly identify which callers benefit
> and why. I have brought this up a few times before[1][2], so there must
> be some reason that I don't understand -- can you explain it?

IMO, WALRead() is the best place to have XLogReadFromBuffers() for 2
reasons: 1) All of the WALRead() callers (except FRONTEND tools) will
benefit if WAL is read from WAL buffers. I don't see any reason for a
caller to skip reading from WAL buffers. If there's a caller (in
future) wanting to skip reading from WAL buffers, I'm open to adding a
flag in XLogReaderState to skip.  2) The amount of code is reduced if
XLogReadFromBuffers() sits in WALRead().

> The XLogReadFromBuffersResult is never used. I can see how it might be
> useful for testing or asserts, but it's not used even in the test
> module. I don't think we should clutter the API with that kind of thing
> -- let's just return the nread.

Removed.

> I also do not like the terminology "partial hit" to be used in this
> way. Perhaps "short read" or something about hitting the end of
> readable WAL would be better?

"short read" seems good. Done that way in the new patch.

> I like how the callers of WALRead() are being more precise about the
> bytes they are requesting.
>
> You've added several spinlock acquisitions to the loop. Two explicitly,
> and one implicitly in WaitXLogInsertionsToFinish(). These may allow you
> to read slightly further, but introduce performance risk. Was this
> discussed?

I opted to read slightly further thinking that the loops aren't going
to get longer for spinlocks to appear costly. Basically, I wasn't sure
which approach was the best. Now that there's an opinion to keep them
outside, I'd agree with it. Done that way in the new patch.

> The callers are not checking for XLREADBUGS_UNINITIALIZED_WAL, so it
> seems like there's a risk of getting partially-written data? And it's
> not clear to me the check of the wal page headers is the right one
> anyway.
>
> It seems like all of this would be simpler if you checked first how far
> you can safely read data, and then just loop and read that far.  I'm not
> sure that it's worth it to try to mix the validity checks with the
> reading of the data.

XLogReadFromBuffers needs the page header check in after reading the
page from WAL buffers. Typically, we must not read a WAL buffer page
that just got initialized. Because we waited enough for the
in-progress WAL insertions to finish above. However, there can exist a
slight window after the above wait finishes in which the read buffer
page can get replaced especially under high WAL generation rates.
After all, we are reading from WAL buffers without any locks here. So,
let's not count such a page in.

I've addressed the above review comments and attached v19 patch-set.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From e03af5726957437c15361bdb1b373fe8982f5c7c Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Wed, 10 Jan 2024 14:12:17 +
Subject: [PATCH v19] Allow WAL reading from WAL buffers

This commit adds postgres the capability to read WAL from WAL
buffers. When requested WAL isn't available in WAL buffers, the
WAL is read from the WAL file as usual.

This commit benefits the callers of WALRead(), that are
walsenders, pg_walinspect etc. They all can now avoid reading WAL
from the WAL file (possibly avoiding disk IO). Tests show that the
WAL buffers hit ratio stood at 95% for 1 primary, 1 sync standby,
1 async standby, with pgbench --scale=300 --client=32 --time=900.
In other words, the walsenders avoided 95% of the time reading from
the file/avoided pread system calls:
https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com

This commit also benefits when direct IO is enabled for WAL.
Reading WAL from WAL buffers puts back the performance close to
that of without direct IO for WAL:
https://www.postgresql.org/message-id/CALj2ACV6rS%2B7iZx5%2BoAvyXJaN4AG-djAQeM1mrM%3DYSDkVrUs7g%40mail.gmail.com

This commit paves the way for the following features in future:
- Improves synchronous replication performance by replicating
directly from WAL buffers.
- A opt-in way for the walreceivers to receive unflushed WAL.
More details here:
https://www.postgresql.org/message-id/20231011224353.cl7c2s222dw3de4j%40awork3.anarazel.de

Author: Bharath Rupireddy
Reviewed-by: Dilip Kumar, Andres Freund
Reviewed-by: Nathan Bossart, Kuntal Ghosh

Re: Improve WALRead() to suck data directly from WAL buffers when possible

2024-01-04 Thread Jeff Davis
On Wed, 2023-12-20 at 15:36 +0530, Bharath Rupireddy wrote:
> Thanks. Attaching remaining patches as v18 patch-set after commits
> c3a8e2a7cb16 and 766571be1659.

Comments:

I still think the right thing for this patch is to call
XLogReadFromBuffers() directly from the callers who need it, and not
change WALRead(). I am open to changing this later, but for now that
makes sense to me so that we can clearly identify which callers benefit
and why. I have brought this up a few times before[1][2], so there must
be some reason that I don't understand -- can you explain it?

The XLogReadFromBuffersResult is never used. I can see how it might be
useful for testing or asserts, but it's not used even in the test
module. I don't think we should clutter the API with that kind of thing
-- let's just return the nread.

I also do not like the terminology "partial hit" to be used in this
way. Perhaps "short read" or something about hitting the end of
readable WAL would be better?

I like how the callers of WALRead() are being more precise about the
bytes they are requesting.

You've added several spinlock acquisitions to the loop. Two explicitly,
and one implicitly in WaitXLogInsertionsToFinish(). These may allow you
to read slightly further, but introduce performance risk. Was this
discussed?

The callers are not checking for XLREADBUGS_UNINITIALIZED_WAL, so it
seems like there's a risk of getting partially-written data? And it's
not clear to me the check of the wal page headers is the right one
anyway.

It seems like all of this would be simpler if you checked first how far
you can safely read data, and then just loop and read that far. I'm not
sure that it's worth it to try to mix the validity checks with the
reading of the data.

Regards,
Jeff Davis

[1] 
https://www.postgresql.org/message-id/4132fe48f831ed6f73a9eb191af5fe475384969c.camel%40j-davis.com
[2]
https://www.postgresql.org/message-id/2ef04861c0f77e7ae78b703770cc2bbbac3d85e6.ca...@j-davis.com




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-12-20 Thread Bharath Rupireddy
On Fri, Dec 8, 2023 at 6:04 AM Jeff Davis  wrote:
>
> On Thu, 2023-12-07 at 15:59 +0530, Bharath Rupireddy wrote:
> > In the attached v17 patch
>
> The code for 0001 itself looks good. These are minor concerns and I am
> inclined to commit something like it fairly soon.

Thanks. Attaching remaining patches as v18 patch-set after commits
c3a8e2a7cb16 and 766571be1659.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v18-0001-Allow-WAL-reading-from-WAL-buffers.patch
Description: Binary data


v18-0002-Add-test-module-for-verifying-WAL-read-from-WAL-.patch
Description: Binary data


Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-12-07 Thread Jeff Davis
On Thu, 2023-12-07 at 15:59 +0530, Bharath Rupireddy wrote:
> In the attached v17 patch

0001 could impact performance could be impacted in a few ways:

   * There's one additional write barrier inside
 AdvanceXLInsertBuffer()
   * AdvanceXLInsertBuffer() already holds WALBufMappingLock, so
 the atomic access inside of it is somewhat redundant
   * On some platforms, the XLogCtlData structure size will change

The patch has been out for a while and nobody seems concerned about
those things, and they look fine to me, so I assume these are not real
problems. I just wanted to highlight them.

Also, the description and the comments seem off. The patch does two
things: (a) make it possible to read a page without a lock, which means
we need to mark with InvalidXLogRecPtr while it's being initialized;
and (b) use 64-bit atomics to make it safer (or at least more
readable).

(a) feels like the most important thing, and it's a hard requirement
for the rest of the work, right?

(b) seems like an implementation choice, and I agree with it on
readability grounds.

Also:

+  * But it means that when we do this
+  * unlocked read, we might see a value that appears to be ahead of
the
+  * page we're looking for. Don't PANIC on that, until we've verified
the
+  * value while holding the lock.

Is that still true even without a torn read?

The code for 0001 itself looks good. These are minor concerns and I am
inclined to commit something like it fairly soon.

Regards,
Jeff Davis





Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-12-07 Thread Bharath Rupireddy
On Mon, Nov 13, 2023 at 7:02 PM Bharath Rupireddy
 wrote:
>
> On Fri, Nov 10, 2023 at 2:28 AM Andres Freund  wrote:
>
> > I think the code needs to make sure that *never* happens. That seems 
> > unrelated
> > to holding or not holding WALBufMappingLock. Even if the page header is
> > already valid, I don't think it's ok to just read/parse WAL data that's
> > concurrently being modified.
> >
> > We can never allow WAL being read that's past
> >   XLogBytePosToRecPtr(XLogCtl->Insert->CurrBytePos)
> > as it does not exist.
>
> Agreed. Erroring out in XLogReadFromBuffers() if passed in WAL is past
> the CurrBytePos is an option. Another cleaner way is to just let the
> caller decide what it needs to do (retry or error out) - fill an error
> message in XLogReadFromBuffers() and return as-if nothing was read or
> return a special negative error code like XLogDecodeNextRecord so that
> the caller can deal with it.

In the attached v17 patch, I've ensured that the XLogReadFromBuffers
returns when the caller requests a WAL that's past the current insert
position at the moment.

> Also, reading CurrBytePos with insertpos_lck spinlock can come in the
> way of concurrent inserters. A possible way is to turn both
> CurrBytePos and PrevBytePos 64-bit atomics so that
> XLogReadFromBuffers() can read CurrBytePos without any lock atomically
> and leave it to the caller to deal with non-existing WAL reads.
>
> > And if the to-be-read LSN is between
> > XLogCtl->LogwrtResult->Write and XLogBytePosToRecPtr(Insert->CurrBytePos)
> > we need to call WaitXLogInsertionsToFinish() before copying the data.
>
> Agree to wait for all in-flight insertions to the pages we're about to
> read to finish. But, reading XLogCtl->LogwrtRqst.Write requires either
> XLogCtl->info_lck spinlock or WALWriteLock. Maybe turn
> XLogCtl->LogwrtRqst.Write a 64-bit atomic and read it without any
> lock, rely on
> WaitXLogInsertionsToFinish()'s return value i.e. if
> WaitXLogInsertionsToFinish() returns a value >= Insert->CurrBytePos,
> then go read that page from WAL buffers.

In the attached v17 patch, I've ensured that the XLogReadFromBuffers
waits for all in-progress insertions to finish when the caller
requests WAL that's past the current write position and before the
current insert position.

I've also ensured that the XLogReadFromBuffers returns special return
codes for various scenarios (when asked to read in recovery, read on a
different TLI, read a non-existent WAL and so on.) instead of it
erroring out. This gives flexibility to the caller to decide what to
do.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v17-0001-Use-64-bit-atomics-for-xlblocks-array-elements.patch
Description: Binary data


v17-0002-Allow-WAL-reading-from-WAL-buffers.patch
Description: Binary data


v17-0003-Add-test-module-for-verifying-WAL-read-from-WAL-.patch
Description: Binary data


Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-11-13 Thread Bharath Rupireddy
On Fri, Nov 10, 2023 at 2:28 AM Andres Freund  wrote:
>
> On 2023-11-08 13:10:34 +0530, Bharath Rupireddy wrote:
> > > > + /*
> > > > +  * The fact that we acquire WALBufMappingLock while 
> > > > reading the WAL
> > > > +  * buffer page itself guarantees that no one else 
> > > > initializes it or
> > > > +  * makes it ready for next use in 
> > > > AdvanceXLInsertBuffer(). However, we
> > > > +  * need to ensure that we are not reading a page that 
> > > > just got
> > > > +  * initialized. For this, we look at the needed page 
> > > > header.
> > > > +  */
> > > > + phdr = (XLogPageHeader) page;
> > > > +
> > > > + /* Return, if WAL buffer page doesn't look valid. */
> > > > + if (!(phdr->xlp_magic == XLOG_PAGE_MAGIC &&
> > > > +   phdr->xlp_pageaddr == (ptr - (ptr % 
> > > > XLOG_BLCKSZ)) &&
> > > > +   phdr->xlp_tli == tli))
> > > > + break;
> > >
> > > I don't think this code should ever encounter a page where this is not the
> > > case?  We particularly shouldn't do so silently, seems that could hide all
> > > kinds of problems.
> >
> > I think it's possible to read a "just got initialized" page with the
> > new approach to read WAL buffer pages without WALBufMappingLock if the
> > page is read right after it is initialized and xlblocks is filled in
> > AdvanceXLInsertBuffer() but before actual WAL is written.
>
> I think the code needs to make sure that *never* happens. That seems unrelated
> to holding or not holding WALBufMappingLock. Even if the page header is
> already valid, I don't think it's ok to just read/parse WAL data that's
> concurrently being modified.
>
> We can never allow WAL being read that's past
>   XLogBytePosToRecPtr(XLogCtl->Insert->CurrBytePos)
> as it does not exist.

Agreed. Erroring out in XLogReadFromBuffers() if passed in WAL is past
the CurrBytePos is an option. Another cleaner way is to just let the
caller decide what it needs to do (retry or error out) - fill an error
message in XLogReadFromBuffers() and return as-if nothing was read or
return a special negative error code like XLogDecodeNextRecord so that
the caller can deal with it.

Also, reading CurrBytePos with insertpos_lck spinlock can come in the
way of concurrent inserters. A possible way is to turn both
CurrBytePos and PrevBytePos 64-bit atomics so that
XLogReadFromBuffers() can read CurrBytePos without any lock atomically
and leave it to the caller to deal with non-existing WAL reads.

> And if the to-be-read LSN is between
> XLogCtl->LogwrtResult->Write and XLogBytePosToRecPtr(Insert->CurrBytePos)
> we need to call WaitXLogInsertionsToFinish() before copying the data.

Agree to wait for all in-flight insertions to the pages we're about to
read to finish. But, reading XLogCtl->LogwrtRqst.Write requires either
XLogCtl->info_lck spinlock or WALWriteLock. Maybe turn
XLogCtl->LogwrtRqst.Write a 64-bit atomic and read it without any
lock, rely on
WaitXLogInsertionsToFinish()'s return value i.e. if
WaitXLogInsertionsToFinish() returns a value >= Insert->CurrBytePos,
then go read that page from WAL buffers.

Thoughts?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-11-09 Thread Andres Freund
Hi,

On 2023-11-08 13:10:34 +0530, Bharath Rupireddy wrote:
> > > + /*
> > > +  * The fact that we acquire WALBufMappingLock while reading 
> > > the WAL
> > > +  * buffer page itself guarantees that no one else 
> > > initializes it or
> > > +  * makes it ready for next use in AdvanceXLInsertBuffer(). 
> > > However, we
> > > +  * need to ensure that we are not reading a page that just 
> > > got
> > > +  * initialized. For this, we look at the needed page header.
> > > +  */
> > > + phdr = (XLogPageHeader) page;
> > > +
> > > + /* Return, if WAL buffer page doesn't look valid. */
> > > + if (!(phdr->xlp_magic == XLOG_PAGE_MAGIC &&
> > > +   phdr->xlp_pageaddr == (ptr - (ptr % XLOG_BLCKSZ)) 
> > > &&
> > > +   phdr->xlp_tli == tli))
> > > + break;
> >
> > I don't think this code should ever encounter a page where this is not the
> > case?  We particularly shouldn't do so silently, seems that could hide all
> > kinds of problems.
> 
> I think it's possible to read a "just got initialized" page with the
> new approach to read WAL buffer pages without WALBufMappingLock if the
> page is read right after it is initialized and xlblocks is filled in
> AdvanceXLInsertBuffer() but before actual WAL is written.

I think the code needs to make sure that *never* happens. That seems unrelated
to holding or not holding WALBufMappingLock. Even if the page header is
already valid, I don't think it's ok to just read/parse WAL data that's
concurrently being modified.

We can never allow WAL being read that's past
  XLogBytePosToRecPtr(XLogCtl->Insert->CurrBytePos)
as it does not exist.

And if the to-be-read LSN is between
XLogCtl->LogwrtResult->Write and XLogBytePosToRecPtr(Insert->CurrBytePos)
we need to call WaitXLogInsertionsToFinish() before copying the data.

Greetings,

Andres Freund




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-11-07 Thread Bharath Rupireddy
On Sat, Nov 4, 2023 at 6:13 AM Andres Freund  wrote:
>
> > + cur_lsn = GetFlushRecPtr(NULL);
> > + if (unlikely(startptr > cur_lsn))
> > + elog(ERROR, "WAL start LSN %X/%X specified for reading from 
> > WAL buffers must be less than current database system WAL LSN %X/%X",
> > +  LSN_FORMAT_ARGS(startptr), LSN_FORMAT_ARGS(cur_lsn));
>
> Hm, why does this check belong here? For some tools it might be legitimate to
> read the WAL before it was fully flushed.

Agreed and removed the check.

> > + /*
> > +  * Holding WALBufMappingLock ensures inserters don't overwrite this 
> > value
> > +  * while we are reading it. We try to acquire it in shared mode so 
> > that
> > +  * the concurrent WAL readers are also allowed. We try to do as less 
> > work
> > +  * as possible while holding the lock to not impact concurrent WAL 
> > writers
> > +  * much. We quickly exit to not cause any contention, if the lock 
> > isn't
> > +  * immediately available.
> > +  */
> > + if (!LWLockConditionalAcquire(WALBufMappingLock, LW_SHARED))
> > + return 0;
>
> That seems problematic - that lock is often heavily contended.  We could
> instead check IsWALRecordAvailableInXLogBuffers() once before reading the
> page, then read the page contents *without* holding a lock, and then check
> IsWALRecordAvailableInXLogBuffers() again - if the page was replaced in the
> interim we read bogus data, but that's a bit of a wasted effort.

In the new approach described upthread here
https://www.postgresql.org/message-id/c3455ab9da42e09ca9d059879b5c512b2d1f9681.camel%40j-davis.com,
there's no lock required for reading from WAL buffers. PSA patches for
more details.

> > + /*
> > +  * The fact that we acquire WALBufMappingLock while reading 
> > the WAL
> > +  * buffer page itself guarantees that no one else initializes 
> > it or
> > +  * makes it ready for next use in AdvanceXLInsertBuffer(). 
> > However, we
> > +  * need to ensure that we are not reading a page that just got
> > +  * initialized. For this, we look at the needed page header.
> > +  */
> > + phdr = (XLogPageHeader) page;
> > +
> > + /* Return, if WAL buffer page doesn't look valid. */
> > + if (!(phdr->xlp_magic == XLOG_PAGE_MAGIC &&
> > +   phdr->xlp_pageaddr == (ptr - (ptr % XLOG_BLCKSZ)) &&
> > +   phdr->xlp_tli == tli))
> > + break;
>
> I don't think this code should ever encounter a page where this is not the
> case?  We particularly shouldn't do so silently, seems that could hide all
> kinds of problems.

I think it's possible to read a "just got initialized" page with the
new approach to read WAL buffer pages without WALBufMappingLock if the
page is read right after it is initialized and xlblocks is filled in
AdvanceXLInsertBuffer() but before actual WAL is written.

> > + /*
> > +  * Note that we don't perform all page header checks here to 
> > avoid
> > +  * extra work in production builds; callers will anyway do 
> > those
> > +  * checks extensively. However, in an assert-enabled build, 
> > we perform
> > +  * all the checks here and raise an error if failed.
> > +  */
>
> Why?

Minimal page header checks are performed to ensure we don't read the
page that just got initialized unlike what
XLogReaderValidatePageHeader(). Are you suggesting to remove page
header checks with XLogReaderValidatePageHeader() for assert-enabled
builds? Or are you suggesting to do page header checks with
XLogReaderValidatePageHeader() for production builds too?

PSA v16 patch set. Note that 0004 patch adds support for WAL read
stats (both from WAL file and WAL buffers) to walsenders and may not
necessarily the best approach to capture WAL read stats in light of
https://www.postgresql.org/message-id/CALj2ACU_f5_c8F%2BxyNR4HURjG%3DJziiz07wCpQc%3DAqAJUFh7%2B8w%40mail.gmail.com
which adds WAL read/write/fsync stats to pg_stat_io.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From b678edfccf7ecf490cb792391249cbf85ba0db29 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Tue, 7 Nov 2023 19:20:00 +
Subject: [PATCH v16] Use 64-bit atomics for xlblocks array elements

In AdvanceXLInsertBuffer(), xlblocks value of a WAL buffer page is
updated only at the end after the page is initialized with all
zeros. A problem with this approach is that anyone reading
xlblocks and WAL buffer page without holding WALBufMappingLock
will see the wrong page contents if the read happens before the
xlblocks is marked with a new entry in AdvanceXLInsertBuffer() at
the end.

To fix this issue, xlblocks is made to use 64-bit atomics instead
of XLogRecPtr and the xlblocks value is marked with

Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-11-04 Thread Bharath Rupireddy
On Sun, Nov 5, 2023 at 2:57 AM Jeff Davis  wrote:
>
> On Sat, 2023-11-04 at 20:55 +0530, Bharath Rupireddy wrote:
> > + XLogRecPtr  EndPtr =
> > pg_atomic_read_u64(>xlblocks[curridx]);
> > +
> > + /*
> > +  * xlblocks value can be InvalidXLogRecPtr before
> > the new WAL buffer
> > +  * page gets initialized in AdvanceXLInsertBuffer.
> > In such a case
> > +  * re-read the xlblocks value under the lock to
> > ensure the correct
> > +  * value is read.
> > +  */
> > + if (unlikely(XLogRecPtrIsInvalid(EndPtr)))
> > + {
> > + LWLockAcquire(WALBufMappingLock,
> > LW_EXCLUSIVE);
> > + EndPtr = pg_atomic_read_u64(
> > >xlblocks[curridx]);
> > + LWLockRelease(WALBufMappingLock);
> > + }
> > +
> > + Assert(!XLogRecPtrIsInvalid(EndPtr));
>
> Can that really happen? If the EndPtr is invalid, that means the page
> is in the process of being cleared, so the contents of the page are
> undefined at that time, right?

My initial thoughts were this way - xlblocks is being read without
holding WALBufMappingLock in XLogWrite() and since we write
InvalidXLogRecPtr to xlblocks array elements temporarily before
MemSet-ting the page in AdvanceXLInsertBuffer(), the PANIC "xlog write
request %X/%X is past end of log %X/%X" might get hit if EndPtr read
from xlblocks is InvalidXLogRecPtr. FWIW, an Assert(false); within the
if (unlikely(XLogRecPtrIsInvalid(EndPtr))) block didn't hit in make
check-world.

It looks like my above understanding isn't correct because it can
never happen that the page that's being written to the WAL file gets
initialized in AdvanceXLInsertBuffer(). I'll remove this piece of code
in next version of the patch unless there are any other thoughts.

[1]
/*
 * Within the loop, curridx is the cache block index of the page to
 * consider writing.  Begin at the buffer containing the next unwritten
 * page, or last partially written page.
 */
curridx = XLogRecPtrToBufIdx(LogwrtResult.Write);

while (LogwrtResult.Write < WriteRqst.Write)
{
/*
 * Make sure we're not ahead of the insert process.  This could happen
 * if we're passed a bogus WriteRqst.Write that is past the end of the
 * last page that's been initialized by AdvanceXLInsertBuffer.
 */
XLogRecPtr  EndPtr = pg_atomic_read_u64(>xlblocks[curridx]);

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-11-04 Thread Jeff Davis
On Sat, 2023-11-04 at 20:55 +0530, Bharath Rupireddy wrote:
> + XLogRecPtr  EndPtr =
> pg_atomic_read_u64(>xlblocks[curridx]);
> +
> + /*
> +  * xlblocks value can be InvalidXLogRecPtr before
> the new WAL buffer
> +  * page gets initialized in AdvanceXLInsertBuffer.
> In such a case
> +  * re-read the xlblocks value under the lock to
> ensure the correct
> +  * value is read.
> +  */
> + if (unlikely(XLogRecPtrIsInvalid(EndPtr)))
> + {
> + LWLockAcquire(WALBufMappingLock,
> LW_EXCLUSIVE);
> + EndPtr = pg_atomic_read_u64(
> >xlblocks[curridx]);
> + LWLockRelease(WALBufMappingLock);
> + }
> +
> + Assert(!XLogRecPtrIsInvalid(EndPtr));

Can that really happen? If the EndPtr is invalid, that means the page
is in the process of being cleared, so the contents of the page are
undefined at that time, right?

Regards,
Jeff Davis





Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-11-04 Thread Bharath Rupireddy
On Sat, Nov 4, 2023 at 1:17 AM Jeff Davis  wrote:
>
> > > I think it needs something like:
> > >
> > >   pg_atomic_write_u64(>xlblocks[nextidx],
> > > InvalidXLogRecPtr);
> > >   pg_write_barrier();
> > >
> > > before the MemSet.
> >
> > I think it works. First, xlblocks needs to be turned to an array of
> > 64-bit atomics and then the above change.
>
> Does anyone see a reason we shouldn't move to atomics here?
>
> >
> > pg_write_barrier();
> >
> > *((volatile XLogRecPtr *) >xlblocks[nextidx]) =
> > NewPageEndPtr;
>
> I am confused why the "volatile" is required on that line (not from
> your patch). I sent a separate message about that:
>
> https://www.postgresql.org/message-id/784f72ac09061fe5eaa5335cc347340c367c73ac.ca...@j-davis.com
>
> > I think the 3 things that helps read from WAL buffers without
> > WALBufMappingLock are: 1) couple of the read barriers in
> > XLogReadFromBuffers, 2) atomically initializing xlblocks[idx] to
> > InvalidXLogRecPtr plus a write barrier in AdvanceXLInsertBuffer(), 3)
> > the following sanity check to see if the read page is valid in
> > XLogReadFromBuffers(). If it sounds sensible, I'll work towards
> > coding
> > it up. Thoughts?
>
> I like it. I think it will ultimately be a fairly simple loop. And by
> moving to atomics, we won't need the delicate comment in
> GetXLogBuffer().

I'm attaching the v15 patch set implementing the above ideas. Please
have a look.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v15-0001-Use-64-bit-atomics-for-xlblocks-array-elements.patch
Description: Binary data


v15-0002-Allow-WAL-reading-from-WAL-buffers.patch
Description: Binary data


v15-0003-Add-test-module-for-verifying-WAL-read-from-WAL-.patch
Description: Binary data


Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-11-03 Thread Andres Freund
Hi,

On 2023-11-02 22:38:38 +0530, Bharath Rupireddy wrote:
> From 5b5469d7dcd8e98bfcaf14227e67356bbc1f5fe8 Mon Sep 17 00:00:00 2001
> From: Bharath Rupireddy 
> Date: Thu, 2 Nov 2023 15:10:51 +
> Subject: [PATCH v14] Track oldest initialized WAL buffer page
>
> ---
>  src/backend/access/transam/xlog.c | 170 ++
>  src/include/access/xlog.h |   1 +
>  2 files changed, 171 insertions(+)
>
> diff --git a/src/backend/access/transam/xlog.c 
> b/src/backend/access/transam/xlog.c
> index b541be8eec..fdf2ef310b 100644
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -504,6 +504,45 @@ typedef struct XLogCtlData
>   XLogRecPtr *xlblocks;   /* 1st byte ptr-s + XLOG_BLCKSZ */
>   int XLogCacheBlck;  /* highest allocated xlog 
> buffer index */
>
> + /*
> +  * Start address of oldest initialized page in XLog buffers.
> +  *
> +  * We mainly track oldest initialized page explicitly to quickly tell 
> if a
> +  * given WAL record is available in XLog buffers. It also can be used 
> for
> +  * other purposes, see notes below.
> +  *
> +  * OldestInitializedPage gives XLog buffers following properties:
> +  *
> +  * 1) At any given point of time, pages in XLog buffers array are sorted
> +  * in an ascending order from OldestInitializedPage till 
> InitializedUpTo.
> +  * Note that we verify this property for assert-only builds, see
> +  * IsXLogBuffersArraySorted() for more details.

This is true - but also not, if you look at it a bit too literally. The
buffers in xlblocks itself obviously aren't ordered when wrapping around
between XLogRecPtrToBufIdx(OldestInitializedPage) and
XLogRecPtrToBufIdx(InitializedUpTo).


> +  * 2) OldestInitializedPage is monotonically increasing (by virtue of 
> how
> +  * postgres generates WAL records), that is, its value never decreases.
> +  * This property lets someone read its value without a lock. There's no
> +  * problem even if its value is slightly stale i.e. concurrently being
> +  * updated. One can still use it for finding if a given WAL record is
> +  * available in XLog buffers. At worst, one might get false positives
> +  * (i.e. OldestInitializedPage may tell that the WAL record is available
> +  * in XLog buffers, but when one actually looks at it, it isn't really
> +  * available). This is more efficient and performant than acquiring a 
> lock
> +  * for reading. Note that we may not need a lock to read
> +  * OldestInitializedPage but we need to update it holding
> +  * WALBufMappingLock.

I'd
s/may not need/do not need/

But perhaps rephrase it a bit more, to something like:

To update OldestInitializedPage, WALBufMappingLock needs to be held
exclusively, for reading no lock is required.


> +  *
> +  * 3) One can start traversing XLog buffers from OldestInitializedPage
> +  * till InitializedUpTo to list out all valid WAL records and stats, and
> +  * expose them via SQL-callable functions to users.
> +  *
> +  * 4) XLog buffers array is inherently organized as a circular, sorted 
> and
> +  * rotated array with OldestInitializedPage as pivot with the property
> +  * where LSN of previous buffer page (if valid) is greater than
> +  * OldestInitializedPage and LSN of next buffer page (if valid) is 
> greater
> +  * than OldestInitializedPage.
> +  */
> + XLogRecPtr  OldestInitializedPage;

It seems a bit odd to name a LSN containing variable *Page...


>   /*
>* InsertTimeLineID is the timeline into which new WAL is being inserted
>* and flushed. It is zero during recovery, and does not change once 
> set.
> @@ -590,6 +629,10 @@ static ControlFileData *ControlFile = NULL;
>  #define NextBufIdx(idx)  \
>   (((idx) == XLogCtl->XLogCacheBlck) ? 0 : ((idx) + 1))
>
> +/* Macro to retreat to previous buffer index. */
> +#define PreviousBufIdx(idx)  \
> + (((idx) == 0) ? XLogCtl->XLogCacheBlck : ((idx) - 1))

I think it might be worth making these inlines and adding assertions that idx
is not bigger than XLogCtl->XLogCacheBlck?


>  /*
>   * XLogRecPtrToBufIdx returns the index of the WAL buffer that holds, or
>   * would hold if it was in cache, the page containing 'recptr'.
> @@ -708,6 +751,10 @@ static void WALInsertLockAcquireExclusive(void);
>  static void WALInsertLockRelease(void);
>  static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
>
> +#ifdef USE_ASSERT_CHECKING
> +static bool IsXLogBuffersArraySorted(void);
> +#endif
> +
>  /*
>   * Insert an XLOG record represented by an already-constructed chain of data
>   * chunks.  This is a low-level routine; to construct the WAL record header
> @@ -1992,6 +2039,52 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, 
> bool opportunistic)
>   

Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-11-03 Thread Andres Freund
Hi,

On 2023-11-03 20:23:30 +0530, Bharath Rupireddy wrote:
> On Fri, Nov 3, 2023 at 12:35 PM Jeff Davis  wrote:
> >
> > On Thu, 2023-11-02 at 22:38 +0530, Bharath Rupireddy wrote:
> > > > I suppose the question is: should reading from the WAL buffers an
> > > > intentional thing that the caller does explicitly by specific
> > > > callers?
> > > > Or is it an optimization that should be hidden from the caller?
> > > >
> > > > I tend toward the former, at least for now.
> > >
> > > Yes, it's an optimization that must be hidden from the caller.
> >
> > As I said, I tend toward the opposite: that specific callers should
> > read from the buffers explicitly in the cases where it makes sense.
> 
> How about adding a bool flag (read_from_wal_buffers) to
> XLogReaderState so that the callers can set it if they want this
> facility via XLogReaderAllocate()?

That seems wrong architecturally - why should xlogreader itself know about any
of this? What would it mean in frontend code if read_from_wal_buffers were
set? IMO this is something that should happen purely within the read function.

Greetings,

Andres Freund




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-11-03 Thread Jeff Davis
On Fri, 2023-11-03 at 20:23 +0530, Bharath Rupireddy wrote:
> > 
> OldestInitializedPage is introduced in v14-0001 patch. Please have a
> look.

I don't see why that's necessary if we move to the algorithm I
suggested below that doesn't require a lock.

> > 
> Okay. Current patch doesn't support this [partial hit of newer pages]
> case.

OK, no need to support it until you see a reason.
> > 

> > 
> > I think it needs something like:
> > 
> >   pg_atomic_write_u64(>xlblocks[nextidx],
> > InvalidXLogRecPtr);
> >   pg_write_barrier();
> > 
> > before the MemSet.
> 
> I think it works. First, xlblocks needs to be turned to an array of
> 64-bit atomics and then the above change.

Does anyone see a reason we shouldn't move to atomics here?

> 
>     pg_write_barrier();
> 
>     *((volatile XLogRecPtr *) >xlblocks[nextidx]) =
> NewPageEndPtr;

I am confused why the "volatile" is required on that line (not from
your patch). I sent a separate message about that:

https://www.postgresql.org/message-id/784f72ac09061fe5eaa5335cc347340c367c73ac.ca...@j-davis.com

> I think the 3 things that helps read from WAL buffers without
> WALBufMappingLock are: 1) couple of the read barriers in
> XLogReadFromBuffers, 2) atomically initializing xlblocks[idx] to
> InvalidXLogRecPtr plus a write barrier in AdvanceXLInsertBuffer(), 3)
> the following sanity check to see if the read page is valid in
> XLogReadFromBuffers(). If it sounds sensible, I'll work towards
> coding
> it up. Thoughts?

I like it. I think it will ultimately be a fairly simple loop. And by
moving to atomics, we won't need the delicate comment in
GetXLogBuffer().


Regards,
Jeff Davis





Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-11-03 Thread Bharath Rupireddy
On Fri, Nov 3, 2023 at 12:35 PM Jeff Davis  wrote:
>
> On Thu, 2023-11-02 at 22:38 +0530, Bharath Rupireddy wrote:
> > > I suppose the question is: should reading from the WAL buffers an
> > > intentional thing that the caller does explicitly by specific
> > > callers?
> > > Or is it an optimization that should be hidden from the caller?
> > >
> > > I tend toward the former, at least for now.
> >
> > Yes, it's an optimization that must be hidden from the caller.
>
> As I said, I tend toward the opposite: that specific callers should
> read from the buffers explicitly in the cases where it makes sense.

How about adding a bool flag (read_from_wal_buffers) to
XLogReaderState so that the callers can set it if they want this
facility via XLogReaderAllocate()?

> > At any given point of time, WAL buffer pages are maintained as a
> > circularly sorted array in an ascending order from
> > OldestInitializedPage to InitializedUpTo (new pages are inserted at
> > this end).
>
> I don't see any reference to OldestInitializedPage or anything like it,
> with or without your patch. Am I missing something?

OldestInitializedPage is introduced in v14-0001 patch. Please have a look.

> > - Read 6 pages starting from LSN 80. Nothing is read from WAL buffers
> > as the page at LSN 80 doesn't exist despite other 5 pages starting
> > from LSN 90 exist.
>
> This is what I imagined a "partial hit" was: read the 5 pages starting
> at 90. The caller would then need to figure out how to read the page at
> LSN 80 from the segment files.
>
> I am not saying we should support this case; perhaps it doesn't matter.
> I'm just describing why that term was confusing for me.

Okay. Current patch doesn't support this case.

> > If a caller asks for an unflushed WAL read
> > intentionally or unintentionally, XLogReadFromBuffers() reads only 4
> > pages starting from LSN 150 to LSN 180 and will leave the remaining 2
> > pages for the caller to deal with. This is the partial hit that can
> > happen.
>
> To me that's more like an EOF case. "Partial hit" sounds to me like the
> data exists but is not available in the cache (i.e. go to the segment
> files); whereas if it encountered the end, the data is not available at
> all.

Right. We can tweak the comments around "partial hit" if required.

> > WALBufMappingLock protects both xlblocks and WAL buffer pages [1][2].
> > I'm not sure how using the memory barrier, not WALBufMappingLock,
> > prevents writers from replacing WAL buffer pages while readers
> > reading
> > the pages.
>
> It doesn't *prevent* that case, but it does *detect* that case. We
> don't want to prevent writers from replacing WAL buffers, because that
> would mean we are slowing down the critical WAL writing path.
>
> Let me explain the potential problem cases, and how the barriers
> prevent them:
>
> Potential problem 1: the page is not yet resident in the cache at the
> time the memcpy begins. In this case, the first read barrier would
> ensure that the page is also not yet resident at the time xlblocks[idx]
> is read into endptr1, and we'd break out of the loop.
>
> Potential problem 2: the page is evicted before the memcpy finishes. In
> this case, the second read barrier would ensure that the page was also
> evicted before xlblocks[idx] is read into endptr2, and again we'd
> detect the problem and break out of the loop.

Understood.

> I assume here that, if xlblocks[idx] holds the endPtr of the desired
> page, all of the bytes for that page are resident at that moment. I
> don't think that's true right now: AdvanceXLInsertBuffers() zeroes the
> old page before updating xlblocks[nextidx].

Right.

> I think it needs something like:
>
>   pg_atomic_write_u64(>xlblocks[nextidx], InvalidXLogRecPtr);
>   pg_write_barrier();
>
> before the MemSet.

I think it works. First, xlblocks needs to be turned to an array of
64-bit atomics and then the above change. With this, all those who
reads xlblocks with or without WALBufMappingLock also need to check if
xlblocks[idx] is ever InvalidXLogRecPtr and take appropriate action.

I'm sure you have seen the following. It looks like I'm leaning
towards the claim that it's safe to read xlblocks without
WALBufMappingLock. I'll put up a patch for these changes separately.

/*
 * Make sure the initialization of the page becomes visible to others
 * before the xlblocks update. GetXLogBuffer() reads xlblocks without
 * holding a lock.
 */
pg_write_barrier();

*((volatile XLogRecPtr *) >xlblocks[nextidx]) = NewPageEndPtr;

I think the 3 things that helps read from WAL buffers without
WALBufMappingLock are: 1) couple of the read barriers in
XLogReadFromBuffers, 2) atomically initializing xlblocks[idx] to
InvalidXLogRecPtr plus a write barrier in AdvanceXLInsertBuffer(), 3)
the following sanity check to see if the read page is valid in
XLogReadFromBuffers(). If it sounds sensible, I'll work towards coding
it up. Thoughts?

+ , we
+ * 

Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-11-03 Thread Jeff Davis
On Thu, 2023-11-02 at 22:38 +0530, Bharath Rupireddy wrote:
> > I suppose the question is: should reading from the WAL buffers an
> > intentional thing that the caller does explicitly by specific
> > callers?
> > Or is it an optimization that should be hidden from the caller?
> > 
> > I tend toward the former, at least for now.
> 
> Yes, it's an optimization that must be hidden from the caller.

As I said, I tend toward the opposite: that specific callers should
read from the buffers explicitly in the cases where it makes sense.

I don't think this is the most important point right now though, let's
sort out the other details.

> > 
> At any given point of time, WAL buffer pages are maintained as a
> circularly sorted array in an ascending order from
> OldestInitializedPage to InitializedUpTo (new pages are inserted at
> this end).

I don't see any reference to OldestInitializedPage or anything like it,
with or without your patch. Am I missing something?

> - Read 6 pages starting from LSN 80. Nothing is read from WAL buffers
> as the page at LSN 80 doesn't exist despite other 5 pages starting
> from LSN 90 exist.

This is what I imagined a "partial hit" was: read the 5 pages starting
at 90. The caller would then need to figure out how to read the page at
LSN 80 from the segment files.

I am not saying we should support this case; perhaps it doesn't matter.
I'm just describing why that term was confusing for me.

> If a caller asks for an unflushed WAL read
> intentionally or unintentionally, XLogReadFromBuffers() reads only 4
> pages starting from LSN 150 to LSN 180 and will leave the remaining 2
> pages for the caller to deal with. This is the partial hit that can
> happen.

To me that's more like an EOF case. "Partial hit" sounds to me like the
data exists but is not available in the cache (i.e. go to the segment
files); whereas if it encountered the end, the data is not available at
all.

> > 
> WALBufMappingLock protects both xlblocks and WAL buffer pages [1][2].
> I'm not sure how using the memory barrier, not WALBufMappingLock,
> prevents writers from replacing WAL buffer pages while readers
> reading
> the pages.

It doesn't *prevent* that case, but it does *detect* that case. We
don't want to prevent writers from replacing WAL buffers, because that
would mean we are slowing down the critical WAL writing path.

Let me explain the potential problem cases, and how the barriers
prevent them:

Potential problem 1: the page is not yet resident in the cache at the
time the memcpy begins. In this case, the first read barrier would
ensure that the page is also not yet resident at the time xlblocks[idx]
is read into endptr1, and we'd break out of the loop.

Potential problem 2: the page is evicted before the memcpy finishes. In
this case, the second read barrier would ensure that the page was also
evicted before xlblocks[idx] is read into endptr2, and again we'd
detect the problem and break out of the loop.

I assume here that, if xlblocks[idx] holds the endPtr of the desired
page, all of the bytes for that page are resident at that moment. I
don't think that's true right now: AdvanceXLInsertBuffers() zeroes the
old page before updating xlblocks[nextidx]. I think it needs something
like:

  pg_atomic_write_u64(>xlblocks[nextidx], InvalidXLogRecPtr);
  pg_write_barrier();

before the MemSet.

I didn't review your latest v14 patch yet.

Regards,
Jeff Davis








Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-11-02 Thread Bharath Rupireddy
On Sat, Oct 28, 2023 at 2:22 AM Jeff Davis  wrote:
>
> I think I see what you are saying: WALRead() is at a lower level than
> the XLogReaderRoutine callbacks, because it's used by the .page_read
> callbacks.
>
> That makes sense, but my first interpretation was that WALRead() is
> above the XLogReaderRoutine callbacks because it calls .segment_open
> and .segment_close. To me that sounds like a layering violation, but it
> exists today without your patch.

Right. WALRead() is a common function used by most if not all
page_read callbacks. Typically, the page_read callbacks code has 2
parts - first determine the target/start LSN and second read WAL (via
WALRead() for instance).

> I suppose the question is: should reading from the WAL buffers an
> intentional thing that the caller does explicitly by specific callers?
> Or is it an optimization that should be hidden from the caller?
>
> I tend toward the former, at least for now.

Yes, it's an optimization that must be hidden from the caller.

> I suspect that when we do
> some more interesting things, like replicating unflushed data, we will
> want reading from buffers to be a separate step, not combined with
> WALRead(). After things in this area settle a bit then we might want to
> refactor and combine them again.

As said previously, the new XLogReadFromBuffers() function is generic
and extensible in the way that anyone with a target/start LSN
(corresponding to flushed or written-but-not-yet-flushed WAL) and TLI
can call it to read from WAL buffers. It's just that the patch
currently uses it where it makes sense i.e. in WALRead(). But, it's
usable in, say, a page_read callback reading unflushed WAL from WAL
buffers.

> > If someone wants to read unflushed WAL, the typical way to implement
> > it is to write a new page_read callback
> > read_local_unflushed_xlog_page/logical_read_unflushed_xlog_page or
> > similar without WALRead() but just the new function
> > XLogReadFromBuffers to read from WAL buffers and return.
>
> Then why is it being called from WALRead() at all?

The patch focuses on reading flushed WAL from WAL buffers if
available, not the unflushed WAL at all; that's why it's in WALRead()
before reading from the WAL file using pg_pread().

I'm trying to make a point that the XLogReadFromBuffers() enables one
to read unflushed WAL from WAL buffers (if really wanted for future
features like replicate from WAL buffers as a new opt-in feature to
improve the replication performance).

> > I prefer to keep the partial hit handling as-is just
> > in case:
>
> So a "partial hit" is essentially a narrow race condition where one
> page is read from buffers, and it's valid; and by the time it gets to
> the next page, it has already been evicted (along with the previously
> read page)?
>
> In other words, I think you are describing a case where
> eviction is happening faster than the memcpy()s in a loop, which is
> certainly possible due to scheduling or whatever, but doesn't seem like
> the common case.
>
> The case I'd expect for a partial read is when the startptr points to
> an evicted page, but some later pages in the requested range are still
> present in the buffers.
>
> I'm not really sure whether either of these cases matters, but if we
> implement one and not the other, there should be some explanation.

At any given point of time, WAL buffer pages are maintained as a
circularly sorted array in an ascending order from
OldestInitializedPage to InitializedUpTo (new pages are inserted at
this end). Also, the current patch holds WALBufMappingLock while
reading the buffer pages, meaning, no one can replace the buffer pages
until reading is finished. Therefore, the above two described partial
hit cases can't happen - when reading multiple pages if the first page
is found to be existing in the buffer pages, it means the other pages
must exist too because of the circular and sortedness of the WAL
buffer page array.

Here's an illustration with WAL buffers circular array (idx, LSN) of
size 10 elements with contents as {(0, 160), (1, 170), (2, 180), (3,
90), (4, 100), (5, 110), (6, 120), (7, 130), (8, 140), (9, 150)} and
current InitializedUpTo pointing to page at LSN 180, idx 2.
- Read 6 pages starting from LSN 80. Nothing is read from WAL buffers
as the page at LSN 80 doesn't exist despite other 5 pages starting
from LSN 90 exist.
- Read 6 pages starting from LSN 90. All the pages exist and are read
from WAL buffers.
- Read 6 pages starting from LSN 150. Note that WAL is currently
flushed only up to page at LSN 180 and the callers won't ask for
unflushed WAL read. If a caller asks for an unflushed WAL read
intentionally or unintentionally, XLogReadFromBuffers() reads only 4
pages starting from LSN 150 to LSN 180 and will leave the remaining 2
pages for the caller to deal with. This is the partial hit that can
happen. Therefore, retaining the partial hit code in WALRead() as-is
in the current patch is needed IMV.

> > Yes, I proposed that idea in 

Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-10-27 Thread Jeff Davis
On Fri, 2023-10-27 at 03:46 +0530, Bharath Rupireddy wrote:
> 
> Almost all the available backend
> page_read callbacks read_local_xlog_page_no_wait,
> read_local_xlog_page, logical_read_xlog_page except XLogPageRead
> (which is used for recovery when WAL buffers aren't used at all) have
> one thing in common, that is, WALRead(). Therefore, it seemed a
> natural choice for me to call XLogReadFromBuffers. In other words,
> I'd
> say it's the responsibility of page_read callback implementers to
> decide if they want to read from WAL buffers or not and hence I don't
> think we need a separate XLogReaderRoutine.

I think I see what you are saying: WALRead() is at a lower level than
the XLogReaderRoutine callbacks, because it's used by the .page_read
callbacks.

That makes sense, but my first interpretation was that WALRead() is
above the XLogReaderRoutine callbacks because it calls .segment_open
and .segment_close. To me that sounds like a layering violation, but it
exists today without your patch.

I suppose the question is: should reading from the WAL buffers an
intentional thing that the caller does explicitly by specific callers?
Or is it an optimization that should be hidden from the caller?

I tend toward the former, at least for now. I suspect that when we do
some more interesting things, like replicating unflushed data, we will
want reading from buffers to be a separate step, not combined with
WALRead(). After things in this area settle a bit then we might want to
refactor and combine them again.

> If someone wants to read unflushed WAL, the typical way to implement
> it is to write a new page_read callback
> read_local_unflushed_xlog_page/logical_read_unflushed_xlog_page or
> similar without WALRead() but just the new function
> XLogReadFromBuffers to read from WAL buffers and return.

Then why is it being called from WALRead() at all?

> 
> I prefer to keep the partial hit handling as-is just
> in case:
> 

So a "partial hit" is essentially a narrow race condition where one
page is read from buffers, and it's valid; and by the time it gets to
the next page, it has already been evicted (along with the previously
read page)? In other words, I think you are describing a case where
eviction is happening faster than the memcpy()s in a loop, which is
certainly possible due to scheduling or whatever, but doesn't seem like
the common case.

The case I'd expect for a partial read is when the startptr points to
an evicted page, but some later pages in the requested range are still
present in the buffers.

I'm not really sure whether either of these cases matters, but if we
implement one and not the other, there should be some explanation.

> Yes, I proposed that idea in another thread -
> https://www.postgresql.org/message-id/CALj2ACVFSirOFiABrNVAA6JtPHvA9iu%2Bwp%3DqkM9pdLZ5mwLaFg%40mail.gmail.com
> .
> If that looks okay, I can add it to the next version of this patch
> set.

The code in the email above still shows a call to:

  /*
   * Requested WAL is available in WAL buffers, so recheck the
existence
   * under the WALBufMappingLock and read if the page still exists,
otherwise
   * return.
   */
  LWLockAcquire(WALBufMappingLock, LW_SHARED);

and I don't think that's required. How about something like:

  endptr1 = XLogCtl->xlblocks[idx];
  /* Requested WAL isn't available in WAL buffers. */
  if (expectedEndPtr != endptr1)
  break;

  pg_read_barrier();
  ...
  memcpy(buf, data, bytes_read_this_loop);
  ...
  pg_read_barrier();
  endptr2 = XLogCtl->xlblocks[idx];
  if (expectedEndPtr != endptr2)
  break;

  ntotal += bytes_read_this_loop;
  /* success; move on to next page */

I'm not sure why GetXLogBuffer() doesn't just use pg_atomic_read_u64().
I suppose because xlblocks are not guaranteed to be 64-bit aligned?
Should we just align it to 64 bits so we can use atomics? (I don't
think it matters in this case, but atomics would avoid the need to
think about it.)
> 

> 
> FWIW, I found heapam.c using unlikely() extensively for safety
> checks.

OK, I won't object to the use of unlikely(), though I typically don't
use it without a fairly strong reason to think I should override what
the compiler thinks and/or what branch predictors can handle.

In this case, I think some of those errors are not really necessary
anyway, though:

  * XLogReadFromBuffers shouldn't take a timeline argument just to
demand that it's always equal to the wal insertion timeline.
  * Why check that startptr is earlier than the flush pointer, but not
startptr+count? Also, given that we intend to support reading unflushed
data, it would be good to comment that the function still works past
the flush pointer, and that it will be safe to remove later (right?).
  * An "Assert(!RecoveryInProgress())" would be more appropriate than
an error. Perhaps we will remove even that check in the future to
achieve cascaded replication of unflushed data.

Regards,
Jeff Davis








Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-10-26 Thread Bharath Rupireddy
On Wed, Oct 25, 2023 at 5:45 AM Jeff Davis  wrote:
>
> Comments:

Thanks for reviewing.

> * It would be good to document that this is partially an optimization
> (read from memory first) and partially an API difference that allows
> reading unflushed data. For instance, walsender may benefit
> performance-wise (and perhaps later with the ability to read unflushed
> data) whereas pg_walinspect benefits primarily from reading unflushed
> data.

Commit message has these things covered in detail. However, I think
adding some info in the code comments is a good idea and done around
the WALRead() function in the attached v13 patch set.

> * Shouldn't there be a new method in XLogReaderRoutine (e.g.
> read_unflushed_data), rather than having logic in WALRead()? The
> callers can define the method if it makes sense (and that would be a
> good place to document why); or leave it NULL if not.

I've designed the new function XLogReadFromBuffers to read from WAL
buffers in such a way that one can easily embed it in page_read
callbacks if it makes sense. Almost all the available backend
page_read callbacks read_local_xlog_page_no_wait,
read_local_xlog_page, logical_read_xlog_page except XLogPageRead
(which is used for recovery when WAL buffers aren't used at all) have
one thing in common, that is, WALRead(). Therefore, it seemed a
natural choice for me to call XLogReadFromBuffers. In other words, I'd
say it's the responsibility of page_read callback implementers to
decide if they want to read from WAL buffers or not and hence I don't
think we need a separate XLogReaderRoutine.

If someone wants to read unflushed WAL, the typical way to implement
it is to write a new page_read callback
read_local_unflushed_xlog_page/logical_read_unflushed_xlog_page or
similar without WALRead() but just the new function
XLogReadFromBuffers to read from WAL buffers and return.

> * I'm not clear on the "partial hit" case. Wouldn't that mean you found
> the earliest byte in the buffers, but not the latest byte requested? Is
> that possible, and if so under what circumstances? I added an
> "Assert(nread == 0 || nread == count)" in WALRead() after calling
> XLogReadFromBuffers(), and it wasn't hit.
>
> * If the partial hit case is important, wouldn't XLogReadFromBuffers()
> fill in the end of the buffer rather than the beginning?

Partial hit was possible when the requested WAL pages are read one
page at a time from WAL buffers with WALBufMappingLock
acquisition-release for each page as the requested page can be
replaced by the time the lock is released and reacquired. This was the
case up until the v6 patch -
https://www.postgresql.org/message-id/CALj2ACWTNneq2EjMDyUeWF-BnwpewuhiNEfjo9bxLwFU9iPF0w%40mail.gmail.com.
Now that the approach has been changed to read multiple pages at once
under one WALBufMappingLock acquisition-release. .
We can either keep the partial hit handling (just to not miss
anything) or turn the following partial hit case to an error or an
Assert(false);. I prefer to keep the partial hit handling as-is just
in case:
+else if (count > nread)
+{
+/*
+ * Buffer partial hit, so reset the state to count the read bytes
+ * and continue.
+ */
+buf += nread;
+startptr += nread;
+count -= nread;
+}

> * Other functions that use xlblocks, e.g. GetXLogBuffer(), use more
> effort to avoid acquiring WALBufMappingLock. Perhaps you can avoid it,
> too? One idea is to check that XLogCtl->xlblocks[idx] is equal to
> expectedEndPtr both before and after the memcpy(), with appropriate
> barriers. That could mitigate concerns expressed by Kyotaro Horiguchi
> and Masahiko Sawada.

Yes, I proposed that idea in another thread -
https://www.postgresql.org/message-id/CALj2ACVFSirOFiABrNVAA6JtPHvA9iu%2Bwp%3DqkM9pdLZ5mwLaFg%40mail.gmail.com.
If that looks okay, I can add it to the next version of this patch
set.

> * Are you sure that reducing the number of calls to memcpy() is a win?
> I would expect that to be true only if the memcpy()s are tiny, but here
> they are around XLOG_BLCKSZ. I believe this was done based on a comment
> from Nathan Bossart, but I didn't really follow why that's important.
> Also, if we try to use one memcpy for all of the data, it might not
> interact well with my idea above to avoid taking the lock.

Up until the v6 patch -
https://www.postgresql.org/message-id/CALj2ACWTNneq2EjMDyUeWF-BnwpewuhiNEfjo9bxLwFU9iPF0w%40mail.gmail.com,
the requested WAL was being read one page at a time from WAL buffers
into output buffer with one memcpy call for each page. Now that the
approach has been changed to read multiple pages at once under one
WALBufMappingLock acquisition-release with comparatively lesser number
of memcpy calls. I honestly haven't seen any difference between the
two approaches -
https://www.postgresql.org/message-id/CALj2ACUpQGiwQTzmoSMOFk5%3DWiJc06FcYpxzBX0SEej4ProRzg%40mail.gmail.com.

The new approach 

Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-10-24 Thread Nathan Bossart
On Tue, Oct 24, 2023 at 05:15:19PM -0700, Jeff Davis wrote:
> * Are you sure that reducing the number of calls to memcpy() is a win?
> I would expect that to be true only if the memcpy()s are tiny, but here
> they are around XLOG_BLCKSZ. I believe this was done based on a comment
> from Nathan Bossart, but I didn't really follow why that's important.
> Also, if we try to use one memcpy for all of the data, it might not
> interact well with my idea above to avoid taking the lock.

I don't recall exactly why I suggested this, but if additional memcpy()s
help in some way and don't negatively impact performance, then I retract my
previous comment.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-10-24 Thread Jeff Davis
On Sat, 2023-10-21 at 23:59 +0530, Bharath Rupireddy wrote:
> I'm attaching v12 patch set with just pgperltidy ran on the new TAP
> test added in 0002. No other changes from that of v11 patch set.

Thank you.

Comments:

* It would be good to document that this is partially an optimization
(read from memory first) and partially an API difference that allows
reading unflushed data. For instance, walsender may benefit
performance-wise (and perhaps later with the ability to read unflushed
data) whereas pg_walinspect benefits primarily from reading unflushed
data.

* Shouldn't there be a new method in XLogReaderRoutine (e.g.
read_unflushed_data), rather than having logic in WALRead()? The
callers can define the method if it makes sense (and that would be a
good place to document why); or leave it NULL if not.

* I'm not clear on the "partial hit" case. Wouldn't that mean you found
the earliest byte in the buffers, but not the latest byte requested? Is
that possible, and if so under what circumstances? I added an
"Assert(nread == 0 || nread == count)" in WALRead() after calling
XLogReadFromBuffers(), and it wasn't hit.

* If the partial hit case is important, wouldn't XLogReadFromBuffers()
fill in the end of the buffer rather than the beginning?

* Other functions that use xlblocks, e.g. GetXLogBuffer(), use more
effort to avoid acquiring WALBufMappingLock. Perhaps you can avoid it,
too? One idea is to check that XLogCtl->xlblocks[idx] is equal to
expectedEndPtr both before and after the memcpy(), with appropriate
barriers. That could mitigate concerns expressed by Kyotaro Horiguchi
and Masahiko Sawada.

* Are you sure that reducing the number of calls to memcpy() is a win?
I would expect that to be true only if the memcpy()s are tiny, but here
they are around XLOG_BLCKSZ. I believe this was done based on a comment
from Nathan Bossart, but I didn't really follow why that's important.
Also, if we try to use one memcpy for all of the data, it might not
interact well with my idea above to avoid taking the lock.

* Style-wise, the use of "unlikely" seems excessive, unless there's a
reason to think it matters.

Regards,
Jeff Davis





Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-10-21 Thread Bharath Rupireddy
On Fri, Oct 20, 2023 at 10:19 PM Bharath Rupireddy
 wrote:
>
> On Thu, Oct 12, 2023 at 4:13 AM Andres Freund  wrote:
> >
> > On 2023-10-03 16:05:32 -0700, Jeff Davis wrote:
> > > On Sat, 2023-01-14 at 12:34 -0800, Andres Freund wrote:
> > > > One benefit would be that it'd make it more realistic to use direct
> > > > IO for WAL
> > > > - for which I have seen significant performance benefits. But when we
> > > > afterwards have to re-read it from disk to replicate, it's less
> > > > clearly a win.
> > >
> > > Does this patch still look like a good fit for your (or someone else's)
> > > plans for direct IO here? If so, would committing this soon make it
> > > easier to make progress on that, or should we wait until it's actually
> > > needed?
> >
> > I think it'd be quite useful to have. Even with the code as of 16, I see
> > better performance in some workloads with debug_io_direct=wal,
> > wal_sync_method=open_datasync compared to any other configuration. Except of
> > course that it makes walsenders more problematic, as they suddenly require
> > read IO. Thus having support for walsenders to send directly from wal 
> > buffers
> > would be beneficial, even without further AIO infrastructure.
>
> I'm attaching the v11 patch set with the following changes:
> - Improved input validation in the function that reads WAL from WAL
> buffers in 0001 patch.
> - Improved test module's code in 0002 patch.
> - Modernized meson build file in 0002 patch.
> - Added commit messages for both the patches.
> - Ran pgindent on both the patches.
>
> Any thoughts are welcome.

I'm attaching v12 patch set with just pgperltidy ran on the new TAP
test added in 0002. No other changes from that of v11 patch set.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From c240d914967261e462290b714ec6ae2803d72442 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Sat, 21 Oct 2023 13:50:15 +
Subject: [PATCH v12] Allow WAL reading from WAL buffers

This commit adds WALRead() the capability to read WAL from WAL
buffers when possible. When requested WAL isn't available in WAL
buffers, the WAL is read from the WAL file as usual. It relies on
WALBufMappingLock so that no one replaces the WAL buffer page that
we're reading from. It skips reading from WAL buffers if
WALBufMappingLock can't be acquired immediately. In other words,
it doesn't wait for WALBufMappingLock to be available. This helps
reduce the contention on WALBufMappingLock.

This commit benefits the callers of WALRead(), that are walsenders
and pg_walinspect. They can now avoid reading WAL from the WAL
file (possibly avoiding disk IO). Tests show that the WAL buffers
hit ratio stood at 95% for 1 primary, 1 sync standby, 1 async
standby, with pgbench --scale=300 --client=32 --time=900. In other
words, the walsenders avoided 95% of the time reading from the
file/avoided pread system calls:
https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com

This commit also benefits when direct IO is enabled for WAL.
Reading WAL from WAL buffers puts back the performance close to
that of without direct IO for WAL:
https://www.postgresql.org/message-id/CALj2ACV6rS%2B7iZx5%2BoAvyXJaN4AG-djAQeM1mrM%3DYSDkVrUs7g%40mail.gmail.com

This commit also paves the way for the following features in
future:
- Improves synchronous replication performance by replicating
directly from WAL buffers.
- A opt-in way for the walreceivers to receive unflushed WAL.
More details here:
https://www.postgresql.org/message-id/20231011224353.cl7c2s222dw3de4j%40awork3.anarazel.de

Author: Bharath Rupireddy
Reviewed-by: Dilip Kumar, Andres Freund
Reviewed-by: Nathan Bossart, Kuntal Ghosh
Discussion: https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com
---
 src/backend/access/transam/xlog.c   | 208 
 src/backend/access/transam/xlogreader.c |  45 -
 src/include/access/xlog.h   |   6 +
 3 files changed, 257 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index cea13e3d58..9553a880f1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1706,6 +1706,214 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
 	return cachedPos + ptr % XLOG_BLCKSZ;
 }
 
+/*
+ * Read WAL from WAL buffers.
+ *
+ * Read 'count' bytes of WAL from WAL buffers into 'buf', starting at location
+ * 'startptr', on timeline 'tli' and return total read bytes.
+ *
+ * Note that this function reads as much as it can from WAL buffers, meaning,
+ * it may not read all the requested 'count' bytes. Caller must be aware of
+ * this and deal with it.
+ *
+ * Note that this function is not available for frontend code as WAL buffers is
+ * an internal mechanism to the server.
+ */
+Size
+XLogReadFromBuffers(XLogReaderState *state,
+	

Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-10-20 Thread Bharath Rupireddy
On Thu, Oct 12, 2023 at 4:13 AM Andres Freund  wrote:
>
> On 2023-10-03 16:05:32 -0700, Jeff Davis wrote:
> > On Sat, 2023-01-14 at 12:34 -0800, Andres Freund wrote:
> > > One benefit would be that it'd make it more realistic to use direct
> > > IO for WAL
> > > - for which I have seen significant performance benefits. But when we
> > > afterwards have to re-read it from disk to replicate, it's less
> > > clearly a win.
> >
> > Does this patch still look like a good fit for your (or someone else's)
> > plans for direct IO here? If so, would committing this soon make it
> > easier to make progress on that, or should we wait until it's actually
> > needed?
>
> I think it'd be quite useful to have. Even with the code as of 16, I see
> better performance in some workloads with debug_io_direct=wal,
> wal_sync_method=open_datasync compared to any other configuration. Except of
> course that it makes walsenders more problematic, as they suddenly require
> read IO. Thus having support for walsenders to send directly from wal buffers
> would be beneficial, even without further AIO infrastructure.

I'm attaching the v11 patch set with the following changes:
- Improved input validation in the function that reads WAL from WAL
buffers in 0001 patch.
- Improved test module's code in 0002 patch.
- Modernized meson build file in 0002 patch.
- Added commit messages for both the patches.
- Ran pgindent on both the patches.

Any thoughts are welcome.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 6615590d795a6068897a8aa348d9e699442bb07b Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Fri, 20 Oct 2023 15:49:23 +
Subject: [PATCH v11] Allow WAL reading from WAL buffers

This commit adds WALRead() the capability to read WAL from WAL
buffers when possible. When requested WAL isn't available in WAL
buffers, the WAL is read from the WAL file as usual. It relies on
WALBufMappingLock so that no one replaces the WAL buffer page that
we're reading from. It skips reading from WAL buffers if
WALBufMappingLock can't be acquired immediately. In other words,
it doesn't wait for WALBufMappingLock to be available. This helps
reduce the contention on WALBufMappingLock.

This commit benefits the callers of WALRead(), that are walsenders
and pg_walinspect. They can now avoid reading WAL from the WAL
file (possibly avoiding disk IO). Tests show that the WAL buffers
hit ratio stood at 95% for 1 primary, 1 sync standby, 1 async
standby, with pgbench --scale=300 --client=32 --time=900. In other
words, the walsenders avoided 95% of the time reading from the
file/avoided pread system calls:
https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com

This commit also benefits when direct IO is enabled for WAL.
Reading WAL from WAL buffers puts back the performance close to
that of without direct IO for WAL:
https://www.postgresql.org/message-id/CALj2ACV6rS%2B7iZx5%2BoAvyXJaN4AG-djAQeM1mrM%3DYSDkVrUs7g%40mail.gmail.com

This commit also paves the way for the following features in
future:
- Improves synchronous replication performance by replicating
directly from WAL buffers.
- A opt-in way for the walreceivers to receive unflushed WAL.
More details here:
https://www.postgresql.org/message-id/20231011224353.cl7c2s222dw3de4j%40awork3.anarazel.de

Author: Bharath Rupireddy
Reviewed-by: Dilip Kumar, Andres Freund
Reviewed-by: Nathan Bossart, Kuntal Ghosh
Discussion: https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com
---
 src/backend/access/transam/xlog.c   | 208 
 src/backend/access/transam/xlogreader.c |  45 -
 src/include/access/xlog.h   |   6 +
 3 files changed, 257 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index cea13e3d58..9553a880f1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1706,6 +1706,214 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
 	return cachedPos + ptr % XLOG_BLCKSZ;
 }
 
+/*
+ * Read WAL from WAL buffers.
+ *
+ * Read 'count' bytes of WAL from WAL buffers into 'buf', starting at location
+ * 'startptr', on timeline 'tli' and return total read bytes.
+ *
+ * Note that this function reads as much as it can from WAL buffers, meaning,
+ * it may not read all the requested 'count' bytes. Caller must be aware of
+ * this and deal with it.
+ *
+ * Note that this function is not available for frontend code as WAL buffers is
+ * an internal mechanism to the server.
+ */
+Size
+XLogReadFromBuffers(XLogReaderState *state,
+	XLogRecPtr startptr,
+	TimeLineID tli,
+	Size count,
+	char *buf)
+{
+	XLogRecPtr	ptr;
+	Size		nbytes;
+	Size		ntotal;
+	Size		nbatch;
+	char	   *batchstart;
+	TimeLineID	current_timeline;
+
+	/*
+	 * Do some input parameter validations to fail 

Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-10-17 Thread Bharath Rupireddy
On Thu, Oct 12, 2023 at 4:13 AM Andres Freund  wrote:
>
> On 2023-10-03 16:05:32 -0700, Jeff Davis wrote:
> >
> > Does this patch still look like a good fit for your (or someone else's)
> > plans for direct IO here? If so, would committing this soon make it
> > easier to make progress on that, or should we wait until it's actually
> > needed?
>
> I think it'd be quite useful to have. Even with the code as of 16, I see
> better performance in some workloads with debug_io_direct=wal,
> wal_sync_method=open_datasync compared to any other configuration. Except of
> course that it makes walsenders more problematic, as they suddenly require
> read IO. Thus having support for walsenders to send directly from wal buffers
> would be beneficial, even without further AIO infrastructure.

Right. Tests show the benefit with WAL DIO + this patch -
https://www.postgresql.org/message-id/CALj2ACV6rS%2B7iZx5%2BoAvyXJaN4AG-djAQeM1mrM%3DYSDkVrUs7g%40mail.gmail.com.

Also, irrespective of WAL DIO, the WAL buffers hit ratio with the
patch stood at 95% for 1 primary, 1 sync standby, 1 async standby,
pgbench --scale=300 --client=32 --time=900. In other words, the
walsenders avoided 95% of the time reading from the file/avoided pread
system calls - 
https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com.

> I also think there are other quite desirable features that are made easier by
> this patch. One of the primary problems with using synchronous replication is
> the latency increase, obviously. We can't send out WAL before it has locally
> been wirten out and flushed to disk. For some workloads, we could
> substantially lower synchronous commit latency if we were able to send WAL to
> remote nodes *before* WAL has been made durable locally, even if the receiving
> systems wouldn't be allowed to write that data to disk yet: It takes less time
> to send just "write LSN: %X/%X, flush LSNL: %X/%X" than also having to send
> all the not-yet-durable WAL.
>
> In many OLTP workloads there won't be WAL flushes between generating WAL for
> DML and commit, which means that the amount of WAL that needs to be sent out
> at commit can be of nontrivial size.
>
> E.g. for pgbench, normally a transaction is about ~550 bytes (fitting in a
> single tcp/ip packet), but a pgbench transaction that needs to emit FPIs for
> everything is a lot larger: ~45kB (not fitting in a single packet). Obviously
> many real world workloads OLTP workloads actually do more writes than
> pgbench. Making the commit latency of the latter be closer to the commit
> latency of the former when using syncrep would obviously be great.
>
> Of course this patch is just a relatively small step towards that: We'd also
> need in-memory buffering on the receiving side, the replication protocol would
> need to be improved, we'd likely need an option to explicitly opt into
> receiving unflushed data. But it's still a pretty much required step.

Yes, this patch can pave the way for all of the above features in
future. However, I'm looking forward to getting this in for now.
Later, I'll come up with more concrete thoughts on the above.

Having said above, the latest v10 patch after addressing some of the
review comments is at
https://www.postgresql.org/message-id/CALj2ACU3ZYzjOv4vZTR%2BLFk5PL4ndUnbLS6E1vG2dhDBjQGy2A%40mail.gmail.com.
Any further thoughts on the patch is welcome.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-10-11 Thread Andres Freund
Hi,

On 2023-10-03 16:05:32 -0700, Jeff Davis wrote:
> On Sat, 2023-01-14 at 12:34 -0800, Andres Freund wrote:
> > One benefit would be that it'd make it more realistic to use direct
> > IO for WAL
> > - for which I have seen significant performance benefits. But when we
> > afterwards have to re-read it from disk to replicate, it's less
> > clearly a win.
> 
> Does this patch still look like a good fit for your (or someone else's)
> plans for direct IO here? If so, would committing this soon make it
> easier to make progress on that, or should we wait until it's actually
> needed?

I think it'd be quite useful to have. Even with the code as of 16, I see
better performance in some workloads with debug_io_direct=wal,
wal_sync_method=open_datasync compared to any other configuration. Except of
course that it makes walsenders more problematic, as they suddenly require
read IO. Thus having support for walsenders to send directly from wal buffers
would be beneficial, even without further AIO infrastructure.


I also think there are other quite desirable features that are made easier by
this patch. One of the primary problems with using synchronous replication is
the latency increase, obviously. We can't send out WAL before it has locally
been wirten out and flushed to disk. For some workloads, we could
substantially lower synchronous commit latency if we were able to send WAL to
remote nodes *before* WAL has been made durable locally, even if the receiving
systems wouldn't be allowed to write that data to disk yet: It takes less time
to send just "write LSN: %X/%X, flush LSNL: %X/%X" than also having to send
all the not-yet-durable WAL.

In many OLTP workloads there won't be WAL flushes between generating WAL for
DML and commit, which means that the amount of WAL that needs to be sent out
at commit can be of nontrivial size.

E.g. for pgbench, normally a transaction is about ~550 bytes (fitting in a
single tcp/ip packet), but a pgbench transaction that needs to emit FPIs for
everything is a lot larger: ~45kB (not fitting in a single packet). Obviously
many real world workloads OLTP workloads actually do more writes than
pgbench. Making the commit latency of the latter be closer to the commit
latency of the former when using syncrep would obviously be great.

Of course this patch is just a relatively small step towards that: We'd also
need in-memory buffering on the receiving side, the replication protocol would
need to be improved, we'd likely need an option to explicitly opt into
receiving unflushed data. But it's still a pretty much required step.

Greetings,

Andres Freund




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-10-03 Thread Jeff Davis
On Sat, 2023-01-14 at 12:34 -0800, Andres Freund wrote:
> One benefit would be that it'd make it more realistic to use direct
> IO for WAL
> - for which I have seen significant performance benefits. But when we
> afterwards have to re-read it from disk to replicate, it's less
> clearly a win.

Does this patch still look like a good fit for your (or someone else's)
plans for direct IO here? If so, would committing this soon make it
easier to make progress on that, or should we wait until it's actually
needed?

If I recall, this patch does not provide a perforance benefit as-is
(correct me if things have changed) and I don't know if a reduction in
syscalls alone is enough to justify it. But if it paves the way for
direct IO for WAL, that does seem worth it.

Regards,
Jeff Davis





Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-03-14 Thread Bharath Rupireddy
On Tue, Mar 7, 2023 at 11:14 PM Nathan Bossart  wrote:
>
> On Tue, Mar 07, 2023 at 12:39:13PM +0530, Bharath Rupireddy wrote:
> > On Tue, Mar 7, 2023 at 3:30 AM Nathan Bossart  
> > wrote:
> >> Is it possible to memcpy more than a page at a time?
> >
> > It would complicate things a lot there; the logic to figure out the
> > last page bytes that may or may not fit in the whole page gets
> > complicated. Also, the logic to verify each page's header gets
> > complicated. We might lose out if we memcpy all the pages at once and
> > start verifying each page's header in another loop.
>
> Doesn't the complicated logic you describe already exist to some extent in
> the patch?  You are copying a page at a time, which involves calculating
> various addresses and byte counts.

Okay here I am with the v10 patch set attached that avoids multiple
memcpy calls which must benefit the callers who want to read more than
1 WAL buffer page (streaming replication WAL sender for instance).

> >> +elog(DEBUG1, "read %zu bytes out of %zu bytes from WAL buffers for 
> >> given LSN %X/%X, Timeline ID %u",
> >> + *read_bytes, count, LSN_FORMAT_ARGS(startptr), tli);
> >>
> >> I definitely don't think we should put an elog() in this code path.
> >> Perhaps this should be guarded behind WAL_DEBUG.
> >
> > Placing it behind WAL_DEBUG doesn't help users/developers. My
> > intention was to let users know that the WAL read hit the buffers,
> > it'll help them report if any issue occurs and also help developers to
> > debug that issue.
>
> I still think an elog() is mighty expensive for this code path, even when
> it doesn't actually produce any messages.  And when it does, I think it has
> the potential to be incredibly noisy.

Well, my motive was to have a way for the user to know WAL buffer hits
and misses to report any found issues. However, I have a plan later to
add WAL buffer stats (hits/misses). I understand that even if someone
enables DEBUG1, this message can bloat server log files and make
recovery slower, especially on a standby. Hence, I agree to keep these
logs behind the WAL_DEBUG macro like others and did so in the attached
v10 patch set.

Please review the attached v10 patch set further.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From aa6454d9abb9a70b728dbba7f40279108486a3e4 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Tue, 14 Mar 2023 07:30:09 +
Subject: [PATCH v10] Improve WALRead() to suck data directly from WAL buffers

---
 src/backend/access/transam/xlog.c   | 171 
 src/backend/access/transam/xlogreader.c |  42 +-
 src/include/access/xlog.h   |   6 +
 3 files changed, 217 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 543d4d897a..d40b9562e1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1639,6 +1639,177 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
 	return cachedPos + ptr % XLOG_BLCKSZ;
 }
 
+/*
+ * Read WAL from WAL buffers.
+ *
+ * Read 'count' bytes of WAL from WAL buffers into 'buf', starting at location
+ * 'startptr', on timeline 'tli' and return total read bytes.
+ *
+ * Note that this function reads as much as it can from WAL buffers, meaning,
+ * it may not read all the requested 'count' bytes. Caller must be aware of
+ * this and deal with it.
+ */
+Size
+XLogReadFromBuffers(XLogReaderState *state PG_USED_FOR_ASSERTS_ONLY,
+	XLogRecPtr startptr,
+	TimeLineID tli,
+	Size count,
+	char *buf)
+{
+	XLogRecPtr	ptr = startptr;
+	Size	nbytes = count;	/* total bytes requested to be read by caller */
+	Size	ntotal = 0;	/* total bytes read */
+	Size	nbatch = 0;	/* bytes to be read in single batch */
+	char	*batchstart = NULL;	/* location to read from for single batch */
+
+	Assert(!XLogRecPtrIsInvalid(startptr));
+	Assert(count > 0);
+	Assert(startptr <= GetFlushRecPtr(NULL));
+	Assert(!RecoveryInProgress());
+	Assert(tli == GetWALInsertionTimeLine());
+
+	/*
+	 * Holding WALBufMappingLock ensures inserters don't overwrite this value
+	 * while we are reading it. We try to acquire it in shared mode so that the
+	 * concurrent WAL readers are also allowed. We try to do as less work as
+	 * possible while holding the lock to not impact concurrent WAL writers
+	 * much. We quickly exit to not cause any contention, if the lock isn't
+	 * immediately available.
+	 */
+	if (!LWLockConditionalAcquire(WALBufMappingLock, LW_SHARED))
+		return ntotal;
+
+	while (nbytes > 0)
+	{
+		XLogRecPtr	expectedEndPtr;
+		XLogRecPtr	endptr;
+		int 	idx;
+		char	*page;
+		char*data;
+		XLogPageHeader	phdr;
+
+		idx = XLogRecPtrToBufIdx(ptr);
+		expectedEndPtr = ptr;
+		expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;
+		endptr = XLogCtl->xlblocks[idx];
+
+		/* Requested WAL isn't available in WAL buffers. */
+		if (expectedEndPtr != endptr)
+			break;

Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-03-13 Thread Bharath Rupireddy
On Sun, Mar 12, 2023 at 11:00 PM Bharath Rupireddy
 wrote:
>
> I have some review comments to fix on
> v8-0001, so, I'll be sending out v9 patch-set soon.

Please find the attached v9 patch set for further review. I moved the
check for just-initialized WAL buffer pages before reading the page.
Up until now, it's the other way around, meaning, read the page and
then check the header if it is just-initialized, which is wrong. The
attached v9 patch set corrects it.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v9-0001-Improve-WALRead-to-suck-data-directly-from-WAL-bu.patch
Description: Binary data


v9-0002-Add-test-module-for-verifying-WAL-read-from-WAL-b.patch
Description: Binary data


Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-03-12 Thread Bharath Rupireddy
On Sun, Mar 12, 2023 at 12:52 AM Nitin Jadhav
 wrote:
>
> I went through the v8 patch.

Thanks for looking at it. Please post the responses in-line, not above
the entire previous message for better readability.

> Following are my thoughts to improve the
> WAL buffer hit ratio.

Note that the motive of this patch is to read WAL from WAL buffers
*when possible* without affecting concurrent WAL writers.

> Currently the no-longer-needed WAL data present in WAL buffers gets
> cleared in XLogBackgroundFlush() which is called based on the
> wal_writer_delay config setting. Once the data is flushed to the disk,
> it is treated as no-longer-needed and it will be cleared as soon as
> possible based on some config settings.

Being opportunistic in pre-initializing as many possible WAL buffer
pages as is there for a purpose. There's an illuminating comment [1],
so that's done for a purpose, so removing it fully is a no-go IMO. For
instance, it'll make WAL buffer pages available for concurrent writers
so there will be less work for writers in GetXLogBuffer. I'm sure
removing the opportunistic pre-initialization of the WAL buffer pages
will hurt performance in a highly concurrent-write workload.

/*
 * Great, done. To take some work off the critical path, try to initialize
 * as many of the no-longer-needed WAL buffers for future use as we can.
 */
AdvanceXLInsertBuffer(InvalidXLogRecPtr, insertTLI, true);

> Second, In WALRead(), we try to read the data from disk whenever we
> don't find the data from WAL buffers. We don't store this data in the
> WAL buffer. We just read the data, use it and leave it. If we store
> this data to the WAL buffer, then we may avoid a few disk reads.

Again this is going to hurt concurrent writers. Note that wal_buffers
aren't used as full cache per-se, there'll be multiple writers to it,
*when possible* readers will try to read from it without hurting
writers.

> The patch attached takes care of this.

Please post the new proposal as a text file (not a .patch file) or as
a plain text in the email itself if the change is small or attach all
the patches if the patch is over-and-above the proposed patches.
Attaching a single over-and-above patch will make CFBot unhappy and
will force authors to repost the original patches. Typically, we
follow this. Having said, I have some review comments to fix on
v8-0001, so, I'll be sending out v9 patch-set soon.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-03-11 Thread Nitin Jadhav
> [1]
> subscription tests:
> PATCHED: WAL buffers hit - 1972, misses - 32616

Can you share more details about the test here?

I went through the v8 patch. Following are my thoughts to improve the
WAL buffer hit ratio.

Currently the no-longer-needed WAL data present in WAL buffers gets
cleared in XLogBackgroundFlush() which is called based on the
wal_writer_delay config setting. Once the data is flushed to the disk,
it is treated as no-longer-needed and it will be cleared as soon as
possible based on some config settings. I have done some testing by
tweaking the wal_writer_delay config setting to confirm the behaviour.
We can see that the WAL buffer hit ratio is good when the
wal_writer_delay is big enough [2] compared to smaller
wal_writer_delay [1]. So irrespective of the wal_writer_delay
settings, we should keep the WAL data in the WAL buffers as long as
possible so that all the readers (Mainly WAL senders) can take
advantage of this. The WAL page should be evicted from the WAL buffers
only when the WAL buffer is full and we need room for the new page.
The patch attached takes care of this. We can see the improvements in
WAL buffer hit ratio even when the wal_writer_delay is set to lower
value [3].

Second, In WALRead(), we try to read the data from disk whenever we
don't find the data from WAL buffers. We don't store this data in the
WAL buffer. We just read the data, use it and leave it. If we store
this data to the WAL buffer, then we may avoid a few disk reads.

[1]:
wal_buffers=1GB
wal_writer_delay=1ms
./pgbench --initialize --scale=300 postgres

WAL buffers hit=5046
WAL buffers miss=56767

[2]:
wal_buffers=1GB
wal_writer_delay=10s
./pgbench --initialize --scale=300 postgres

WAL buffers hit=45454
WAL buffers miss=14064

[3]:
wal_buffers=1GB
wal_writer_delay=1ms
./pgbench --initialize --scale=300 postgres

WAL buffers hit=37214
WAL buffers miss=844

Please share your thoughts.

Thanks & Regards,
Nitin Jadhav

On Tue, Mar 7, 2023 at 12:39 PM Bharath Rupireddy
 wrote:
>
> On Tue, Mar 7, 2023 at 3:30 AM Nathan Bossart  
> wrote:
> >
> > +void
> > +XLogReadFromBuffers(XLogRecPtr startptr,
> >
> > Since this function presently doesn't return anything, can we have it
> > return the number of bytes read instead of storing it in a pointer
> > variable?
>
> Done.
>
> > +ptr = startptr;
> > +nbytes = count;
> > +dst = buf;
> >
> > These variables seem superfluous.
>
> Needed startptr and count for DEBUG1 message and assertion at the end.
> Removed dst and used buf in the new patch now.
>
> > +/*
> > + * Requested WAL isn't available in WAL buffers, so return with
> > + * what we have read so far.
> > + */
> > +break;
> >
> > nitpick: I'd move this to the top so that you can save a level of
> > indentation.
>
> Done.
>
> > +/*
> > + * All the bytes are not in one page. Read available bytes 
> > on
> > + * the current page, copy them over to output buffer and
> > + * continue to read remaining bytes.
> > + */
> >
> > Is it possible to memcpy more than a page at a time?
>
> It would complicate things a lot there; the logic to figure out the
> last page bytes that may or may not fit in the whole page gets
> complicated. Also, the logic to verify each page's header gets
> complicated. We might lose out if we memcpy all the pages at once and
> start verifying each page's header in another loop.
>
> I would like to keep it simple - read a single page from WAL buffers,
> verify it and continue.
>
> > +/*
> > + * The fact that we acquire WALBufMappingLock while reading 
> > the WAL
> > + * buffer page itself guarantees that no one else initializes 
> > it or
> > + * makes it ready for next use in AdvanceXLInsertBuffer().
> > + *
> > + * However, we perform basic page header checks for ensuring 
> > that
> > + * we are not reading a page that just got initialized. Callers
> > + * will anyway perform extensive page-level and record-level
> > + * checks.
> > + */
> >
> > Hm.  I wonder if we should make these assertions instead.
>
> Okay. I added XLogReaderValidatePageHeader for assert-only builds
> which will help catch any issues there. But we can't perform record
> level checks here because this function doesn't know where the record
> starts from, it knows only pages. This change required us to pass in
> XLogReaderState to XLogReadFromBuffers. I marked it as
> PG_USED_FOR_ASSERTS_ONLY and did page header checks only when it is
> passed as non-null so that someone who doesn't have XLogReaderState
> can still read from buffers.
>
> > +elog(DEBUG1, "read %zu bytes out of %zu bytes from WAL buffers for 
> > given LSN %X/%X, Timeline ID %u",
> > + *read_bytes, count, LSN_FORMAT_ARGS(startptr), tli);
> >
> > I definitely don't think we 

Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-03-07 Thread Nathan Bossart
On Tue, Mar 07, 2023 at 12:39:13PM +0530, Bharath Rupireddy wrote:
> On Tue, Mar 7, 2023 at 3:30 AM Nathan Bossart  
> wrote:
>> Is it possible to memcpy more than a page at a time?
> 
> It would complicate things a lot there; the logic to figure out the
> last page bytes that may or may not fit in the whole page gets
> complicated. Also, the logic to verify each page's header gets
> complicated. We might lose out if we memcpy all the pages at once and
> start verifying each page's header in another loop.

Doesn't the complicated logic you describe already exist to some extent in
the patch?  You are copying a page at a time, which involves calculating
various addresses and byte counts.

>> +elog(DEBUG1, "read %zu bytes out of %zu bytes from WAL buffers for 
>> given LSN %X/%X, Timeline ID %u",
>> + *read_bytes, count, LSN_FORMAT_ARGS(startptr), tli);
>>
>> I definitely don't think we should put an elog() in this code path.
>> Perhaps this should be guarded behind WAL_DEBUG.
> 
> Placing it behind WAL_DEBUG doesn't help users/developers. My
> intention was to let users know that the WAL read hit the buffers,
> it'll help them report if any issue occurs and also help developers to
> debug that issue.

I still think an elog() is mighty expensive for this code path, even when
it doesn't actually produce any messages.  And when it does, I think it has
the potential to be incredibly noisy.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-03-06 Thread Bharath Rupireddy
On Tue, Mar 7, 2023 at 3:30 AM Nathan Bossart  wrote:
>
> +void
> +XLogReadFromBuffers(XLogRecPtr startptr,
>
> Since this function presently doesn't return anything, can we have it
> return the number of bytes read instead of storing it in a pointer
> variable?

Done.

> +ptr = startptr;
> +nbytes = count;
> +dst = buf;
>
> These variables seem superfluous.

Needed startptr and count for DEBUG1 message and assertion at the end.
Removed dst and used buf in the new patch now.

> +/*
> + * Requested WAL isn't available in WAL buffers, so return with
> + * what we have read so far.
> + */
> +break;
>
> nitpick: I'd move this to the top so that you can save a level of
> indentation.

Done.

> +/*
> + * All the bytes are not in one page. Read available bytes on
> + * the current page, copy them over to output buffer and
> + * continue to read remaining bytes.
> + */
>
> Is it possible to memcpy more than a page at a time?

It would complicate things a lot there; the logic to figure out the
last page bytes that may or may not fit in the whole page gets
complicated. Also, the logic to verify each page's header gets
complicated. We might lose out if we memcpy all the pages at once and
start verifying each page's header in another loop.

I would like to keep it simple - read a single page from WAL buffers,
verify it and continue.

> +/*
> + * The fact that we acquire WALBufMappingLock while reading the 
> WAL
> + * buffer page itself guarantees that no one else initializes it 
> or
> + * makes it ready for next use in AdvanceXLInsertBuffer().
> + *
> + * However, we perform basic page header checks for ensuring that
> + * we are not reading a page that just got initialized. Callers
> + * will anyway perform extensive page-level and record-level
> + * checks.
> + */
>
> Hm.  I wonder if we should make these assertions instead.

Okay. I added XLogReaderValidatePageHeader for assert-only builds
which will help catch any issues there. But we can't perform record
level checks here because this function doesn't know where the record
starts from, it knows only pages. This change required us to pass in
XLogReaderState to XLogReadFromBuffers. I marked it as
PG_USED_FOR_ASSERTS_ONLY and did page header checks only when it is
passed as non-null so that someone who doesn't have XLogReaderState
can still read from buffers.

> +elog(DEBUG1, "read %zu bytes out of %zu bytes from WAL buffers for given 
> LSN %X/%X, Timeline ID %u",
> + *read_bytes, count, LSN_FORMAT_ARGS(startptr), tli);
>
> I definitely don't think we should put an elog() in this code path.
> Perhaps this should be guarded behind WAL_DEBUG.

Placing it behind WAL_DEBUG doesn't help users/developers. My
intention was to let users know that the WAL read hit the buffers,
it'll help them report if any issue occurs and also help developers to
debug that issue.

On a different note - I was recently looking at the code around
WAL_DEBUG macro and the wal_debug GUC. It looks so complex that one
needs to build source code with the WAL_DEBUG macro and enable the GUC
to see the extended logs for WAL. IMO, the best way there is either:
1) unify all the code under WAL_DEBUG macro and get rid of wal_debug GUC, or
2) unify all the code under wal_debug GUC (it is developer-only and
superuser-only so there shouldn't be a problem even if we ship it out
of the box).

If someone is concerned about the GUC being enabled on production
servers knowingly or unknowingly with option (2), we can go ahead with
option (1). I will discuss this separately to see what others think.

> I think we can simplify this.  We effectively take the same action any time
> "count" doesn't equal "read_bytes", so there's no need for the "else if".
>
> if (count == read_bytes)
> return true;
>
> buf += read_bytes;
> startptr += read_bytes;
> count -= read_bytes;

I wanted to avoid setting these unnecessarily for buffer misses.

Thanks a lot for reviewing. I'm attaching the v8 patch for further review.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v8-0001-Improve-WALRead-to-suck-data-directly-from-WAL-bu.patch
Description: Binary data


v8-0002-Add-test-module-for-verifying-WAL-read-from-WAL-b.patch
Description: Binary data


Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-03-06 Thread Nathan Bossart
+void
+XLogReadFromBuffers(XLogRecPtr startptr,
+TimeLineID tli,
+Size count,
+char *buf,
+Size *read_bytes)

Since this function presently doesn't return anything, can we have it
return the number of bytes read instead of storing it in a pointer
variable?

+ptr = startptr;
+nbytes = count;
+dst = buf;

These variables seem superfluous.

+/*
+ * Requested WAL isn't available in WAL buffers, so return with
+ * what we have read so far.
+ */
+break;

nitpick: I'd move this to the top so that you can save a level of
indentation.

if (expectedEndPtr != endptr)
break;

... logic for when the data is found in the WAL buffers ...

+/*
+ * All the bytes are not in one page. Read available bytes on
+ * the current page, copy them over to output buffer and
+ * continue to read remaining bytes.
+ */

Is it possible to memcpy more than a page at a time?

+/*
+ * The fact that we acquire WALBufMappingLock while reading the WAL
+ * buffer page itself guarantees that no one else initializes it or
+ * makes it ready for next use in AdvanceXLInsertBuffer().
+ *
+ * However, we perform basic page header checks for ensuring that
+ * we are not reading a page that just got initialized. Callers
+ * will anyway perform extensive page-level and record-level
+ * checks.
+ */

Hm.  I wonder if we should make these assertions instead.

+elog(DEBUG1, "read %zu bytes out of %zu bytes from WAL buffers for given 
LSN %X/%X, Timeline ID %u",
+ *read_bytes, count, LSN_FORMAT_ARGS(startptr), tli);

I definitely don't think we should put an elog() in this code path.
Perhaps this should be guarded behind WAL_DEBUG.

+/*
+ * Check if we have read fully (hit), partially (partial hit) or
+ * nothing (miss) from WAL buffers. If we have read either partially or
+ * nothing, then continue to read the remaining bytes the usual way,
+ * that is, read from WAL file.
+ */
+if (count == read_bytes)
+{
+/* Buffer hit, so return. */
+return true;
+}
+else if (read_bytes > 0 && count > read_bytes)
+{
+/*
+ * Buffer partial hit, so reset the state to count the read bytes
+ * and continue.
+ */
+buf += read_bytes;
+startptr += read_bytes;
+count -= read_bytes;
+}
+
+/* Buffer miss i.e., read_bytes = 0, so continue */

I think we can simplify this.  We effectively take the same action any time
"count" doesn't equal "read_bytes", so there's no need for the "else if".

if (count == read_bytes)
return true;

buf += read_bytes;
startptr += read_bytes;
count -= read_bytes;

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-03-03 Thread Bharath Rupireddy
On Wed, Mar 1, 2023 at 9:45 AM Nathan Bossart  wrote:
>
> On Tue, Feb 28, 2023 at 10:38:31AM +0530, Bharath Rupireddy wrote:
> > On Tue, Feb 28, 2023 at 6:14 AM Nathan Bossart  
> > wrote:
> >> Why do we only read a page at a time in XLogReadFromBuffersGuts()?  What is
> >> preventing us from copying all the data we need in one go?
> >
> > Note that most of the WALRead() callers request a single page of
> > XLOG_BLCKSZ bytes even if the server has less or more available WAL
> > pages. It's the streaming replication wal sender that can request less
> > than XLOG_BLCKSZ bytes and upto MAX_SEND_SIZE (16 * XLOG_BLCKSZ). And,
> > if we read, say, MAX_SEND_SIZE at once while holding
> > WALBufMappingLock, that might impact concurrent inserters (at least, I
> > can say it in theory) - one of the main intentions of this patch is
> > not to impact inserters much.
>
> Perhaps we should test both approaches to see if there is a noticeable
> difference.  It might not be great for concurrent inserts to repeatedly
> take the lock, either.  If there's no real difference, we might be able to
> simplify the code a bit.

I took a stab at this - acquire WALBufMappingLock separately for each
requested WAL buffer page vs acquire WALBufMappingLock once for all
requested WAL buffer pages. I chose the pgbench tpcb-like benchmark
that has 3 UPDATE statements and 1 INSERT statement. I ran pgbench for
30min with scale factor 100 and 4096 clients with primary and 1 async
standby, see [1]. I captured wait_events to see the contention on
WALBufMappingLock. I haven't noticed any contention on the lock and no
difference in TPS too, see [2] for results on HEAD, see [3] for
results on v6 patch which has "acquire WALBufMappingLock separately
for each requested WAL buffer page" strategy and see [4] for results
on v7 patch (attached herewith) which has "acquire WALBufMappingLock
once for all requested WAL buffer pages" strategy. Another thing to
note from the test results is that reduction in WALRead IO wait events
from 136 on HEAD to 1 on v6 or v7 patch. So, the read from WAL buffers
is really  helping here.

With these observations, I'd like to use the approach that acquires
WALBufMappingLock once for all requested WAL buffer pages unlike v6
and the previous patches.

I'm attaching the v7 patch set with this change for further review.

[1]
shared_buffers = '8GB'
wal_buffers = '1GB'
max_wal_size = '16GB'
max_connections = '5000'
archive_mode = 'on'
archive_command='cp %p /home/ubuntu/archived_wal/%f'
./pgbench --initialize --scale=100 postgres
./pgbench -n -M prepared -U ubuntu postgres -b tpcb-like -c4096 -j4096 -T1800

[2]
HEAD:
done in 20.03 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 15.53 s, vacuum 0.19 s, primary keys 4.30 s).
tps = 11654.475345 (without initial connection time)

50950253  Lock| transactionid
16472447  Lock| tuple
3869523  LWLock  | LockManager
 739283  IPC | ProcArrayGroupUpdate
 718549  |
 439877  LWLock  | WALWrite
 130737  Client  | ClientRead
 121113  LWLock  | BufferContent
  70778  LWLock  | WALInsert
  43346  IPC | XactGroupUpdate
  18547
  18546  Activity| LogicalLauncherMain
  18545  Activity| AutoVacuumMain
  18272  Activity| ArchiverMain
  17627  Activity| WalSenderMain
  17207  Activity| WalWriterMain
  15455  IO  | WALSync
  14963  LWLock  | ProcArray
  14747  LWLock  | XactSLRU
  13943  Timeout | CheckpointWriteDelay
  10519  Activity| BgWriterHibernate
   8022  Activity| BgWriterMain
   4486  Timeout | SpinDelay
   4443  Activity| CheckpointerMain
   1435  Lock| extend
670  LWLock  | XidGen
373  IO  | WALWrite
283  Timeout | VacuumDelay
268  IPC | ArchiveCommand
249  Timeout | VacuumTruncate
136  IO  | WALRead
115  IO  | WALInitSync
 74  IO  | DataFileWrite
 67  IO  | WALInitWrite
 36  IO  | DataFileFlush
 35  IO  | DataFileExtend
 17  IO  | DataFileRead
  4  IO  | SLRUWrite
  3  IO  | BufFileWrite
  2  IO  | DataFileImmediateSync
  1 Tuples only is on.
  1  LWLock  | SInvalWrite
  1  LWLock  | LockFastPath
  1  IO  | ControlFileSyncUpdate

[3]
done in 19.99 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 15.52 s, vacuum 0.18 s, primary keys 4.28 s).
tps = 11689.584538 (without initial connection time)

50678977  Lock| transactionid
16252048  Lock| tuple
4146827  LWLock  | LockManager
 768256  |
 719923  IPC | ProcArrayGroupUpdate
 432836  LWLock  | WALWrite
 140354  Client  | ClientRead
 124203  

Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-03-02 Thread Bharath Rupireddy
On Wed, Mar 1, 2023 at 2:39 PM Bharath Rupireddy
 wrote:
>
> Please find the attached v5 patch set for further review.

I simplified the code largely by moving the logic of reading the WAL
buffer page from a separate function to the main while loop. This
enabled me to get rid of XLogReadFromBuffersGuts() that v5 and other
previous patches have.

Please find the attached v6 patch set for further review. Meanwhile,
I'll continue to work on the review comment raised upthread -
https://www.postgresql.org/message-id/20230301041523.GA1453450%40nathanxps13.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From afb57cd61955f7fc0dba315f3f7c81604a0b47c9 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Thu, 2 Mar 2023 07:54:06 +
Subject: [PATCH v6] Improve WALRead() to suck data directly from WAL buffers

---
 src/backend/access/transam/xlog.c   | 144 
 src/backend/access/transam/xlogreader.c |  45 +++-
 src/include/access/xlog.h   |   6 +
 3 files changed, 193 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f9f0f6db8d..b29dc67c38 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1639,6 +1639,150 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
 	return cachedPos + ptr % XLOG_BLCKSZ;
 }
 
+/*
+ * Read WAL from WAL buffers.
+ *
+ * Read 'count' bytes of WAL from WAL buffers into 'buf', starting at location
+ * 'startptr', on timeline 'tli' and set the read bytes to 'read_bytes'.
+ *
+ * Note that this function reads as much as it can from WAL buffers, meaning,
+ * it may not read all the requested 'count' bytes. The caller must be aware of
+ * this and deal with it.
+ */
+void
+XLogReadFromBuffers(XLogRecPtr startptr,
+	TimeLineID tli,
+	Size count,
+	char *buf,
+	Size *read_bytes)
+{
+	XLogRecPtr	ptr;
+	char*dst;
+	Sizenbytes;
+
+	Assert(!XLogRecPtrIsInvalid(startptr));
+	Assert(count > 0);
+	Assert(startptr <= GetFlushRecPtr(NULL));
+	Assert(!RecoveryInProgress());
+	Assert(tli == GetWALInsertionTimeLine());
+
+	ptr = startptr;
+	nbytes = count;
+	dst = buf;
+	*read_bytes = 0;
+
+	while (nbytes > 0)
+	{
+		XLogRecPtr origptr;
+		XLogRecPtr	expectedEndPtr;
+		XLogRecPtr	endptr;
+		int 	idx;
+
+		origptr = ptr;
+		idx = XLogRecPtrToBufIdx(ptr);
+		expectedEndPtr = ptr;
+		expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;
+
+		/*
+		 * Holding WALBufMappingLock ensures inserters don't overwrite this
+		 * value while we are reading it. We try to acquire it in shared mode
+		 * so that the concurrent WAL readers are also allowed. We try to do as
+		 * less work as possible while holding the lock to not impact
+		 * concurrent WAL writers much. We quickly exit to not cause any
+		 * contention, if the lock isn't immediately available.
+		 */
+		if (!LWLockConditionalAcquire(WALBufMappingLock, LW_SHARED))
+			return;
+
+		endptr = XLogCtl->xlblocks[idx];
+
+		if (expectedEndPtr == endptr)
+		{
+			char	*page;
+			char*data;
+			XLogPageHeader	phdr;
+
+			/*
+			 * We found WAL buffer page containing given XLogRecPtr. Get
+			 * starting address of the page and a pointer to the right location
+			 * of given XLogRecPtr in that page.
+			 */
+			page = XLogCtl->pages + idx * (Size) XLOG_BLCKSZ;
+			data = page + ptr % XLOG_BLCKSZ;
+
+			/* Read what is wanted, not the whole page. */
+			if ((data + nbytes) <= (page + XLOG_BLCKSZ))
+			{
+/* All the bytes are in one page. */
+memcpy(dst, data, nbytes);
+*read_bytes += nbytes;
+nbytes = 0;
+			}
+			else
+			{
+Size	nread;
+
+/*
+ * All the bytes are not in one page. Read available bytes on
+ * the current page, copy them over to output buffer and
+ * continue to read remaining bytes.
+ */
+nread = XLOG_BLCKSZ - (data - page);
+Assert(nread > 0 && nread <= nbytes);
+memcpy(dst, data, nread);
+ptr += nread;
+nbytes -= nread;
+dst += nread;
+*read_bytes += nread;
+			}
+
+			/*
+			 * Release the lock as early as possible to avoid creating any
+			 * possible contention.
+			 */
+			LWLockRelease(WALBufMappingLock);
+
+			/*
+			 * The fact that we acquire WALBufMappingLock while reading the WAL
+			 * buffer page itself guarantees that no one else initializes it or
+			 * makes it ready for next use in AdvanceXLInsertBuffer().
+			 *
+			 * However, we perform basic page header checks for ensuring that
+			 * we are not reading a page that just got initialized. Callers
+			 * will anyway perform extensive page-level and record-level
+			 * checks.
+			 */
+			phdr = (XLogPageHeader) page;
+
+			if (!(phdr->xlp_magic == XLOG_PAGE_MAGIC &&
+  phdr->xlp_pageaddr == (origptr - (origptr % XLOG_BLCKSZ)) &&
+  phdr->xlp_tli == tli))
+			{
+/*
+ * WAL buffer page doesn't look valid, so return with what we
+ * have read so far.
+ 

Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-03-01 Thread Bharath Rupireddy
On Wed, Mar 1, 2023 at 12:06 AM Kuntal Ghosh  wrote:
>
> On Tue, Feb 28, 2023 at 10:38 AM Bharath Rupireddy
>  wrote:
> >
> +/*
> + * Guts of XLogReadFromBuffers().
> + *
> + * Read 'count' bytes into 'buf', starting at location 'ptr', from WAL
> + * fetched WAL buffers on timeline 'tli' and return the read bytes.
> + */
> s/fetched WAL buffers/fetched from WAL buffers

Modified that comment a bit and moved it to the XLogReadFromBuffers.

> + else if (nread < nbytes)
> + {
> + /*
> + * We read some of the requested bytes. Continue to read remaining
> + * bytes.
> + */
> + ptr += nread;
> + nbytes -= nread;
> + dst += nread;
> + *read_bytes += nread;
> + }
>
> The 'if' condition should always be true. You can replace the same
> with an assertion instead.

Yeah, added an assert and changed that else if (nread < nbytes) to
else only condition.

> s/Continue to read remaining/Continue to read the remaining

Done.

> The good thing about this patch is that it reduces read IO calls
> without impacting the write performance (at least not that
> noticeable). It also takes us one step forward towards the
> enhancements mentioned in the thread.

Right.

> If performance is a concern, we
> can introduce a GUC to enable/disable this feature.

I didn't see any performance issues from my testing so far with 3
different pgbench cases
https://www.postgresql.org/message-id/CALj2ACWXHP6Ha1BfDB14txm%3DXP272wCbOV00mcPg9c6EXbnp5A%40mail.gmail.com.

While adding a GUC to enable/disable a feature sounds useful, IMHO it
isn't good for the user. Because we already have too many GUCs for the
user and we may not want all features to be defensive and add their
own GUCs. If at all, any bugs arise due to some corner-case we missed
to count in, we can surely help fix them. Having said this, I'm open
to suggestions here.

Please find the attached v5 patch set for further review.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 121f98633541c340a89dc628cc3f02711abdee02 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Wed, 1 Mar 2023 08:40:30 +
Subject: [PATCH v5] Improve WALRead() to suck data directly from WAL buffers

---
 src/backend/access/transam/xlog.c   | 175 
 src/backend/access/transam/xlogreader.c |  45 +-
 src/include/access/xlog.h   |   6 +
 3 files changed, 224 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f9f0f6db8d..98bd588ea0 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -689,6 +689,10 @@ static bool ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos,
 			  XLogRecPtr *PrevPtr);
 static XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto);
 static char *GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli);
+static Size XLogReadFromBuffersGuts(XLogRecPtr ptr,
+	TimeLineID tli,
+	Size count,
+	char *page);
 static XLogRecPtr XLogBytePosToRecPtr(uint64 bytepos);
 static XLogRecPtr XLogBytePosToEndRecPtr(uint64 bytepos);
 static uint64 XLogRecPtrToBytePos(XLogRecPtr ptr);
@@ -1639,6 +1643,177 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
 	return cachedPos + ptr % XLOG_BLCKSZ;
 }
 
+/*
+ * Guts of XLogReadFromBuffers().
+ */
+static Size
+XLogReadFromBuffersGuts(XLogRecPtr ptr,
+		TimeLineID tli,
+		Size count,
+		char *buf)
+{
+	XLogRecPtr	expectedEndPtr;
+	XLogRecPtr	endptr;
+	int 	idx;
+	Size	nread = 0;
+
+	idx = XLogRecPtrToBufIdx(ptr);
+	expectedEndPtr = ptr;
+	expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;
+
+	/*
+	 * Holding WALBufMappingLock ensures inserters don't overwrite this value
+	 * while we are reading it. We try to acquire it in shared mode so that the
+	 * concurrent WAL readers are also allowed. We try to do as less work as
+	 * possible while holding the lock to not impact concurrent WAL writers
+	 * much. We quickly exit to not cause any contention, if the lock isn't
+	 * immediately available.
+	 */
+	if (!LWLockConditionalAcquire(WALBufMappingLock, LW_SHARED))
+		return nread;
+
+	endptr = XLogCtl->xlblocks[idx];
+
+	if (expectedEndPtr == endptr)
+	{
+		char	*page;
+		char*data;
+		XLogPageHeader	phdr;
+
+		/*
+		 * We found WAL buffer page containing given XLogRecPtr. Get starting
+		 * address of the page and a pointer to the right location of given
+		 * XLogRecPtr in that page.
+		 */
+		page = XLogCtl->pages + idx * (Size) XLOG_BLCKSZ;
+		data = page + ptr % XLOG_BLCKSZ;
+
+		/* Read what is wanted, not the whole page. */
+		if ((data + count) <= (page + XLOG_BLCKSZ))
+		{
+			/* All the bytes are in one page. */
+			memcpy(buf, data, count);
+			nread = count;
+		}
+		else
+		{
+			Size nremaining;
+
+			/*
+			 * All the bytes are not in one page. Compute remaining bytes on
+			 * the current page, copy them over to output buffer.
+			 */
+			nremaining = XLOG_BLCKSZ - (data - page);
+			

Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-02-28 Thread Nathan Bossart
On Tue, Feb 28, 2023 at 10:38:31AM +0530, Bharath Rupireddy wrote:
> On Tue, Feb 28, 2023 at 6:14 AM Nathan Bossart  
> wrote:
>> Why do we only read a page at a time in XLogReadFromBuffersGuts()?  What is
>> preventing us from copying all the data we need in one go?
> 
> Note that most of the WALRead() callers request a single page of
> XLOG_BLCKSZ bytes even if the server has less or more available WAL
> pages. It's the streaming replication wal sender that can request less
> than XLOG_BLCKSZ bytes and upto MAX_SEND_SIZE (16 * XLOG_BLCKSZ). And,
> if we read, say, MAX_SEND_SIZE at once while holding
> WALBufMappingLock, that might impact concurrent inserters (at least, I
> can say it in theory) - one of the main intentions of this patch is
> not to impact inserters much.

Perhaps we should test both approaches to see if there is a noticeable
difference.  It might not be great for concurrent inserts to repeatedly
take the lock, either.  If there's no real difference, we might be able to
simplify the code a bit.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-02-28 Thread Kuntal Ghosh
On Tue, Feb 28, 2023 at 10:38 AM Bharath Rupireddy
 wrote:
>
+/*
+ * Guts of XLogReadFromBuffers().
+ *
+ * Read 'count' bytes into 'buf', starting at location 'ptr', from WAL
+ * fetched WAL buffers on timeline 'tli' and return the read bytes.
+ */
s/fetched WAL buffers/fetched from WAL buffers


+ else if (nread < nbytes)
+ {
+ /*
+ * We read some of the requested bytes. Continue to read remaining
+ * bytes.
+ */
+ ptr += nread;
+ nbytes -= nread;
+ dst += nread;
+ *read_bytes += nread;
+ }

The 'if' condition should always be true. You can replace the same
with an assertion instead.
s/Continue to read remaining/Continue to read the remaining

The good thing about this patch is that it reduces read IO calls
without impacting the write performance (at least not that
noticeable). It also takes us one step forward towards the
enhancements mentioned in the thread. If performance is a concern, we
can introduce a GUC to enable/disable this feature.

-- 
Thanks & Regards,
Kuntal Ghosh




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-02-27 Thread Bharath Rupireddy
On Tue, Feb 28, 2023 at 6:14 AM Nathan Bossart  wrote:
>
> On Wed, Feb 08, 2023 at 08:00:00PM +0530, Bharath Rupireddy wrote:
> > + /*
> > +  * We read some of the requested bytes. Continue to 
> > read remaining
> > +  * bytes.
> > +  */
> > + ptr += nread;
> > + nbytes -= nread;
> > + dst += nread;
> > + *read_bytes += nread;
>
> Why do we only read a page at a time in XLogReadFromBuffersGuts()?  What is
> preventing us from copying all the data we need in one go?

Note that most of the WALRead() callers request a single page of
XLOG_BLCKSZ bytes even if the server has less or more available WAL
pages. It's the streaming replication wal sender that can request less
than XLOG_BLCKSZ bytes and upto MAX_SEND_SIZE (16 * XLOG_BLCKSZ). And,
if we read, say, MAX_SEND_SIZE at once while holding
WALBufMappingLock, that might impact concurrent inserters (at least, I
can say it in theory) - one of the main intentions of this patch is
not to impact inserters much.

Therefore, I feel reading one WAL buffer page at a time, which works
for most of the cases, without impacting concurrent inserters much is
better - 
https://www.postgresql.org/message-id/CALj2ACWXHP6Ha1BfDB14txm%3DXP272wCbOV00mcPg9c6EXbnp5A%40mail.gmail.com.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-02-27 Thread Nathan Bossart
On Wed, Feb 08, 2023 at 08:00:00PM +0530, Bharath Rupireddy wrote:
> + /*
> +  * We read some of the requested bytes. Continue to 
> read remaining
> +  * bytes.
> +  */
> + ptr += nread;
> + nbytes -= nread;
> + dst += nread;
> + *read_bytes += nread;

Why do we only read a page at a time in XLogReadFromBuffersGuts()?  What is
preventing us from copying all the data we need in one go?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-02-08 Thread Bharath Rupireddy
On Wed, Feb 8, 2023 at 10:33 AM Dilip Kumar  wrote:
>
> On Wed, Feb 8, 2023 at 9:57 AM Bharath Rupireddy
>  wrote:
> >
> > I can also do a few other things, but before working on them, I'd like
> > to hear from others:
> > 1. A separate wait event (WAIT_EVENT_WAL_READ_FROM_BUFFERS) for
> > reading from WAL buffers - right now, WAIT_EVENT_WAL_READ is being
> > used both for reading from WAL buffers and WAL files. Given the fact
> > that we won't wait for a lock or do a time-taking task while reading
> > from buffers, it seems unnecessary.
>
> Yes, we do not need this separate wait event and we also don't need
> WAIT_EVENT_WAL_READ wait event while reading from the buffer.  Because
> we are not performing any IO so no specific wait event is needed and
> for reading from the WAL buffer we are acquiring WALBufMappingLock so
> that lwlock event will be tracked under that lock.

Nope, LWLockConditionalAcquire doesn't wait, so no lock wait event (no
LWLockReportWaitStart) there. I agree to not have any wait event for
reading from WAL buffers as no IO is involved there. I removed it in
the attached v4 patch.

> > 2. A separate TAP test for verifying that the WAL is actually read
> > from WAL buffers - right now, existing tests for recovery,
> > subscription, pg_walinspect already cover the code, see [1]. However,
> > if needed, I can add a separate TAP test.
>
> Can we write a test that can actually validate that we have read from
> a WAL Buffer? If so then it would be good to have such a test to avoid
> any future breakage in that logic.  But if it is just for hitting the
> code but no guarantee that whether we can validate as part of the test
> whether it has hit the WAL buffer or not then I think the existing
> cases are sufficient.

We could set up a standby or a logical replication subscriber or
pg_walinspect extension and verify if the code got hit with the help
of the server log (DEBUG1) message added by the patch. However, this
can make the test volatile.

Therefore, I came up with a simple and small test module/extension
named test_wal_read_from_buffers under src/test/module. It basically
exposes a SQL-function given an LSN, it calls XLogReadFromBuffers()
and returns true if it hits WAL buffers, otherwise false. And the
simple TAP test of this module verifies if the function returns true.
I attached the test module as v4-0002 here. The test module looks
specific and also helps as demonstration of how one can possibly use
the new XLogReadFromBuffers().

Thoughts?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v4-0001-Improve-WALRead-to-suck-data-directly-from-WAL-bu.patch
Description: Binary data


v4-0002-Add-test-module-for-verifying-WAL-read-from-WAL-b.patch
Description: Binary data


Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-02-07 Thread Dilip Kumar
On Wed, Feb 8, 2023 at 9:57 AM Bharath Rupireddy
 wrote:
>
> I can also do a few other things, but before working on them, I'd like
> to hear from others:
> 1. A separate wait event (WAIT_EVENT_WAL_READ_FROM_BUFFERS) for
> reading from WAL buffers - right now, WAIT_EVENT_WAL_READ is being
> used both for reading from WAL buffers and WAL files. Given the fact
> that we won't wait for a lock or do a time-taking task while reading
> from buffers, it seems unnecessary.

Yes, we do not need this separate wait event and we also don't need
WAIT_EVENT_WAL_READ wait event while reading from the buffer.  Because
we are not performing any IO so no specific wait event is needed and
for reading from the WAL buffer we are acquiring WALBufMappingLock so
that lwlock event will be tracked under that lock.

> 2. A separate TAP test for verifying that the WAL is actually read
> from WAL buffers - right now, existing tests for recovery,
> subscription, pg_walinspect already cover the code, see [1]. However,
> if needed, I can add a separate TAP test.

Can we write a test that can actually validate that we have read from
a WAL Buffer? If so then it would be good to have such a test to avoid
any future breakage in that logic.  But if it is just for hitting the
code but no guarantee that whether we can validate as part of the test
whether it has hit the WAL buffer or not then I think the existing
cases are sufficient.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-02-07 Thread Bharath Rupireddy
On Tue, Feb 7, 2023 at 4:12 PM Dilip Kumar  wrote:
>
> On Mon, Dec 26, 2022 at 2:20 PM Bharath Rupireddy
>  wrote:
>
> I have gone through this patch, I have some comments (mostly cosmetic
> and comments)

Thanks a lot for reviewing.

> From the above comments, it appears that we are reading from the exact
> pointer we are interested to read, but actually, we are reading
> the complete page.  I think this comment needs to be fixed and we can
> also explain why we read the complete page here.

I modified it. Please see the attached v3 patch.

> Generally, we use the name 'recptr' to represent XLogRecPtr type of
> variable, but in your case, it is actually data at that recptr, so
> better use some other name like 'buf' or 'buffer'.

Changed it to use 'data' as it seemed more appropriate than just a
buffer to not confuse with the WAL buffer page.

> 3.
> + if ((recptr + nbytes) <= (page + XLOG_BLCKSZ))
> + {
> + }
> + else if ((recptr + nbytes) > (page + XLOG_BLCKSZ))
> + {
>
> Why do you have this 'else if ((recptr + nbytes) > (page +
> XLOG_BLCKSZ))' check in the else part? why it is not directly else
> without a condition in 'if'?

Changed.

> I think we do not need 2 separate variables 'count' and '*read_bytes',
> just one variable for input/output is sufficient.  The original value
> can always be stored in some temp variable
> instead of the function argument.

We could do that, but for the sake of readability and not cluttering
the API, I kept it as-is.

Besides addressing the above review comments, I've made some more
changes - 1) I optimized the patch a bit by removing an extra memcpy.
Up until v2 patch, the entire WAL buffer page is returned and the
caller takes what is wanted from it. This adds an extra memcpy, so I
changed it to avoid extra memcpy and just copy what is wanted. 2) I
improved the comments.

I can also do a few other things, but before working on them, I'd like
to hear from others:
1. A separate wait event (WAIT_EVENT_WAL_READ_FROM_BUFFERS) for
reading from WAL buffers - right now, WAIT_EVENT_WAL_READ is being
used both for reading from WAL buffers and WAL files. Given the fact
that we won't wait for a lock or do a time-taking task while reading
from buffers, it seems unnecessary.
2. A separate TAP test for verifying that the WAL is actually read
from WAL buffers - right now, existing tests for recovery,
subscription, pg_walinspect already cover the code, see [1]. However,
if needed, I can add a separate TAP test.
3. Use the oldest initialized WAL buffer page to quickly tell if the
given LSN is present in WAL buffers without taking any lock - right
now, WALBufMappingLock is acquired to do so. While this doesn't seem
to impact much, it's good to optimize it away. But, the oldest
initialized WAL buffer page isn't tracked, so I've put up a patch and
sent in another thread [2]. Irrespective of [2], we are still good
with what we have in this patch.

[1]
recovery tests:
PATCHED: WAL buffers hit - 14759, misses - 3371

subscription tests:
PATCHED: WAL buffers hit - 1972, misses - 32616

pg_walinspect tests:
PATCHED: WAL buffers hit - 8, misses - 8

[2] 
https://www.postgresql.org/message-id/CALj2ACVgi6LirgLDZh%3DFdfdvGvKAD%3D%3DWTOSWcQy%3DAtNgPDVnKw%40mail.gmail.com

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From e6670c4098930f87e9b4dce2f21e73e0c0ce9361 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Wed, 8 Feb 2023 03:48:32 +
Subject: [PATCH v3] Improve WALRead() to suck data directly from WAL buffers

---
 src/backend/access/transam/xlog.c   | 178 
 src/backend/access/transam/xlogreader.c |  47 ++-
 src/include/access/xlog.h   |   6 +
 3 files changed, 229 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f9f0f6db8d..e4679d42f5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -689,6 +689,10 @@ static bool ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos,
 			  XLogRecPtr *PrevPtr);
 static XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto);
 static char *GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli);
+static Size XLogReadFromBuffersGuts(XLogRecPtr ptr,
+	TimeLineID tli,
+	Size count,
+	char *page);
 static XLogRecPtr XLogBytePosToRecPtr(uint64 bytepos);
 static XLogRecPtr XLogBytePosToEndRecPtr(uint64 bytepos);
 static uint64 XLogRecPtrToBytePos(XLogRecPtr ptr);
@@ -1639,6 +1643,180 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
 	return cachedPos + ptr % XLOG_BLCKSZ;
 }
 
+/*
+ * Guts of XLogReadFromBuffers().
+ *
+ * Read 'count' bytes into 'buf', starting at location 'ptr', from WAL
+ * fetched WAL buffers on timeline 'tli' and return the read bytes.
+ */
+static Size
+XLogReadFromBuffersGuts(XLogRecPtr ptr,
+		TimeLineID tli,
+		Size count,
+		char *buf)
+{
+	XLogRecPtr	expectedEndPtr;
+	

Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-02-07 Thread Dilip Kumar
On Mon, Dec 26, 2022 at 2:20 PM Bharath Rupireddy
 wrote:
>

I have gone through this patch, I have some comments (mostly cosmetic
and comments)

1.
+ /*
+ * We have found the WAL buffer page holding the given LSN. Read from a
+ * pointer to the right offset within the page.
+ */
+ memcpy(page, (XLogCtl->pages + idx * (Size) XLOG_BLCKSZ),
+(Size) XLOG_BLCKSZ);


>From the above comments, it appears that we are reading from the exact
pointer we are interested to read, but actually, we are reading
the complete page.  I think this comment needs to be fixed and we can
also explain why we read the complete page here.

2.
+static char *
+GetXLogBufferForRead(XLogRecPtr ptr, TimeLineID tli, char *page)
+{
+ XLogRecPtr expectedEndPtr;
+ XLogRecPtr endptr;
+ int idx;
+ char*recptr = NULL;

Generally, we use the name 'recptr' to represent XLogRecPtr type of
variable, but in your case, it is actually data at that recptr, so
better use some other name like 'buf' or 'buffer'.


3.
+ if ((recptr + nbytes) <= (page + XLOG_BLCKSZ))
+ {
+ /* All the bytes are in one page. */
+ memcpy(dst, recptr, nbytes);
+ dst += nbytes;
+ *read_bytes += nbytes;
+ ptr += nbytes;
+ nbytes = 0;
+ }
+ else if ((recptr + nbytes) > (page + XLOG_BLCKSZ))
+ {
+ /* All the bytes are not in one page. */
+ Size bytes_remaining;

Why do you have this 'else if ((recptr + nbytes) > (page +
XLOG_BLCKSZ))' check in the else part? why it is not directly else
without a condition in 'if'?

4.
+XLogReadFromBuffers(XLogRecPtr startptr,
+ TimeLineID tli,
+ Size count,
+ char *buf,
+ Size *read_bytes)

I think we do not need 2 separate variables 'count' and '*read_bytes',
just one variable for input/output is sufficient.  The original value
can always be stored in some temp variable
instead of the function argument.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-01-27 Thread Bharath Rupireddy
On Fri, Jan 27, 2023 at 12:16 PM Masahiko Sawada  wrote:
>
> I'd like to confirm whether there is any performance regression caused
> by this patch in some cases, especially when not using DIO.

Thanks. I ran some insert tests with primary and 1 async standby.
Please see the numbers below and attached graphs. I've not noticed a
regression as such, in fact, with patch, there's a slight improvement.
Note that there's no WAL DIO involved here.

test-case 1:
clientsHEADPATCHED
1139156
2624599
431133410
861946433
161125511722
322245521658
644607247103
1288025585970
256110067111488
512114043118094
768109588111892
1024106144109361
20488580890745
40965591153755

test-case 2:
clientsHEADPATCHED
1177128
2186425
421142946
858355840
161065411199
321407113959
641809217519
1282729828274
2562460024843
5121713919450
7681677820473
10241829420209
20481289813920
409663996815

test-case 3:
clientsHEADPATCHED
1148191
2302317
434153243
858646193
16957310267
321406915819
641742418453
1282449329192
2563318038250
5123556836551
7682973130317
10243229132124
20482796428933
40961370215034

[1]
cat << EOF >> data/postgresql.conf
shared_buffers = '8GB'
wal_buffers = '1GB'
max_wal_size = '16GB'
max_connections = '5000'
archive_mode = 'on'
archive_command='cp %p /home/ubuntu/archived_wal/%f'
EOF

test-case 1:
./pgbench -i -s 300 -d postgres
./psql -d postgres -c "ALTER TABLE pgbench_accounts DROP CONSTRAINT
pgbench_accounts_pkey;"
cat << EOF >> insert.sql
\set aid random(1, 10 * :scale)
\set delta random(1, 10 * :scale)
INSERT INTO pgbench_accounts (aid, bid, abalance) VALUES (:aid, :aid, :delta);
EOF
for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096; do echo -n
"$c ";./pgbench -n -M prepared -U ubuntu postgres -f insert.sql -c$c
-j$c -T5 2>&1|grep '^tps'|awk '{print $3}';done

test-case 2:
./pgbench --initialize --scale=300 postgres
for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096; do echo -n
"$c ";./pgbench -n -M prepared -U ubuntu postgres -b tpcb-like -c$c
-j$c -T5 2>&1|grep '^tps'|awk '{print $3}';done

test-case 3:
./pgbench --initialize --scale=300 postgres
for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096; do echo -n
"$c ";./pgbench -n -M prepared -U ubuntu postgres -b simple-update
-c$c -j$c -T5 2>&1|grep '^tps'|awk '{print $3}';done

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-01-26 Thread Masahiko Sawada
On Fri, Jan 27, 2023 at 3:17 PM Andres Freund  wrote:
>
> Hi,
>
> On 2023-01-27 14:24:51 +0900, Masahiko Sawada wrote:
> > If I'm understanding this result correctly, it seems to me that your
> > patch works well with the WAL DIO patch (WALDIO vs. WAL DIO & WAL
> > BUFFERS READ), but there seems no visible performance gain with only
> > your patch (HEAD vs. WAL BUFFERS READ). So it seems to me that your
> > patch should be included in the WAL DIO patch rather than applying it
> > alone. Am I missing something?
>
> We already support using DIO for WAL - it's just restricted in a way that
> makes it practically not usable. And the reason for that is precisely that
> walsenders need to read the WAL. See get_sync_bit():
>
> /*
>  * Optimize writes by bypassing kernel cache with O_DIRECT when using
>  * O_SYNC and O_DSYNC.  But only if archiving and streaming are 
> disabled,
>  * otherwise the archive command or walsender process will read the 
> WAL
>  * soon after writing it, which is guaranteed to cause a physical 
> read if
>  * we bypassed the kernel cache. We also skip the
>  * posix_fadvise(POSIX_FADV_DONTNEED) call in XLogFileClose() for the 
> same
>  * reason.
>  *
>  * Never use O_DIRECT in walreceiver process for similar reasons; the 
> WAL
>  * written by walreceiver is normally read by the startup process soon
>  * after it's written. Also, walreceiver performs unaligned writes, 
> which
>  * don't work with O_DIRECT, so it is required for correctness too.
>  */
> if (!XLogIsNeeded() && !AmWalReceiverProcess())
> o_direct_flag = PG_O_DIRECT;
>
>
> Even if that weren't the case, splitting up bigger commits in incrementally
> committable chunks is a good idea.

Agreed. I was wondering about the fact that the test result doesn't
show things to satisfy the first motivation of this patch, which is to
improve performance by reducing disk I/O and system calls regardless
of the DIO patch. But it makes sense to me that this patch is a part
of the DIO patch series.

I'd like to confirm whether there is any performance regression caused
by this patch in some cases, especially when not using DIO.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-01-26 Thread Andres Freund
Hi,

On 2023-01-27 14:24:51 +0900, Masahiko Sawada wrote:
> If I'm understanding this result correctly, it seems to me that your
> patch works well with the WAL DIO patch (WALDIO vs. WAL DIO & WAL
> BUFFERS READ), but there seems no visible performance gain with only
> your patch (HEAD vs. WAL BUFFERS READ). So it seems to me that your
> patch should be included in the WAL DIO patch rather than applying it
> alone. Am I missing something?

We already support using DIO for WAL - it's just restricted in a way that
makes it practically not usable. And the reason for that is precisely that
walsenders need to read the WAL. See get_sync_bit():

/*
 * Optimize writes by bypassing kernel cache with O_DIRECT when using
 * O_SYNC and O_DSYNC.  But only if archiving and streaming are 
disabled,
 * otherwise the archive command or walsender process will read the WAL
 * soon after writing it, which is guaranteed to cause a physical read 
if
 * we bypassed the kernel cache. We also skip the
 * posix_fadvise(POSIX_FADV_DONTNEED) call in XLogFileClose() for the 
same
 * reason.
 *
 * Never use O_DIRECT in walreceiver process for similar reasons; the 
WAL
 * written by walreceiver is normally read by the startup process soon
 * after it's written. Also, walreceiver performs unaligned writes, 
which
 * don't work with O_DIRECT, so it is required for correctness too.
 */
if (!XLogIsNeeded() && !AmWalReceiverProcess())
o_direct_flag = PG_O_DIRECT;


Even if that weren't the case, splitting up bigger commits in incrementally
committable chunks is a good idea.

Greetings,

Andres Freund




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-01-26 Thread Masahiko Sawada
On Thu, Jan 26, 2023 at 2:33 PM Bharath Rupireddy
 wrote:
>
> On Thu, Jan 26, 2023 at 2:45 AM Andres Freund  wrote:
> >
> > Hi,
> >
> > On 2023-01-14 12:34:03 -0800, Andres Freund wrote:
> > > On 2023-01-14 00:48:52 -0800, Jeff Davis wrote:
> > > > On Mon, 2022-12-26 at 14:20 +0530, Bharath Rupireddy wrote:
> > > > > Please review the attached v2 patch further.
> > > >
> > > > I'm still unclear on the performance goals of this patch. I see that it
> > > > will reduce syscalls, which sounds good, but to what end?
> > > >
> > > > Does it allow a greater number of walsenders? Lower replication
> > > > latency? Less IO bandwidth? All of the above?
> > >
> > > One benefit would be that it'd make it more realistic to use direct IO 
> > > for WAL
> > > - for which I have seen significant performance benefits. But when we
> > > afterwards have to re-read it from disk to replicate, it's less clearly a 
> > > win.
> >
> > Satya's email just now reminded me of another important reason:
> >
> > Eventually we should add the ability to stream out WAL *before* it has 
> > locally
> > been written out and flushed. Obviously the relevant positions would have to
> > be noted in the relevant message in the streaming protocol, and we couldn't
> > generally allow standbys to apply that data yet.
> >
> > That'd allow us to significantly reduce the overhead of synchronous
> > replication, because instead of commonly needing to send out all the pending
> > WAL at commit, we'd just need to send out the updated flush position. The
> > reason this would lower the overhead is that:
> >
> > a) The reduced amount of data to be transferred reduces latency - it's easy 
> > to
> >accumulate a few TCP packets worth of data even in a single small OLTP
> >transaction
> > b) The remote side can start to write out data earlier
> >
> >
> > Of course this would require additional infrastructure on the receiver
> > side. E.g. some persistent state indicating up to where WAL is allowed to be
> > applied, to avoid the standby getting ahead of th eprimary, in case the
> > primary crash-restarts (or has more severe issues).
> >
> >
> > With a bit of work we could perform WAL replay on standby without waiting 
> > for
> > the fdatasync of the received WAL - that only needs to happen when a) we 
> > need
> > to confirm a flush position to the primary b) when we need to write back 
> > pages
> > from the buffer pool (and some other things).
>
> Thanks Andres, Jeff and Satya for taking a look at the thread. Andres
> is right, the eventual plan is to do a bunch of other stuff as
> described above and we've discussed this in another thread (see
> below). I would like to once again clarify motivation behind this
> feature:
>
> 1. It enables WAL readers (callers of WALRead() - wal senders,
> pg_walinspect etc.) to use WAL buffers as first level cache which
> might reduce number of IOPS at a peak load especially when the pread()
> results in a disk read (WAL isn't available in OS page cache). I had
> earlier presented the buffer hit ratio/amount of pread() system calls
> reduced with wal senders in the first email of this thread (95% of the
> time wal senders are able to read from WAL buffers without impacting
> anybody). Now, here are the results with the WAL DIO patch [1] - where
> WAL pread() turns into a disk read, see the results [2] and attached
> graph.
>
> 2. As Andres rightly mentioned, it helps WAL DIO; since there's no OS
> page cache, using WAL buffers as read cache helps a lot. It is clearly
> evident from my experiment with WAL DIO patch [1], see the results [2]
> and attached graph. As expected, WAL DIO brings down the TPS, whereas
> WAL buffers read i.e. this patch brings it up.
>
> [2] Test case is an insert pgbench workload.
> clientsHEADWAL DIOWAL DIO & WAL BUFFERS READWAL BUFFERS READ
> 11404107014241375
> 2148779614541517
> 43064174330113019
> 86114355660265954
> 161156070511221612132
> 3223181130792344923561
> 6443607269834399745636
> 12880723451698151581911
> 25611092590185107332114046
> 512119354109817110287117506
> 768112435105795106853111605
> 1024107554105541105942109370
> 204888552790248069990555
> 409661323548145870461743

If I'm understanding this result correctly, it seems to me that your
patch works well with the WAL DIO patch (WALDIO vs. WAL DIO & WAL
BUFFERS READ), but there seems no visible performance gain with only
your patch (HEAD vs. WAL BUFFERS READ). So it seems to me that your
patch should be included in the WAL DIO patch rather than applying it
alone. Am I missing something?

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-01-25 Thread Bharath Rupireddy
On Thu, Jan 26, 2023 at 2:45 AM Andres Freund  wrote:
>
> Hi,
>
> On 2023-01-14 12:34:03 -0800, Andres Freund wrote:
> > On 2023-01-14 00:48:52 -0800, Jeff Davis wrote:
> > > On Mon, 2022-12-26 at 14:20 +0530, Bharath Rupireddy wrote:
> > > > Please review the attached v2 patch further.
> > >
> > > I'm still unclear on the performance goals of this patch. I see that it
> > > will reduce syscalls, which sounds good, but to what end?
> > >
> > > Does it allow a greater number of walsenders? Lower replication
> > > latency? Less IO bandwidth? All of the above?
> >
> > One benefit would be that it'd make it more realistic to use direct IO for 
> > WAL
> > - for which I have seen significant performance benefits. But when we
> > afterwards have to re-read it from disk to replicate, it's less clearly a 
> > win.
>
> Satya's email just now reminded me of another important reason:
>
> Eventually we should add the ability to stream out WAL *before* it has locally
> been written out and flushed. Obviously the relevant positions would have to
> be noted in the relevant message in the streaming protocol, and we couldn't
> generally allow standbys to apply that data yet.
>
> That'd allow us to significantly reduce the overhead of synchronous
> replication, because instead of commonly needing to send out all the pending
> WAL at commit, we'd just need to send out the updated flush position. The
> reason this would lower the overhead is that:
>
> a) The reduced amount of data to be transferred reduces latency - it's easy to
>accumulate a few TCP packets worth of data even in a single small OLTP
>transaction
> b) The remote side can start to write out data earlier
>
>
> Of course this would require additional infrastructure on the receiver
> side. E.g. some persistent state indicating up to where WAL is allowed to be
> applied, to avoid the standby getting ahead of th eprimary, in case the
> primary crash-restarts (or has more severe issues).
>
>
> With a bit of work we could perform WAL replay on standby without waiting for
> the fdatasync of the received WAL - that only needs to happen when a) we need
> to confirm a flush position to the primary b) when we need to write back pages
> from the buffer pool (and some other things).

Thanks Andres, Jeff and Satya for taking a look at the thread. Andres
is right, the eventual plan is to do a bunch of other stuff as
described above and we've discussed this in another thread (see
below). I would like to once again clarify motivation behind this
feature:

1. It enables WAL readers (callers of WALRead() - wal senders,
pg_walinspect etc.) to use WAL buffers as first level cache which
might reduce number of IOPS at a peak load especially when the pread()
results in a disk read (WAL isn't available in OS page cache). I had
earlier presented the buffer hit ratio/amount of pread() system calls
reduced with wal senders in the first email of this thread (95% of the
time wal senders are able to read from WAL buffers without impacting
anybody). Now, here are the results with the WAL DIO patch [1] - where
WAL pread() turns into a disk read, see the results [2] and attached
graph.

2. As Andres rightly mentioned, it helps WAL DIO; since there's no OS
page cache, using WAL buffers as read cache helps a lot. It is clearly
evident from my experiment with WAL DIO patch [1], see the results [2]
and attached graph. As expected, WAL DIO brings down the TPS, whereas
WAL buffers read i.e. this patch brings it up.

3. As Andres rightly mentioned above, it enables flushing WAL in
parallel on primary and all standbys [3]. I haven't yet started work
on this, I will aim for PG 17.

4. It will make the work on - disallow async standbys or subscribers
getting ahead of the sync standbys [3] possible. I haven't yet started
work on this, I will aim for PG 17.

5. It implements the following TODO item specified near WALRead():
 * XXX probably this should be improved to suck data directly from the
 * WAL buffers when possible.
 */
bool
WALRead(XLogReaderState *state,

That said, this feature is separately reviewable and perhaps can go
separately as it has its own benefits.

[1] 
https://www.postgresql.org/message-id/CA%2BhUKGLmeyrDcUYAty90V_YTcoo5kAFfQjRQ-_1joS_%3DX7HztA%40mail.gmail.com

[2] Test case is an insert pgbench workload.
clientsHEADWAL DIOWAL DIO & WAL BUFFERS READWAL BUFFERS READ
11404107014241375
2148779614541517
43064174330113019
86114355660265954
161156070511221612132
3223181130792344923561
6443607269834399745636
12880723451698151581911
25611092590185107332114046
512119354109817110287117506
768112435105795106853111605
1024107554105541105942109370
204888552790248069990555
409661323548145870461743

[3]

Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-01-25 Thread SATYANARAYANA NARLAPURAM
On Sat, Jan 14, 2023 at 12:34 PM Andres Freund  wrote:

> Hi,
>
> On 2023-01-14 00:48:52 -0800, Jeff Davis wrote:
> > On Mon, 2022-12-26 at 14:20 +0530, Bharath Rupireddy wrote:
> > > Please review the attached v2 patch further.
> >
> > I'm still unclear on the performance goals of this patch. I see that it
> > will reduce syscalls, which sounds good, but to what end?
> >
> > Does it allow a greater number of walsenders? Lower replication
> > latency? Less IO bandwidth? All of the above?
>
> One benefit would be that it'd make it more realistic to use direct IO for
> WAL
> - for which I have seen significant performance benefits. But when we
> afterwards have to re-read it from disk to replicate, it's less clearly a
> win.
>

 +1. Archive modules rely on reading the wal files for PITR. Direct IO for
WAL requires reading these files from disk anyways for archival. However,
Archiving using utilities like pg_receivewal can take advantage of this
patch together with direct IO for WAL.

Thanks,
Satya


Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-01-25 Thread Andres Freund
Hi,

On 2023-01-14 12:34:03 -0800, Andres Freund wrote:
> On 2023-01-14 00:48:52 -0800, Jeff Davis wrote:
> > On Mon, 2022-12-26 at 14:20 +0530, Bharath Rupireddy wrote:
> > > Please review the attached v2 patch further.
> > 
> > I'm still unclear on the performance goals of this patch. I see that it
> > will reduce syscalls, which sounds good, but to what end?
> > 
> > Does it allow a greater number of walsenders? Lower replication
> > latency? Less IO bandwidth? All of the above?
> 
> One benefit would be that it'd make it more realistic to use direct IO for WAL
> - for which I have seen significant performance benefits. But when we
> afterwards have to re-read it from disk to replicate, it's less clearly a win.

Satya's email just now reminded me of another important reason:

Eventually we should add the ability to stream out WAL *before* it has locally
been written out and flushed. Obviously the relevant positions would have to
be noted in the relevant message in the streaming protocol, and we couldn't
generally allow standbys to apply that data yet.

That'd allow us to significantly reduce the overhead of synchronous
replication, because instead of commonly needing to send out all the pending
WAL at commit, we'd just need to send out the updated flush position. The
reason this would lower the overhead is that:

a) The reduced amount of data to be transferred reduces latency - it's easy to
   accumulate a few TCP packets worth of data even in a single small OLTP
   transaction
b) The remote side can start to write out data earlier


Of course this would require additional infrastructure on the receiver
side. E.g. some persistent state indicating up to where WAL is allowed to be
applied, to avoid the standby getting ahead of th eprimary, in case the
primary crash-restarts (or has more severe issues).


With a bit of work we could perform WAL replay on standby without waiting for
the fdatasync of the received WAL - that only needs to happen when a) we need
to confirm a flush position to the primary b) when we need to write back pages
from the buffer pool (and some other things).


Greetings,

Andres Freund




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-01-14 Thread Andres Freund
Hi,

On 2023-01-14 00:48:52 -0800, Jeff Davis wrote:
> On Mon, 2022-12-26 at 14:20 +0530, Bharath Rupireddy wrote:
> > Please review the attached v2 patch further.
> 
> I'm still unclear on the performance goals of this patch. I see that it
> will reduce syscalls, which sounds good, but to what end?
> 
> Does it allow a greater number of walsenders? Lower replication
> latency? Less IO bandwidth? All of the above?

One benefit would be that it'd make it more realistic to use direct IO for WAL
- for which I have seen significant performance benefits. But when we
afterwards have to re-read it from disk to replicate, it's less clearly a win.

Greetings,

Andres Freund




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2023-01-14 Thread Jeff Davis
On Mon, 2022-12-26 at 14:20 +0530, Bharath Rupireddy wrote:
> Please review the attached v2 patch further.

I'm still unclear on the performance goals of this patch. I see that it
will reduce syscalls, which sounds good, but to what end?

Does it allow a greater number of walsenders? Lower replication
latency? Less IO bandwidth? All of the above?


-- 
Jeff Davis
PostgreSQL Contributor Team - AWS






Re: Improve WALRead() to suck data directly from WAL buffers when possible

2022-12-26 Thread Bharath Rupireddy
On Sun, Dec 25, 2022 at 4:55 PM Dilip Kumar  wrote:
>
> > > This adds copying of the whole page (at least) at every WAL *record*
> > > read,
> >
> > In the worst case yes, but that may not always be true. On a typical
> > production server with decent write traffic, it happens that the
> > callers of WALRead() read a full WAL page of size XLOG_BLCKSZ bytes or
> > MAX_SEND_SIZE bytes.
>
> I agree with this.
>
> > > This patch copies the bleeding edge WAL page without recording the
> > > (next) insertion point nor checking whether all in-progress insertion
> > > behind the target LSN have finished. Thus the copied page may have
> > > holes.  That being said, the sequential-reading nature and the fact
> > > that WAL buffers are zero-initialized may make it work for recovery,
> > > but I don't think this also works for replication.
> >
> > WALRead() callers are smart enough to take the flushed bytes only.
> > Although they read the whole WAL page, they calculate the valid bytes.
>
> Right
>
> On first read the patch looks good, although it needs some more
> thoughts on 'XXX' comments in the patch.

Thanks a lot for reviewing.

Here are some open points that I mentioned in v1 patch:

1.
+ * XXX: Perhaps, measuring the immediate lock availability and its impact
+ * on concurrent WAL writers is a good idea here.

It was shown in my testng upthread [1] that the patch does no harm in
this regard. It will be great if other members try testing in their
respective environments and use cases.

2.
+ * XXX: Perhaps, returning if lock is not immediately available a good idea
+ * here. The caller can then go ahead with reading WAL from WAL file.

After thinking a bit more on this, ISTM that doing the above is right
to not cause any contention when the lock is busy. I've done so in the
v2 patch.

3.
+ * XXX: Perhaps, quickly finding if the given WAL record is in WAL buffers
+ * a good idea here. This avoids unnecessary lock acquire-release cycles.
+ * One way to do that is by maintaining oldest WAL record that's currently
+ * present in WAL buffers.

I think by doing the above we might end up creating a new point of
contention. Because shared variables to track min and max available
LSNs in the WAL buffers will need to be protected against all the
concurrent writers. Also, with the change that's done in (2) above,
that is, quickly exiting if the lock was busy, this comment seems
unnecessary to worry about. Hence, I decided to leave it there.

4.
+ * XXX: Perhaps, we can further go and validate the found page header,
+ * record header and record at least in assert builds, something like
+ * the xlogreader.c does and return if any of those validity checks
+ * fail. Having said that, we stick to the minimal checks for now.

I was being over-cautious initially. The fact that we acquire
WALBufMappingLock while reading the needed WAL buffer page itself
guarantees that no one else initializes it/makes it ready for next use
in AdvanceXLInsertBuffer(). The checks that we have for page header
(xlp_magic, xlp_pageaddr and xlp_tli) in the patch are enough for us
to ensure that we're not reading a page that got just initialized. The
callers will anyway perform extensive checks on page and record in
XLogReaderValidatePageHeader() and ValidXLogRecordHeader()
respectively. If any such failures occur after reading WAL from WAL
buffers, then that must be treated as a bug IMO. Hence, I don't think
we need to do the above.

> And also I do not like that XLogReadFromBuffers() is using 3 bools
> hit/partial hit/miss, instead of this we can use an enum or some
> tristate variable, I think that will be cleaner.

Yeah, that seems more verbose, all that information can be deduced
from requested bytes and read bytes, I've done so in the v2 patch.

Please review the attached v2 patch further.

I'm also attaching two helper patches (as .txt files) herewith for
testing that basically adds WAL read stats -
USE-ON-HEAD-Collect-WAL-read-from-file-stats.txt - apply on HEAD and
monitor pg_stat_replication for per-walsender WAL read from WAL file
stats. USE-ON-PATCH-Collect-WAL-read-from-buffers-and-file-stats.txt -
apply on v2 patch and monitor pg_stat_replication for per-walsender
WAL read from WAL buffers and WAL file stats.

[1] 
https://www.postgresql.org/message-id/CALj2ACXUbvON86vgwTkum8ab3bf1%3DHkMxQ5hZJZS3ZcJn8NEXQ%40mail.gmail.com

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 6517e50f482f88ea5185609ff4dcf0e0256475d5 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Mon, 26 Dec 2022 08:14:11 +
Subject: [PATCH v2] Collect WAL read from file stats for WAL senders

---
 doc/src/sgml/monitoring.sgml| 31 +++
 src/backend/access/transam/xlogreader.c | 33 ++--
 src/backend/access/transam/xlogutils.c  |  2 +-
 src/backend/catalog/system_views.sql|  5 +-
 

Re: Improve WALRead() to suck data directly from WAL buffers when possible

2022-12-25 Thread Dilip Kumar
On Fri, Dec 23, 2022 at 3:46 PM Bharath Rupireddy
 wrote:
>
> On Mon, Dec 12, 2022 at 8:27 AM Kyotaro Horiguchi
>  wrote:
> >
>
> Thanks for providing thoughts.
>
> > At Fri, 9 Dec 2022 14:33:39 +0530, Bharath Rupireddy 
> >  wrote in
> > > The patch introduces concurrent readers for the WAL buffers, so far
> > > only there are concurrent writers. In the patch, WALRead() takes just
> > > one lock (WALBufMappingLock) in shared mode to enable concurrent
> > > readers and does minimal things - checks if the requested WAL page is
> > > present in WAL buffers, if so, copies the page and releases the lock.
> > > I think taking just WALBufMappingLock is enough here as the concurrent
> > > writers depend on it to initialize and replace a page in WAL buffers.
> > >
> > > I'll add this to the next commitfest.
> > >
> > > Thoughts?
> >
> > This adds copying of the whole page (at least) at every WAL *record*
> > read,
>
> In the worst case yes, but that may not always be true. On a typical
> production server with decent write traffic, it happens that the
> callers of WALRead() read a full WAL page of size XLOG_BLCKSZ bytes or
> MAX_SEND_SIZE bytes.

I agree with this.

> > This patch copies the bleeding edge WAL page without recording the
> > (next) insertion point nor checking whether all in-progress insertion
> > behind the target LSN have finished. Thus the copied page may have
> > holes.  That being said, the sequential-reading nature and the fact
> > that WAL buffers are zero-initialized may make it work for recovery,
> > but I don't think this also works for replication.
>
> WALRead() callers are smart enough to take the flushed bytes only.
> Although they read the whole WAL page, they calculate the valid bytes.

Right

On first read the patch looks good, although it needs some more
thoughts on 'XXX' comments in the patch.

And also I do not like that XLogReadFromBuffers() is using 3 bools
hit/partial hit/miss, instead of this we can use an enum or some
tristate variable, I think that will be cleaner.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2022-12-23 Thread Bharath Rupireddy
On Mon, Dec 12, 2022 at 8:27 AM Kyotaro Horiguchi
 wrote:
>

Thanks for providing thoughts.

> At Fri, 9 Dec 2022 14:33:39 +0530, Bharath Rupireddy 
>  wrote in
> > The patch introduces concurrent readers for the WAL buffers, so far
> > only there are concurrent writers. In the patch, WALRead() takes just
> > one lock (WALBufMappingLock) in shared mode to enable concurrent
> > readers and does minimal things - checks if the requested WAL page is
> > present in WAL buffers, if so, copies the page and releases the lock.
> > I think taking just WALBufMappingLock is enough here as the concurrent
> > writers depend on it to initialize and replace a page in WAL buffers.
> >
> > I'll add this to the next commitfest.
> >
> > Thoughts?
>
> This adds copying of the whole page (at least) at every WAL *record*
> read,

In the worst case yes, but that may not always be true. On a typical
production server with decent write traffic, it happens that the
callers of WALRead() read a full WAL page of size XLOG_BLCKSZ bytes or
MAX_SEND_SIZE bytes.

> fighting all WAL writers by taking WALBufMappingLock on a very
> busy page while the copying. I'm a bit doubtful that it results in an
> overall improvement.

Well, the tests don't reflect that [1], I've run an insert work load
[2]. The WAL is being read from WAL buffers 99% of the time, which is
pretty cool. If you have any use-cases in mind, please share them
and/or feel free to run at your end.

> It seems to me almost all pread()s here happens
> on file buffer so it is unclear to me that copying a whole WAL page
> (then copying the target record again) wins over a pread() call that
> copies only the record to read.

That's not always guaranteed. Imagine a typical production server with
decent write traffic and heavy analytical queries (which fills OS page
cache with the table pages accessed for the queries), the WAL pread()
calls turn to IOPS. Despite the WAL being present in WAL buffers,
customers will be paying unnecessarily for these IOPS too. With the
patch, we are basically avoiding the pread() system calls which may
turn into IOPS on production servers (99% of the time for the insert
use case [1][2], 95% of the time for pgbench use case specified
upthread). With the patch, WAL buffers can act as L1 cache, if one
calls OS page cache as L2 cache (of course this illustration is not
related to the typical processor L1 and L2 ... caches).

> Do you have an actual number of how
> frequent WAL reads go to disk, or the actual number of performance
> gain or real I/O reduction this patch offers?

It might be a bit tough to generate such heavy traffic. An idea is to
ensure the WAL page/file goes out of the OS page cache before
WALRead() - these might help here - 0002 patch from
https://www.postgresql.org/message-id/CA%2BhUKGLmeyrDcUYAty90V_YTcoo5kAFfQjRQ-_1joS_%3DX7HztA%40mail.gmail.com
and tool https://github.com/klando/pgfincore.

> This patch copies the bleeding edge WAL page without recording the
> (next) insertion point nor checking whether all in-progress insertion
> behind the target LSN have finished. Thus the copied page may have
> holes.  That being said, the sequential-reading nature and the fact
> that WAL buffers are zero-initialized may make it work for recovery,
> but I don't think this also works for replication.

WALRead() callers are smart enough to take the flushed bytes only.
Although they read the whole WAL page, they calculate the valid bytes.

> I remember that the one of the advantage of reading the on-memory WAL
> records is that that allows walsender to presend the unwritten
> records. So perhaps we should manage how far the buffer is filled with
> valid content (or how far we can presend) in this feature.

Yes, the non-flushed WAL can be read and sent across if one wishes to
to make replication faster and parallel flushing on primary and
standbys at the cost of a bit of extra crash handling, that's
mentioned here 
https://www.postgresql.org/message-id/CALj2ACXCSM%2BsTR%3D5NNRtmSQr3g1Vnr-yR91azzkZCaCJ7u4d4w%40mail.gmail.com.
However, this can be a separate discussion.

I also want to reiterate that the patch implemented a TODO item:

 * XXX probably this should be improved to suck data directly from the
 * WAL buffers when possible.
 */
bool
WALRead(XLogReaderState *state,

[1]
PATCHED:
1 1470.329907
2 1437.096329
4 2966.096948
8 5978.441040
16 11405.538255
32 22933.546058
64 43341.870038
128 73623.837068
256 104754.248661
512 115746.359530
768 106106.691455
1024 91900.079086
2048 84134.278589
4096 62580.875507

-[ RECORD 1 ]--+---
application_name   | assb1
sent_lsn   | 0/1B8106A8
write_lsn  | 0/1B8106A8
flush_lsn  | 0/1B8106A8
replay_lsn | 0/1B8106A8
write_lag  |
flush_lag  |
replay_lag |
wal_read   | 104
wal_read_bytes | 10733008
wal_read_time  | 1.845
wal_read_buffers   | 76662
wal_read_bytes_buffers | 383598808

Re: Improve WALRead() to suck data directly from WAL buffers when possible

2022-12-11 Thread Kyotaro Horiguchi
Sorry for the confusion.

At Mon, 12 Dec 2022 12:06:36 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Mon, 12 Dec 2022 11:57:17 +0900 (JST), Kyotaro Horiguchi 
>  wrote in 
> > This patch copies the bleeding edge WAL page without recording the
> > (next) insertion point nor checking whether all in-progress insertion
> > behind the target LSN have finished. Thus the copied page may have
> > holes.  That being said, the sequential-reading nature and the fact
> > that WAL buffers are zero-initialized may make it work for recovery,
> > but I don't think this also works for replication.
> 
> Mmm. I'm a bit dim. Recovery doesn't read concurrently-written
> records. Please forget about recovery.

NO... Logical walsenders do that. So, please forget about this...

> > I remember that the one of the advantage of reading the on-memory WAL
> > records is that that allows walsender to presend the unwritten
> > records. So perhaps we should manage how far the buffer is filled with
> > valid content (or how far we can presend) in this feature.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2022-12-11 Thread Kyotaro Horiguchi
At Mon, 12 Dec 2022 11:57:17 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> This patch copies the bleeding edge WAL page without recording the
> (next) insertion point nor checking whether all in-progress insertion
> behind the target LSN have finished. Thus the copied page may have
> holes.  That being said, the sequential-reading nature and the fact
> that WAL buffers are zero-initialized may make it work for recovery,
> but I don't think this also works for replication.

Mmm. I'm a bit dim. Recovery doesn't read concurrently-written
records. Please forget about recovery.

> I remember that the one of the advantage of reading the on-memory WAL
> records is that that allows walsender to presend the unwritten
> records. So perhaps we should manage how far the buffer is filled with
> valid content (or how far we can presend) in this feature.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Improve WALRead() to suck data directly from WAL buffers when possible

2022-12-11 Thread Kyotaro Horiguchi
At Fri, 9 Dec 2022 14:33:39 +0530, Bharath Rupireddy 
 wrote in 
> The patch introduces concurrent readers for the WAL buffers, so far
> only there are concurrent writers. In the patch, WALRead() takes just
> one lock (WALBufMappingLock) in shared mode to enable concurrent
> readers and does minimal things - checks if the requested WAL page is
> present in WAL buffers, if so, copies the page and releases the lock.
> I think taking just WALBufMappingLock is enough here as the concurrent
> writers depend on it to initialize and replace a page in WAL buffers.
> 
> I'll add this to the next commitfest.
> 
> Thoughts?

This adds copying of the whole page (at least) at every WAL *record*
read, fighting all WAL writers by taking WALBufMappingLock on a very
busy page while the copying. I'm a bit doubtful that it results in an
overall improvement. It seems to me almost all pread()s here happens
on file buffer so it is unclear to me that copying a whole WAL page
(then copying the target record again) wins over a pread() call that
copies only the record to read. Do you have an actual number of how
frequent WAL reads go to disk, or the actual number of performance
gain or real I/O reduction this patch offers?

This patch copies the bleeding edge WAL page without recording the
(next) insertion point nor checking whether all in-progress insertion
behind the target LSN have finished. Thus the copied page may have
holes.  That being said, the sequential-reading nature and the fact
that WAL buffers are zero-initialized may make it work for recovery,
but I don't think this also works for replication.

I remember that the one of the advantage of reading the on-memory WAL
records is that that allows walsender to presend the unwritten
records. So perhaps we should manage how far the buffer is filled with
valid content (or how far we can presend) in this feature.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Improve WALRead() to suck data directly from WAL buffers when possible

2022-12-09 Thread Bharath Rupireddy
Hi,

WALRead() currently reads WAL from the WAL file on the disk, which
means, the walsenders serving streaming and logical replication
(callers of WALRead()) will have to hit the disk/OS's page cache for
reading the WAL. This may increase the amount of read IO required for
all the walsenders put together as one typically maintains many
standbys/subscribers on production servers for high availability,
disaster recovery, read-replicas and so on. Also, it may increase
replication lag if all the WAL reads are always hitting the disk.

It may happen that WAL buffers contain the requested WAL, if so, the
WALRead() can attempt to read from the WAL buffers first before
reading from the file. If the read hits the WAL buffers, then reading
from the file on disk is avoided. This mainly reduces the read IO/read
system calls. It also enables us to do other features specified
elsewhere [1].

I'm attaching a patch that implements the idea which is also noted
elsewhere [2]. I've run some tests [3]. The WAL buffers hit ratio with
the patch stood at 95%, in other words, the walsenders avoided 95% of
the time reading from the file. The benefit, if measured in terms of
the amount of data - 79% (13.5GB out of total 17GB) of the requested
WAL is read from the WAL buffers as opposed to 21% from the file. Note
that the WAL buffers hit ratio can be very low for write-heavy
workloads, in which case, file reads are inevitable.

The patch introduces concurrent readers for the WAL buffers, so far
only there are concurrent writers. In the patch, WALRead() takes just
one lock (WALBufMappingLock) in shared mode to enable concurrent
readers and does minimal things - checks if the requested WAL page is
present in WAL buffers, if so, copies the page and releases the lock.
I think taking just WALBufMappingLock is enough here as the concurrent
writers depend on it to initialize and replace a page in WAL buffers.

I'll add this to the next commitfest.

Thoughts?

[1] 
https://www.postgresql.org/message-id/CALj2ACXCSM%2BsTR%3D5NNRtmSQr3g1Vnr-yR91azzkZCaCJ7u4d4w%40mail.gmail.com

[2]
 * XXX probably this should be improved to suck data directly from the
 * WAL buffers when possible.
 */
bool
WALRead(XLogReaderState *state,

[3]
1 primary, 1 sync standby, 1 async standby
./pgbench --initialize --scale=300 postgres
./pgbench --jobs=16 --progress=300 --client=32 --time=900
--username=ubuntu postgres

PATCHED:
-[ RECORD 1 ]--+
application_name   | assb1
wal_read   | 31005
wal_read_bytes | 3800607104
wal_read_time  | 779.402
wal_read_buffers   | 610611
wal_read_bytes_buffers | 14493226440
wal_read_time_buffers  | 3033.309
sync_state | async
-[ RECORD 2 ]--+
application_name   | ssb1
wal_read   | 31027
wal_read_bytes | 3800932712
wal_read_time  | 696.365
wal_read_buffers   | 610580
wal_read_bytes_buffers | 14492900832
wal_read_time_buffers  | 2989.507
sync_state | sync

HEAD:
-[ RECORD 1 ]+
application_name | assb1
wal_read | 705627
wal_read_bytes   | 18343480640
wal_read_time| 7607.783
sync_state   | async
-[ RECORD 2 ]+
application_name | ssb1
wal_read | 705625
wal_read_bytes   | 18343480640
wal_read_time| 4539.058
sync_state   | sync

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v1-0001-Improve-WALRead-to-suck-data-directly-from-WAL-bu.patch
Description: Binary data