Re: Handing off SLRU fsyncs to the checkpointer

2021-01-04 Thread Thomas Munro
On Mon, Jan 4, 2021 at 3:35 AM Tomas Vondra
 wrote:
> Seems this commit left behind a couple unnecessary prototypes in a bunch
> of header files. In particular, it removed these functions
>
> - ShutdownCLOG();
> - ShutdownCommitTs();
> - ShutdownSUBTRANS();
> - ShutdownMultiXact();

Thanks.  Fixed.




Re: Handing off SLRU fsyncs to the checkpointer

2021-01-03 Thread Tomas Vondra

On 9/25/20 9:09 AM, Thomas Munro wrote:

On Fri, Sep 25, 2020 at 12:53 PM Thomas Munro  wrote:

Here's a new version.  The final thing I'm contemplating before
pushing this is whether there may be hidden magical dependencies in
the order of operations in CheckPointGuts(), which I've changed
around.  Andres, any comments?


I nagged Andres off-list and he opined that it might be better to
reorder it a bit so that ProcessSyncRequests() comes after almost
everything else, so that if we ever teach more things to offload their
fsync work it'll be in the right order.  I reordered it like that; now
only CheckPointTwoPhase() comes later, based on the comment that
accompanies it.  In any case, we can always reconsider the ordering of
this function in later commits as required.  Pushed like that.



Seems this commit left behind a couple unnecessary prototypes in a bunch 
of header files. In particular, it removed these functions


- ShutdownCLOG();
- ShutdownCommitTs();
- ShutdownSUBTRANS();
- ShutdownMultiXact();

but we still have

$ git grep ShutdownCLOG
src/include/access/clog.h:extern void ShutdownCLOG(void);


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Handing off SLRU fsyncs to the checkpointer

2020-09-25 Thread Thomas Munro
On Fri, Sep 25, 2020 at 12:53 PM Thomas Munro  wrote:
> Here's a new version.  The final thing I'm contemplating before
> pushing this is whether there may be hidden magical dependencies in
> the order of operations in CheckPointGuts(), which I've changed
> around.  Andres, any comments?

I nagged Andres off-list and he opined that it might be better to
reorder it a bit so that ProcessSyncRequests() comes after almost
everything else, so that if we ever teach more things to offload their
fsync work it'll be in the right order.  I reordered it like that; now
only CheckPointTwoPhase() comes later, based on the comment that
accompanies it.  In any case, we can always reconsider the ordering of
this function in later commits as required.  Pushed like that.




Re: Handing off SLRU fsyncs to the checkpointer

2020-09-24 Thread Thomas Munro
On Fri, Sep 25, 2020 at 12:05 PM Tom Lane  wrote:
> Thomas Munro  writes:
> > Tom, do you have any thoughts on ShutdownCLOG() etc?
>
> Hm, if we cannot reach that without first completing a shutdown checkpoint,
> it does seem a little pointless.

Thanks for the sanity check.

> It'd likely be a good idea to add a comment to CheckPointCLOG et al
> explaining that we expect $what-exactly to fsync the data we are writing
> before the checkpoint is considered done.

Good point.  Done like this:

+   /*
+* Write dirty CLOG pages to disk.  This may result in sync
requests queued
+* for later handling by ProcessSyncRequests(), as part of the
checkpoint.
+*/
TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(true);
-   SimpleLruFlush(XactCtl, true);
+   SimpleLruWriteAll(XactCtl, true);
TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(true);

Here's a new version.  The final thing I'm contemplating before
pushing this is whether there may be hidden magical dependencies in
the order of operations in CheckPointGuts(), which I've changed
around.  Andres, any comments?
From a722d71c4296e5fae778a783eb8140be7887eced Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Fri, 25 Sep 2020 11:16:03 +1200
Subject: [PATCH v8] Defer flushing of SLRU files.

Previously, we called fsync() after writing out individual pg_xact,
pg_multixact and pg_commit_ts pages due to cache pressure, leading to
regular I/O stalls in user backends and recovery.  Collapse requests for
the same file into a single system call as part of the next checkpoint,
as we already did for relation files, using the infrastructure developed
by commit 3eb77eba.  This causes a significant improvement to recovery
performance.

Remove the Shutdown{CLOG,CommitTS,SUBTRANS,MultiXact}() functions,
because they were redundant after the shutdown checkpoint that
immediately precedes them.

Hoist ProcessSyncRequests() up into CheckPointGuts() to make it clearer
that it applies to all the SLRU mini-buffer-pools as well as the main
buffer pool.  Also rearrange things so that data collected in
CheckpointStats includes SLRU activity.

Reviewed-by: Tom Lane  (ShutdownXXX removal part)
Tested-by: Jakub Wartak 
Discussion: https://postgr.es/m/CA+hUKGLJ=84yt+nvhkeedauutvhmfq9i-n7k_o50jmq6rpj...@mail.gmail.com
---
 src/backend/access/transam/clog.c  |  39 +++
 src/backend/access/transam/commit_ts.c |  36 +++---
 src/backend/access/transam/multixact.c |  57 +
 src/backend/access/transam/slru.c  | 154 +
 src/backend/access/transam/subtrans.c  |  25 +---
 src/backend/access/transam/xlog.c  |  18 ++-
 src/backend/commands/async.c   |   5 +-
 src/backend/storage/buffer/bufmgr.c|   7 --
 src/backend/storage/lmgr/predicate.c   |   8 +-
 src/backend/storage/sync/sync.c|  28 -
 src/include/access/clog.h  |   3 +
 src/include/access/commit_ts.h |   3 +
 src/include/access/multixact.h |   4 +
 src/include/access/slru.h  |  14 ++-
 src/include/storage/sync.h |   7 +-
 src/tools/pgindent/typedefs.list   |   1 +
 16 files changed, 244 insertions(+), 165 deletions(-)

diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 9e352d2658..2ec6a924db 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -42,6 +42,7 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "storage/proc.h"
+#include "storage/sync.h"
 
 /*
  * Defines for CLOG page sizes.  A page is the same BLCKSZ as is used
@@ -691,7 +692,8 @@ CLOGShmemInit(void)
 {
 	XactCtl->PagePrecedes = CLOGPagePrecedes;
 	SimpleLruInit(XactCtl, "Xact", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE,
-  XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER);
+  XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER,
+  SYNC_HANDLER_CLOG);
 }
 
 /*
@@ -808,34 +810,18 @@ TrimCLOG(void)
 	LWLockRelease(XactSLRULock);
 }
 
-/*
- * This must be called ONCE during postmaster or standalone-backend shutdown
- */
-void
-ShutdownCLOG(void)
-{
-	/* Flush dirty CLOG pages to disk */
-	TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(false);
-	SimpleLruFlush(XactCtl, false);
-
-	/*
-	 * fsync pg_xact to ensure that any files flushed previously are durably
-	 * on disk.
-	 */
-	fsync_fname("pg_xact", true);
-
-	TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(false);
-}
-
 /*
  * Perform a checkpoint --- either during shutdown, or on-the-fly
  */
 void
 CheckPointCLOG(void)
 {
-	/* Flush dirty CLOG pages to disk */
+	/*
+	 * Write dirty CLOG pages to disk.  This may result in sync requests queued
+	 * for later handling by ProcessSyncRequests(), as part of the checkpoint.
+	 */
 	TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(true);
-	SimpleLruFlush(XactCtl, true);
+	SimpleLruWriteAll(XactCtl, true);
 	TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(true);
 }
 
@@ -1026,3 +1012,12 @@ clog_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "clog_redo: unknown op code %u", info);
 }
+
+/*
+ 

Re: Handing off SLRU fsyncs to the checkpointer

2020-09-24 Thread Tom Lane
Thomas Munro  writes:
> Tom, do you have any thoughts on ShutdownCLOG() etc?

Hm, if we cannot reach that without first completing a shutdown checkpoint,
it does seem a little pointless.

It'd likely be a good idea to add a comment to CheckPointCLOG et al
explaining that we expect $what-exactly to fsync the data we are writing
before the checkpoint is considered done.

regards, tom lane




Re: Handing off SLRU fsyncs to the checkpointer

2020-09-24 Thread Thomas Munro
On Wed, Sep 23, 2020 at 1:56 PM Thomas Munro  wrote:
> As for the ShutdownXXX() functions, I haven't yet come up with any
> reason for this code to exist.  Emboldened by a colleague's inability
> to explain to me what that code is doing for us, here is a new version
> that just rips it all out.

Rebased.

Tom, do you have any thoughts on ShutdownCLOG() etc?
From 07196368e6cd36df5910b536e93570c9acff4ff2 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Fri, 25 Sep 2020 11:16:03 +1200
Subject: [PATCH v7] Defer flushing of SLRU files.

Previously, we called fsync() after writing out individual pg_xact,
pg_multixact and pg_commit_ts pages due to cache pressure, leading to
regular I/O stalls in user backends and recovery.  Collapse requests for
the same file into a single system call as part of the next checkpoint,
as we already did for relation files, using the infrastructure developed
by commit 3eb77eba.  This causes a significant improvement to recovery
performance.

Remove the Shutdown{CLOG,CommitTS,SUBTRANS,MultiXact}() functions,
because they were redundant after the shutdown checkpoint that
immediately precedes them.

Hoist ProcessSyncRequests() up into CheckPointGuts() to make it clearer
that it applies to all the SLRU mini-buffer-pools as well as the main
buffer pool.  Also rearrange things so that data collected in
CheckpointStats includes SLRU activity.

Tested-by: Jakub Wartak 
Discussion: https://postgr.es/m/CA+hUKGLJ=84yt+nvhkeedauutvhmfq9i-n7k_o50jmq6rpj...@mail.gmail.com
---
 src/backend/access/transam/clog.c  |  36 +++---
 src/backend/access/transam/commit_ts.c |  32 +++--
 src/backend/access/transam/multixact.c |  53 +
 src/backend/access/transam/slru.c  | 154 +
 src/backend/access/transam/subtrans.c  |  25 +---
 src/backend/access/transam/xlog.c  |  18 ++-
 src/backend/commands/async.c   |   5 +-
 src/backend/storage/buffer/bufmgr.c|   7 --
 src/backend/storage/lmgr/predicate.c   |   8 +-
 src/backend/storage/sync/sync.c|  28 -
 src/include/access/clog.h  |   3 +
 src/include/access/commit_ts.h |   3 +
 src/include/access/multixact.h |   4 +
 src/include/access/slru.h  |  14 ++-
 src/include/storage/sync.h |   7 +-
 src/tools/pgindent/typedefs.list   |   1 +
 16 files changed, 233 insertions(+), 165 deletions(-)

diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 9e352d2658..c4c20f508b 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -42,6 +42,7 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "storage/proc.h"
+#include "storage/sync.h"
 
 /*
  * Defines for CLOG page sizes.  A page is the same BLCKSZ as is used
@@ -691,7 +692,8 @@ CLOGShmemInit(void)
 {
 	XactCtl->PagePrecedes = CLOGPagePrecedes;
 	SimpleLruInit(XactCtl, "Xact", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE,
-  XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER);
+  XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER,
+  SYNC_HANDLER_CLOG);
 }
 
 /*
@@ -808,34 +810,15 @@ TrimCLOG(void)
 	LWLockRelease(XactSLRULock);
 }
 
-/*
- * This must be called ONCE during postmaster or standalone-backend shutdown
- */
-void
-ShutdownCLOG(void)
-{
-	/* Flush dirty CLOG pages to disk */
-	TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(false);
-	SimpleLruFlush(XactCtl, false);
-
-	/*
-	 * fsync pg_xact to ensure that any files flushed previously are durably
-	 * on disk.
-	 */
-	fsync_fname("pg_xact", true);
-
-	TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(false);
-}
-
 /*
  * Perform a checkpoint --- either during shutdown, or on-the-fly
  */
 void
 CheckPointCLOG(void)
 {
-	/* Flush dirty CLOG pages to disk */
+	/* Write dirty CLOG pages to disk */
 	TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(true);
-	SimpleLruFlush(XactCtl, true);
+	SimpleLruWriteAll(XactCtl, true);
 	TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(true);
 }
 
@@ -1026,3 +1009,12 @@ clog_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "clog_redo: unknown op code %u", info);
 }
+
+/*
+ * Entrypoint for sync.c to sync clog files.
+ */
+int
+clogsyncfiletag(const FileTag *ftag, char *path)
+{
+	return SlruSyncFileTag(XactCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index f6a7329ba3..af5d9baae5 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -555,7 +555,8 @@ CommitTsShmemInit(void)
 	CommitTsCtl->PagePrecedes = CommitTsPagePrecedes;
 	SimpleLruInit(CommitTsCtl, "CommitTs", CommitTsShmemBuffers(), 0,
   CommitTsSLRULock, "pg_commit_ts",
-  LWTRANCHE_COMMITTS_BUFFER);
+  LWTRANCHE_COMMITTS_BUFFER,
+  SYNC_HANDLER_COMMIT_TS);
 
 	commitTsShared = ShmemInitStruct("CommitTs shared",
 	 sizeof(CommitTimestampShared),
@@ -798,30 +799,14 @@ DeactivateCommitTs(void)
 	LWLockRelease(CommitTsSLRULock);
 }
 
-/*
- * This must be called ONCE during postmaster or 

Re: Handing off SLRU fsyncs to the checkpointer

2020-09-22 Thread Thomas Munro
On Tue, Sep 22, 2020 at 9:08 AM Thomas Munro  wrote:
> On Mon, Sep 21, 2020 at 2:19 PM Thomas Munro  wrote:
> > While scanning for comments and identifier names that needed updating,
> > I realised that this patch changed the behaviour of the ShutdownXXX()
> > functions, since they currently flush the SLRUs but are not followed
> > by a checkpoint.  I'm not entirely sure I understand the logic of
> > that, but it wasn't my intention to change it.  So here's a version
> > that converts the existing fsync_fname() to fsync_fname_recurse() to
>
> Bleugh, that was probably a bad idea, it's too expensive.  But it
> forces me to ask the question: *why* do we need to call
> Shutdown{CLOG,CommitTS,SUBTRANS, MultiXact}() after a creating a
> shutdown checkpoint?  I wondered if this might date from before the
> WAL, but I see that the pattern was introduced when the CLOG was moved
> out of shared buffers into a proto-SLRU in ancient commit 2589735da08,
> but even in that commit the preceding CreateCheckPoint() call included
> a call to CheckPointCLOG().

I complained about the apparently missing multixact fsync in a new
thread, because if I'm right about that it requires a back-patchable
fix.

As for the ShutdownXXX() functions, I haven't yet come up with any
reason for this code to exist.  Emboldened by a colleague's inability
to explain to me what that code is doing for us, here is a new version
that just rips it all out.
From 89e5b787ea5efd03ced8da46f95f236e03bf4adf Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Wed, 23 Sep 2020 13:02:27 +1200
Subject: [PATCH v6 1/2] Fix missing fsync of multixact directories.

Standardize behavior by moving reponsibility for fsyncing directories
down into slru.c.

Back-patch to all supported releases.
---
 src/backend/access/transam/clog.c  | 7 ---
 src/backend/access/transam/commit_ts.c | 6 --
 src/backend/access/transam/slru.c  | 7 +++
 3 files changed, 7 insertions(+), 13 deletions(-)

diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 65aa8841f7..9e352d2658 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -836,13 +836,6 @@ CheckPointCLOG(void)
 	/* Flush dirty CLOG pages to disk */
 	TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(true);
 	SimpleLruFlush(XactCtl, true);
-
-	/*
-	 * fsync pg_xact to ensure that any files flushed previously are durably
-	 * on disk.
-	 */
-	fsync_fname("pg_xact", true);
-
 	TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(true);
 }
 
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 5244b06a2b..f6a7329ba3 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -822,12 +822,6 @@ CheckPointCommitTs(void)
 {
 	/* Flush dirty CommitTs pages to disk */
 	SimpleLruFlush(CommitTsCtl, true);
-
-	/*
-	 * fsync pg_commit_ts to ensure that any files flushed previously are
-	 * durably on disk.
-	 */
-	fsync_fname("pg_commit_ts", true);
 }
 
 /*
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 7640f153c2..89ce7c43b5 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -1187,6 +1187,13 @@ SimpleLruFlush(SlruCtl ctl, bool allow_redirtied)
 	}
 	if (!ok)
 		SlruReportIOError(ctl, pageno, InvalidTransactionId);
+
+	/*
+	 * Make sure that the directory entries for any newly created files are on
+	 * disk.
+	 */
+	if (ctl->do_fsync)
+		fsync_fname(ctl->Dir, true);
 }
 
 /*
-- 
2.20.1

From ae51604c43e10ba6d24b628c2b907eca251177e9 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 21 Sep 2020 09:43:32 +1200
Subject: [PATCH v6 2/2] Defer flushing of SLRU files.

Previously, we called fsync() after writing out individual pg_xact,
pg_multixact and pg_commit_ts pages due to cache pressure, leading to
regular I/O stalls in user backends and recovery.  Collapse requests for
the same file into a single system call as part of the next checkpoint,
as we already did for relation files, using the infrastructure developed
by commit 3eb77eba.  This causes a significant improvement to recovery
performance.

Remove the Shutdown{CLOG,CommitTS,SUBTRANS,MultiXact}() functions,
because they were redundant after the shutdown checkpoint that
immediately precedes them.

Hoist ProcessSyncRequests() up into CheckPointGuts() to make it clearer
that it applies to all the SLRU mini-buffer-pools as well as the main
buffer pool.  Also rearrange things so that data collected in
CheckpointStats includes SLRU activity.

Tested-by: Jakub Wartak 
Discussion: https://postgr.es/m/CA+hUKGLJ=84yt+nvhkeedauutvhmfq9i-n7k_o50jmq6rpj...@mail.gmail.com
---
 src/backend/access/transam/clog.c  |  36 +++---
 src/backend/access/transam/commit_ts.c |  32 +++--
 src/backend/access/transam/multixact.c |  53 +
 src/backend/access/transam/slru.c  | 154 +
 src/backend/access/transam/subtrans.c  |  25 +---
 

Re: Handing off SLRU fsyncs to the checkpointer

2020-09-21 Thread Thomas Munro
On Mon, Sep 21, 2020 at 2:19 PM Thomas Munro  wrote:
> While scanning for comments and identifier names that needed updating,
> I realised that this patch changed the behaviour of the ShutdownXXX()
> functions, since they currently flush the SLRUs but are not followed
> by a checkpoint.  I'm not entirely sure I understand the logic of
> that, but it wasn't my intention to change it.  So here's a version
> that converts the existing fsync_fname() to fsync_fname_recurse() to

Bleugh, that was probably a bad idea, it's too expensive.  But it
forces me to ask the question: *why* do we need to call
Shutdown{CLOG,CommitTS,SUBTRANS, MultiXact}() after a creating a
shutdown checkpoint?  I wondered if this might date from before the
WAL, but I see that the pattern was introduced when the CLOG was moved
out of shared buffers into a proto-SLRU in ancient commit 2589735da08,
but even in that commit the preceding CreateCheckPoint() call included
a call to CheckPointCLOG().




Re: Handing off SLRU fsyncs to the checkpointer

2020-09-20 Thread Thomas Munro
On Sun, Sep 20, 2020 at 12:40 PM Thomas Munro  wrote:
> On Sat, Sep 19, 2020 at 5:06 PM Thomas Munro  wrote:
> > In the meantime, from the low-hanging-fruit department, here's a new
> > version of the SLRU-fsync-offload patch.  The only changes are a
> > tweaked commit message, and adoption of C99 designated initialisers
> > for the function table, so { [SYNC_HANDLER_CLOG] = ... } instead of
> > relying on humans to make the array order match the enum values.  If
> > there are no comments or objections, I'm planning to commit this quite
> > soon.
>
> ... and CI told me that Windows didn't like my array syntax with the
> extra trailing comma.  Here's one without.

While scanning for comments and identifier names that needed updating,
I realised that this patch changed the behaviour of the ShutdownXXX()
functions, since they currently flush the SLRUs but are not followed
by a checkpoint.  I'm not entirely sure I understand the logic of
that, but it wasn't my intention to change it.  So here's a version
that converts the existing fsync_fname() to fsync_fname_recurse() to
fix that.

Strangely, the fsync calls that ensure that directory entries are on
disk seemed to be missing from CheckPointMultixact(), so I added them.
Isn't that a live bug?

I decided it was a little too magical that CheckPointCLOG() etc
depended on the later call to CheckPointBuffers() to perform their
fsyncs.  I started writing comments about that, but then I realised
that the right thing to do was to hoist ProcessSyncRequests() out of
there into CheckPointGuts() and make it all more explicit.

I also realised that it would be inconsistent to count SLRU sync
activity as buffer sync time, but not count SLRU write activity as
buffer write time, or count its buffers as written in the reported
statistics.  In other words, SLRU buffers *are* buffers for checkpoint
reporting purposes (or should at least be consistently in or out of
the stats, and with this patch they have to be in).

Does that make sense?   Is there a problem I'm not seeing with
reordering CheckPointGuts() as I have?

One comment change that seems worth highlighting is this code reached
by VACUUM, which seems like a strict improvement (it wasn't flushing
for crash recovery):

/*
-* Flush out dirty data, so PhysicalPageExists can work correctly.
-* SimpleLruFlush() is a pretty big hammer for that.  Alternatively we
-* could add an in-memory version of page exists, but
find_multixact_start
-* is called infrequently, and it doesn't seem bad to flush buffers to
-* disk before truncation.
+* Write out dirty data, so PhysicalPageExists can work correctly.
 */
-   SimpleLruFlush(MultiXactOffsetCtl, true);
-   SimpleLruFlush(MultiXactMemberCtl, true);
+   SimpleLruWriteAll(MultiXactOffsetCtl, true);
+   SimpleLruWriteAll(MultiXactMemberCtl, true);
From c0b2281a983144d39ff7288431ef9c4af51b2e32 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 21 Sep 2020 09:43:32 +1200
Subject: [PATCH v5] Defer flushing of SLRU files.

Previously, we called fsync() after writing out individual pg_xact,
pg_multixact and pg_commit_ts pages due to cache pressure, leading to
regular I/O stalls in user backends and recovery.  Collapse requests for
the same file into a single system call as part of the next checkpoint,
as we already did for relation files, using the infrastructure developed
by commit 3eb77eba.  This causes a significant improvement to recovery
performance.

Perform a new recursive fsync() in the ShutdownXXX() functions to
preserve existing behavior, since they won't be followed by a
checkpoint.

Hoist ProcessSyncRequests() up into CheckPointGuts() to make it clearer
that it applies to all the SLRU mini-buffer-pools as well as the main
buffer pool.  Also rearrange things so that data collected in
CheckpointStats includes SLRU activity.

Tested-by: Jakub Wartak 
Discussion: https://postgr.es/m/CA+hUKGLJ=84yt+nvhkeedauutvhmfq9i-n7k_o50jmq6rpj...@mail.gmail.com
---
 src/backend/access/transam/clog.c  |  29 +++--
 src/backend/access/transam/commit_ts.c |  28 +++--
 src/backend/access/transam/multixact.c |  62 +++---
 src/backend/access/transam/slru.c  | 150 +
 src/backend/access/transam/subtrans.c  |  12 +-
 src/backend/access/transam/xlog.c  |  14 ++-
 src/backend/commands/async.c   |   5 +-
 src/backend/storage/buffer/bufmgr.c|   7 --
 src/backend/storage/file/fd.c  |  20 
 src/backend/storage/lmgr/predicate.c   |   8 +-
 src/backend/storage/sync/sync.c|  28 -
 src/include/access/clog.h  |   3 +
 src/include/access/commit_ts.h |   3 +
 src/include/access/multixact.h |   4 +
 src/include/access/slru.h  |  14 ++-
 src/include/storage/fd.h   |   1 +
 src/include/storage/sync.h |   7 +-
 src/tools/pgindent/typedefs.list   |   1 +
 18 files changed, 284 insertions(+), 112 

Re: Handing off SLRU fsyncs to the checkpointer

2020-09-19 Thread Thomas Munro
On Sat, Sep 19, 2020 at 5:06 PM Thomas Munro  wrote:
> In the meantime, from the low-hanging-fruit department, here's a new
> version of the SLRU-fsync-offload patch.  The only changes are a
> tweaked commit message, and adoption of C99 designated initialisers
> for the function table, so { [SYNC_HANDLER_CLOG] = ... } instead of
> relying on humans to make the array order match the enum values.  If
> there are no comments or objections, I'm planning to commit this quite
> soon.

... and CI told me that Windows didn't like my array syntax with the
extra trailing comma.  Here's one without.
From 89bc77827a73c93914f018fc862439f53ff54a39 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sat, 19 Sep 2020 16:22:06 +1200
Subject: [PATCH v4 1/2] Defer flushing of SLRU files.

Previously, we called fsync() after writing out pg_xact, multixact and
commit_ts pages due to cache pressure, leading to an I/O stall in user
backends and recovery.  Collapse requests for the same file into a
single system call as part of the next checkpoint, as we already do for
relation files.  This causes a significant improvement to recovery
times.

Tested-by: Jakub Wartak 
Discussion: https://postgr.es/m/CA+hUKGLJ=84yt+nvhkeedauutvhmfq9i-n7k_o50jmq6rpj...@mail.gmail.com
---
 src/backend/access/transam/clog.c  |  13 +++-
 src/backend/access/transam/commit_ts.c |  12 ++-
 src/backend/access/transam/multixact.c |  24 +-
 src/backend/access/transam/slru.c  | 101 +++--
 src/backend/access/transam/subtrans.c  |   4 +-
 src/backend/commands/async.c   |   5 +-
 src/backend/storage/lmgr/predicate.c   |   4 +-
 src/backend/storage/sync/sync.c|  28 ++-
 src/include/access/clog.h  |   3 +
 src/include/access/commit_ts.h |   3 +
 src/include/access/multixact.h |   4 +
 src/include/access/slru.h  |  12 ++-
 src/include/storage/sync.h |   7 +-
 13 files changed, 174 insertions(+), 46 deletions(-)

diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 65aa8841f7..304612c159 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -42,6 +42,7 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "storage/proc.h"
+#include "storage/sync.h"
 
 /*
  * Defines for CLOG page sizes.  A page is the same BLCKSZ as is used
@@ -691,7 +692,8 @@ CLOGShmemInit(void)
 {
 	XactCtl->PagePrecedes = CLOGPagePrecedes;
 	SimpleLruInit(XactCtl, "Xact", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE,
-  XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER);
+  XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER,
+  SYNC_HANDLER_CLOG);
 }
 
 /*
@@ -1033,3 +1035,12 @@ clog_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "clog_redo: unknown op code %u", info);
 }
+
+/*
+ * Entrypoint for sync.c to sync clog files.
+ */
+int
+clogsyncfiletag(const FileTag *ftag, char *path)
+{
+	return slrusyncfiletag(XactCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 5244b06a2b..913ec9e48d 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -555,7 +555,8 @@ CommitTsShmemInit(void)
 	CommitTsCtl->PagePrecedes = CommitTsPagePrecedes;
 	SimpleLruInit(CommitTsCtl, "CommitTs", CommitTsShmemBuffers(), 0,
   CommitTsSLRULock, "pg_commit_ts",
-  LWTRANCHE_COMMITTS_BUFFER);
+  LWTRANCHE_COMMITTS_BUFFER,
+  SYNC_HANDLER_COMMIT_TS);
 
 	commitTsShared = ShmemInitStruct("CommitTs shared",
 	 sizeof(CommitTimestampShared),
@@ -1083,3 +1084,12 @@ commit_ts_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "commit_ts_redo: unknown op code %u", info);
 }
+
+/*
+ * Entrypoint for sync.c to sync commit_ts files.
+ */
+int
+committssyncfiletag(const FileTag *ftag, char *path)
+{
+	return slrusyncfiletag(CommitTsCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index b8bedca04a..344006b0f5 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1831,11 +1831,13 @@ MultiXactShmemInit(void)
 	SimpleLruInit(MultiXactOffsetCtl,
   "MultiXactOffset", NUM_MULTIXACTOFFSET_BUFFERS, 0,
   MultiXactOffsetSLRULock, "pg_multixact/offsets",
-  LWTRANCHE_MULTIXACTOFFSET_BUFFER);
+  LWTRANCHE_MULTIXACTOFFSET_BUFFER,
+  SYNC_HANDLER_MULTIXACT_OFFSET);
 	SimpleLruInit(MultiXactMemberCtl,
   "MultiXactMember", NUM_MULTIXACTMEMBER_BUFFERS, 0,
   MultiXactMemberSLRULock, "pg_multixact/members",
-  LWTRANCHE_MULTIXACTMEMBER_BUFFER);
+  LWTRANCHE_MULTIXACTMEMBER_BUFFER,
+  SYNC_HANDLER_MULTIXACT_MEMBER);
 
 	/* Initialize our shared state struct */
 	MultiXactState = ShmemInitStruct("Shared MultiXact State",
@@ -3386,3 +3388,21 @@ pg_get_multixact_members(PG_FUNCTION_ARGS)
 
 	SRF_RETURN_DONE(funccxt);
 }
+
+/*
+ * Entrypoint for sync.c to sync offsets files.
+ */
+int

Re: Handing off SLRU fsyncs to the checkpointer

2020-09-18 Thread Thomas Munro
On Mon, Aug 31, 2020 at 8:50 PM Jakub Wartak  wrote:
> - IO_URING - gives a lot of promise here I think, is it even planned to be 
> shown for PgSQL14 cycle ? Or it's more like PgSQL15?

I can't answer that, but I've played around with the prototype quite a
bit, and thought quite a lot about how to port it to systems without
IO_URING, and I'm just as keen to see this happen as you are.

In the meantime, from the low-hanging-fruit department, here's a new
version of the SLRU-fsync-offload patch.  The only changes are a
tweaked commit message, and adoption of C99 designated initialisers
for the function table, so { [SYNC_HANDLER_CLOG] = ... } instead of
relying on humans to make the array order match the enum values.  If
there are no comments or objections, I'm planning to commit this quite
soon.
From 0eba29484684a0642c4424741e256b0787051a19 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sat, 19 Sep 2020 16:22:06 +1200
Subject: [PATCH v3] Defer flushing of SLRU files.

Previously, we called fsync() after writing out pg_xact, multixact and
commit_ts pages due to cache pressure, leading to an I/O stall in user
backends and recovery.  Collapse requests for the same file into a
single system call as part of the next checkpoint, as we already do for
relation files.  This causes a significant improvement to recovery
times.

Tested-by: Jakub Wartak 
Discussion: https://postgr.es/m/CA+hUKGLJ=84yt+nvhkeedauutvhmfq9i-n7k_o50jmq6rpj...@mail.gmail.com
---
 src/backend/access/transam/clog.c  |  13 +++-
 src/backend/access/transam/commit_ts.c |  12 ++-
 src/backend/access/transam/multixact.c |  24 +-
 src/backend/access/transam/slru.c  | 101 +++--
 src/backend/access/transam/subtrans.c  |   4 +-
 src/backend/commands/async.c   |   5 +-
 src/backend/storage/lmgr/predicate.c   |   4 +-
 src/backend/storage/sync/sync.c|  30 +++-
 src/include/access/clog.h  |   3 +
 src/include/access/commit_ts.h |   3 +
 src/include/access/multixact.h |   4 +
 src/include/access/slru.h  |  12 ++-
 src/include/storage/sync.h |   7 +-
 13 files changed, 175 insertions(+), 47 deletions(-)

diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 65aa8841f7..304612c159 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -42,6 +42,7 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "storage/proc.h"
+#include "storage/sync.h"
 
 /*
  * Defines for CLOG page sizes.  A page is the same BLCKSZ as is used
@@ -691,7 +692,8 @@ CLOGShmemInit(void)
 {
 	XactCtl->PagePrecedes = CLOGPagePrecedes;
 	SimpleLruInit(XactCtl, "Xact", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE,
-  XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER);
+  XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER,
+  SYNC_HANDLER_CLOG);
 }
 
 /*
@@ -1033,3 +1035,12 @@ clog_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "clog_redo: unknown op code %u", info);
 }
+
+/*
+ * Entrypoint for sync.c to sync clog files.
+ */
+int
+clogsyncfiletag(const FileTag *ftag, char *path)
+{
+	return slrusyncfiletag(XactCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 5244b06a2b..913ec9e48d 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -555,7 +555,8 @@ CommitTsShmemInit(void)
 	CommitTsCtl->PagePrecedes = CommitTsPagePrecedes;
 	SimpleLruInit(CommitTsCtl, "CommitTs", CommitTsShmemBuffers(), 0,
   CommitTsSLRULock, "pg_commit_ts",
-  LWTRANCHE_COMMITTS_BUFFER);
+  LWTRANCHE_COMMITTS_BUFFER,
+  SYNC_HANDLER_COMMIT_TS);
 
 	commitTsShared = ShmemInitStruct("CommitTs shared",
 	 sizeof(CommitTimestampShared),
@@ -1083,3 +1084,12 @@ commit_ts_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "commit_ts_redo: unknown op code %u", info);
 }
+
+/*
+ * Entrypoint for sync.c to sync commit_ts files.
+ */
+int
+committssyncfiletag(const FileTag *ftag, char *path)
+{
+	return slrusyncfiletag(CommitTsCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index b8bedca04a..344006b0f5 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1831,11 +1831,13 @@ MultiXactShmemInit(void)
 	SimpleLruInit(MultiXactOffsetCtl,
   "MultiXactOffset", NUM_MULTIXACTOFFSET_BUFFERS, 0,
   MultiXactOffsetSLRULock, "pg_multixact/offsets",
-  LWTRANCHE_MULTIXACTOFFSET_BUFFER);
+  LWTRANCHE_MULTIXACTOFFSET_BUFFER,
+  SYNC_HANDLER_MULTIXACT_OFFSET);
 	SimpleLruInit(MultiXactMemberCtl,
   "MultiXactMember", NUM_MULTIXACTMEMBER_BUFFERS, 0,
   MultiXactMemberSLRULock, "pg_multixact/members",
-  LWTRANCHE_MULTIXACTMEMBER_BUFFER);
+  LWTRANCHE_MULTIXACTMEMBER_BUFFER,
+  SYNC_HANDLER_MULTIXACT_MEMBER);
 
 	/* Initialize our shared state struct */
 	MultiXactState = 

Re: Handing off SLRU fsyncs to the checkpointer

2020-08-31 Thread Jakub Wartak
Hi Thomas, hackers,

>> ... %CPU ... COMMAND
>> ... 97.4 ... postgres: startup recovering 00010089
> So, what else is pushing this thing off CPU, anyway?  For one thing, I
> guess it might be stalling while reading the WAL itself, because (1)
> we only read it 8KB at a time, relying on kernel read-ahead, which
> typically defaults to 128KB I/Os unless you cranked it up, but for
> example we know that's not enough to saturate a sequential scan on
> NVME system, so maybe it hurts here too (2) we keep having to switch
> segment files every 16MB.  Increasing WAL segment size and kernel
> readahead size presumably help with that, if indeed it is a problem,
> but we could also experiment with a big POSIX_FADV_WILLNEED hint for a
> future segment every time we cross a boundary, and also maybe increase
> the size of our reads.

All of the above (1,2) would make sense and the effects IMHO are partially 
possible to achieve via ./configure compile options, but from previous 
correspondence [1] in this particular workload, it looked like it was not WAL 
reading, but reading random DB blocks into shared buffer: in that case I 
suppose it was the price of too many syscalls to the OS/VFS cache itself as the 
DB was small and fully cached there - so problem (3): 
copy_user_enhanced_fast_string <- 17.79%--copyout (!) <- __pread_nocancel <- 
16.56%--FileRead / mdread / ReadBuffer_common (!). Without some 
micro-optimization or some form of vectorized [batching] I/O in recovery it's 
dead end when it comes to small changes. Thing that come to my mind as for 
enhancing recovery:
- preadv() - works only for 1 fd, while WAL stream might require reading a lot 
of random pages into s_b (many relations/fds, even btree inserting to single 
relation might put data into many 1GB [default] forks). This would only 
micro-optimize INSERT (pk) SELECT nextval(seq) kind of processing on recovery 
side I suppose. Of coruse provided that StartupXLOG would be more working in a 
batched way: (a) reading a lot of blocks from WAL at once (b) then issuing 
preadv() to get all the DB blocks into s_b going from the same rel/fd (c) 
applying WAL. Sounds like a major refactor just to save syscalls :(
- mmap() - even more unrealistic
- IO_URING - gives a lot of promise here I think, is it even planned to be 
shown for PgSQL14 cycle ? Or it's more like PgSQL15?

-Jakub Wartak

[1] - 
https://www.postgresql.org/message-id/VI1PR0701MB6960EEB838D53886D8A180E3F6520%40VI1PR0701MB6960.eurprd07.prod.outlook.com
 please see profile after "0001+0002 s_b(at)128MB: 85.392"



Re: Handing off SLRU fsyncs to the checkpointer

2020-08-28 Thread Thomas Munro
On Sat, Aug 29, 2020 at 12:43 AM Jakub Wartak  wrote:
> ... %CPU ... COMMAND
> ... 97.4 ... postgres: startup recovering 00010089

So, what else is pushing this thing off CPU, anyway?  For one thing, I
guess it might be stalling while reading the WAL itself, because (1)
we only read it 8KB at a time, relying on kernel read-ahead, which
typically defaults to 128KB I/Os unless you cranked it up, but for
example we know that's not enough to saturate a sequential scan on
NVME system, so maybe it hurts here too (2) we keep having to switch
segment files every 16MB.  Increasing WAL segment size and kernel
readahead size presumably help with that, if indeed it is a problem,
but we could also experiment with a big POSIX_FADV_WILLNEED hint for a
future segment every time we cross a boundary, and also maybe increase
the size of our reads.




Re: Handing off SLRU fsyncs to the checkpointer

2020-08-28 Thread Thomas Munro
On Sat, Aug 29, 2020 at 12:43 AM Jakub Wartak  wrote:
> USERPID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
> postgres 120935  0.9  0.0 866052  3824 ?Ss   09:47   0:00 postgres: 
> checkpointer
> postgres 120936 61.9  0.0 865796  3824 ?Rs   09:47   0:22 postgres: 
> background writer
> postgres 120937 97.4  0.0 865940  5228 ?Rs   09:47   0:36 postgres: 
> startup recovering 00010089
>
> speedup of 1.647x

Thanks for testing!  That's better than I expected.  I guess it wasn't
quite so good with default bgwriter settings.

> I have only one comment about those 2 WIP patches, bgwriter_lru_maxpages 
> should be maybe called standby_bgwriter_lru_maxpages in this scenario or even 
> more preferred there shouldn't be a maximum set during closed DB recovery 
> scenario.

I wish bgwriter could auto-tune itself better, so we wouldn't need to
contemplate adding more settings.

As for the second patch ("Optionally, don't wait for end-of-recovery
checkpoint."), that also looked quite useful in your test scenario:

> end_of_recovery_checkpoint_wait = off, before DB is open 15s faster

> I suppose a checkpoint for large shared_buffers (hundredths of GB) might take 
> a lot of time and this 0002 patch bypasses that. I would find it quite useful 
> in some scenarios (e.g. testing backups, PITR recoveries, opening DB from 
> storage snapshots / storage replication, maybe with DWH-after-crash too).

I suppose a third option that you might want is no checkpoint at all
(meaning leave it to regular checkpoint scheduling), like fast
promotion.  One thing missing from the patch is that we probably need
to log an end-of-recovery *record*, like fast promotion does.  I'm a
little fuzzy on the timeline stuff.  I wonder if any recovery experts
would like to weigh in on theoretical problems that might be lurking
here...




Re: Handing off SLRU fsyncs to the checkpointer

2020-08-28 Thread Jakub Wartak
Hi Thomas, hackers,

>> > To move these writes out of recovery's way, we should probably just
>> > run the bgwriter process during crash recovery.  I'm going to look
>> > into that.
>>
>> Sounds awesome.
>
>I wrote a quick and dirty experimental patch to try that.  I can't see
>any benefit from it on pgbench with default shared buffers, but maybe
>it would do better with your append test due to locality, especially
>if you can figure out how to tune bgwriter to pace itself optimally.
>https://github.com/macdice/postgres/tree/bgwriter-in-crash-recovery

OK, so I've quickly tested those two PoCs patches together, in the conditions 
like below:
- similar append-only workload by pgbench (to eliminate other already known 
different WAL bottlenecks: e.g. sorting),
- 4.3GB of WAL to be applied (mostly Btree/INSERT_LEAF)
- on same system as last time (ext4 on NVMe, 1s8c16, 4.14 kernel) 
- 14master already with SLRU fsync to checkpointer/pg_qgsort patches applied

TEST bgwriterPOC1:
- in severe dirty memory conditions (artificially simulated via small s_b here) 
--> so for workloads with very high FlushBuffer activity in StartupXLOG
- with fsync=off/fpw=off by default and on NVMe (e.g. scenario:  I want to 
perform some PITR as fast as I can to see how production data looked like in 
the past, before some user deleted some data)

baseline s_b@128MB: 140.404, 0.123 (2nd small as there is small region to 
checkpoint)

    22.49%  postgres  [kernel.kallsyms]   [k] copy_user_enhanced_fast_string
            ---copy_user_enhanced_fast_string
               |--14.72%--copyin
               |          __pwrite_nocancel
               |          FileWrite
               |          mdwrite
               |          FlushBuffer
               |          ReadBuffer_common
               |           --14.52%--btree_xlog_insert
                --7.77%--copyout
                          __pread_nocancel
                           --7.57%--FileRead
                                     mdread
                                     ReadBuffer_common
     6.13%  postgres  [kernel.kallsyms]   [k] do_syscall_64
               |--1.64%--__pwrite_nocancel
                --1.23%--__pread_nocancel
     3.68%  postgres  postgres            [.] hash_search_with_hash_value
            ---hash_search_with_hash_value
               |--1.02%--smgropen

After applying:
patch -p1 < ../0001-Run-checkpointer-and-bgworker-in-crash-recovery.patch
patch -p1 < ../0002-Optionally-don-t-wait-for-end-of-recovery-checkpoint.patch

0001+0002 s_b@128MB: similar result to above
0001+0002 s_b@128MB: 108.871, 0.114 , bgwriter_delay = 
10ms/bgwriter_lru_maxpages = 1000
0001+0002 s_b@128MB: 85.392, 0.103 , bgwriter_delay = 
10ms/bgwriter_lru_maxpages = 5 #~390MB max?

    18.40%  postgres  [kernel.kallsyms]   [k] copy_user_enhanced_fast_string
            ---copy_user_enhanced_fast_string
               |--17.79%--copyout
               |          __pread_nocancel
               |          |--16.56%--FileRead
               |          |          mdread
               |          |          ReadBuffer_common
                --0.61%--copyin // WOW
                          __pwrite_nocancel
                          FileWrite
                          mdwrite
                          FlushBuffer
                          ReadBuffer_common
     9.20%  postgres  postgres            [.] hash_search_with_hash_value
            ---hash_search_with_hash_value
               |--4.70%--smgropen

of course there is another WOW moment during recovery ("61.9%")

USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
postgres 120935  0.9  0.0 866052  3824 ?        Ss   09:47   0:00 postgres: 
checkpointer
postgres 120936 61.9  0.0 865796  3824 ?        Rs   09:47   0:22 postgres: 
background writer
postgres 120937 97.4  0.0 865940  5228 ?        Rs   09:47   0:36 postgres: 
startup recovering 00010089

speedup of 1.647x when dirty memory is in way. When it's not:

baseline  s_b@24000MB: 39.199, 1.448 (2x patches off)
0001+0002 s_b@24000MB: 39.383, 1.442 , bgwriter_delay = 
10ms/bgwriter_lru_maxpages = 5 #~390MB/s max, yay

there's no regression. I have only one comment about those 2 WIP patches, 
bgwriter_lru_maxpages should be maybe called standby_bgwriter_lru_maxpages in 
this scenario or even more preferred there shouldn't be a maximum set during 
closed DB recovery scenario.

TEST bgwriterPOC2a to showcase the 2nd patch which opens the DB for read-write 
users before the final checkpoint finishes after redo recovery. The DBA may 
make the decision via this parameter end_of_recovery_checkpoint_wait=off.
- on slow storage (xvda, fsync=on) and even with high memory:

s_b@24000MB: 39.043, 15.639 -- even with WAL recovery being 100% CPU 
bound(mostly on hash_search_with_hash_value() for 
Buffers/__memmove_ssse3_back), it took additional 15s to perform checkpoint 
before DB was open for users (it had to write 269462 buffers =~ 2GB =~ 

Re: Handing off SLRU fsyncs to the checkpointer

2020-08-27 Thread Thomas Munro
On Thu, Aug 27, 2020 at 8:48 PM Jakub Wartak  wrote:
> >> 29.62%  postgres  [kernel.kallsyms]   [k] 
> >> copy_user_enhanced_fast_string
> >> ---copy_user_enhanced_fast_string
> >>|--17.98%--copyin
> >> [..]
> >>|  __pwrite_nocancel
> >>|  FileWrite
> >>|  mdwrite
> >>|  FlushBuffer
> >>|  ReadBuffer_common
> >>|  ReadBufferWithoutRelcache
> >>|  XLogReadBufferExtended
> >>|  XLogReadBufferForRedoExtended
> >>|   --17.57%--btree_xlog_insert
> >
> > To move these writes out of recovery's way, we should probably just
> > run the bgwriter process during crash recovery.  I'm going to look
> > into that.
>
> Sounds awesome.

I wrote a quick and dirty experimental patch to try that.  I can't see
any benefit from it on pgbench with default shared buffers, but maybe
it would do better with your append test due to locality, especially
if you can figure out how to tune bgwriter to pace itself optimally.
https://github.com/macdice/postgres/tree/bgwriter-in-crash-recovery




Re: Handing off SLRU fsyncs to the checkpointer

2020-08-27 Thread Thomas Munro
On Thu, Aug 27, 2020 at 8:48 PM Jakub Wartak  wrote:
> I've tried to get cache misses ratio via PMCs, apparently on EC2 they are 
> (even on bigger) reporting as not-supported or zeros.

I heard some of the counters are only allowed on their dedicated instance types.

> However interestingly the workload has IPC of 1.40 (instruction bound) which 
> to me is strange as I would expect BufTableLookup() to be actually heavy 
> memory bound (?) Maybe I'll try on some different hardware one day.

Hmm, OK now you've made me go and read a bunch of Brendan Gregg bloggs
and try some experiments of my own to get a feel for this number and
what it might be telling us about the cache miss counters you can't
see.  Since I know how to generate arbitrary cache miss workloads for
quick experiments using hash joins of different sizes, I tried that
and noticed that when LLC misses were at 76% (bad), IPC was at 1.69
which is still higher than what you're seeing.  When the hash table
was much smaller and LLC misses were down to 15% (much better), IPC
was at 2.83.  I know Gregg said[1] "An IPC < 1.0 likely means memory
bound, and an IPC > 1.0 likely means instruction bound", but that's
not what I'm seeing here, and in his comments section that is
disputed.  So I'm not sure your IPC of 1.40 is evidence against the
hypothesis on its own.

[1] http://www.brendangregg.com/blog/2017-05-09/cpu-utilization-is-wrong.html




Re: Handing off SLRU fsyncs to the checkpointer

2020-08-27 Thread Jakub Wartak
Hi Alvaro, Thomas, hackers

>> 14.69%  postgres  postgres[.] hash_search_with_hash_value
>> ---hash_search_with_hash_value
>>|--9.80%--BufTableLookup
[..]
>> --4.90%--smgropen
>>   |--2.86%--ReadBufferWithoutRelcache
> Looking at an earlier report of this problem I was thinking whether it'd
> make sense to replace SMgrRelationHash with a simplehash table; I have a
> half-written patch for that, but I haven't completed that work.
> However, in the older profile things were looking different, as
> hash_search_with_hash_value was taking 35.25%, and smgropen was 33.74%
> of it.  BufTableLookup was also there but only 1.51%.  So I'm not so
> sure now that that'll pay off as clearly as I had hoped.

Yes, quite frankly my expectation was to see 
hash_search_with_hash_value()<-smgropen() outcome as 1st one, but in simplified 
redo-bench script it's not the case. The original scenario was much more 
complex with plenty of differences (in no particular order: TB-sized DB VS 
~500GB RAM -> thousands of forks, multiple tables, huge btrees, multiple 
INSERTs wih plenty of data in VALUES() thrown as one commit, real 
primary->hot-standby replication [not closed DB in recovery], sorted not random 
UUIDs) - I'm going to try nail down these differences and maybe I manage to 
produce more realistic "pgbench reproducer" (this may take some time though).

-Jakub Wartak.



Re: Handing off SLRU fsyncs to the checkpointer

2020-08-27 Thread Jakub Wartak
Hi Thomas / hackers,

>> The append-only bottleneck appears to be limited by syscalls/s due to small 
>> block size even with everything in FS cache (but not in shared buffers, 
>> please compare with TEST1 as there was no such bottleneck at all):
>>
>> 29.62%  postgres  [kernel.kallsyms]   [k] copy_user_enhanced_fast_string
>> ---copy_user_enhanced_fast_string
>>|--17.98%--copyin
>> [..]
>>|  __pwrite_nocancel
>>|  FileWrite
>>|  mdwrite
>>|  FlushBuffer
>>|  ReadBuffer_common
>>|  ReadBufferWithoutRelcache
>>|  XLogReadBufferExtended
>>|  XLogReadBufferForRedoExtended
>>|   --17.57%--btree_xlog_insert
>
> To move these writes out of recovery's way, we should probably just
> run the bgwriter process during crash recovery.  I'm going to look
> into that.

Sounds awesome. Also as this thread is starting to derail the SLRU fsync topic 
- maybe we should change subject? However, to add some data to the separate 
bgwriter: on 14master (already with lseek() caching, SLRU fsyncs out of way, 
better sorting), I've measured the same configuration as last time with still 
the same append-only WAL workload on NVMe and compared with various 
shared_buffers settings (and buffers description sizing from 
pg_shmem_allocations which as You stated is wrongly reported(?) which I'm 
stating only for reference just in case):

shared_buffers=128MB buffers_desc=1024kB 96.778, 0.438 [a]
shared_buffers=256MB buffers_desc=2048kB 62.755, 0.577 [a]
shared_buffers=512MB buffers_desc=4096kB 33.167, 0.62 [a]
shared_buffers=1GB buffers_desc=8192kB 27.303, 0.929
shared_buffers=4GB buffers_desc=32MB 27.185, 1.166
shared_buffers=8GB buffers_desc=64MB 27.649, 1.088 
shared_buffers=16GB buffers_desc=128MB 27.584, 1.201 [b]
shared_buffers=32GB buffers_desc=256MB 32.314, 1.171 [b]
shared_buffers=48GB buffers_desc=384 MB 31.95, 1.217
shared_buffers=64GB buffers_desc=512 MB 31.276, 1.349
shared_buffers=72GB buffers_desc=576 MB 31.925, 1.284
shared_buffers=80GB buffers_desc=640 MB 31.809, 1.413

The amount of WAL to be replayed was ~2.8GB. To me it looks like that
a) flushing dirty buffers by StartupXLog might be a real problem but please 
read-on.
b) there is very low impact by this L2/L3 hypothesis you mention (?), but it's 
not that big and it's not degrading linearly as one would expect (??) 
L1d:L1d:L2:L3 cache sizes on this machine are as follows on this machine: 
32K/32K/256K/46080K. Maybe the DB size is too small.

I've double-checked that in condition [a] (shared_buffers=128MB) there was a 
lot of FlushBuffer() invocations per second (perf stat -e 
probe_postgres:FlushBuffer -I 1000), e.g:
#   time counts unit events
 1.000485217 79,494  probe_postgres:FlushBuffer
 2.000861366 75,305  probe_postgres:FlushBuffer
 3.001263163 79,662  probe_postgres:FlushBuffer
 4.001650795 80,317  probe_postgres:FlushBuffer
 5.002033273 79,951  probe_postgres:FlushBuffer
 6.002418646 79,993  probe_postgres:FlushBuffer
while at 1GB shared_buffers it sits nearly at zero all the time.  So there is 
like 3x speed-up possible  when StartupXLog wouldn't have to care too much 
about dirty buffers in the critical path (bgwriter would as you say?) at least 
in append-only scenarios.  But ... I've checked some real systems (even older 
versions of PostgreSQL not doing that much of replication, and indeed it's  
e.g. happening 8-12k/s FlushBuffer's() and shared buffers are pretty huge 
(>100GB, 0.5TB RAM) but they are *system-wide* numbers, work is really 
performed by bgwriter not by startup/recovering as in this redo-bench case when 
DB is open for reads and replicating-- so it appears that this isn't 
optimization for hot standby case , but for the DB-closed startup 
recovery/restart/disaster/PITR scenario).

As for the 24GB shared_buffers scenario where dirty buffers are not at all a 
problem with given top profile (output trimmed), again as expected:

13.41%  postgres  postgres   [.] hash_search_with_hash_value
   |--8.31%--BufTableLookup <- ReadBuffer_common <- 
ReadBufferWithoutRelcache
--5.11%--smgropen 
  |--2.77%--XLogReadBufferExtended
   --2.34%--ReadBufferWithoutRelcache
 7.88%  postgres  postgres   [.] MarkBufferDirty

I've tried to get cache misses ratio via PMCs, apparently on EC2 they are (even 
on bigger) reporting as not-supported or zeros. However interestingly the 
workload has IPC of 1.40 (instruction bound) which to me is strange as I would 
expect BufTableLookup() to be actually heavy memory bound (?) Maybe I'll try on 
some different hardware one day. 

>>   

Re: Handing off SLRU fsyncs to the checkpointer

2020-08-26 Thread Thomas Munro
On Thu, Aug 27, 2020 at 6:15 AM Alvaro Herrera  wrote:
> > --4.90%--smgropen
> >   |--2.86%--ReadBufferWithoutRelcache
>
> Looking at an earlier report of this problem I was thinking whether it'd
> make sense to replace SMgrRelationHash with a simplehash table; I have a
> half-written patch for that, but I haven't completed that work.
> However, in the older profile things were looking different, as
> hash_search_with_hash_value was taking 35.25%, and smgropen was 33.74%
> of it.  BufTableLookup was also there but only 1.51%.  So I'm not so
> sure now that that'll pay off as clearly as I had hoped.

Right, my hypothesis requires an uncacheably large buffer mapping
table, and I think smgropen() needs a different explanation because
it's not expected to be as large or as random, at least not with a
pgbench workload.  I think the reasons for a profile with a smgropen()
showing up so high, and in particular higher than BufTableLookup(),
must be:

1.  We call smgropen() twice for every call to BufTableLookup().  Once
in XLogReadBufferExtended(), and then again in
ReadBufferWithoutRelcache().
2.  We also call it for every block forced out of the buffer pool, and
in recovery that has to be done by the recovery loop.
3.  We also call it for every block in the buffer pool during the
end-of-recovery checkpoint.

Not sure but the last two might perform worse due to proximity to
interleaving pwrite() system calls (just a thought, not investigated).
In any case, I'm going to propose we move those things out of the
recovery loop, in a new thread.




Re: Handing off SLRU fsyncs to the checkpointer

2020-08-26 Thread Alvaro Herrera
On 2020-Aug-25, Jakub Wartak wrote:

> Turning on/off the defer SLRU patch and/or fsync doesn't seem to make
> any difference, so if anyone is curious the next sets of append-only
> bottlenecks is like below:
> 
> 14.69%  postgres  postgres[.] hash_search_with_hash_value
> ---hash_search_with_hash_value
>|--9.80%--BufTableLookup
>|  ReadBuffer_common
>|  ReadBufferWithoutRelcache
>|  XLogReadBufferExtended
>|  XLogReadBufferForRedoExtended
>|  |--7.76%--btree_xlog_insert
>|  |  btree_redo
>|  |  StartupXLOG
>|   --1.63%--heap_xlog_insert
> --4.90%--smgropen
>   |--2.86%--ReadBufferWithoutRelcache

Looking at an earlier report of this problem I was thinking whether it'd
make sense to replace SMgrRelationHash with a simplehash table; I have a
half-written patch for that, but I haven't completed that work.
However, in the older profile things were looking different, as
hash_search_with_hash_value was taking 35.25%, and smgropen was 33.74%
of it.  BufTableLookup was also there but only 1.51%.  So I'm not so
sure now that that'll pay off as clearly as I had hoped.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Handing off SLRU fsyncs to the checkpointer

2020-08-26 Thread Alvaro Herrera
On 2020-Aug-25, Andres Freund wrote:

> Hi,
> 
> On 2020-08-26 15:58:14 +1200, Thomas Munro wrote:
> > > --12.51%--compactify_tuples
> > >   PageRepairFragmentation
> > >   heap2_redo
> > >   StartupXLOG
> > 
> > I wonder if there is something higher level that could be done to
> > reduce the amount of compaction work required in the first place, but
> > in the meantime I'm very happy if we can improve the situation so much
> > with such a microscopic improvement that might eventually benefit
> > other sorting stuff...
> 
> Another approach could be to not perform any sorting during recovery,
> instead including enough information in the WAL record to avoid doing a
> full blown PageRepairFragmentation during recovery.

Hmm, including the sorted ItemId array in the WAL record might make
sense to alleviate the qsort work ...

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Handing off SLRU fsyncs to the checkpointer

2020-08-25 Thread Andres Freund
Hi,

On 2020-08-26 15:58:14 +1200, Thomas Munro wrote:
> > --12.51%--compactify_tuples
> >   PageRepairFragmentation
> >   heap2_redo
> >   StartupXLOG
> 
> I wonder if there is something higher level that could be done to
> reduce the amount of compaction work required in the first place, but
> in the meantime I'm very happy if we can improve the situation so much
> with such a microscopic improvement that might eventually benefit
> other sorting stuff...

Another approach could be to not perform any sorting during recovery,
instead including enough information in the WAL record to avoid doing a
full blown PageRepairFragmentation during recovery.

Greetings,

Andres Freund




Re: Handing off SLRU fsyncs to the checkpointer

2020-08-25 Thread Thomas Munro
On Tue, Aug 25, 2020 at 9:16 PM Jakub Wartak  wrote:
> I just wanted to help testing this patch (defer SLRU fsyncs during recovery) 
> and also faster compactify_tuples() patch [2] as both are related to the WAL 
> recovery performance in which I'm interested in. This is my first message to 
> this mailing group so please let me know also if I should adjust testing 
> style or formatting.

Hi Jakub,

Thanks very much for these results!

> - Handing SLRU sync work over to the checkpointer: in my testing it 
> accelerates WAL recovery performance on slower / higer latency storage by ~20%

Wow.  Those fsyncs must have had fairly high latency (presumably due
to queuing behind other write back activity).

> - Faster sort in compactify_tuples(): in my testing it accelerates WAL 
> recovery performance for HOT updates also by ~20%

Nice.

> In the last final case the top profile is as follows related still to the 
> sorting but as I understand in much optimal way:
>
> 26.68%  postgres  postgres[.] qsort_itemoff
> ---qsort_itemoff
>|--14.17%--qsort_itemoff
>|  |--10.96%--compactify_tuples
>|  |  PageRepairFragmentation
>|  |  heap2_redo
>|  |  StartupXLOG
>|   --3.21%--qsort_itemoff
>|  --3.10%--compactify_tuples
>|PageRepairFragmentation
>|heap2_redo
>|StartupXLOG
> --12.51%--compactify_tuples
>   PageRepairFragmentation
>   heap2_redo
>   StartupXLOG

I wonder if there is something higher level that could be done to
reduce the amount of compaction work required in the first place, but
in the meantime I'm very happy if we can improve the situation so much
with such a microscopic improvement that might eventually benefit
other sorting stuff...

>  8.38%  postgres  libc-2.17.so[.] __memmove_ssse3_back
> ---__memmove_ssse3_back
>compactify_tuples
>PageRepairFragmentation
>heap2_redo

Hmm, I wonder if this bit could go teensy bit faster by moving as many
adjacent tuples as you can in one go rather than moving them one at a
time...

> The append-only bottleneck appears to be limited by syscalls/s due to small 
> block size even with everything in FS cache (but not in shared buffers, 
> please compare with TEST1 as there was no such bottleneck at all):
>
> 29.62%  postgres  [kernel.kallsyms]   [k] copy_user_enhanced_fast_string
> ---copy_user_enhanced_fast_string
>|--17.98%--copyin
> [..]
>|  __pwrite_nocancel
>|  FileWrite
>|  mdwrite
>|  FlushBuffer
>|  ReadBuffer_common
>|  ReadBufferWithoutRelcache
>|  XLogReadBufferExtended
>|  XLogReadBufferForRedoExtended
>|   --17.57%--btree_xlog_insert

To move these writes out of recovery's way, we should probably just
run the bgwriter process during crash recovery.  I'm going to look
into that.

The other thing is of course the checkpointer process, and our
end-of-recovery checkpoint.  I was going to suggest it should be
optional and not done by the recovery process itself, which is why
some earlier numbers I shared didn't include the end-of-recovery
checkpoint, but then I realised it complicated the numbers for this
little patch and, anyway, it'd be a good idea to open that can of
worms separately...

>| btree_redo
>| StartupXLOG
>|
> --11.64%--copyout
> [..]
>   __pread_nocancel
>--11.44%--FileRead
>  mdread
>  ReadBuffer_common
>  ReadBufferWithoutRelcache
>  XLogReadBufferExtended
>  XLogReadBufferForRedoExtended

For these reads, the solution should be WAL prefetching, but the patch
I shared for that (and will be updating soon) is just one piece of the
puzzle, and as it stands it actually *increases* the number of
syscalls by adding some posix_fadvise() calls, so ... erm, for an
all-in-kernel-cache-already workload like what you profiled there it
can only make things worse on that front.  But... when combined with
Andres's work-in-progress AIO stuff, a whole bunch of reads can be
submitted with a single system call ahead of time and then the results
are delivered directly into 

Re: Handing off SLRU fsyncs to the checkpointer

2020-08-25 Thread Jakub Wartak
On Wed, Aug 12, 2020 at 6:06 PM Thomas Munro  wrote:
> [patch]

Hi Thomas / hackers,

I just wanted to help testing this patch (defer SLRU fsyncs during recovery) 
and also faster compactify_tuples() patch [2] as both are related to the WAL 
recovery performance in which I'm interested in. This is my first message to 
this mailing group so please let me know also if I should adjust testing style 
or formatting.

With both of those patches applied:
make check -> Passes
make check-world -> Passes
make standbycheck (section "Testing Hot Standby" from docs) -> Passes
There wasn't a single segfault or postmaster crash during the tests.
Review of the patches itself: I'm not qualified to review the PostgreSQL 
internals.

I've used redo-bench scripts [1] by Thomas to measure the performance effect 
(this approach simplifies testing and excludes network jittering effects): 1st 
column is redo start->end timing, 2nd is redo end -> end of checkpointing 
timing before opening the DB for reads. I've conducted 2-3 separate tests that 
show benefits of those patches depending on the workload:
- Handing SLRU sync work over to the checkpointer: in my testing it accelerates 
WAL recovery performance on slower / higer latency storage by ~20%
- Faster sort in compactify_tuples(): in my testing it accelerates WAL recovery 
performance for HOT updates also by ~20%

TEST1: workload profile test as per standard TPC-B pgbench -c8 -j8, with 
majority of records in WAL stream being Heap/HOT_UPDATE:

xvda, master @ a3c66de6c5e1ee9dd41ce1454496568622fb7712 (24/08/2020) - baseline:
72.991, 0.919
73.688, 1.027
72.228, 1.032

xvda, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch
72.271, 0.857
72.717, 0.748
72.705, 0.81

xvda, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch, fsync=off
72.49, 0.103
74.069, 0.102
73.368, 0.102

NVMe, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch
70.312, 0.22
70.615, 0.201
69.739, 0.194

NVMe, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch, fsync=off
69.686, 0.101
70.601, 0.102
70.042, 0.101

As Thomas stated in the standard pgbench workload profile on recovery side 
there is compactify_tuples()->pg_qsort() overhead visible. So this is where the 
2nd patch helps:

NVMe, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch and 
compactify_tuples()->pg_qsort() performance improvement
58.85, 0.296
58.605, 0.269
58.603, 0.277

NVMe, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch and 
compactify_tuples()->pg_qsort() performance improvement, fsync=off
58.618, 0.103
57.693, 0.101
58.779, 0.111

In the last final case the top profile is as follows related still to the 
sorting but as I understand in much optimal way:

26.68%  postgres  postgres[.] qsort_itemoff
---qsort_itemoff
   |--14.17%--qsort_itemoff
   |  |--10.96%--compactify_tuples
   |  |  PageRepairFragmentation
   |  |  heap2_redo
   |  |  StartupXLOG
   |   --3.21%--qsort_itemoff
   |  --3.10%--compactify_tuples
   |PageRepairFragmentation
   |heap2_redo
   |StartupXLOG
--12.51%--compactify_tuples
  PageRepairFragmentation
  heap2_redo
  StartupXLOG

 8.38%  postgres  libc-2.17.so[.] __memmove_ssse3_back
---__memmove_ssse3_back
   compactify_tuples
   PageRepairFragmentation
   heap2_redo
   StartupXLOG

 6.51%  postgres  postgres[.] hash_search_with_hash_value
---hash_search_with_hash_value
   |--3.62%--smgropen
   |  |--2.17%--XLogReadBufferExtended
   |  |   --1.76%--XLogReadBufferForRedoExtended
   |  |  --0.93%--heap_xlog_update
   |   --1.45%--ReadBufferWithoutRelcache
   | XLogReadBufferExtended
   |  --1.34%--XLogReadBufferForRedoExtended
   | --0.72%--heap_xlog_update
--2.69%--BufTableLookup
  ReadBuffer_common
  ReadBufferWithoutRelcache
  XLogReadBufferExtended
   --2.48%--XLogReadBufferForRedoExtended
 |--1.34%--heap2_redo
 |  StartupXLOG
  --0.83%--heap_xlog_update


So to sum, HOT update-like workload profile tends to be CPU bound on single 
process recovery side. Even slow storage (like xvda) was not the 

Re: Handing off SLRU fsyncs to the checkpointer

2020-08-12 Thread Thomas Munro
On Wed, Aug 12, 2020 at 6:06 PM Thomas Munro  wrote:
> [patch]

Bitrot, rebased, no changes.

> Yeah, the combined effect of these two patches is better than I
> expected.  To be clear though, I was only measuring the time between
> the "redo starts at ..." and "redo done at ..." messages, since I've
> been staring at the main recovery code, but there are also some more
> fsyncs before (SyncDataDirectory()) and after (RemoveOldXlogFiles())
> that are unaffected.  I think it's probably possible to do something
> about those too, but that's another topic.

... and of course the end-of-recovery checkpoint; in my tests this
wasn't materially changed since there isn't actually very much CLOG,
it's just that we avoided syncing it block at a time and getting
rescheduled.  FWIW I put a very simple test here:
https://github.com/macdice/redo-bench, YMMV.
From 32c1c16d2800467d1d179678b66d1042d07c Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Tue, 4 Aug 2020 17:57:18 +1200
Subject: [PATCH v2] Defer flushing of SLRU files.

Previously, we called fsync() after writing out pg_xact, multixact and
commit_ts pages, leading to an I/O stall in user backends and recovery.
Collapse requests for the same file into a single system call as part of
the next checkpoint, as we do for relation files.

Discussion: https://postgr.es/m/CA+hUKGLJ=84yt+nvhkeedauutvhmfq9i-n7k_o50jmq6rpj...@mail.gmail.com
---
 src/backend/access/transam/clog.c  |  13 +++-
 src/backend/access/transam/commit_ts.c |  12 ++-
 src/backend/access/transam/multixact.c |  24 +-
 src/backend/access/transam/slru.c  | 101 +++--
 src/backend/access/transam/subtrans.c  |   4 +-
 src/backend/commands/async.c   |   5 +-
 src/backend/storage/lmgr/predicate.c   |   4 +-
 src/backend/storage/sync/sync.c|  28 ++-
 src/include/access/clog.h  |   3 +
 src/include/access/commit_ts.h |   3 +
 src/include/access/multixact.h |   4 +
 src/include/access/slru.h  |  12 ++-
 src/include/storage/sync.h |   7 +-
 13 files changed, 174 insertions(+), 46 deletions(-)

diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index dd2f4d5bc7..3eb33aea01 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -42,6 +42,7 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "storage/proc.h"
+#include "storage/sync.h"
 
 /*
  * Defines for CLOG page sizes.  A page is the same BLCKSZ as is used
@@ -692,7 +693,8 @@ CLOGShmemInit(void)
 {
 	XactCtl->PagePrecedes = CLOGPagePrecedes;
 	SimpleLruInit(XactCtl, "Xact", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE,
-  XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER);
+  XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER,
+  SYNC_HANDLER_CLOG);
 }
 
 /*
@@ -1034,3 +1036,12 @@ clog_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "clog_redo: unknown op code %u", info);
 }
+
+/*
+ * Entrypoint for sync.c to sync clog files.
+ */
+int
+clogsyncfiletag(const FileTag *ftag, char *path)
+{
+	return slrusyncfiletag(XactCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 5244b06a2b..913ec9e48d 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -555,7 +555,8 @@ CommitTsShmemInit(void)
 	CommitTsCtl->PagePrecedes = CommitTsPagePrecedes;
 	SimpleLruInit(CommitTsCtl, "CommitTs", CommitTsShmemBuffers(), 0,
   CommitTsSLRULock, "pg_commit_ts",
-  LWTRANCHE_COMMITTS_BUFFER);
+  LWTRANCHE_COMMITTS_BUFFER,
+  SYNC_HANDLER_COMMIT_TS);
 
 	commitTsShared = ShmemInitStruct("CommitTs shared",
 	 sizeof(CommitTimestampShared),
@@ -1083,3 +1084,12 @@ commit_ts_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "commit_ts_redo: unknown op code %u", info);
 }
+
+/*
+ * Entrypoint for sync.c to sync commit_ts files.
+ */
+int
+committssyncfiletag(const FileTag *ftag, char *path)
+{
+	return slrusyncfiletag(CommitTsCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index b8bedca04a..344006b0f5 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1831,11 +1831,13 @@ MultiXactShmemInit(void)
 	SimpleLruInit(MultiXactOffsetCtl,
   "MultiXactOffset", NUM_MULTIXACTOFFSET_BUFFERS, 0,
   MultiXactOffsetSLRULock, "pg_multixact/offsets",
-  LWTRANCHE_MULTIXACTOFFSET_BUFFER);
+  LWTRANCHE_MULTIXACTOFFSET_BUFFER,
+  SYNC_HANDLER_MULTIXACT_OFFSET);
 	SimpleLruInit(MultiXactMemberCtl,
   "MultiXactMember", NUM_MULTIXACTMEMBER_BUFFERS, 0,
   MultiXactMemberSLRULock, "pg_multixact/members",
-  LWTRANCHE_MULTIXACTMEMBER_BUFFER);
+  LWTRANCHE_MULTIXACTMEMBER_BUFFER,
+  SYNC_HANDLER_MULTIXACT_MEMBER);
 
 	/* Initialize our shared state struct */
 	MultiXactState = ShmemInitStruct("Shared MultiXact State",
@@ -3386,3 +3388,21 @@ 

Re: Handing off SLRU fsyncs to the checkpointer

2020-08-12 Thread Thomas Munro
On Sat, Aug 8, 2020 at 2:44 AM Robert Haas  wrote:
> On Wed, Aug 5, 2020 at 2:01 AM Thomas Munro  wrote:
> > * Master is around 11% faster than last week before commit c5315f4f
> > "Cache smgrnblocks() results in recovery."
> > * This patch gives a similar speedup, bringing the total to around 25%
> > faster than last week (the time is ~20% less, the WAL processing speed
> > is ~1.25x).
>
> Dang, that's pretty nice, especially for the relatively small amount
> of code that it seems to require.

Yeah, the combined effect of these two patches is better than I
expected.  To be clear though, I was only measuring the time between
the "redo starts at ..." and "redo done at ..." messages, since I've
been staring at the main recovery code, but there are also some more
fsyncs before (SyncDataDirectory()) and after (RemoveOldXlogFiles())
that are unaffected.  I think it's probably possible to do something
about those too, but that's another topic.

I spotted a small problem: if the transaction ID wrap all the way
around between checkpoints, then you might have cancelled requests for
a removed SLRU segment from the previous epoch, so we'd better
uncancel them if we see that.  That's a one line fix, done in the
attached.  I also adjusted the commit message to be a little clearer
(this work deferment/collapsing scheme works in crash recovery too,
not just when there is a checkpointer process to hand the work to).
From 63235aca6b2525716f8f29caf07e0a0e6a26965e Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Tue, 4 Aug 2020 17:57:18 +1200
Subject: [PATCH] Defer flushing of SLRU files.

Previously, we called fsync() after writing out pg_xact, multixact and
commit_ts pages, leading to an I/O stall in user backends and recovery.
Collapse requests for the same file into a single system call as part of
the next checkpoint, as we do for relation files.

Discussion: https://postgr.es/m/CA+hUKGLJ=84yt+nvhkeedauutvhmfq9i-n7k_o50jmq6rpj...@mail.gmail.com
---
 src/backend/access/transam/clog.c  |  13 +++-
 src/backend/access/transam/commit_ts.c |  12 ++-
 src/backend/access/transam/multixact.c |  24 +-
 src/backend/access/transam/slru.c  | 101 +++--
 src/backend/access/transam/subtrans.c  |   4 +-
 src/backend/commands/async.c   |   5 +-
 src/backend/storage/lmgr/predicate.c   |   4 +-
 src/backend/storage/sync/sync.c|  28 ++-
 src/include/access/clog.h  |   3 +
 src/include/access/commit_ts.h |   3 +
 src/include/access/multixact.h |   4 +
 src/include/access/slru.h  |  12 ++-
 src/include/storage/sync.h |   7 +-
 13 files changed, 174 insertions(+), 46 deletions(-)

diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index f3da40ae01..0c5b7a525e 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -42,6 +42,7 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "storage/proc.h"
+#include "storage/sync.h"
 
 /*
  * Defines for CLOG page sizes.  A page is the same BLCKSZ as is used
@@ -692,7 +693,8 @@ CLOGShmemInit(void)
 {
 	XactCtl->PagePrecedes = CLOGPagePrecedes;
 	SimpleLruInit(XactCtl, "Xact", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE,
-  XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER);
+  XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER,
+  SYNC_HANDLER_CLOG);
 }
 
 /*
@@ -1034,3 +1036,12 @@ clog_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "clog_redo: unknown op code %u", info);
 }
+
+/*
+ * Entrypoint for sync.c to sync clog files.
+ */
+int
+clogsyncfiletag(const FileTag *ftag, char *path)
+{
+	return slrusyncfiletag(XactCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 903280ae92..b4edbdb4e3 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -555,7 +555,8 @@ CommitTsShmemInit(void)
 	CommitTsCtl->PagePrecedes = CommitTsPagePrecedes;
 	SimpleLruInit(CommitTsCtl, "CommitTs", CommitTsShmemBuffers(), 0,
   CommitTsSLRULock, "pg_commit_ts",
-  LWTRANCHE_COMMITTS_BUFFER);
+  LWTRANCHE_COMMITTS_BUFFER,
+  SYNC_HANDLER_COMMIT_TS);
 
 	commitTsShared = ShmemInitStruct("CommitTs shared",
 	 sizeof(CommitTimestampShared),
@@ -1083,3 +1084,12 @@ commit_ts_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "commit_ts_redo: unknown op code %u", info);
 }
+
+/*
+ * Entrypoint for sync.c to sync commit_ts files.
+ */
+int
+committssyncfiletag(const FileTag *ftag, char *path)
+{
+	return slrusyncfiletag(CommitTsCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 475f5ed861..27ae2edbdc 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1831,11 +1831,13 @@ MultiXactShmemInit(void)
 	SimpleLruInit(MultiXactOffsetCtl,
   "MultiXactOffset", NUM_MULTIXACTOFFSET_BUFFERS, 0,
   

Re: Handing off SLRU fsyncs to the checkpointer

2020-08-07 Thread Robert Haas
On Wed, Aug 5, 2020 at 2:01 AM Thomas Munro  wrote:
> * Master is around 11% faster than last week before commit c5315f4f
> "Cache smgrnblocks() results in recovery."
> * This patch gives a similar speedup, bringing the total to around 25%
> faster than last week (the time is ~20% less, the WAL processing speed
> is ~1.25x).

Dang, that's pretty nice, especially for the relatively small amount
of code that it seems to require.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Handing off SLRU fsyncs to the checkpointer

2020-08-05 Thread Thomas Munro
On Tue, Aug 4, 2020 at 6:02 PM Thomas Munro  wrote:
> ... speedup of around 6% ...

I did some better testing.  OS: Linux, storage: consumer SSD.  I
repeatedly ran crash recovery on 3.3GB worth of WAL generated with 8M
pgbench transactions.  I tested 3 different builds 7 times each and
used "ministat" to compare the recovery times.  It told me that:

* Master is around 11% faster than last week before commit c5315f4f
"Cache smgrnblocks() results in recovery."
* This patch gives a similar speedup, bringing the total to around 25%
faster than last week (the time is ~20% less, the WAL processing speed
is ~1.25x).

My test fit in RAM and was all cached.  With the patch, the recovery
process used 100% of a single core the whole time and stayed on that
core and the variance is low, but in the other builds it hovered
around 90% and hopped around as it kept getting rescheduled and the
variance was higher.

Of course, SLRU fsyncs aren't the only I/O stalls in a real system;
among others, there are also reads from faulting in referenced pages
that don't have full page images in the WAL.  I'm working on that
separately, but that's a tad more complicated than this stuff.

Added to commit fest.

=== ministat output showing recovery times in seconds ===

x patched.dat
+ master.dat
* lastweek.dat
+--+
|  *   |
|x +   *   |
|x x xx  + +  ++ ++  *     |
|  |AM||_AM|   |_A_M__||
+--+
N   Min   MaxMedian   AvgStddev
x   738.65539.40639.218 39.1348570.25188849
+   742.12845.06843.958 43.8152860.91387758
Difference at 95.0% confidence
4.68043 +/- 0.780722
11.9597% +/- 1.99495%
(Student's t, pooled s = 0.670306)
*   747.18749.40449.203 48.9042860.76793483
Difference at 95.0% confidence
9.76943 +/- 0.665613
24.9635% +/- 1.70082%
(Student's t, pooled s = 0.571477)




Re: Handing off SLRU fsyncs to the checkpointer

2020-08-04 Thread Thomas Munro
On Wed, Feb 12, 2020 at 9:54 PM Thomas Munro  wrote:
> In commit 3eb77eba we made it possible for any subsystem that wants a
> file to be flushed as part of the next checkpoint to ask the
> checkpointer to do that, as previously only md.c could do.

Hello,

While working on recovery performance, I found my way back to this
idea and rebased the patch.

Problem statement:

Every time we have to write out a page of pg_commit_ts, pg_multixact
or pg_xact due to cache pressure, we immediately call fsync().  This
runs serially in the recovery process, and it's quite bad for
pg_commit_ts, because we need to dump out a page for every ~800
transactions (track_commit_timestamps is not enabled by default).  If
we ask the checkpointer to do it, it collapses the 2048 fsync calls
for each SLRU segment into one, and the kernel can write out the data
with larger I/Os, maybe even ahead of time, and update the inode only
once.

Experiment:

Run crash recovery for 1 million pgbench transactions:

  postgres -D pgdata \
-c synchronous_commit=off \
-c enable_commit_timestamps=on \
-c max_wal_size=10GB \
-c checkpoint_timeout=60min

  # in another shell
  pgbench -i -s10 postgres
  psql postgres -c checkpoint
  pgbench -t100 -Mprepared postgres
  killall -9 postgres

  # save the crashed pgdata dir for repeated experiments
  mv pgdata pgdata-save

  # now run experiments like this and see how long recovery takes
  rm -fr pgdata
  cp -r pgdata-save pgdata
  postgres -D pgdata

What I see on a system that has around 2.5ms latency for fsync:

  master: 16.83 seconds
  patched: 4.00 seconds

It's harder to see it without commit timestamps enabled since we only
need to flush a pg_xact page every 32k transactions (and multixacts
are more complicated to test), but you can still see the effect.  With
8x more transactions to make it clearer what's going on, I could
measure a speedup of around 6% from this patch, which I suppose scales
up fairly obviously with storage latency (every million transaction =
at least 30 fsyncs stalls, so you can multiply that by your fsync
latency and work out how much time your recovery process will be
asleep at the wheel instead of applying your records).

>From a syscall overhead point of view, it's a bit unfortunate that we
open and close SLRU segments every time we write, but it's probably
not really enough to complain about... except for the (small) risk of
an inode dropping out of kernel caches in the time between closing it
and the checkpointer opening it.  Hmm.
From 0fe8767316d5a973f43553080c3973759d83038d Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Tue, 4 Aug 2020 17:57:18 +1200
Subject: [PATCH v2] Ask the checkpointer to flush SLRU files.

Previously, we called fsync() after writing out pg_xact, multixact and
commit_ts pages, leading to an I/O stall in user backends and recovery.
Ask the checkpointer to collapse requests for the same file into a
single system call as part of the next checkpoint, as we do for relation
files.

Discussion: https://postgr.es/m/CA+hUKGLJ=84yt+nvhkeedauutvhmfq9i-n7k_o50jmq6rpj...@mail.gmail.com
---
 src/backend/access/transam/clog.c  |  13 +++-
 src/backend/access/transam/commit_ts.c |  12 ++-
 src/backend/access/transam/multixact.c |  24 +-
 src/backend/access/transam/slru.c  | 101 +++--
 src/backend/access/transam/subtrans.c  |   4 +-
 src/backend/commands/async.c   |   5 +-
 src/backend/storage/lmgr/predicate.c   |   4 +-
 src/backend/storage/sync/sync.c|  24 +-
 src/include/access/clog.h  |   3 +
 src/include/access/commit_ts.h |   3 +
 src/include/access/multixact.h |   4 +
 src/include/access/slru.h  |  12 ++-
 src/include/storage/sync.h |   7 +-
 13 files changed, 172 insertions(+), 44 deletions(-)

diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index f3da40ae01..0c5b7a525e 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -42,6 +42,7 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "storage/proc.h"
+#include "storage/sync.h"
 
 /*
  * Defines for CLOG page sizes.  A page is the same BLCKSZ as is used
@@ -692,7 +693,8 @@ CLOGShmemInit(void)
 {
 	XactCtl->PagePrecedes = CLOGPagePrecedes;
 	SimpleLruInit(XactCtl, "Xact", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE,
-  XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER);
+  XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER,
+  SYNC_HANDLER_CLOG);
 }
 
 /*
@@ -1034,3 +1036,12 @@ clog_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "clog_redo: unknown op code %u", info);
 }
+
+/*
+ * Entrypoint for sync.c to sync clog files.
+ */
+int
+clogsyncfiletag(const FileTag *ftag, char *path)
+{
+	return slrusyncfiletag(XactCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 903280ae92..b4edbdb4e3 100644
--- 

Handing off SLRU fsyncs to the checkpointer

2020-02-12 Thread Thomas Munro
Hi,

In commit 3eb77eba we made it possible for any subsystem that wants a
file to be flushed as part of the next checkpoint to ask the
checkpointer to do that, as previously only md.c could do.

In the past, foreground CLOG flush stalls were a problem, but then
commit 33aaa139 cranked up the number of buffers, and then commit
5364b357 cranked it right up until the flushes mostly disappeared from
some benchmark workload but not so high that the resulting linear
searches through the buffer array destroyed the gains.  I know there
is interest in moving that stuff into regular shared buffers, so it
can be found via the buffer mapping system (and improve as that
improves), written back by the background writer (and improve as that
improves), managed with a proper replacement algorithm (and improve as
that improves), etc etc.  That sounds like a great idea to me, but
it's a big project.

In the meantime, one thing we could do is hand off the fsyncs, but I'm
not sure if it's still considered a real problem in the field given
the new parameters.

Anyway, I had a patch for that, that I used while testing commit
3eb77eba.  While reading old threads about SLRU today I found that
several people had wished for a thing exactly like that, so I dusted
it off and rebased it.


0001-Use-the-checkpointer-to-fsync-SLRU-files.patch
Description: Binary data