Re: Handing off SLRU fsyncs to the checkpointer
On Mon, Jan 4, 2021 at 3:35 AM Tomas Vondra wrote: > Seems this commit left behind a couple unnecessary prototypes in a bunch > of header files. In particular, it removed these functions > > - ShutdownCLOG(); > - ShutdownCommitTs(); > - ShutdownSUBTRANS(); > - ShutdownMultiXact(); Thanks. Fixed.
Re: Handing off SLRU fsyncs to the checkpointer
On 9/25/20 9:09 AM, Thomas Munro wrote: On Fri, Sep 25, 2020 at 12:53 PM Thomas Munro wrote: Here's a new version. The final thing I'm contemplating before pushing this is whether there may be hidden magical dependencies in the order of operations in CheckPointGuts(), which I've changed around. Andres, any comments? I nagged Andres off-list and he opined that it might be better to reorder it a bit so that ProcessSyncRequests() comes after almost everything else, so that if we ever teach more things to offload their fsync work it'll be in the right order. I reordered it like that; now only CheckPointTwoPhase() comes later, based on the comment that accompanies it. In any case, we can always reconsider the ordering of this function in later commits as required. Pushed like that. Seems this commit left behind a couple unnecessary prototypes in a bunch of header files. In particular, it removed these functions - ShutdownCLOG(); - ShutdownCommitTs(); - ShutdownSUBTRANS(); - ShutdownMultiXact(); but we still have $ git grep ShutdownCLOG src/include/access/clog.h:extern void ShutdownCLOG(void); regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Handing off SLRU fsyncs to the checkpointer
On Fri, Sep 25, 2020 at 12:53 PM Thomas Munro wrote: > Here's a new version. The final thing I'm contemplating before > pushing this is whether there may be hidden magical dependencies in > the order of operations in CheckPointGuts(), which I've changed > around. Andres, any comments? I nagged Andres off-list and he opined that it might be better to reorder it a bit so that ProcessSyncRequests() comes after almost everything else, so that if we ever teach more things to offload their fsync work it'll be in the right order. I reordered it like that; now only CheckPointTwoPhase() comes later, based on the comment that accompanies it. In any case, we can always reconsider the ordering of this function in later commits as required. Pushed like that.
Re: Handing off SLRU fsyncs to the checkpointer
On Fri, Sep 25, 2020 at 12:05 PM Tom Lane wrote: > Thomas Munro writes: > > Tom, do you have any thoughts on ShutdownCLOG() etc? > > Hm, if we cannot reach that without first completing a shutdown checkpoint, > it does seem a little pointless. Thanks for the sanity check. > It'd likely be a good idea to add a comment to CheckPointCLOG et al > explaining that we expect $what-exactly to fsync the data we are writing > before the checkpoint is considered done. Good point. Done like this: + /* +* Write dirty CLOG pages to disk. This may result in sync requests queued +* for later handling by ProcessSyncRequests(), as part of the checkpoint. +*/ TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(true); - SimpleLruFlush(XactCtl, true); + SimpleLruWriteAll(XactCtl, true); TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(true); Here's a new version. The final thing I'm contemplating before pushing this is whether there may be hidden magical dependencies in the order of operations in CheckPointGuts(), which I've changed around. Andres, any comments? From a722d71c4296e5fae778a783eb8140be7887eced Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Fri, 25 Sep 2020 11:16:03 +1200 Subject: [PATCH v8] Defer flushing of SLRU files. Previously, we called fsync() after writing out individual pg_xact, pg_multixact and pg_commit_ts pages due to cache pressure, leading to regular I/O stalls in user backends and recovery. Collapse requests for the same file into a single system call as part of the next checkpoint, as we already did for relation files, using the infrastructure developed by commit 3eb77eba. This causes a significant improvement to recovery performance. Remove the Shutdown{CLOG,CommitTS,SUBTRANS,MultiXact}() functions, because they were redundant after the shutdown checkpoint that immediately precedes them. Hoist ProcessSyncRequests() up into CheckPointGuts() to make it clearer that it applies to all the SLRU mini-buffer-pools as well as the main buffer pool. Also rearrange things so that data collected in CheckpointStats includes SLRU activity. Reviewed-by: Tom Lane (ShutdownXXX removal part) Tested-by: Jakub Wartak Discussion: https://postgr.es/m/CA+hUKGLJ=84yt+nvhkeedauutvhmfq9i-n7k_o50jmq6rpj...@mail.gmail.com --- src/backend/access/transam/clog.c | 39 +++ src/backend/access/transam/commit_ts.c | 36 +++--- src/backend/access/transam/multixact.c | 57 + src/backend/access/transam/slru.c | 154 + src/backend/access/transam/subtrans.c | 25 +--- src/backend/access/transam/xlog.c | 18 ++- src/backend/commands/async.c | 5 +- src/backend/storage/buffer/bufmgr.c| 7 -- src/backend/storage/lmgr/predicate.c | 8 +- src/backend/storage/sync/sync.c| 28 - src/include/access/clog.h | 3 + src/include/access/commit_ts.h | 3 + src/include/access/multixact.h | 4 + src/include/access/slru.h | 14 ++- src/include/storage/sync.h | 7 +- src/tools/pgindent/typedefs.list | 1 + 16 files changed, 244 insertions(+), 165 deletions(-) diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c index 9e352d2658..2ec6a924db 100644 --- a/src/backend/access/transam/clog.c +++ b/src/backend/access/transam/clog.c @@ -42,6 +42,7 @@ #include "pg_trace.h" #include "pgstat.h" #include "storage/proc.h" +#include "storage/sync.h" /* * Defines for CLOG page sizes. A page is the same BLCKSZ as is used @@ -691,7 +692,8 @@ CLOGShmemInit(void) { XactCtl->PagePrecedes = CLOGPagePrecedes; SimpleLruInit(XactCtl, "Xact", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE, - XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER); + XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER, + SYNC_HANDLER_CLOG); } /* @@ -808,34 +810,18 @@ TrimCLOG(void) LWLockRelease(XactSLRULock); } -/* - * This must be called ONCE during postmaster or standalone-backend shutdown - */ -void -ShutdownCLOG(void) -{ - /* Flush dirty CLOG pages to disk */ - TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(false); - SimpleLruFlush(XactCtl, false); - - /* - * fsync pg_xact to ensure that any files flushed previously are durably - * on disk. - */ - fsync_fname("pg_xact", true); - - TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(false); -} - /* * Perform a checkpoint --- either during shutdown, or on-the-fly */ void CheckPointCLOG(void) { - /* Flush dirty CLOG pages to disk */ + /* + * Write dirty CLOG pages to disk. This may result in sync requests queued + * for later handling by ProcessSyncRequests(), as part of the checkpoint. + */ TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(true); - SimpleLruFlush(XactCtl, true); + SimpleLruWriteAll(XactCtl, true); TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(true); } @@ -1026,3 +1012,12 @@ clog_redo(XLogReaderState *record) else elog(PANIC, "clog_redo: unknown op code %u", info); } + +/* +
Re: Handing off SLRU fsyncs to the checkpointer
Thomas Munro writes: > Tom, do you have any thoughts on ShutdownCLOG() etc? Hm, if we cannot reach that without first completing a shutdown checkpoint, it does seem a little pointless. It'd likely be a good idea to add a comment to CheckPointCLOG et al explaining that we expect $what-exactly to fsync the data we are writing before the checkpoint is considered done. regards, tom lane
Re: Handing off SLRU fsyncs to the checkpointer
On Wed, Sep 23, 2020 at 1:56 PM Thomas Munro wrote: > As for the ShutdownXXX() functions, I haven't yet come up with any > reason for this code to exist. Emboldened by a colleague's inability > to explain to me what that code is doing for us, here is a new version > that just rips it all out. Rebased. Tom, do you have any thoughts on ShutdownCLOG() etc? From 07196368e6cd36df5910b536e93570c9acff4ff2 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Fri, 25 Sep 2020 11:16:03 +1200 Subject: [PATCH v7] Defer flushing of SLRU files. Previously, we called fsync() after writing out individual pg_xact, pg_multixact and pg_commit_ts pages due to cache pressure, leading to regular I/O stalls in user backends and recovery. Collapse requests for the same file into a single system call as part of the next checkpoint, as we already did for relation files, using the infrastructure developed by commit 3eb77eba. This causes a significant improvement to recovery performance. Remove the Shutdown{CLOG,CommitTS,SUBTRANS,MultiXact}() functions, because they were redundant after the shutdown checkpoint that immediately precedes them. Hoist ProcessSyncRequests() up into CheckPointGuts() to make it clearer that it applies to all the SLRU mini-buffer-pools as well as the main buffer pool. Also rearrange things so that data collected in CheckpointStats includes SLRU activity. Tested-by: Jakub Wartak Discussion: https://postgr.es/m/CA+hUKGLJ=84yt+nvhkeedauutvhmfq9i-n7k_o50jmq6rpj...@mail.gmail.com --- src/backend/access/transam/clog.c | 36 +++--- src/backend/access/transam/commit_ts.c | 32 +++-- src/backend/access/transam/multixact.c | 53 + src/backend/access/transam/slru.c | 154 + src/backend/access/transam/subtrans.c | 25 +--- src/backend/access/transam/xlog.c | 18 ++- src/backend/commands/async.c | 5 +- src/backend/storage/buffer/bufmgr.c| 7 -- src/backend/storage/lmgr/predicate.c | 8 +- src/backend/storage/sync/sync.c| 28 - src/include/access/clog.h | 3 + src/include/access/commit_ts.h | 3 + src/include/access/multixact.h | 4 + src/include/access/slru.h | 14 ++- src/include/storage/sync.h | 7 +- src/tools/pgindent/typedefs.list | 1 + 16 files changed, 233 insertions(+), 165 deletions(-) diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c index 9e352d2658..c4c20f508b 100644 --- a/src/backend/access/transam/clog.c +++ b/src/backend/access/transam/clog.c @@ -42,6 +42,7 @@ #include "pg_trace.h" #include "pgstat.h" #include "storage/proc.h" +#include "storage/sync.h" /* * Defines for CLOG page sizes. A page is the same BLCKSZ as is used @@ -691,7 +692,8 @@ CLOGShmemInit(void) { XactCtl->PagePrecedes = CLOGPagePrecedes; SimpleLruInit(XactCtl, "Xact", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE, - XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER); + XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER, + SYNC_HANDLER_CLOG); } /* @@ -808,34 +810,15 @@ TrimCLOG(void) LWLockRelease(XactSLRULock); } -/* - * This must be called ONCE during postmaster or standalone-backend shutdown - */ -void -ShutdownCLOG(void) -{ - /* Flush dirty CLOG pages to disk */ - TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(false); - SimpleLruFlush(XactCtl, false); - - /* - * fsync pg_xact to ensure that any files flushed previously are durably - * on disk. - */ - fsync_fname("pg_xact", true); - - TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(false); -} - /* * Perform a checkpoint --- either during shutdown, or on-the-fly */ void CheckPointCLOG(void) { - /* Flush dirty CLOG pages to disk */ + /* Write dirty CLOG pages to disk */ TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(true); - SimpleLruFlush(XactCtl, true); + SimpleLruWriteAll(XactCtl, true); TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(true); } @@ -1026,3 +1009,12 @@ clog_redo(XLogReaderState *record) else elog(PANIC, "clog_redo: unknown op code %u", info); } + +/* + * Entrypoint for sync.c to sync clog files. + */ +int +clogsyncfiletag(const FileTag *ftag, char *path) +{ + return SlruSyncFileTag(XactCtl, ftag, path); +} diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c index f6a7329ba3..af5d9baae5 100644 --- a/src/backend/access/transam/commit_ts.c +++ b/src/backend/access/transam/commit_ts.c @@ -555,7 +555,8 @@ CommitTsShmemInit(void) CommitTsCtl->PagePrecedes = CommitTsPagePrecedes; SimpleLruInit(CommitTsCtl, "CommitTs", CommitTsShmemBuffers(), 0, CommitTsSLRULock, "pg_commit_ts", - LWTRANCHE_COMMITTS_BUFFER); + LWTRANCHE_COMMITTS_BUFFER, + SYNC_HANDLER_COMMIT_TS); commitTsShared = ShmemInitStruct("CommitTs shared", sizeof(CommitTimestampShared), @@ -798,30 +799,14 @@ DeactivateCommitTs(void) LWLockRelease(CommitTsSLRULock); } -/* - * This must be called ONCE during postmaster or
Re: Handing off SLRU fsyncs to the checkpointer
On Tue, Sep 22, 2020 at 9:08 AM Thomas Munro wrote: > On Mon, Sep 21, 2020 at 2:19 PM Thomas Munro wrote: > > While scanning for comments and identifier names that needed updating, > > I realised that this patch changed the behaviour of the ShutdownXXX() > > functions, since they currently flush the SLRUs but are not followed > > by a checkpoint. I'm not entirely sure I understand the logic of > > that, but it wasn't my intention to change it. So here's a version > > that converts the existing fsync_fname() to fsync_fname_recurse() to > > Bleugh, that was probably a bad idea, it's too expensive. But it > forces me to ask the question: *why* do we need to call > Shutdown{CLOG,CommitTS,SUBTRANS, MultiXact}() after a creating a > shutdown checkpoint? I wondered if this might date from before the > WAL, but I see that the pattern was introduced when the CLOG was moved > out of shared buffers into a proto-SLRU in ancient commit 2589735da08, > but even in that commit the preceding CreateCheckPoint() call included > a call to CheckPointCLOG(). I complained about the apparently missing multixact fsync in a new thread, because if I'm right about that it requires a back-patchable fix. As for the ShutdownXXX() functions, I haven't yet come up with any reason for this code to exist. Emboldened by a colleague's inability to explain to me what that code is doing for us, here is a new version that just rips it all out. From 89e5b787ea5efd03ced8da46f95f236e03bf4adf Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Wed, 23 Sep 2020 13:02:27 +1200 Subject: [PATCH v6 1/2] Fix missing fsync of multixact directories. Standardize behavior by moving reponsibility for fsyncing directories down into slru.c. Back-patch to all supported releases. --- src/backend/access/transam/clog.c | 7 --- src/backend/access/transam/commit_ts.c | 6 -- src/backend/access/transam/slru.c | 7 +++ 3 files changed, 7 insertions(+), 13 deletions(-) diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c index 65aa8841f7..9e352d2658 100644 --- a/src/backend/access/transam/clog.c +++ b/src/backend/access/transam/clog.c @@ -836,13 +836,6 @@ CheckPointCLOG(void) /* Flush dirty CLOG pages to disk */ TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(true); SimpleLruFlush(XactCtl, true); - - /* - * fsync pg_xact to ensure that any files flushed previously are durably - * on disk. - */ - fsync_fname("pg_xact", true); - TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(true); } diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c index 5244b06a2b..f6a7329ba3 100644 --- a/src/backend/access/transam/commit_ts.c +++ b/src/backend/access/transam/commit_ts.c @@ -822,12 +822,6 @@ CheckPointCommitTs(void) { /* Flush dirty CommitTs pages to disk */ SimpleLruFlush(CommitTsCtl, true); - - /* - * fsync pg_commit_ts to ensure that any files flushed previously are - * durably on disk. - */ - fsync_fname("pg_commit_ts", true); } /* diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c index 7640f153c2..89ce7c43b5 100644 --- a/src/backend/access/transam/slru.c +++ b/src/backend/access/transam/slru.c @@ -1187,6 +1187,13 @@ SimpleLruFlush(SlruCtl ctl, bool allow_redirtied) } if (!ok) SlruReportIOError(ctl, pageno, InvalidTransactionId); + + /* + * Make sure that the directory entries for any newly created files are on + * disk. + */ + if (ctl->do_fsync) + fsync_fname(ctl->Dir, true); } /* -- 2.20.1 From ae51604c43e10ba6d24b628c2b907eca251177e9 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Mon, 21 Sep 2020 09:43:32 +1200 Subject: [PATCH v6 2/2] Defer flushing of SLRU files. Previously, we called fsync() after writing out individual pg_xact, pg_multixact and pg_commit_ts pages due to cache pressure, leading to regular I/O stalls in user backends and recovery. Collapse requests for the same file into a single system call as part of the next checkpoint, as we already did for relation files, using the infrastructure developed by commit 3eb77eba. This causes a significant improvement to recovery performance. Remove the Shutdown{CLOG,CommitTS,SUBTRANS,MultiXact}() functions, because they were redundant after the shutdown checkpoint that immediately precedes them. Hoist ProcessSyncRequests() up into CheckPointGuts() to make it clearer that it applies to all the SLRU mini-buffer-pools as well as the main buffer pool. Also rearrange things so that data collected in CheckpointStats includes SLRU activity. Tested-by: Jakub Wartak Discussion: https://postgr.es/m/CA+hUKGLJ=84yt+nvhkeedauutvhmfq9i-n7k_o50jmq6rpj...@mail.gmail.com --- src/backend/access/transam/clog.c | 36 +++--- src/backend/access/transam/commit_ts.c | 32 +++-- src/backend/access/transam/multixact.c | 53 + src/backend/access/transam/slru.c | 154 + src/backend/access/transam/subtrans.c | 25 +---
Re: Handing off SLRU fsyncs to the checkpointer
On Mon, Sep 21, 2020 at 2:19 PM Thomas Munro wrote: > While scanning for comments and identifier names that needed updating, > I realised that this patch changed the behaviour of the ShutdownXXX() > functions, since they currently flush the SLRUs but are not followed > by a checkpoint. I'm not entirely sure I understand the logic of > that, but it wasn't my intention to change it. So here's a version > that converts the existing fsync_fname() to fsync_fname_recurse() to Bleugh, that was probably a bad idea, it's too expensive. But it forces me to ask the question: *why* do we need to call Shutdown{CLOG,CommitTS,SUBTRANS, MultiXact}() after a creating a shutdown checkpoint? I wondered if this might date from before the WAL, but I see that the pattern was introduced when the CLOG was moved out of shared buffers into a proto-SLRU in ancient commit 2589735da08, but even in that commit the preceding CreateCheckPoint() call included a call to CheckPointCLOG().
Re: Handing off SLRU fsyncs to the checkpointer
On Sun, Sep 20, 2020 at 12:40 PM Thomas Munro wrote: > On Sat, Sep 19, 2020 at 5:06 PM Thomas Munro wrote: > > In the meantime, from the low-hanging-fruit department, here's a new > > version of the SLRU-fsync-offload patch. The only changes are a > > tweaked commit message, and adoption of C99 designated initialisers > > for the function table, so { [SYNC_HANDLER_CLOG] = ... } instead of > > relying on humans to make the array order match the enum values. If > > there are no comments or objections, I'm planning to commit this quite > > soon. > > ... and CI told me that Windows didn't like my array syntax with the > extra trailing comma. Here's one without. While scanning for comments and identifier names that needed updating, I realised that this patch changed the behaviour of the ShutdownXXX() functions, since they currently flush the SLRUs but are not followed by a checkpoint. I'm not entirely sure I understand the logic of that, but it wasn't my intention to change it. So here's a version that converts the existing fsync_fname() to fsync_fname_recurse() to fix that. Strangely, the fsync calls that ensure that directory entries are on disk seemed to be missing from CheckPointMultixact(), so I added them. Isn't that a live bug? I decided it was a little too magical that CheckPointCLOG() etc depended on the later call to CheckPointBuffers() to perform their fsyncs. I started writing comments about that, but then I realised that the right thing to do was to hoist ProcessSyncRequests() out of there into CheckPointGuts() and make it all more explicit. I also realised that it would be inconsistent to count SLRU sync activity as buffer sync time, but not count SLRU write activity as buffer write time, or count its buffers as written in the reported statistics. In other words, SLRU buffers *are* buffers for checkpoint reporting purposes (or should at least be consistently in or out of the stats, and with this patch they have to be in). Does that make sense? Is there a problem I'm not seeing with reordering CheckPointGuts() as I have? One comment change that seems worth highlighting is this code reached by VACUUM, which seems like a strict improvement (it wasn't flushing for crash recovery): /* -* Flush out dirty data, so PhysicalPageExists can work correctly. -* SimpleLruFlush() is a pretty big hammer for that. Alternatively we -* could add an in-memory version of page exists, but find_multixact_start -* is called infrequently, and it doesn't seem bad to flush buffers to -* disk before truncation. +* Write out dirty data, so PhysicalPageExists can work correctly. */ - SimpleLruFlush(MultiXactOffsetCtl, true); - SimpleLruFlush(MultiXactMemberCtl, true); + SimpleLruWriteAll(MultiXactOffsetCtl, true); + SimpleLruWriteAll(MultiXactMemberCtl, true); From c0b2281a983144d39ff7288431ef9c4af51b2e32 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Mon, 21 Sep 2020 09:43:32 +1200 Subject: [PATCH v5] Defer flushing of SLRU files. Previously, we called fsync() after writing out individual pg_xact, pg_multixact and pg_commit_ts pages due to cache pressure, leading to regular I/O stalls in user backends and recovery. Collapse requests for the same file into a single system call as part of the next checkpoint, as we already did for relation files, using the infrastructure developed by commit 3eb77eba. This causes a significant improvement to recovery performance. Perform a new recursive fsync() in the ShutdownXXX() functions to preserve existing behavior, since they won't be followed by a checkpoint. Hoist ProcessSyncRequests() up into CheckPointGuts() to make it clearer that it applies to all the SLRU mini-buffer-pools as well as the main buffer pool. Also rearrange things so that data collected in CheckpointStats includes SLRU activity. Tested-by: Jakub Wartak Discussion: https://postgr.es/m/CA+hUKGLJ=84yt+nvhkeedauutvhmfq9i-n7k_o50jmq6rpj...@mail.gmail.com --- src/backend/access/transam/clog.c | 29 +++-- src/backend/access/transam/commit_ts.c | 28 +++-- src/backend/access/transam/multixact.c | 62 +++--- src/backend/access/transam/slru.c | 150 + src/backend/access/transam/subtrans.c | 12 +- src/backend/access/transam/xlog.c | 14 ++- src/backend/commands/async.c | 5 +- src/backend/storage/buffer/bufmgr.c| 7 -- src/backend/storage/file/fd.c | 20 src/backend/storage/lmgr/predicate.c | 8 +- src/backend/storage/sync/sync.c| 28 - src/include/access/clog.h | 3 + src/include/access/commit_ts.h | 3 + src/include/access/multixact.h | 4 + src/include/access/slru.h | 14 ++- src/include/storage/fd.h | 1 + src/include/storage/sync.h | 7 +- src/tools/pgindent/typedefs.list | 1 + 18 files changed, 284 insertions(+), 112
Re: Handing off SLRU fsyncs to the checkpointer
On Sat, Sep 19, 2020 at 5:06 PM Thomas Munro wrote: > In the meantime, from the low-hanging-fruit department, here's a new > version of the SLRU-fsync-offload patch. The only changes are a > tweaked commit message, and adoption of C99 designated initialisers > for the function table, so { [SYNC_HANDLER_CLOG] = ... } instead of > relying on humans to make the array order match the enum values. If > there are no comments or objections, I'm planning to commit this quite > soon. ... and CI told me that Windows didn't like my array syntax with the extra trailing comma. Here's one without. From 89bc77827a73c93914f018fc862439f53ff54a39 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Sat, 19 Sep 2020 16:22:06 +1200 Subject: [PATCH v4 1/2] Defer flushing of SLRU files. Previously, we called fsync() after writing out pg_xact, multixact and commit_ts pages due to cache pressure, leading to an I/O stall in user backends and recovery. Collapse requests for the same file into a single system call as part of the next checkpoint, as we already do for relation files. This causes a significant improvement to recovery times. Tested-by: Jakub Wartak Discussion: https://postgr.es/m/CA+hUKGLJ=84yt+nvhkeedauutvhmfq9i-n7k_o50jmq6rpj...@mail.gmail.com --- src/backend/access/transam/clog.c | 13 +++- src/backend/access/transam/commit_ts.c | 12 ++- src/backend/access/transam/multixact.c | 24 +- src/backend/access/transam/slru.c | 101 +++-- src/backend/access/transam/subtrans.c | 4 +- src/backend/commands/async.c | 5 +- src/backend/storage/lmgr/predicate.c | 4 +- src/backend/storage/sync/sync.c| 28 ++- src/include/access/clog.h | 3 + src/include/access/commit_ts.h | 3 + src/include/access/multixact.h | 4 + src/include/access/slru.h | 12 ++- src/include/storage/sync.h | 7 +- 13 files changed, 174 insertions(+), 46 deletions(-) diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c index 65aa8841f7..304612c159 100644 --- a/src/backend/access/transam/clog.c +++ b/src/backend/access/transam/clog.c @@ -42,6 +42,7 @@ #include "pg_trace.h" #include "pgstat.h" #include "storage/proc.h" +#include "storage/sync.h" /* * Defines for CLOG page sizes. A page is the same BLCKSZ as is used @@ -691,7 +692,8 @@ CLOGShmemInit(void) { XactCtl->PagePrecedes = CLOGPagePrecedes; SimpleLruInit(XactCtl, "Xact", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE, - XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER); + XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER, + SYNC_HANDLER_CLOG); } /* @@ -1033,3 +1035,12 @@ clog_redo(XLogReaderState *record) else elog(PANIC, "clog_redo: unknown op code %u", info); } + +/* + * Entrypoint for sync.c to sync clog files. + */ +int +clogsyncfiletag(const FileTag *ftag, char *path) +{ + return slrusyncfiletag(XactCtl, ftag, path); +} diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c index 5244b06a2b..913ec9e48d 100644 --- a/src/backend/access/transam/commit_ts.c +++ b/src/backend/access/transam/commit_ts.c @@ -555,7 +555,8 @@ CommitTsShmemInit(void) CommitTsCtl->PagePrecedes = CommitTsPagePrecedes; SimpleLruInit(CommitTsCtl, "CommitTs", CommitTsShmemBuffers(), 0, CommitTsSLRULock, "pg_commit_ts", - LWTRANCHE_COMMITTS_BUFFER); + LWTRANCHE_COMMITTS_BUFFER, + SYNC_HANDLER_COMMIT_TS); commitTsShared = ShmemInitStruct("CommitTs shared", sizeof(CommitTimestampShared), @@ -1083,3 +1084,12 @@ commit_ts_redo(XLogReaderState *record) else elog(PANIC, "commit_ts_redo: unknown op code %u", info); } + +/* + * Entrypoint for sync.c to sync commit_ts files. + */ +int +committssyncfiletag(const FileTag *ftag, char *path) +{ + return slrusyncfiletag(CommitTsCtl, ftag, path); +} diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c index b8bedca04a..344006b0f5 100644 --- a/src/backend/access/transam/multixact.c +++ b/src/backend/access/transam/multixact.c @@ -1831,11 +1831,13 @@ MultiXactShmemInit(void) SimpleLruInit(MultiXactOffsetCtl, "MultiXactOffset", NUM_MULTIXACTOFFSET_BUFFERS, 0, MultiXactOffsetSLRULock, "pg_multixact/offsets", - LWTRANCHE_MULTIXACTOFFSET_BUFFER); + LWTRANCHE_MULTIXACTOFFSET_BUFFER, + SYNC_HANDLER_MULTIXACT_OFFSET); SimpleLruInit(MultiXactMemberCtl, "MultiXactMember", NUM_MULTIXACTMEMBER_BUFFERS, 0, MultiXactMemberSLRULock, "pg_multixact/members", - LWTRANCHE_MULTIXACTMEMBER_BUFFER); + LWTRANCHE_MULTIXACTMEMBER_BUFFER, + SYNC_HANDLER_MULTIXACT_MEMBER); /* Initialize our shared state struct */ MultiXactState = ShmemInitStruct("Shared MultiXact State", @@ -3386,3 +3388,21 @@ pg_get_multixact_members(PG_FUNCTION_ARGS) SRF_RETURN_DONE(funccxt); } + +/* + * Entrypoint for sync.c to sync offsets files. + */ +int
Re: Handing off SLRU fsyncs to the checkpointer
On Mon, Aug 31, 2020 at 8:50 PM Jakub Wartak wrote: > - IO_URING - gives a lot of promise here I think, is it even planned to be > shown for PgSQL14 cycle ? Or it's more like PgSQL15? I can't answer that, but I've played around with the prototype quite a bit, and thought quite a lot about how to port it to systems without IO_URING, and I'm just as keen to see this happen as you are. In the meantime, from the low-hanging-fruit department, here's a new version of the SLRU-fsync-offload patch. The only changes are a tweaked commit message, and adoption of C99 designated initialisers for the function table, so { [SYNC_HANDLER_CLOG] = ... } instead of relying on humans to make the array order match the enum values. If there are no comments or objections, I'm planning to commit this quite soon. From 0eba29484684a0642c4424741e256b0787051a19 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Sat, 19 Sep 2020 16:22:06 +1200 Subject: [PATCH v3] Defer flushing of SLRU files. Previously, we called fsync() after writing out pg_xact, multixact and commit_ts pages due to cache pressure, leading to an I/O stall in user backends and recovery. Collapse requests for the same file into a single system call as part of the next checkpoint, as we already do for relation files. This causes a significant improvement to recovery times. Tested-by: Jakub Wartak Discussion: https://postgr.es/m/CA+hUKGLJ=84yt+nvhkeedauutvhmfq9i-n7k_o50jmq6rpj...@mail.gmail.com --- src/backend/access/transam/clog.c | 13 +++- src/backend/access/transam/commit_ts.c | 12 ++- src/backend/access/transam/multixact.c | 24 +- src/backend/access/transam/slru.c | 101 +++-- src/backend/access/transam/subtrans.c | 4 +- src/backend/commands/async.c | 5 +- src/backend/storage/lmgr/predicate.c | 4 +- src/backend/storage/sync/sync.c| 30 +++- src/include/access/clog.h | 3 + src/include/access/commit_ts.h | 3 + src/include/access/multixact.h | 4 + src/include/access/slru.h | 12 ++- src/include/storage/sync.h | 7 +- 13 files changed, 175 insertions(+), 47 deletions(-) diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c index 65aa8841f7..304612c159 100644 --- a/src/backend/access/transam/clog.c +++ b/src/backend/access/transam/clog.c @@ -42,6 +42,7 @@ #include "pg_trace.h" #include "pgstat.h" #include "storage/proc.h" +#include "storage/sync.h" /* * Defines for CLOG page sizes. A page is the same BLCKSZ as is used @@ -691,7 +692,8 @@ CLOGShmemInit(void) { XactCtl->PagePrecedes = CLOGPagePrecedes; SimpleLruInit(XactCtl, "Xact", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE, - XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER); + XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER, + SYNC_HANDLER_CLOG); } /* @@ -1033,3 +1035,12 @@ clog_redo(XLogReaderState *record) else elog(PANIC, "clog_redo: unknown op code %u", info); } + +/* + * Entrypoint for sync.c to sync clog files. + */ +int +clogsyncfiletag(const FileTag *ftag, char *path) +{ + return slrusyncfiletag(XactCtl, ftag, path); +} diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c index 5244b06a2b..913ec9e48d 100644 --- a/src/backend/access/transam/commit_ts.c +++ b/src/backend/access/transam/commit_ts.c @@ -555,7 +555,8 @@ CommitTsShmemInit(void) CommitTsCtl->PagePrecedes = CommitTsPagePrecedes; SimpleLruInit(CommitTsCtl, "CommitTs", CommitTsShmemBuffers(), 0, CommitTsSLRULock, "pg_commit_ts", - LWTRANCHE_COMMITTS_BUFFER); + LWTRANCHE_COMMITTS_BUFFER, + SYNC_HANDLER_COMMIT_TS); commitTsShared = ShmemInitStruct("CommitTs shared", sizeof(CommitTimestampShared), @@ -1083,3 +1084,12 @@ commit_ts_redo(XLogReaderState *record) else elog(PANIC, "commit_ts_redo: unknown op code %u", info); } + +/* + * Entrypoint for sync.c to sync commit_ts files. + */ +int +committssyncfiletag(const FileTag *ftag, char *path) +{ + return slrusyncfiletag(CommitTsCtl, ftag, path); +} diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c index b8bedca04a..344006b0f5 100644 --- a/src/backend/access/transam/multixact.c +++ b/src/backend/access/transam/multixact.c @@ -1831,11 +1831,13 @@ MultiXactShmemInit(void) SimpleLruInit(MultiXactOffsetCtl, "MultiXactOffset", NUM_MULTIXACTOFFSET_BUFFERS, 0, MultiXactOffsetSLRULock, "pg_multixact/offsets", - LWTRANCHE_MULTIXACTOFFSET_BUFFER); + LWTRANCHE_MULTIXACTOFFSET_BUFFER, + SYNC_HANDLER_MULTIXACT_OFFSET); SimpleLruInit(MultiXactMemberCtl, "MultiXactMember", NUM_MULTIXACTMEMBER_BUFFERS, 0, MultiXactMemberSLRULock, "pg_multixact/members", - LWTRANCHE_MULTIXACTMEMBER_BUFFER); + LWTRANCHE_MULTIXACTMEMBER_BUFFER, + SYNC_HANDLER_MULTIXACT_MEMBER); /* Initialize our shared state struct */ MultiXactState =
Re: Handing off SLRU fsyncs to the checkpointer
Hi Thomas, hackers, >> ... %CPU ... COMMAND >> ... 97.4 ... postgres: startup recovering 00010089 > So, what else is pushing this thing off CPU, anyway? For one thing, I > guess it might be stalling while reading the WAL itself, because (1) > we only read it 8KB at a time, relying on kernel read-ahead, which > typically defaults to 128KB I/Os unless you cranked it up, but for > example we know that's not enough to saturate a sequential scan on > NVME system, so maybe it hurts here too (2) we keep having to switch > segment files every 16MB. Increasing WAL segment size and kernel > readahead size presumably help with that, if indeed it is a problem, > but we could also experiment with a big POSIX_FADV_WILLNEED hint for a > future segment every time we cross a boundary, and also maybe increase > the size of our reads. All of the above (1,2) would make sense and the effects IMHO are partially possible to achieve via ./configure compile options, but from previous correspondence [1] in this particular workload, it looked like it was not WAL reading, but reading random DB blocks into shared buffer: in that case I suppose it was the price of too many syscalls to the OS/VFS cache itself as the DB was small and fully cached there - so problem (3): copy_user_enhanced_fast_string <- 17.79%--copyout (!) <- __pread_nocancel <- 16.56%--FileRead / mdread / ReadBuffer_common (!). Without some micro-optimization or some form of vectorized [batching] I/O in recovery it's dead end when it comes to small changes. Thing that come to my mind as for enhancing recovery: - preadv() - works only for 1 fd, while WAL stream might require reading a lot of random pages into s_b (many relations/fds, even btree inserting to single relation might put data into many 1GB [default] forks). This would only micro-optimize INSERT (pk) SELECT nextval(seq) kind of processing on recovery side I suppose. Of coruse provided that StartupXLOG would be more working in a batched way: (a) reading a lot of blocks from WAL at once (b) then issuing preadv() to get all the DB blocks into s_b going from the same rel/fd (c) applying WAL. Sounds like a major refactor just to save syscalls :( - mmap() - even more unrealistic - IO_URING - gives a lot of promise here I think, is it even planned to be shown for PgSQL14 cycle ? Or it's more like PgSQL15? -Jakub Wartak [1] - https://www.postgresql.org/message-id/VI1PR0701MB6960EEB838D53886D8A180E3F6520%40VI1PR0701MB6960.eurprd07.prod.outlook.com please see profile after "0001+0002 s_b(at)128MB: 85.392"
Re: Handing off SLRU fsyncs to the checkpointer
On Sat, Aug 29, 2020 at 12:43 AM Jakub Wartak wrote: > ... %CPU ... COMMAND > ... 97.4 ... postgres: startup recovering 00010089 So, what else is pushing this thing off CPU, anyway? For one thing, I guess it might be stalling while reading the WAL itself, because (1) we only read it 8KB at a time, relying on kernel read-ahead, which typically defaults to 128KB I/Os unless you cranked it up, but for example we know that's not enough to saturate a sequential scan on NVME system, so maybe it hurts here too (2) we keep having to switch segment files every 16MB. Increasing WAL segment size and kernel readahead size presumably help with that, if indeed it is a problem, but we could also experiment with a big POSIX_FADV_WILLNEED hint for a future segment every time we cross a boundary, and also maybe increase the size of our reads.
Re: Handing off SLRU fsyncs to the checkpointer
On Sat, Aug 29, 2020 at 12:43 AM Jakub Wartak wrote: > USERPID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND > postgres 120935 0.9 0.0 866052 3824 ?Ss 09:47 0:00 postgres: > checkpointer > postgres 120936 61.9 0.0 865796 3824 ?Rs 09:47 0:22 postgres: > background writer > postgres 120937 97.4 0.0 865940 5228 ?Rs 09:47 0:36 postgres: > startup recovering 00010089 > > speedup of 1.647x Thanks for testing! That's better than I expected. I guess it wasn't quite so good with default bgwriter settings. > I have only one comment about those 2 WIP patches, bgwriter_lru_maxpages > should be maybe called standby_bgwriter_lru_maxpages in this scenario or even > more preferred there shouldn't be a maximum set during closed DB recovery > scenario. I wish bgwriter could auto-tune itself better, so we wouldn't need to contemplate adding more settings. As for the second patch ("Optionally, don't wait for end-of-recovery checkpoint."), that also looked quite useful in your test scenario: > end_of_recovery_checkpoint_wait = off, before DB is open 15s faster > I suppose a checkpoint for large shared_buffers (hundredths of GB) might take > a lot of time and this 0002 patch bypasses that. I would find it quite useful > in some scenarios (e.g. testing backups, PITR recoveries, opening DB from > storage snapshots / storage replication, maybe with DWH-after-crash too). I suppose a third option that you might want is no checkpoint at all (meaning leave it to regular checkpoint scheduling), like fast promotion. One thing missing from the patch is that we probably need to log an end-of-recovery *record*, like fast promotion does. I'm a little fuzzy on the timeline stuff. I wonder if any recovery experts would like to weigh in on theoretical problems that might be lurking here...
Re: Handing off SLRU fsyncs to the checkpointer
Hi Thomas, hackers, >> > To move these writes out of recovery's way, we should probably just >> > run the bgwriter process during crash recovery. I'm going to look >> > into that. >> >> Sounds awesome. > >I wrote a quick and dirty experimental patch to try that. I can't see >any benefit from it on pgbench with default shared buffers, but maybe >it would do better with your append test due to locality, especially >if you can figure out how to tune bgwriter to pace itself optimally. >https://github.com/macdice/postgres/tree/bgwriter-in-crash-recovery OK, so I've quickly tested those two PoCs patches together, in the conditions like below: - similar append-only workload by pgbench (to eliminate other already known different WAL bottlenecks: e.g. sorting), - 4.3GB of WAL to be applied (mostly Btree/INSERT_LEAF) - on same system as last time (ext4 on NVMe, 1s8c16, 4.14 kernel) - 14master already with SLRU fsync to checkpointer/pg_qgsort patches applied TEST bgwriterPOC1: - in severe dirty memory conditions (artificially simulated via small s_b here) --> so for workloads with very high FlushBuffer activity in StartupXLOG - with fsync=off/fpw=off by default and on NVMe (e.g. scenario: I want to perform some PITR as fast as I can to see how production data looked like in the past, before some user deleted some data) baseline s_b@128MB: 140.404, 0.123 (2nd small as there is small region to checkpoint) 22.49% postgres [kernel.kallsyms] [k] copy_user_enhanced_fast_string ---copy_user_enhanced_fast_string |--14.72%--copyin | __pwrite_nocancel | FileWrite | mdwrite | FlushBuffer | ReadBuffer_common | --14.52%--btree_xlog_insert --7.77%--copyout __pread_nocancel --7.57%--FileRead mdread ReadBuffer_common 6.13% postgres [kernel.kallsyms] [k] do_syscall_64 |--1.64%--__pwrite_nocancel --1.23%--__pread_nocancel 3.68% postgres postgres [.] hash_search_with_hash_value ---hash_search_with_hash_value |--1.02%--smgropen After applying: patch -p1 < ../0001-Run-checkpointer-and-bgworker-in-crash-recovery.patch patch -p1 < ../0002-Optionally-don-t-wait-for-end-of-recovery-checkpoint.patch 0001+0002 s_b@128MB: similar result to above 0001+0002 s_b@128MB: 108.871, 0.114 , bgwriter_delay = 10ms/bgwriter_lru_maxpages = 1000 0001+0002 s_b@128MB: 85.392, 0.103 , bgwriter_delay = 10ms/bgwriter_lru_maxpages = 5 #~390MB max? 18.40% postgres [kernel.kallsyms] [k] copy_user_enhanced_fast_string ---copy_user_enhanced_fast_string |--17.79%--copyout | __pread_nocancel | |--16.56%--FileRead | | mdread | | ReadBuffer_common --0.61%--copyin // WOW __pwrite_nocancel FileWrite mdwrite FlushBuffer ReadBuffer_common 9.20% postgres postgres [.] hash_search_with_hash_value ---hash_search_with_hash_value |--4.70%--smgropen of course there is another WOW moment during recovery ("61.9%") USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND postgres 120935 0.9 0.0 866052 3824 ? Ss 09:47 0:00 postgres: checkpointer postgres 120936 61.9 0.0 865796 3824 ? Rs 09:47 0:22 postgres: background writer postgres 120937 97.4 0.0 865940 5228 ? Rs 09:47 0:36 postgres: startup recovering 00010089 speedup of 1.647x when dirty memory is in way. When it's not: baseline s_b@24000MB: 39.199, 1.448 (2x patches off) 0001+0002 s_b@24000MB: 39.383, 1.442 , bgwriter_delay = 10ms/bgwriter_lru_maxpages = 5 #~390MB/s max, yay there's no regression. I have only one comment about those 2 WIP patches, bgwriter_lru_maxpages should be maybe called standby_bgwriter_lru_maxpages in this scenario or even more preferred there shouldn't be a maximum set during closed DB recovery scenario. TEST bgwriterPOC2a to showcase the 2nd patch which opens the DB for read-write users before the final checkpoint finishes after redo recovery. The DBA may make the decision via this parameter end_of_recovery_checkpoint_wait=off. - on slow storage (xvda, fsync=on) and even with high memory: s_b@24000MB: 39.043, 15.639 -- even with WAL recovery being 100% CPU bound(mostly on hash_search_with_hash_value() for Buffers/__memmove_ssse3_back), it took additional 15s to perform checkpoint before DB was open for users (it had to write 269462 buffers =~ 2GB =~
Re: Handing off SLRU fsyncs to the checkpointer
On Thu, Aug 27, 2020 at 8:48 PM Jakub Wartak wrote: > >> 29.62% postgres [kernel.kallsyms] [k] > >> copy_user_enhanced_fast_string > >> ---copy_user_enhanced_fast_string > >>|--17.98%--copyin > >> [..] > >>| __pwrite_nocancel > >>| FileWrite > >>| mdwrite > >>| FlushBuffer > >>| ReadBuffer_common > >>| ReadBufferWithoutRelcache > >>| XLogReadBufferExtended > >>| XLogReadBufferForRedoExtended > >>| --17.57%--btree_xlog_insert > > > > To move these writes out of recovery's way, we should probably just > > run the bgwriter process during crash recovery. I'm going to look > > into that. > > Sounds awesome. I wrote a quick and dirty experimental patch to try that. I can't see any benefit from it on pgbench with default shared buffers, but maybe it would do better with your append test due to locality, especially if you can figure out how to tune bgwriter to pace itself optimally. https://github.com/macdice/postgres/tree/bgwriter-in-crash-recovery
Re: Handing off SLRU fsyncs to the checkpointer
On Thu, Aug 27, 2020 at 8:48 PM Jakub Wartak wrote: > I've tried to get cache misses ratio via PMCs, apparently on EC2 they are > (even on bigger) reporting as not-supported or zeros. I heard some of the counters are only allowed on their dedicated instance types. > However interestingly the workload has IPC of 1.40 (instruction bound) which > to me is strange as I would expect BufTableLookup() to be actually heavy > memory bound (?) Maybe I'll try on some different hardware one day. Hmm, OK now you've made me go and read a bunch of Brendan Gregg bloggs and try some experiments of my own to get a feel for this number and what it might be telling us about the cache miss counters you can't see. Since I know how to generate arbitrary cache miss workloads for quick experiments using hash joins of different sizes, I tried that and noticed that when LLC misses were at 76% (bad), IPC was at 1.69 which is still higher than what you're seeing. When the hash table was much smaller and LLC misses were down to 15% (much better), IPC was at 2.83. I know Gregg said[1] "An IPC < 1.0 likely means memory bound, and an IPC > 1.0 likely means instruction bound", but that's not what I'm seeing here, and in his comments section that is disputed. So I'm not sure your IPC of 1.40 is evidence against the hypothesis on its own. [1] http://www.brendangregg.com/blog/2017-05-09/cpu-utilization-is-wrong.html
Re: Handing off SLRU fsyncs to the checkpointer
Hi Alvaro, Thomas, hackers >> 14.69% postgres postgres[.] hash_search_with_hash_value >> ---hash_search_with_hash_value >>|--9.80%--BufTableLookup [..] >> --4.90%--smgropen >> |--2.86%--ReadBufferWithoutRelcache > Looking at an earlier report of this problem I was thinking whether it'd > make sense to replace SMgrRelationHash with a simplehash table; I have a > half-written patch for that, but I haven't completed that work. > However, in the older profile things were looking different, as > hash_search_with_hash_value was taking 35.25%, and smgropen was 33.74% > of it. BufTableLookup was also there but only 1.51%. So I'm not so > sure now that that'll pay off as clearly as I had hoped. Yes, quite frankly my expectation was to see hash_search_with_hash_value()<-smgropen() outcome as 1st one, but in simplified redo-bench script it's not the case. The original scenario was much more complex with plenty of differences (in no particular order: TB-sized DB VS ~500GB RAM -> thousands of forks, multiple tables, huge btrees, multiple INSERTs wih plenty of data in VALUES() thrown as one commit, real primary->hot-standby replication [not closed DB in recovery], sorted not random UUIDs) - I'm going to try nail down these differences and maybe I manage to produce more realistic "pgbench reproducer" (this may take some time though). -Jakub Wartak.
Re: Handing off SLRU fsyncs to the checkpointer
Hi Thomas / hackers, >> The append-only bottleneck appears to be limited by syscalls/s due to small >> block size even with everything in FS cache (but not in shared buffers, >> please compare with TEST1 as there was no such bottleneck at all): >> >> 29.62% postgres [kernel.kallsyms] [k] copy_user_enhanced_fast_string >> ---copy_user_enhanced_fast_string >>|--17.98%--copyin >> [..] >>| __pwrite_nocancel >>| FileWrite >>| mdwrite >>| FlushBuffer >>| ReadBuffer_common >>| ReadBufferWithoutRelcache >>| XLogReadBufferExtended >>| XLogReadBufferForRedoExtended >>| --17.57%--btree_xlog_insert > > To move these writes out of recovery's way, we should probably just > run the bgwriter process during crash recovery. I'm going to look > into that. Sounds awesome. Also as this thread is starting to derail the SLRU fsync topic - maybe we should change subject? However, to add some data to the separate bgwriter: on 14master (already with lseek() caching, SLRU fsyncs out of way, better sorting), I've measured the same configuration as last time with still the same append-only WAL workload on NVMe and compared with various shared_buffers settings (and buffers description sizing from pg_shmem_allocations which as You stated is wrongly reported(?) which I'm stating only for reference just in case): shared_buffers=128MB buffers_desc=1024kB 96.778, 0.438 [a] shared_buffers=256MB buffers_desc=2048kB 62.755, 0.577 [a] shared_buffers=512MB buffers_desc=4096kB 33.167, 0.62 [a] shared_buffers=1GB buffers_desc=8192kB 27.303, 0.929 shared_buffers=4GB buffers_desc=32MB 27.185, 1.166 shared_buffers=8GB buffers_desc=64MB 27.649, 1.088 shared_buffers=16GB buffers_desc=128MB 27.584, 1.201 [b] shared_buffers=32GB buffers_desc=256MB 32.314, 1.171 [b] shared_buffers=48GB buffers_desc=384 MB 31.95, 1.217 shared_buffers=64GB buffers_desc=512 MB 31.276, 1.349 shared_buffers=72GB buffers_desc=576 MB 31.925, 1.284 shared_buffers=80GB buffers_desc=640 MB 31.809, 1.413 The amount of WAL to be replayed was ~2.8GB. To me it looks like that a) flushing dirty buffers by StartupXLog might be a real problem but please read-on. b) there is very low impact by this L2/L3 hypothesis you mention (?), but it's not that big and it's not degrading linearly as one would expect (??) L1d:L1d:L2:L3 cache sizes on this machine are as follows on this machine: 32K/32K/256K/46080K. Maybe the DB size is too small. I've double-checked that in condition [a] (shared_buffers=128MB) there was a lot of FlushBuffer() invocations per second (perf stat -e probe_postgres:FlushBuffer -I 1000), e.g: # time counts unit events 1.000485217 79,494 probe_postgres:FlushBuffer 2.000861366 75,305 probe_postgres:FlushBuffer 3.001263163 79,662 probe_postgres:FlushBuffer 4.001650795 80,317 probe_postgres:FlushBuffer 5.002033273 79,951 probe_postgres:FlushBuffer 6.002418646 79,993 probe_postgres:FlushBuffer while at 1GB shared_buffers it sits nearly at zero all the time. So there is like 3x speed-up possible when StartupXLog wouldn't have to care too much about dirty buffers in the critical path (bgwriter would as you say?) at least in append-only scenarios. But ... I've checked some real systems (even older versions of PostgreSQL not doing that much of replication, and indeed it's e.g. happening 8-12k/s FlushBuffer's() and shared buffers are pretty huge (>100GB, 0.5TB RAM) but they are *system-wide* numbers, work is really performed by bgwriter not by startup/recovering as in this redo-bench case when DB is open for reads and replicating-- so it appears that this isn't optimization for hot standby case , but for the DB-closed startup recovery/restart/disaster/PITR scenario). As for the 24GB shared_buffers scenario where dirty buffers are not at all a problem with given top profile (output trimmed), again as expected: 13.41% postgres postgres [.] hash_search_with_hash_value |--8.31%--BufTableLookup <- ReadBuffer_common <- ReadBufferWithoutRelcache --5.11%--smgropen |--2.77%--XLogReadBufferExtended --2.34%--ReadBufferWithoutRelcache 7.88% postgres postgres [.] MarkBufferDirty I've tried to get cache misses ratio via PMCs, apparently on EC2 they are (even on bigger) reporting as not-supported or zeros. However interestingly the workload has IPC of 1.40 (instruction bound) which to me is strange as I would expect BufTableLookup() to be actually heavy memory bound (?) Maybe I'll try on some different hardware one day. >>
Re: Handing off SLRU fsyncs to the checkpointer
On Thu, Aug 27, 2020 at 6:15 AM Alvaro Herrera wrote: > > --4.90%--smgropen > > |--2.86%--ReadBufferWithoutRelcache > > Looking at an earlier report of this problem I was thinking whether it'd > make sense to replace SMgrRelationHash with a simplehash table; I have a > half-written patch for that, but I haven't completed that work. > However, in the older profile things were looking different, as > hash_search_with_hash_value was taking 35.25%, and smgropen was 33.74% > of it. BufTableLookup was also there but only 1.51%. So I'm not so > sure now that that'll pay off as clearly as I had hoped. Right, my hypothesis requires an uncacheably large buffer mapping table, and I think smgropen() needs a different explanation because it's not expected to be as large or as random, at least not with a pgbench workload. I think the reasons for a profile with a smgropen() showing up so high, and in particular higher than BufTableLookup(), must be: 1. We call smgropen() twice for every call to BufTableLookup(). Once in XLogReadBufferExtended(), and then again in ReadBufferWithoutRelcache(). 2. We also call it for every block forced out of the buffer pool, and in recovery that has to be done by the recovery loop. 3. We also call it for every block in the buffer pool during the end-of-recovery checkpoint. Not sure but the last two might perform worse due to proximity to interleaving pwrite() system calls (just a thought, not investigated). In any case, I'm going to propose we move those things out of the recovery loop, in a new thread.
Re: Handing off SLRU fsyncs to the checkpointer
On 2020-Aug-25, Jakub Wartak wrote: > Turning on/off the defer SLRU patch and/or fsync doesn't seem to make > any difference, so if anyone is curious the next sets of append-only > bottlenecks is like below: > > 14.69% postgres postgres[.] hash_search_with_hash_value > ---hash_search_with_hash_value >|--9.80%--BufTableLookup >| ReadBuffer_common >| ReadBufferWithoutRelcache >| XLogReadBufferExtended >| XLogReadBufferForRedoExtended >| |--7.76%--btree_xlog_insert >| | btree_redo >| | StartupXLOG >| --1.63%--heap_xlog_insert > --4.90%--smgropen > |--2.86%--ReadBufferWithoutRelcache Looking at an earlier report of this problem I was thinking whether it'd make sense to replace SMgrRelationHash with a simplehash table; I have a half-written patch for that, but I haven't completed that work. However, in the older profile things were looking different, as hash_search_with_hash_value was taking 35.25%, and smgropen was 33.74% of it. BufTableLookup was also there but only 1.51%. So I'm not so sure now that that'll pay off as clearly as I had hoped. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Handing off SLRU fsyncs to the checkpointer
On 2020-Aug-25, Andres Freund wrote: > Hi, > > On 2020-08-26 15:58:14 +1200, Thomas Munro wrote: > > > --12.51%--compactify_tuples > > > PageRepairFragmentation > > > heap2_redo > > > StartupXLOG > > > > I wonder if there is something higher level that could be done to > > reduce the amount of compaction work required in the first place, but > > in the meantime I'm very happy if we can improve the situation so much > > with such a microscopic improvement that might eventually benefit > > other sorting stuff... > > Another approach could be to not perform any sorting during recovery, > instead including enough information in the WAL record to avoid doing a > full blown PageRepairFragmentation during recovery. Hmm, including the sorted ItemId array in the WAL record might make sense to alleviate the qsort work ... -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Handing off SLRU fsyncs to the checkpointer
Hi, On 2020-08-26 15:58:14 +1200, Thomas Munro wrote: > > --12.51%--compactify_tuples > > PageRepairFragmentation > > heap2_redo > > StartupXLOG > > I wonder if there is something higher level that could be done to > reduce the amount of compaction work required in the first place, but > in the meantime I'm very happy if we can improve the situation so much > with such a microscopic improvement that might eventually benefit > other sorting stuff... Another approach could be to not perform any sorting during recovery, instead including enough information in the WAL record to avoid doing a full blown PageRepairFragmentation during recovery. Greetings, Andres Freund
Re: Handing off SLRU fsyncs to the checkpointer
On Tue, Aug 25, 2020 at 9:16 PM Jakub Wartak wrote: > I just wanted to help testing this patch (defer SLRU fsyncs during recovery) > and also faster compactify_tuples() patch [2] as both are related to the WAL > recovery performance in which I'm interested in. This is my first message to > this mailing group so please let me know also if I should adjust testing > style or formatting. Hi Jakub, Thanks very much for these results! > - Handing SLRU sync work over to the checkpointer: in my testing it > accelerates WAL recovery performance on slower / higer latency storage by ~20% Wow. Those fsyncs must have had fairly high latency (presumably due to queuing behind other write back activity). > - Faster sort in compactify_tuples(): in my testing it accelerates WAL > recovery performance for HOT updates also by ~20% Nice. > In the last final case the top profile is as follows related still to the > sorting but as I understand in much optimal way: > > 26.68% postgres postgres[.] qsort_itemoff > ---qsort_itemoff >|--14.17%--qsort_itemoff >| |--10.96%--compactify_tuples >| | PageRepairFragmentation >| | heap2_redo >| | StartupXLOG >| --3.21%--qsort_itemoff >| --3.10%--compactify_tuples >|PageRepairFragmentation >|heap2_redo >|StartupXLOG > --12.51%--compactify_tuples > PageRepairFragmentation > heap2_redo > StartupXLOG I wonder if there is something higher level that could be done to reduce the amount of compaction work required in the first place, but in the meantime I'm very happy if we can improve the situation so much with such a microscopic improvement that might eventually benefit other sorting stuff... > 8.38% postgres libc-2.17.so[.] __memmove_ssse3_back > ---__memmove_ssse3_back >compactify_tuples >PageRepairFragmentation >heap2_redo Hmm, I wonder if this bit could go teensy bit faster by moving as many adjacent tuples as you can in one go rather than moving them one at a time... > The append-only bottleneck appears to be limited by syscalls/s due to small > block size even with everything in FS cache (but not in shared buffers, > please compare with TEST1 as there was no such bottleneck at all): > > 29.62% postgres [kernel.kallsyms] [k] copy_user_enhanced_fast_string > ---copy_user_enhanced_fast_string >|--17.98%--copyin > [..] >| __pwrite_nocancel >| FileWrite >| mdwrite >| FlushBuffer >| ReadBuffer_common >| ReadBufferWithoutRelcache >| XLogReadBufferExtended >| XLogReadBufferForRedoExtended >| --17.57%--btree_xlog_insert To move these writes out of recovery's way, we should probably just run the bgwriter process during crash recovery. I'm going to look into that. The other thing is of course the checkpointer process, and our end-of-recovery checkpoint. I was going to suggest it should be optional and not done by the recovery process itself, which is why some earlier numbers I shared didn't include the end-of-recovery checkpoint, but then I realised it complicated the numbers for this little patch and, anyway, it'd be a good idea to open that can of worms separately... >| btree_redo >| StartupXLOG >| > --11.64%--copyout > [..] > __pread_nocancel >--11.44%--FileRead > mdread > ReadBuffer_common > ReadBufferWithoutRelcache > XLogReadBufferExtended > XLogReadBufferForRedoExtended For these reads, the solution should be WAL prefetching, but the patch I shared for that (and will be updating soon) is just one piece of the puzzle, and as it stands it actually *increases* the number of syscalls by adding some posix_fadvise() calls, so ... erm, for an all-in-kernel-cache-already workload like what you profiled there it can only make things worse on that front. But... when combined with Andres's work-in-progress AIO stuff, a whole bunch of reads can be submitted with a single system call ahead of time and then the results are delivered directly into
Re: Handing off SLRU fsyncs to the checkpointer
On Wed, Aug 12, 2020 at 6:06 PM Thomas Munro wrote: > [patch] Hi Thomas / hackers, I just wanted to help testing this patch (defer SLRU fsyncs during recovery) and also faster compactify_tuples() patch [2] as both are related to the WAL recovery performance in which I'm interested in. This is my first message to this mailing group so please let me know also if I should adjust testing style or formatting. With both of those patches applied: make check -> Passes make check-world -> Passes make standbycheck (section "Testing Hot Standby" from docs) -> Passes There wasn't a single segfault or postmaster crash during the tests. Review of the patches itself: I'm not qualified to review the PostgreSQL internals. I've used redo-bench scripts [1] by Thomas to measure the performance effect (this approach simplifies testing and excludes network jittering effects): 1st column is redo start->end timing, 2nd is redo end -> end of checkpointing timing before opening the DB for reads. I've conducted 2-3 separate tests that show benefits of those patches depending on the workload: - Handing SLRU sync work over to the checkpointer: in my testing it accelerates WAL recovery performance on slower / higer latency storage by ~20% - Faster sort in compactify_tuples(): in my testing it accelerates WAL recovery performance for HOT updates also by ~20% TEST1: workload profile test as per standard TPC-B pgbench -c8 -j8, with majority of records in WAL stream being Heap/HOT_UPDATE: xvda, master @ a3c66de6c5e1ee9dd41ce1454496568622fb7712 (24/08/2020) - baseline: 72.991, 0.919 73.688, 1.027 72.228, 1.032 xvda, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch 72.271, 0.857 72.717, 0.748 72.705, 0.81 xvda, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch, fsync=off 72.49, 0.103 74.069, 0.102 73.368, 0.102 NVMe, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch 70.312, 0.22 70.615, 0.201 69.739, 0.194 NVMe, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch, fsync=off 69.686, 0.101 70.601, 0.102 70.042, 0.101 As Thomas stated in the standard pgbench workload profile on recovery side there is compactify_tuples()->pg_qsort() overhead visible. So this is where the 2nd patch helps: NVMe, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch and compactify_tuples()->pg_qsort() performance improvement 58.85, 0.296 58.605, 0.269 58.603, 0.277 NVMe, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch and compactify_tuples()->pg_qsort() performance improvement, fsync=off 58.618, 0.103 57.693, 0.101 58.779, 0.111 In the last final case the top profile is as follows related still to the sorting but as I understand in much optimal way: 26.68% postgres postgres[.] qsort_itemoff ---qsort_itemoff |--14.17%--qsort_itemoff | |--10.96%--compactify_tuples | | PageRepairFragmentation | | heap2_redo | | StartupXLOG | --3.21%--qsort_itemoff | --3.10%--compactify_tuples |PageRepairFragmentation |heap2_redo |StartupXLOG --12.51%--compactify_tuples PageRepairFragmentation heap2_redo StartupXLOG 8.38% postgres libc-2.17.so[.] __memmove_ssse3_back ---__memmove_ssse3_back compactify_tuples PageRepairFragmentation heap2_redo StartupXLOG 6.51% postgres postgres[.] hash_search_with_hash_value ---hash_search_with_hash_value |--3.62%--smgropen | |--2.17%--XLogReadBufferExtended | | --1.76%--XLogReadBufferForRedoExtended | | --0.93%--heap_xlog_update | --1.45%--ReadBufferWithoutRelcache | XLogReadBufferExtended | --1.34%--XLogReadBufferForRedoExtended | --0.72%--heap_xlog_update --2.69%--BufTableLookup ReadBuffer_common ReadBufferWithoutRelcache XLogReadBufferExtended --2.48%--XLogReadBufferForRedoExtended |--1.34%--heap2_redo | StartupXLOG --0.83%--heap_xlog_update So to sum, HOT update-like workload profile tends to be CPU bound on single process recovery side. Even slow storage (like xvda) was not the
Re: Handing off SLRU fsyncs to the checkpointer
On Wed, Aug 12, 2020 at 6:06 PM Thomas Munro wrote: > [patch] Bitrot, rebased, no changes. > Yeah, the combined effect of these two patches is better than I > expected. To be clear though, I was only measuring the time between > the "redo starts at ..." and "redo done at ..." messages, since I've > been staring at the main recovery code, but there are also some more > fsyncs before (SyncDataDirectory()) and after (RemoveOldXlogFiles()) > that are unaffected. I think it's probably possible to do something > about those too, but that's another topic. ... and of course the end-of-recovery checkpoint; in my tests this wasn't materially changed since there isn't actually very much CLOG, it's just that we avoided syncing it block at a time and getting rescheduled. FWIW I put a very simple test here: https://github.com/macdice/redo-bench, YMMV. From 32c1c16d2800467d1d179678b66d1042d07c Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Tue, 4 Aug 2020 17:57:18 +1200 Subject: [PATCH v2] Defer flushing of SLRU files. Previously, we called fsync() after writing out pg_xact, multixact and commit_ts pages, leading to an I/O stall in user backends and recovery. Collapse requests for the same file into a single system call as part of the next checkpoint, as we do for relation files. Discussion: https://postgr.es/m/CA+hUKGLJ=84yt+nvhkeedauutvhmfq9i-n7k_o50jmq6rpj...@mail.gmail.com --- src/backend/access/transam/clog.c | 13 +++- src/backend/access/transam/commit_ts.c | 12 ++- src/backend/access/transam/multixact.c | 24 +- src/backend/access/transam/slru.c | 101 +++-- src/backend/access/transam/subtrans.c | 4 +- src/backend/commands/async.c | 5 +- src/backend/storage/lmgr/predicate.c | 4 +- src/backend/storage/sync/sync.c| 28 ++- src/include/access/clog.h | 3 + src/include/access/commit_ts.h | 3 + src/include/access/multixact.h | 4 + src/include/access/slru.h | 12 ++- src/include/storage/sync.h | 7 +- 13 files changed, 174 insertions(+), 46 deletions(-) diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c index dd2f4d5bc7..3eb33aea01 100644 --- a/src/backend/access/transam/clog.c +++ b/src/backend/access/transam/clog.c @@ -42,6 +42,7 @@ #include "pg_trace.h" #include "pgstat.h" #include "storage/proc.h" +#include "storage/sync.h" /* * Defines for CLOG page sizes. A page is the same BLCKSZ as is used @@ -692,7 +693,8 @@ CLOGShmemInit(void) { XactCtl->PagePrecedes = CLOGPagePrecedes; SimpleLruInit(XactCtl, "Xact", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE, - XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER); + XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER, + SYNC_HANDLER_CLOG); } /* @@ -1034,3 +1036,12 @@ clog_redo(XLogReaderState *record) else elog(PANIC, "clog_redo: unknown op code %u", info); } + +/* + * Entrypoint for sync.c to sync clog files. + */ +int +clogsyncfiletag(const FileTag *ftag, char *path) +{ + return slrusyncfiletag(XactCtl, ftag, path); +} diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c index 5244b06a2b..913ec9e48d 100644 --- a/src/backend/access/transam/commit_ts.c +++ b/src/backend/access/transam/commit_ts.c @@ -555,7 +555,8 @@ CommitTsShmemInit(void) CommitTsCtl->PagePrecedes = CommitTsPagePrecedes; SimpleLruInit(CommitTsCtl, "CommitTs", CommitTsShmemBuffers(), 0, CommitTsSLRULock, "pg_commit_ts", - LWTRANCHE_COMMITTS_BUFFER); + LWTRANCHE_COMMITTS_BUFFER, + SYNC_HANDLER_COMMIT_TS); commitTsShared = ShmemInitStruct("CommitTs shared", sizeof(CommitTimestampShared), @@ -1083,3 +1084,12 @@ commit_ts_redo(XLogReaderState *record) else elog(PANIC, "commit_ts_redo: unknown op code %u", info); } + +/* + * Entrypoint for sync.c to sync commit_ts files. + */ +int +committssyncfiletag(const FileTag *ftag, char *path) +{ + return slrusyncfiletag(CommitTsCtl, ftag, path); +} diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c index b8bedca04a..344006b0f5 100644 --- a/src/backend/access/transam/multixact.c +++ b/src/backend/access/transam/multixact.c @@ -1831,11 +1831,13 @@ MultiXactShmemInit(void) SimpleLruInit(MultiXactOffsetCtl, "MultiXactOffset", NUM_MULTIXACTOFFSET_BUFFERS, 0, MultiXactOffsetSLRULock, "pg_multixact/offsets", - LWTRANCHE_MULTIXACTOFFSET_BUFFER); + LWTRANCHE_MULTIXACTOFFSET_BUFFER, + SYNC_HANDLER_MULTIXACT_OFFSET); SimpleLruInit(MultiXactMemberCtl, "MultiXactMember", NUM_MULTIXACTMEMBER_BUFFERS, 0, MultiXactMemberSLRULock, "pg_multixact/members", - LWTRANCHE_MULTIXACTMEMBER_BUFFER); + LWTRANCHE_MULTIXACTMEMBER_BUFFER, + SYNC_HANDLER_MULTIXACT_MEMBER); /* Initialize our shared state struct */ MultiXactState = ShmemInitStruct("Shared MultiXact State", @@ -3386,3 +3388,21 @@
Re: Handing off SLRU fsyncs to the checkpointer
On Sat, Aug 8, 2020 at 2:44 AM Robert Haas wrote: > On Wed, Aug 5, 2020 at 2:01 AM Thomas Munro wrote: > > * Master is around 11% faster than last week before commit c5315f4f > > "Cache smgrnblocks() results in recovery." > > * This patch gives a similar speedup, bringing the total to around 25% > > faster than last week (the time is ~20% less, the WAL processing speed > > is ~1.25x). > > Dang, that's pretty nice, especially for the relatively small amount > of code that it seems to require. Yeah, the combined effect of these two patches is better than I expected. To be clear though, I was only measuring the time between the "redo starts at ..." and "redo done at ..." messages, since I've been staring at the main recovery code, but there are also some more fsyncs before (SyncDataDirectory()) and after (RemoveOldXlogFiles()) that are unaffected. I think it's probably possible to do something about those too, but that's another topic. I spotted a small problem: if the transaction ID wrap all the way around between checkpoints, then you might have cancelled requests for a removed SLRU segment from the previous epoch, so we'd better uncancel them if we see that. That's a one line fix, done in the attached. I also adjusted the commit message to be a little clearer (this work deferment/collapsing scheme works in crash recovery too, not just when there is a checkpointer process to hand the work to). From 63235aca6b2525716f8f29caf07e0a0e6a26965e Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Tue, 4 Aug 2020 17:57:18 +1200 Subject: [PATCH] Defer flushing of SLRU files. Previously, we called fsync() after writing out pg_xact, multixact and commit_ts pages, leading to an I/O stall in user backends and recovery. Collapse requests for the same file into a single system call as part of the next checkpoint, as we do for relation files. Discussion: https://postgr.es/m/CA+hUKGLJ=84yt+nvhkeedauutvhmfq9i-n7k_o50jmq6rpj...@mail.gmail.com --- src/backend/access/transam/clog.c | 13 +++- src/backend/access/transam/commit_ts.c | 12 ++- src/backend/access/transam/multixact.c | 24 +- src/backend/access/transam/slru.c | 101 +++-- src/backend/access/transam/subtrans.c | 4 +- src/backend/commands/async.c | 5 +- src/backend/storage/lmgr/predicate.c | 4 +- src/backend/storage/sync/sync.c| 28 ++- src/include/access/clog.h | 3 + src/include/access/commit_ts.h | 3 + src/include/access/multixact.h | 4 + src/include/access/slru.h | 12 ++- src/include/storage/sync.h | 7 +- 13 files changed, 174 insertions(+), 46 deletions(-) diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c index f3da40ae01..0c5b7a525e 100644 --- a/src/backend/access/transam/clog.c +++ b/src/backend/access/transam/clog.c @@ -42,6 +42,7 @@ #include "pg_trace.h" #include "pgstat.h" #include "storage/proc.h" +#include "storage/sync.h" /* * Defines for CLOG page sizes. A page is the same BLCKSZ as is used @@ -692,7 +693,8 @@ CLOGShmemInit(void) { XactCtl->PagePrecedes = CLOGPagePrecedes; SimpleLruInit(XactCtl, "Xact", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE, - XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER); + XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER, + SYNC_HANDLER_CLOG); } /* @@ -1034,3 +1036,12 @@ clog_redo(XLogReaderState *record) else elog(PANIC, "clog_redo: unknown op code %u", info); } + +/* + * Entrypoint for sync.c to sync clog files. + */ +int +clogsyncfiletag(const FileTag *ftag, char *path) +{ + return slrusyncfiletag(XactCtl, ftag, path); +} diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c index 903280ae92..b4edbdb4e3 100644 --- a/src/backend/access/transam/commit_ts.c +++ b/src/backend/access/transam/commit_ts.c @@ -555,7 +555,8 @@ CommitTsShmemInit(void) CommitTsCtl->PagePrecedes = CommitTsPagePrecedes; SimpleLruInit(CommitTsCtl, "CommitTs", CommitTsShmemBuffers(), 0, CommitTsSLRULock, "pg_commit_ts", - LWTRANCHE_COMMITTS_BUFFER); + LWTRANCHE_COMMITTS_BUFFER, + SYNC_HANDLER_COMMIT_TS); commitTsShared = ShmemInitStruct("CommitTs shared", sizeof(CommitTimestampShared), @@ -1083,3 +1084,12 @@ commit_ts_redo(XLogReaderState *record) else elog(PANIC, "commit_ts_redo: unknown op code %u", info); } + +/* + * Entrypoint for sync.c to sync commit_ts files. + */ +int +committssyncfiletag(const FileTag *ftag, char *path) +{ + return slrusyncfiletag(CommitTsCtl, ftag, path); +} diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c index 475f5ed861..27ae2edbdc 100644 --- a/src/backend/access/transam/multixact.c +++ b/src/backend/access/transam/multixact.c @@ -1831,11 +1831,13 @@ MultiXactShmemInit(void) SimpleLruInit(MultiXactOffsetCtl, "MultiXactOffset", NUM_MULTIXACTOFFSET_BUFFERS, 0,
Re: Handing off SLRU fsyncs to the checkpointer
On Wed, Aug 5, 2020 at 2:01 AM Thomas Munro wrote: > * Master is around 11% faster than last week before commit c5315f4f > "Cache smgrnblocks() results in recovery." > * This patch gives a similar speedup, bringing the total to around 25% > faster than last week (the time is ~20% less, the WAL processing speed > is ~1.25x). Dang, that's pretty nice, especially for the relatively small amount of code that it seems to require. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Handing off SLRU fsyncs to the checkpointer
On Tue, Aug 4, 2020 at 6:02 PM Thomas Munro wrote: > ... speedup of around 6% ... I did some better testing. OS: Linux, storage: consumer SSD. I repeatedly ran crash recovery on 3.3GB worth of WAL generated with 8M pgbench transactions. I tested 3 different builds 7 times each and used "ministat" to compare the recovery times. It told me that: * Master is around 11% faster than last week before commit c5315f4f "Cache smgrnblocks() results in recovery." * This patch gives a similar speedup, bringing the total to around 25% faster than last week (the time is ~20% less, the WAL processing speed is ~1.25x). My test fit in RAM and was all cached. With the patch, the recovery process used 100% of a single core the whole time and stayed on that core and the variance is low, but in the other builds it hovered around 90% and hopped around as it kept getting rescheduled and the variance was higher. Of course, SLRU fsyncs aren't the only I/O stalls in a real system; among others, there are also reads from faulting in referenced pages that don't have full page images in the WAL. I'm working on that separately, but that's a tad more complicated than this stuff. Added to commit fest. === ministat output showing recovery times in seconds === x patched.dat + master.dat * lastweek.dat +--+ | * | |x + * | |x x xx + + ++ ++ * | | |AM||_AM| |_A_M__|| +--+ N Min MaxMedian AvgStddev x 738.65539.40639.218 39.1348570.25188849 + 742.12845.06843.958 43.8152860.91387758 Difference at 95.0% confidence 4.68043 +/- 0.780722 11.9597% +/- 1.99495% (Student's t, pooled s = 0.670306) * 747.18749.40449.203 48.9042860.76793483 Difference at 95.0% confidence 9.76943 +/- 0.665613 24.9635% +/- 1.70082% (Student's t, pooled s = 0.571477)
Re: Handing off SLRU fsyncs to the checkpointer
On Wed, Feb 12, 2020 at 9:54 PM Thomas Munro wrote: > In commit 3eb77eba we made it possible for any subsystem that wants a > file to be flushed as part of the next checkpoint to ask the > checkpointer to do that, as previously only md.c could do. Hello, While working on recovery performance, I found my way back to this idea and rebased the patch. Problem statement: Every time we have to write out a page of pg_commit_ts, pg_multixact or pg_xact due to cache pressure, we immediately call fsync(). This runs serially in the recovery process, and it's quite bad for pg_commit_ts, because we need to dump out a page for every ~800 transactions (track_commit_timestamps is not enabled by default). If we ask the checkpointer to do it, it collapses the 2048 fsync calls for each SLRU segment into one, and the kernel can write out the data with larger I/Os, maybe even ahead of time, and update the inode only once. Experiment: Run crash recovery for 1 million pgbench transactions: postgres -D pgdata \ -c synchronous_commit=off \ -c enable_commit_timestamps=on \ -c max_wal_size=10GB \ -c checkpoint_timeout=60min # in another shell pgbench -i -s10 postgres psql postgres -c checkpoint pgbench -t100 -Mprepared postgres killall -9 postgres # save the crashed pgdata dir for repeated experiments mv pgdata pgdata-save # now run experiments like this and see how long recovery takes rm -fr pgdata cp -r pgdata-save pgdata postgres -D pgdata What I see on a system that has around 2.5ms latency for fsync: master: 16.83 seconds patched: 4.00 seconds It's harder to see it without commit timestamps enabled since we only need to flush a pg_xact page every 32k transactions (and multixacts are more complicated to test), but you can still see the effect. With 8x more transactions to make it clearer what's going on, I could measure a speedup of around 6% from this patch, which I suppose scales up fairly obviously with storage latency (every million transaction = at least 30 fsyncs stalls, so you can multiply that by your fsync latency and work out how much time your recovery process will be asleep at the wheel instead of applying your records). >From a syscall overhead point of view, it's a bit unfortunate that we open and close SLRU segments every time we write, but it's probably not really enough to complain about... except for the (small) risk of an inode dropping out of kernel caches in the time between closing it and the checkpointer opening it. Hmm. From 0fe8767316d5a973f43553080c3973759d83038d Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Tue, 4 Aug 2020 17:57:18 +1200 Subject: [PATCH v2] Ask the checkpointer to flush SLRU files. Previously, we called fsync() after writing out pg_xact, multixact and commit_ts pages, leading to an I/O stall in user backends and recovery. Ask the checkpointer to collapse requests for the same file into a single system call as part of the next checkpoint, as we do for relation files. Discussion: https://postgr.es/m/CA+hUKGLJ=84yt+nvhkeedauutvhmfq9i-n7k_o50jmq6rpj...@mail.gmail.com --- src/backend/access/transam/clog.c | 13 +++- src/backend/access/transam/commit_ts.c | 12 ++- src/backend/access/transam/multixact.c | 24 +- src/backend/access/transam/slru.c | 101 +++-- src/backend/access/transam/subtrans.c | 4 +- src/backend/commands/async.c | 5 +- src/backend/storage/lmgr/predicate.c | 4 +- src/backend/storage/sync/sync.c| 24 +- src/include/access/clog.h | 3 + src/include/access/commit_ts.h | 3 + src/include/access/multixact.h | 4 + src/include/access/slru.h | 12 ++- src/include/storage/sync.h | 7 +- 13 files changed, 172 insertions(+), 44 deletions(-) diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c index f3da40ae01..0c5b7a525e 100644 --- a/src/backend/access/transam/clog.c +++ b/src/backend/access/transam/clog.c @@ -42,6 +42,7 @@ #include "pg_trace.h" #include "pgstat.h" #include "storage/proc.h" +#include "storage/sync.h" /* * Defines for CLOG page sizes. A page is the same BLCKSZ as is used @@ -692,7 +693,8 @@ CLOGShmemInit(void) { XactCtl->PagePrecedes = CLOGPagePrecedes; SimpleLruInit(XactCtl, "Xact", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE, - XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER); + XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER, + SYNC_HANDLER_CLOG); } /* @@ -1034,3 +1036,12 @@ clog_redo(XLogReaderState *record) else elog(PANIC, "clog_redo: unknown op code %u", info); } + +/* + * Entrypoint for sync.c to sync clog files. + */ +int +clogsyncfiletag(const FileTag *ftag, char *path) +{ + return slrusyncfiletag(XactCtl, ftag, path); +} diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c index 903280ae92..b4edbdb4e3 100644 ---
Handing off SLRU fsyncs to the checkpointer
Hi, In commit 3eb77eba we made it possible for any subsystem that wants a file to be flushed as part of the next checkpoint to ask the checkpointer to do that, as previously only md.c could do. In the past, foreground CLOG flush stalls were a problem, but then commit 33aaa139 cranked up the number of buffers, and then commit 5364b357 cranked it right up until the flushes mostly disappeared from some benchmark workload but not so high that the resulting linear searches through the buffer array destroyed the gains. I know there is interest in moving that stuff into regular shared buffers, so it can be found via the buffer mapping system (and improve as that improves), written back by the background writer (and improve as that improves), managed with a proper replacement algorithm (and improve as that improves), etc etc. That sounds like a great idea to me, but it's a big project. In the meantime, one thing we could do is hand off the fsyncs, but I'm not sure if it's still considered a real problem in the field given the new parameters. Anyway, I had a patch for that, that I used while testing commit 3eb77eba. While reading old threads about SLRU today I found that several people had wished for a thing exactly like that, so I dusted it off and rebased it. 0001-Use-the-checkpointer-to-fsync-SLRU-files.patch Description: Binary data