Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On 2024-Apr-03, Dilip Kumar wrote: > Yeah, we missed acquiring the bank lock w.r.t. intervening pages, > thanks for reporting. Your fix looks correct to me. Thanks for the quick review! And thanks to Alexander for the report. Pushed the fix. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "No hay hombre que no aspire a la plenitud, es decir, la suma de experiencias de que un hombre es capaz"
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Wed, Apr 3, 2024 at 7:40 PM Alvaro Herrera wrote: > > Hello, > > On 2024-Apr-03, Alexander Lakhin wrote: > > > I've managed to trigger an assert added by 53c2a97a9. > > Please try the following script against a server compiled with > > -DTEST_SUMMARIZE_SERIAL (initially I observed this failure without the > > define, it just simplifies reproducing...): > > Ah yes, absolutely, we're missing to trade the correct SLRU bank lock > there. This rewrite of that small piece should fix it. Thanks for > reporting this. > Yeah, we missed acquiring the bank lock w.r.t. intervening pages, thanks for reporting. Your fix looks correct to me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
Hello, On 2024-Apr-03, Alexander Lakhin wrote: > I've managed to trigger an assert added by 53c2a97a9. > Please try the following script against a server compiled with > -DTEST_SUMMARIZE_SERIAL (initially I observed this failure without the > define, it just simplifies reproducing...): Ah yes, absolutely, we're missing to trade the correct SLRU bank lock there. This rewrite of that small piece should fix it. Thanks for reporting this. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Pido que me den el Nobel por razones humanitarias" (Nicanor Parra) >From 44c39cf4bf258fb0b65aa1acc5f84e5d7f729eb1 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Wed, 3 Apr 2024 16:00:24 +0200 Subject: [PATCH] Fix zeroing of pg_serial page without SLRU bank lock --- src/backend/storage/lmgr/predicate.c | 19 --- 1 file changed, 12 insertions(+), 7 deletions(-) diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c index 3f378c0099..d5bbfbd4c6 100644 --- a/src/backend/storage/lmgr/predicate.c +++ b/src/backend/storage/lmgr/predicate.c @@ -137,7 +137,7 @@ * SerialControlLock * - Protects SerialControlData members * - * SerialSLRULock + * SLRU per-bank locks * - Protects SerialSlruCtl * * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group @@ -908,20 +908,25 @@ SerialAdd(TransactionId xid, SerCommitSeqNo minConflictCommitSeqNo) if (isNewPage) serialControl->headPage = targetPage; - LWLockAcquire(lock, LW_EXCLUSIVE); - if (isNewPage) { - /* Initialize intervening pages. */ - while (firstZeroPage != targetPage) + /* Initialize intervening pages; might involve trading locks */ + for (;;) { - (void) SimpleLruZeroPage(SerialSlruCtl, firstZeroPage); + lock = SimpleLruGetBankLock(SerialSlruCtl, firstZeroPage); + LWLockAcquire(lock, LW_EXCLUSIVE); + slotno = SimpleLruZeroPage(SerialSlruCtl, firstZeroPage); + if (firstZeroPage == targetPage) +break; firstZeroPage = SerialNextPage(firstZeroPage); + LWLockRelease(lock); } - slotno = SimpleLruZeroPage(SerialSlruCtl, targetPage); } else + { + LWLockAcquire(lock, LW_EXCLUSIVE); slotno = SimpleLruReadPage(SerialSlruCtl, targetPage, true, xid); + } SerialValue(slotno, xid) = minConflictCommitSeqNo; SerialSlruCtl->shared->page_dirty[slotno] = true; -- 2.39.2
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
Hello Alvaro, 27.02.2024 20:33, Alvaro Herrera wrote: Here's the complete set, with these two names using the singular. I've managed to trigger an assert added by 53c2a97a9. Please try the following script against a server compiled with -DTEST_SUMMARIZE_SERIAL (initially I observed this failure without the define, it just simplifies reproducing...): # initdb & start ... createdb test echo " SELECT pg_current_xact_id() AS tx \gset SELECT format('CREATE TABLE t%s(i int)', g) FROM generate_series(1, 1022 - :tx) g \gexec BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE; SELECT pg_current_xact_id(); SELECT pg_sleep(5); " | psql test & echo " SELECT pg_sleep(1); BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE; SELECT 1 INTO a; COMMIT; BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE; SELECT 2 INTO b; " | psql test It fails for me with the following stack trace: TRAP: failed Assert("LWLockHeldByMeInMode(SimpleLruGetBankLock(ctl, pageno), LW_EXCLUSIVE)"), File: "slru.c", Line: 366, PID: 21711 ExceptionalCondition at assert.c:52:13 SimpleLruZeroPage at slru.c:369:11 SerialAdd at predicate.c:921:20 SummarizeOldestCommittedSxact at predicate.c:1521:2 GetSerializableTransactionSnapshotInt at predicate.c:1787:16 GetSerializableTransactionSnapshot at predicate.c:1691:1 GetTransactionSnapshot at snapmgr.c:264:21 exec_simple_query at postgres.c:1162:4 ... Best regards, Alexander
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On 2024-Mar-04, Tom Lane wrote: > In hopes of noticing whether there are other similar thinkos, > I permuted the order of the SlruPageStatus enum values, and > now I get the expected warnings from gcc: Thanks for checking! I pushed the fixes. Maybe we should assign a nonzero value (= 1) to the first element of enums, to avoid this kind of mistake. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
I wrote: > It looks like SLRU_PAGE_EMPTY has (by chance, or deliberately) > the numeric value of zero, so I guess the majority of our BF > animals are understanding this as "address != NULL". But that > doesn't look like a useful test to be making. In hopes of noticing whether there are other similar thinkos, I permuted the order of the SlruPageStatus enum values, and now I get the expected warnings from gcc: In file included from ../../../../src/include/postgres.h:45, from slru.c:59: slru.c: In function ‘SimpleLruWaitIO’: slru.c:436:38: warning: comparison between pointer and integer Assert(>page_status[slotno] != SLRU_PAGE_EMPTY); ^~ ../../../../src/include/c.h:862:9: note: in definition of macro ‘Assert’ if (!(condition)) \ ^ slru.c: In function ‘SimpleLruWritePage’: slru.c:717:43: warning: comparison between pointer and integer Assert(>shared->page_status[slotno] != SLRU_PAGE_EMPTY); ^~ ../../../../src/include/c.h:862:9: note: in definition of macro ‘Assert’ if (!(condition)) \ ^ So it looks like it's just these two places. regards, tom lane
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
Alvaro Herrera writes: > Pushed that way, but we can discuss further wording improvements/changes > if someone wants to propose any. I just noticed that drongo is complaining about two lines added by commit 53c2a97a9: drongo| 2024-03-04 14:34:52 | ../pgsql/src/backend/access/transam/slru.c(436): warning C4047: '!=': 'SlruPageStatus *' differs in levels of indirection from 'int' drongo| 2024-03-04 14:34:52 | ../pgsql/src/backend/access/transam/slru.c(717): warning C4047: '!=': 'SlruPageStatus *' differs in levels of indirection from 'int' These lines are Assert(>page_status[slotno] != SLRU_PAGE_EMPTY); Assert(>shared->page_status[slotno] != SLRU_PAGE_EMPTY); These are comparing the address of something with an enum value, which surely cannot be sane. Is the "&" operator incorrect? It looks like SLRU_PAGE_EMPTY has (by chance, or deliberately) the numeric value of zero, so I guess the majority of our BF animals are understanding this as "address != NULL". But that doesn't look like a useful test to be making. regards, tom lane
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On 2024-Feb-29, Alvaro Herrera wrote: > On 2024-Feb-29, Kyotaro Horiguchi wrote: > > Some of them, commit_timestamp_buffers, transaction_buffers, > > subtransaction_buffers use 0 to mean auto-tuning based on > > shared-buffer size. I think it's worth adding an extra_desc such as "0 > > to automatically determine this value based on the shared buffer > > size". > > How about this? Pushed that way, but we can discuss further wording improvements/changes if someone wants to propose any. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "La rebeldía es la virtud original del hombre" (Arthur Schopenhauer)
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On 2024-Feb-29, Kyotaro Horiguchi wrote: > At Tue, 27 Feb 2024 18:33:18 +0100, Alvaro Herrera > wrote in > > Here's the complete set, with these two names using the singular. > > The commit by the second patch added several GUC descriptions: > > > Sets the size of the dedicated buffer pool used for the commit timestamp > > cache. > > Some of them, commit_timestamp_buffers, transaction_buffers, > subtransaction_buffers use 0 to mean auto-tuning based on > shared-buffer size. I think it's worth adding an extra_desc such as "0 > to automatically determine this value based on the shared buffer > size". How about this? -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ "La victoria es para quien se atreve a estar solo" >From d0d7216eb4e2e2e9e71aa849cf90c218bbe2b164 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Thu, 29 Feb 2024 11:45:31 +0100 Subject: [PATCH] extra_desc --- src/backend/utils/misc/guc_tables.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c index 93ded31ed9..543a87c659 100644 --- a/src/backend/utils/misc/guc_tables.c +++ b/src/backend/utils/misc/guc_tables.c @@ -2287,7 +2287,7 @@ struct config_int ConfigureNamesInt[] = { {"commit_timestamp_buffers", PGC_POSTMASTER, RESOURCES_MEM, gettext_noop("Sets the size of the dedicated buffer pool used for the commit timestamp cache."), - NULL, + gettext_noop("Specify 0 to have this value determined as a fraction of shared_buffers."), GUC_UNIT_BLOCKS }, _timestamp_buffers, @@ -2342,7 +2342,7 @@ struct config_int ConfigureNamesInt[] = { {"subtransaction_buffers", PGC_POSTMASTER, RESOURCES_MEM, gettext_noop("Sets the size of the dedicated buffer pool used for the sub-transaction cache."), - NULL, + gettext_noop("Specify 0 to have this value determined as a fraction of shared_buffers."), GUC_UNIT_BLOCKS }, _buffers, @@ -2353,7 +2353,7 @@ struct config_int ConfigureNamesInt[] = { {"transaction_buffers", PGC_POSTMASTER, RESOURCES_MEM, gettext_noop("Sets the size of the dedicated buffer pool used for the transaction status cache."), - NULL, + gettext_noop("Specify 0 to have this value determined as a fraction of shared_buffers."), GUC_UNIT_BLOCKS }, _buffers, @@ -2868,7 +2868,7 @@ struct config_int ConfigureNamesInt[] = { {"wal_buffers", PGC_POSTMASTER, WAL_SETTINGS, gettext_noop("Sets the number of disk-page buffers in shared memory for WAL."), - NULL, + gettext_noop("Specify -1 to have this value determined as a fraction of shared_buffers."), GUC_UNIT_XBLOCKS }, , -- 2.39.2
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
At Tue, 27 Feb 2024 18:33:18 +0100, Alvaro Herrera wrote in > Here's the complete set, with these two names using the singular. The commit by the second patch added several GUC descriptions: > Sets the size of the dedicated buffer pool used for the commit timestamp > cache. Some of them, commit_timestamp_buffers, transaction_buffers, subtransaction_buffers use 0 to mean auto-tuning based on shared-buffer size. I think it's worth adding an extra_desc such as "0 to automatically determine this value based on the shared buffer size". regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Tue, Feb 27, 2024 at 11:41 PM Alvaro Herrera wrote: > On 2024-Feb-27, Alvaro Herrera wrote: > > > Here's the complete set, with these two names using the singular. > > BTW one thing I had not noticed is that before this patch we have > minimum shmem size that's lower than the lowest you can go with the new > code. > > This means Postgres may no longer start when extremely tight memory > restrictions (and of course use more memory even when idle or with small > databases). I wonder to what extent should we make an effort to relax > that. For small, largely inactive servers, this is just memory we use > for no good reason. However, anything we do here will impact > performance on the high end, because as Andrey says this will add > calculations and jumps where there are none today. > > I was just comparing the minimum memory required for SLRU when the system is minimally configured, correct me if I am wrong. SLRUunpatched patched commit_timestamp_buffers 4 16 subtransaction_buffers 32 16 transaction_buffers 4 16 multixact_offset_buffers8 16 multixact_member_buffers 16 16 notify_buffers 8 16 serializable_buffers 16 16 - total buffers 88 112 so that is < 200kB of extra memory on a minimally configured system, IMHO this should not matter. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On 2024-Feb-27, Alvaro Herrera wrote: > Here's the complete set, with these two names using the singular. BTW one thing I had not noticed is that before this patch we have minimum shmem size that's lower than the lowest you can go with the new code. This means Postgres may no longer start when extremely tight memory restrictions (and of course use more memory even when idle or with small databases). I wonder to what extent should we make an effort to relax that. For small, largely inactive servers, this is just memory we use for no good reason. However, anything we do here will impact performance on the high end, because as Andrey says this will add calculations and jumps where there are none today. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "We’ve narrowed the problem down to the customer’s pants being in a situation of vigorous combustion" (Robert Haas, Postgres expert extraordinaire)
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
> On 27 Feb 2024, at 22:33, Alvaro Herrera wrote: > > These patches look amazing! Best regards, Andrey Borodin.
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
Here's the complete set, with these two names using the singular. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Uno puede defenderse de los ataques; contra los elogios se esta indefenso" >From 225b2403f7bb9990656d18c8339c452fcd6822c5 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Tue, 27 Feb 2024 16:56:00 +0100 Subject: [PATCH v21 1/2] Rename SLRU elements in pg_stat_slru The new names are intended to match an upcoming patch that adds a few GUCs to configure the SLRU buffer sizes. Discussion: https://postgr.es/m/202402261616.dlriae7b6emv@alvherre.pgsql --- doc/src/sgml/monitoring.sgml| 14 src/backend/access/transam/clog.c | 2 +- src/backend/access/transam/commit_ts.c | 2 +- src/backend/access/transam/multixact.c | 4 +-- src/backend/access/transam/subtrans.c | 2 +- src/backend/commands/async.c| 2 +- src/backend/storage/lmgr/predicate.c| 2 +- src/include/utils/pgstat_internal.h | 14 src/test/isolation/expected/stats.out | 44 - src/test/isolation/expected/stats_1.out | 44 - src/test/isolation/specs/stats.spec | 4 +-- src/test/regress/expected/stats.out | 14 src/test/regress/sql/stats.sql | 14 13 files changed, 81 insertions(+), 81 deletions(-) diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index 5cf9363ac8..9d73d8c1bb 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -4853,13 +4853,13 @@ description | Waiting for a newly initialized WAL file to reach durable storage NULL or is not specified, all the counters shown in the pg_stat_slru view for all SLRU caches are reset. The argument can be one of -CommitTs, -MultiXactMember, -MultiXactOffset, -Notify, -Serial, -Subtrans, or -Xact +commit_timestamp, +multixact_member, +multixact_offset, +notify, +serializable, +subtransaction, or +transaction to reset the counters for only that entry. If the argument is other (or indeed, any unrecognized name), then the counters for all other SLRU caches, such diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c index 97f7434da3..34f079cbb1 100644 --- a/src/backend/access/transam/clog.c +++ b/src/backend/access/transam/clog.c @@ -706,7 +706,7 @@ void CLOGShmemInit(void) { XactCtl->PagePrecedes = CLOGPagePrecedes; - SimpleLruInit(XactCtl, "Xact", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE, + SimpleLruInit(XactCtl, "transaction", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE, XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER, SYNC_HANDLER_CLOG, false); SlruPagePrecedesUnitTests(XactCtl, CLOG_XACTS_PER_PAGE); diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c index 6bfe60343e..d965db89c7 100644 --- a/src/backend/access/transam/commit_ts.c +++ b/src/backend/access/transam/commit_ts.c @@ -529,7 +529,7 @@ CommitTsShmemInit(void) bool found; CommitTsCtl->PagePrecedes = CommitTsPagePrecedes; - SimpleLruInit(CommitTsCtl, "CommitTs", CommitTsShmemBuffers(), 0, + SimpleLruInit(CommitTsCtl, "commit_timestamp", CommitTsShmemBuffers(), 0, CommitTsSLRULock, "pg_commit_ts", LWTRANCHE_COMMITTS_BUFFER, SYNC_HANDLER_COMMIT_TS, diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c index febc429f72..64040d330e 100644 --- a/src/backend/access/transam/multixact.c +++ b/src/backend/access/transam/multixact.c @@ -1851,14 +1851,14 @@ MultiXactShmemInit(void) MultiXactMemberCtl->PagePrecedes = MultiXactMemberPagePrecedes; SimpleLruInit(MultiXactOffsetCtl, - "MultiXactOffset", NUM_MULTIXACTOFFSET_BUFFERS, 0, + "multixact_offset", NUM_MULTIXACTOFFSET_BUFFERS, 0, MultiXactOffsetSLRULock, "pg_multixact/offsets", LWTRANCHE_MULTIXACTOFFSET_BUFFER, SYNC_HANDLER_MULTIXACT_OFFSET, false); SlruPagePrecedesUnitTests(MultiXactOffsetCtl, MULTIXACT_OFFSETS_PER_PAGE); SimpleLruInit(MultiXactMemberCtl, - "MultiXactMember", NUM_MULTIXACTMEMBER_BUFFERS, 0, + "multixact_member", NUM_MULTIXACTMEMBER_BUFFERS, 0, MultiXactMemberSLRULock, "pg_multixact/members", LWTRANCHE_MULTIXACTMEMBER_BUFFER, SYNC_HANDLER_MULTIXACT_MEMBER, diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c index b2ed82ac56..6aa47af43e 100644 --- a/src/backend/access/transam/subtrans.c +++ b/src/backend/access/transam/subtrans.c @@ -200,7 +200,7 @@ void SUBTRANSShmemInit(void) { SubTransCtl->PagePrecedes = SubTransPagePrecedes; - SimpleLruInit(SubTransCtl, "Subtrans", NUM_SUBTRANS_BUFFERS, 0, + SimpleLruInit(SubTransCtl, "subtransaction", NUM_SUBTRANS_BUFFERS, 0, SubtransSLRULock, "pg_subtrans",
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On 2024-Feb-27, Andrey M. Borodin wrote: > Sorry for the late reply, I have one nit. Are you sure that > multixact_members and multixact_offsets are plural, while transaction > and commit_timestamp are singular? > Maybe multixact_members and multixact_offset? Because there are many > members and one offset for a givent multixact? Users certainly do not > care, thought... I made myself the same question actually, and thought about putting them both in the singular. I only backed off because I noticed that the directories themselves are in plural (an old mistake of mine, evidently). Maybe we should follow that instinct and use the singular for these. If we do that, we can rename the directories to also appear in singular when/if the patch to add standard page headers to the SLRUs lands -- which is going to need code to rewrite the files during pg_upgrade anyway, so the rename is not going to be a big deal. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ "Crear es tan difícil como ser libre" (Elsa Triolet)
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
> On 27 Feb 2024, at 21:03, Alvaro Herrera wrote: > > On 2024-Feb-27, Dilip Kumar wrote: > >>> static const char *const slru_names[] = { >>>"commit_timestamp", >>>"multixact_members", >>>"multixact_offsets", >>>"notify", >>>"serializable", >>>"subtransaction", >>>"transaction", >>>"other" /* has to be last >>> */ >>> }; > > Here's a patch for the renaming part. Sorry for the late reply, I have one nit. Are you sure that multixact_members and multixact_offsets are plural, while transaction and commit_timestamp are singular? Maybe multixact_members and multixact_offset? Because there are many members and one offset for a givent multixact? Users certainly do not care, thought... Best regards, Andrey Borodin.
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On 2024-Feb-27, Dilip Kumar wrote: > > static const char *const slru_names[] = { > > "commit_timestamp", > > "multixact_members", > > "multixact_offsets", > > "notify", > > "serializable", > > "subtransaction", > > "transaction", > > "other" /* has to be last > > */ > > }; Here's a patch for the renaming part. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "No nos atrevemos a muchas cosas porque son difíciles, pero son difíciles porque no nos atrevemos a hacerlas" (Séneca) >From 91741984cbd77f88e39b6fac8e8c7dc622d2ccf4 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Tue, 27 Feb 2024 16:56:00 +0100 Subject: [PATCH] Rename SLRU elements in pg_stat_slru The new names are intended to match an upcoming patch that adds a few GUCs to configure the SLRU buffer sizes. Discussion: https://postgr.es/m/202402261616.dlriae7b6emv@alvherre.pgsql --- doc/src/sgml/monitoring.sgml| 14 src/backend/access/transam/clog.c | 2 +- src/backend/access/transam/commit_ts.c | 2 +- src/backend/access/transam/multixact.c | 4 +-- src/backend/access/transam/subtrans.c | 2 +- src/backend/commands/async.c| 2 +- src/backend/storage/lmgr/predicate.c| 2 +- src/include/utils/pgstat_internal.h | 14 src/test/isolation/expected/stats.out | 44 - src/test/isolation/expected/stats_1.out | 44 - src/test/isolation/specs/stats.spec | 4 +-- src/test/regress/expected/stats.out | 14 src/test/regress/sql/stats.sql | 14 13 files changed, 81 insertions(+), 81 deletions(-) diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index 5cf9363ac8..7d92e68572 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -4853,13 +4853,13 @@ description | Waiting for a newly initialized WAL file to reach durable storage NULL or is not specified, all the counters shown in the pg_stat_slru view for all SLRU caches are reset. The argument can be one of -CommitTs, -MultiXactMember, -MultiXactOffset, -Notify, -Serial, -Subtrans, or -Xact +commit_timestamp, +multixact_members, +multixact_offsets, +notify, +serializable, +subtransaction, or +transaction to reset the counters for only that entry. If the argument is other (or indeed, any unrecognized name), then the counters for all other SLRU caches, such diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c index 97f7434da3..34f079cbb1 100644 --- a/src/backend/access/transam/clog.c +++ b/src/backend/access/transam/clog.c @@ -706,7 +706,7 @@ void CLOGShmemInit(void) { XactCtl->PagePrecedes = CLOGPagePrecedes; - SimpleLruInit(XactCtl, "Xact", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE, + SimpleLruInit(XactCtl, "transaction", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE, XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER, SYNC_HANDLER_CLOG, false); SlruPagePrecedesUnitTests(XactCtl, CLOG_XACTS_PER_PAGE); diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c index 6bfe60343e..d965db89c7 100644 --- a/src/backend/access/transam/commit_ts.c +++ b/src/backend/access/transam/commit_ts.c @@ -529,7 +529,7 @@ CommitTsShmemInit(void) bool found; CommitTsCtl->PagePrecedes = CommitTsPagePrecedes; - SimpleLruInit(CommitTsCtl, "CommitTs", CommitTsShmemBuffers(), 0, + SimpleLruInit(CommitTsCtl, "commit_timestamp", CommitTsShmemBuffers(), 0, CommitTsSLRULock, "pg_commit_ts", LWTRANCHE_COMMITTS_BUFFER, SYNC_HANDLER_COMMIT_TS, diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c index febc429f72..f8bb83927c 100644 --- a/src/backend/access/transam/multixact.c +++ b/src/backend/access/transam/multixact.c @@ -1851,14 +1851,14 @@ MultiXactShmemInit(void) MultiXactMemberCtl->PagePrecedes = MultiXactMemberPagePrecedes; SimpleLruInit(MultiXactOffsetCtl, - "MultiXactOffset", NUM_MULTIXACTOFFSET_BUFFERS, 0, + "multixact_offsets", NUM_MULTIXACTOFFSET_BUFFERS, 0, MultiXactOffsetSLRULock, "pg_multixact/offsets", LWTRANCHE_MULTIXACTOFFSET_BUFFER, SYNC_HANDLER_MULTIXACT_OFFSET, false); SlruPagePrecedesUnitTests(MultiXactOffsetCtl, MULTIXACT_OFFSETS_PER_PAGE); SimpleLruInit(MultiXactMemberCtl, - "MultiXactMember", NUM_MULTIXACTMEMBER_BUFFERS, 0, + "multixact_members", NUM_MULTIXACTMEMBER_BUFFERS, 0, MultiXactMemberSLRULock, "pg_multixact/members", LWTRANCHE_MULTIXACTMEMBER_BUFFER, SYNC_HANDLER_MULTIXACT_MEMBER, diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c index b2ed82ac56..6aa47af43e
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Mon, Feb 26, 2024 at 9:46 PM Alvaro Herrera wrote: > On 2024-Feb-23, Dilip Kumar wrote: > > + > + For each SLRU area that's part of the core server, > + there is a configuration parameter that controls its size, with the > suffix > + _buffers appended. For historical > + reasons, the names are not exact matches, but Xact > + corresponds to transaction_buffers and the rest > should > + be obvious. > + > + > > I think I would like to suggest renaming the GUCs to have the _slru_ bit > in the middle: > > +# - SLRU Buffers (change requires restart) - > + > +#commit_timestamp_slru_buffers = 0 # memory for pg_commit_ts (0 > = auto) > +#multixact_offsets_slru_buffers = 16# memory for > pg_multixact/offsets > +#multixact_members_slru_buffers = 32# memory for > pg_multixact/members > +#notify_slru_buffers = 16 # memory for pg_notify > +#serializable_slru_buffers = 32 # memory for pg_serial > +#subtransaction_slru_buffers = 0# memory for pg_subtrans (0 = > auto) > +#transaction_slru_buffers = 0 # memory for pg_xact (0 = > auto) > > and the pgstat_internal.h table: > > static const char *const slru_names[] = { > "commit_timestamp", > "multixact_members", > "multixact_offsets", > "notify", > "serializable", > "subtransaction", > "transaction", > "other" /* has to be last > */ > }; > > This way they match perfectly. > Yeah, I think this looks fine to me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On 2024-Feb-23, Dilip Kumar wrote: > 1. > + * If no process is already in the list, we're the leader; our first step > + * is to "close out the group" by resetting the list pointer from > + * ProcGlobal->clogGroupFirst (this lets other processes set up other > + * groups later); then we lock the SLRU bank corresponding to our group's > + * page, do the SLRU updates, release the SLRU bank lock, and wake up the > + * sleeping processes. > > I think here we are saying that we "close out the group" before > acquiring the SLRU lock but that's not true. We keep the group open > until we gets the lock so that we can get maximum members in while we > are anyway waiting for the lock. Absolutely right. Reworded that. > 2. > static void > TransactionIdSetCommitTs(TransactionId xid, TimestampTz ts, > RepOriginId nodeid, int slotno) > { > - Assert(TransactionIdIsNormal(xid)); > + if (!TransactionIdIsNormal(xid)) > + return; > + > + entryno = TransactionIdToCTsEntry(xid); > > I do not understand why we need this change. Ah yeah, I was bothered by the fact that if you pass Xid values earlier than NormalXid to this function, we'd reply with some nonsensical values instead of throwing an error. But you're right that it doesn't belong in this patch, so I removed that. Here's a version with these fixes, where I also added some text to the pg_stat_slru documentation: + + For each SLRU area that's part of the core server, + there is a configuration parameter that controls its size, with the suffix + _buffers appended. For historical + reasons, the names are not exact matches, but Xact + corresponds to transaction_buffers and the rest should + be obvious. + + I think I would like to suggest renaming the GUCs to have the _slru_ bit in the middle: +# - SLRU Buffers (change requires restart) - + +#commit_timestamp_slru_buffers = 0 # memory for pg_commit_ts (0 = auto) +#multixact_offsets_slru_buffers = 16# memory for pg_multixact/offsets +#multixact_members_slru_buffers = 32# memory for pg_multixact/members +#notify_slru_buffers = 16 # memory for pg_notify +#serializable_slru_buffers = 32 # memory for pg_serial +#subtransaction_slru_buffers = 0# memory for pg_subtrans (0 = auto) +#transaction_slru_buffers = 0 # memory for pg_xact (0 = auto) and the pgstat_internal.h table: static const char *const slru_names[] = { "commit_timestamp", "multixact_members", "multixact_offsets", "notify", "serializable", "subtransaction", "transaction", "other" /* has to be last */ }; This way they match perfectly. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ "All rings of power are equal, But some rings of power are more equal than others." (George Orwell's The Lord of the Rings) >From 4dc139e70feb5e43bbe2689cfb044ef0957761b3 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Thu, 22 Feb 2024 18:42:56 +0100 Subject: [PATCH v20] Make SLRU buffer sizes configurable Also, divide the slot array in banks, so that the LRU algorithm can be made more scalable. Also remove the centralized control lock for even better scalability. Authors: Dilip Kumar, Andrey Borodin --- doc/src/sgml/config.sgml | 139 +++ doc/src/sgml/monitoring.sgml | 14 +- src/backend/access/transam/clog.c | 236 src/backend/access/transam/commit_ts.c| 81 ++-- src/backend/access/transam/multixact.c| 190 +++--- src/backend/access/transam/slru.c | 357 +- src/backend/access/transam/subtrans.c | 103 - src/backend/commands/async.c | 61 ++- src/backend/storage/lmgr/lwlock.c | 9 +- src/backend/storage/lmgr/lwlocknames.txt | 14 +- src/backend/storage/lmgr/predicate.c | 34 +- .../utils/activity/wait_event_names.txt | 15 +- src/backend/utils/init/globals.c | 9 + src/backend/utils/misc/guc_tables.c | 78 src/backend/utils/misc/postgresql.conf.sample | 9 + src/include/access/clog.h | 1 - src/include/access/commit_ts.h| 1 - src/include/access/multixact.h| 4 - src/include/access/slru.h | 86 +++-- src/include/access/subtrans.h | 3 - src/include/commands/async.h | 5 - src/include/miscadmin.h | 8 + src/include/storage/lwlock.h | 7 + src/include/storage/predicate.h | 4 - src/include/utils/guc_hooks.h | 11 + src/test/modules/test_slru/test_slru.c| 35 +- 26 files changed, 1161 insertions(+), 353 deletions(-) diff --git a/doc/src/sgml/config.sgml
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On 2024-Feb-23, Andrey M. Borodin wrote: > I'm sure anyone with multiple CPUs should increase, not decrease > previous default of 128 buffers (with 512MB shared buffers). Having > more CPUs (the only way to benefit from more locks) implies bigger > transaction buffers. Sure. > IMO making bank size variable adds unneeded computation overhead, bank > search loops should be unrollable by compiler etc. Makes sense. > Originally there was a patch set step, that packed bank's page > addresses together in one array. It was done to make bank search a > SIMD instruction. Ants Aasma had proposed a rework of the LRU code for better performance. He told me it depended on bank size being 16, so you're right that it's probably not a good idea to make it variable. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Fri, Feb 23, 2024 at 1:06 PM Dilip Kumar wrote: > > On Fri, Feb 23, 2024 at 1:48 AM Alvaro Herrera > wrote: > > > > On 2024-Feb-07, Dilip Kumar wrote: > > > > > On Wed, Feb 7, 2024 at 3:49 PM Alvaro Herrera > > > wrote: > > > > > > Sure, but is that really what we want? > > > > > > So your question is do we want these buffers to be in multiple of > > > SLRU_BANK_SIZE? Maybe we can have the last bank to be partial, I > > > don't think it should create any problem logically. I mean we can > > > look again in the patch to see if we have made any such assumptions > > > but that should be fairly easy to fix, then maybe if we are going in > > > this way we should get rid of the check_slru_buffers() function as > > > well. > > > > Not really, I just don't think the macro should be in slru.h. > > Okay > > > Another thing I've been thinking is that perhaps it would be useful to > > make the banks smaller, when the total number of buffers is small. For > > example, if you have 16 or 32 buffers, it's not really clear to me that > > it makes sense to have just 1 bank or 2 banks. It might be more > > sensible to have 4 banks with 4 or 8 buffers instead. That should make > > the algorithm scale down as well as up ... > > It might be helpful to have small-size banks when SLRU buffers are set > to a very low value and we are only accessing a couple of pages at a > time (i.e. no buffer replacement) because in such cases most of the > contention will be on SLRU Bank lock. Although I am not sure how > practical such a use case would be, I mean if someone is using > multi-xact very heavily or creating frequent subtransaction overflow > then wouldn't they should set this buffer limit to some big enough > value? By doing this we would lose some simplicity of the patch I > mean instead of using the simple macro i.e. SLRU_BANK_SIZE we would > need to compute this and store it in SlruShared. Maybe that's not that > bad. > > > > > I haven't done either of those things in the attached v19 version. I > > did go over the comments once again and rewrote the parts I was unhappy > > with, including some existing ones. I think it's OK now from that point > > of view ... at some point I thought about creating a separate README, > > but in the end I thought it not necessary. > > Thanks, I will review those changes. Few other things I noticed while reading through the patch, I haven't read it completely yet but this is what I got for now. 1. + * If no process is already in the list, we're the leader; our first step + * is to "close out the group" by resetting the list pointer from + * ProcGlobal->clogGroupFirst (this lets other processes set up other + * groups later); then we lock the SLRU bank corresponding to our group's + * page, do the SLRU updates, release the SLRU bank lock, and wake up the + * sleeping processes. I think here we are saying that we "close out the group" before acquiring the SLRU lock but that's not true. We keep the group open until we gets the lock so that we can get maximum members in while we are anyway waiting for the lock. 2. static void TransactionIdSetCommitTs(TransactionId xid, TimestampTz ts, RepOriginId nodeid, int slotno) { - Assert(TransactionIdIsNormal(xid)); + if (!TransactionIdIsNormal(xid)) + return; + + entryno = TransactionIdToCTsEntry(xid); I do not understand why we need this change. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
> On 23 Feb 2024, at 12:36, Dilip Kumar wrote: > >> Another thing I've been thinking is that perhaps it would be useful to >> make the banks smaller, when the total number of buffers is small. For >> example, if you have 16 or 32 buffers, it's not really clear to me that >> it makes sense to have just 1 bank or 2 banks. It might be more >> sensible to have 4 banks with 4 or 8 buffers instead. That should make >> the algorithm scale down as well as up ... > > It might be helpful to have small-size banks when SLRU buffers are set > to a very low value and we are only accessing a couple of pages at a > time (i.e. no buffer replacement) because in such cases most of the > contention will be on SLRU Bank lock. Although I am not sure how > practical such a use case would be, I mean if someone is using > multi-xact very heavily or creating frequent subtransaction overflow > then wouldn't they should set this buffer limit to some big enough > value? By doing this we would lose some simplicity of the patch I > mean instead of using the simple macro i.e. SLRU_BANK_SIZE we would > need to compute this and store it in SlruShared. Maybe that's not that > bad. I'm sure anyone with multiple CPUs should increase, not decrease previous default of 128 buffers (with 512MB shared buffers). Having more CPUs (the only way to benefit from more locks) implies bigger transaction buffers. IMO making bank size variable adds unneeded computation overhead, bank search loops should be unrollable by compiler etc. Originally there was a patch set step, that packed bank's page addresses together in one array. It was done to make bank search a SIMD instruction. Best regards, Andrey Borodin.
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Fri, Feb 23, 2024 at 1:48 AM Alvaro Herrera wrote: > > On 2024-Feb-07, Dilip Kumar wrote: > > > On Wed, Feb 7, 2024 at 3:49 PM Alvaro Herrera > > wrote: > > > > Sure, but is that really what we want? > > > > So your question is do we want these buffers to be in multiple of > > SLRU_BANK_SIZE? Maybe we can have the last bank to be partial, I > > don't think it should create any problem logically. I mean we can > > look again in the patch to see if we have made any such assumptions > > but that should be fairly easy to fix, then maybe if we are going in > > this way we should get rid of the check_slru_buffers() function as > > well. > > Not really, I just don't think the macro should be in slru.h. Okay > Another thing I've been thinking is that perhaps it would be useful to > make the banks smaller, when the total number of buffers is small. For > example, if you have 16 or 32 buffers, it's not really clear to me that > it makes sense to have just 1 bank or 2 banks. It might be more > sensible to have 4 banks with 4 or 8 buffers instead. That should make > the algorithm scale down as well as up ... It might be helpful to have small-size banks when SLRU buffers are set to a very low value and we are only accessing a couple of pages at a time (i.e. no buffer replacement) because in such cases most of the contention will be on SLRU Bank lock. Although I am not sure how practical such a use case would be, I mean if someone is using multi-xact very heavily or creating frequent subtransaction overflow then wouldn't they should set this buffer limit to some big enough value? By doing this we would lose some simplicity of the patch I mean instead of using the simple macro i.e. SLRU_BANK_SIZE we would need to compute this and store it in SlruShared. Maybe that's not that bad. > > I haven't done either of those things in the attached v19 version. I > did go over the comments once again and rewrote the parts I was unhappy > with, including some existing ones. I think it's OK now from that point > of view ... at some point I thought about creating a separate README, > but in the end I thought it not necessary. Thanks, I will review those changes. > I did add a bunch of Assert()s to make sure the locks that are supposed > to be held are actually held. This led me to testing the page status to > be not EMPTY during SimpleLruWriteAll() before calling > SlruInternalWritePage(), because the assert was firing. The previous > code is not really *buggy*, but to me it's weird to call WritePage() on > a slot with no contents. Okay, I mean internally SlruInternalWritePage() will flush only if the status is SLRU_PAGE_VALID, but it is better the way you have done. > Another change was in TransactionGroupUpdateXidStatus: the original code > had the leader doing pg_atomic_read_u32(>clogGroupFirst) to > know which bank to lock. I changed it to simply be the page used by the > leader process; this doesn't need an atomic read, and should be the same > page anyway. (If it isn't, it's no big deal). But what's more: even if > we do read ->clogGroupFirst at that point, there's no guarantee that > this is going to be exactly for the same process that ends up being the > first in the list, because since we have not set it to INVALID by the > time we grab the bank lock, it is quite possible for more processes to > add themselves to the list. Yeah, this looks better > I realized all this while rewriting the comments in a way that would let > me understand what was going on ... so IMO the effort was worthwhile. +1 I will review and do some more testing early next week and share my feedback. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On 2024-Feb-07, Dilip Kumar wrote: > On Wed, Feb 7, 2024 at 3:49 PM Alvaro Herrera wrote: > > Sure, but is that really what we want? > > So your question is do we want these buffers to be in multiple of > SLRU_BANK_SIZE? Maybe we can have the last bank to be partial, I > don't think it should create any problem logically. I mean we can > look again in the patch to see if we have made any such assumptions > but that should be fairly easy to fix, then maybe if we are going in > this way we should get rid of the check_slru_buffers() function as > well. Not really, I just don't think the macro should be in slru.h. Another thing I've been thinking is that perhaps it would be useful to make the banks smaller, when the total number of buffers is small. For example, if you have 16 or 32 buffers, it's not really clear to me that it makes sense to have just 1 bank or 2 banks. It might be more sensible to have 4 banks with 4 or 8 buffers instead. That should make the algorithm scale down as well as up ... I haven't done either of those things in the attached v19 version. I did go over the comments once again and rewrote the parts I was unhappy with, including some existing ones. I think it's OK now from that point of view ... at some point I thought about creating a separate README, but in the end I thought it not necessary. I did add a bunch of Assert()s to make sure the locks that are supposed to be held are actually held. This led me to testing the page status to be not EMPTY during SimpleLruWriteAll() before calling SlruInternalWritePage(), because the assert was firing. The previous code is not really *buggy*, but to me it's weird to call WritePage() on a slot with no contents. Another change was in TransactionGroupUpdateXidStatus: the original code had the leader doing pg_atomic_read_u32(>clogGroupFirst) to know which bank to lock. I changed it to simply be the page used by the leader process; this doesn't need an atomic read, and should be the same page anyway. (If it isn't, it's no big deal). But what's more: even if we do read ->clogGroupFirst at that point, there's no guarantee that this is going to be exactly for the same process that ends up being the first in the list, because since we have not set it to INVALID by the time we grab the bank lock, it is quite possible for more processes to add themselves to the list. I realized all this while rewriting the comments in a way that would let me understand what was going on ... so IMO the effort was worthwhile. Anyway, what I send now should be pretty much final, modulo the change to the check_slru_buffers routines and documentation additions to match pg_stat_slru to the new GUC names. > > Another painful point is that pg_stat_slru uses internal names in the > > data it outputs, which obviously do not match the new GUCs. > > Yeah, that's true, but I think this could be explained somewhere not > sure what is the right place for this. Or we can change those names in the view. > FYI, I have also repeated all the performance tests I performed in my > first email[1], and I am seeing a similar gain. Excellent, thanks for doing that. > Some comments on v18 in my first pass of the review. > > 1. > @@ -665,7 +765,7 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn) > lsnindex = GetLSNIndex(slotno, xid); > *lsn = XactCtl->shared->group_lsn[lsnindex]; > > - LWLockRelease(XactSLRULock); > + LWLockRelease(SimpleLruGetBankLock(XactCtl, pageno)); > > Maybe here we can add an assert before releasing the lock for a safety check > > Assert(LWLockHeldByMe(SimpleLruGetBankLock(XactCtl, pageno))); Hmm, I think this would just throw a warning or error "you don't hold such lwlock", so it doesn't seem necessary. > 2. > + * > + * XXX could we make the LSNs to be bank-based? > */ > XLogRecPtr *group_lsn; > > IMHO, the flush still happens at the page level so up to which LSN > should be flush before flushing the particular clog page should also > be maintained at the page level. Yeah, this was a misguided thought, nevermind. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ It does it in a really, really complicated way why does it need to be complicated? Because it's MakeMaker. >From e1aabcbf5ce1417decfe24f513e5cfe8b6de77f2 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Thu, 22 Feb 2024 18:42:56 +0100 Subject: [PATCH v19] Make SLRU buffer sizes configurable Also, divide the slot array in banks, so that the LRU algorithm can be made more scalable. Also remove the centralized control lock for even better scalability. Authors: Dilip Kumar, Andrey Borodin --- doc/src/sgml/config.sgml | 139 +++ src/backend/access/transam/clog.c | 235 src/backend/access/transam/commit_ts.c| 89 +++-- src/backend/access/transam/multixact.c| 190 +++--- src/backend/access/transam/slru.c | 345 +-
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Wed, Feb 7, 2024 at 3:49 PM Alvaro Herrera wrote: > > On 2024-Feb-07, Dilip Kumar wrote: > > > On Tue, Feb 6, 2024 at 8:55 PM Alvaro Herrera > > wrote: > > > > > > I made CLOGShmemBuffers, CommitTsShmemBuffers and SUBTRANSShmemBuffers > > > compute a number that's multiple of SLRU_BANK_SIZE. But it's a crock, > > > because we don't have that macro at that point, so I just used constant > > > 16. Obviously need a better solution for this. > > > > If we define SLRU_BANK_SIZE in slur.h then we can use it here right, > > because these files are anyway include slur.h so. > > Sure, but is that really what we want? So your question is do we want these buffers to be in multiple of SLRU_BANK_SIZE? Maybe we can have the last bank to be partial, I don't think it should create any problem logically. I mean we can look again in the patch to see if we have made any such assumptions but that should be fairly easy to fix, then maybe if we are going in this way we should get rid of the check_slru_buffers() function as well. > > > I've been wondering whether we should add a "slru" to the name of the > > > GUCs: > > > > > > commit_timestamp_slru_buffers > > > transaction_slru_buffers > > > etc > > > > I am not sure we are exposing anything related to SLRU to the user, > > We do -- we have pg_stat_slru already. > > > I mean transaction_buffers should make sense for the user that it > > stores transaction-related data in some buffers pool but whether that > > buffer pool is called SLRU or not doesn't matter much to the user > > IMHO. > > Yeah, that's exactly what my initial argument was for naming these this > way. But since the term slru already escaped into the wild via the > pg_stat_slru view, perhaps it helps users make the connection between > these things. Alternatively, we can cross-reference each term from the > other's documentation and call it a day. Yeah, that's true I forgot this point about the pg_stat_slru, from this pov if the configuration has the name slru they would be able to make a better connection with the configured value, and the stats in this view based on that they can take call if they need to somehow increase the size of these slru buffers. > Another painful point is that pg_stat_slru uses internal names in the > data it outputs, which obviously do not match the new GUCs. Yeah, that's true, but I think this could be explained somewhere not sure what is the right place for this. FYI, I have also repeated all the performance tests I performed in my first email[1], and I am seeing a similar gain. Some comments on v18 in my first pass of the review. 1. @@ -665,7 +765,7 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn) lsnindex = GetLSNIndex(slotno, xid); *lsn = XactCtl->shared->group_lsn[lsnindex]; - LWLockRelease(XactSLRULock); + LWLockRelease(SimpleLruGetBankLock(XactCtl, pageno)); Maybe here we can add an assert before releasing the lock for a safety check Assert(LWLockHeldByMe(SimpleLruGetBankLock(XactCtl, pageno))); 2. + * + * XXX could we make the LSNs to be bank-based? */ XLogRecPtr *group_lsn; IMHO, the flush still happens at the page level so up to which LSN should be flush before flushing the particular clog page should also be maintained at the page level. [1] https://www.postgresql.org/message-id/CAFiTN-vzDvNz%3DExGXz6gdyjtzGixKSqs0mKHMmaQ8sOSEFZ33A%40mail.gmail.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On 2024-Feb-07, Dilip Kumar wrote: > On Tue, Feb 6, 2024 at 8:55 PM Alvaro Herrera wrote: > > > > I made CLOGShmemBuffers, CommitTsShmemBuffers and SUBTRANSShmemBuffers > > compute a number that's multiple of SLRU_BANK_SIZE. But it's a crock, > > because we don't have that macro at that point, so I just used constant > > 16. Obviously need a better solution for this. > > If we define SLRU_BANK_SIZE in slur.h then we can use it here right, > because these files are anyway include slur.h so. Sure, but is that really what we want? > > I've been wondering whether we should add a "slru" to the name of the > > GUCs: > > > > commit_timestamp_slru_buffers > > transaction_slru_buffers > > etc > > I am not sure we are exposing anything related to SLRU to the user, We do -- we have pg_stat_slru already. > I mean transaction_buffers should make sense for the user that it > stores transaction-related data in some buffers pool but whether that > buffer pool is called SLRU or not doesn't matter much to the user > IMHO. Yeah, that's exactly what my initial argument was for naming these this way. But since the term slru already escaped into the wild via the pg_stat_slru view, perhaps it helps users make the connection between these things. Alternatively, we can cross-reference each term from the other's documentation and call it a day. Another painful point is that pg_stat_slru uses internal names in the data it outputs, which obviously do not match the new GUCs. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ "Uno puede defenderse de los ataques; contra los elogios se esta indefenso"
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
> On 7 Feb 2024, at 10:58, Dilip Kumar wrote: > >> commit_timestamp_slru_buffers >> transaction_slru_buffers >> etc > > I am not sure we are exposing anything related to SLRU to the user, I think we already tell something about SLRU to the user. I’d rather consider if “transaction_slru_buffers" is easier to understand than “transaction_buffers” .. IMO transaction_buffers is clearer. But I do not have strong opinion. > I > mean transaction_buffers should make sense for the user that it stores > transaction-related data in some buffers pool but whether that buffer > pool is called SLRU or not doesn't matter much to the user IMHO. +1 Best regards, Andrey Borodin.
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Tue, Feb 6, 2024 at 8:55 PM Alvaro Herrera wrote: > > Here's the rest of it rebased on top of current master. I think it > makes sense to have this as one individual commit. > > I made CLOGShmemBuffers, CommitTsShmemBuffers and SUBTRANSShmemBuffers > compute a number that's multiple of SLRU_BANK_SIZE. But it's a crock, > because we don't have that macro at that point, so I just used constant > 16. Obviously need a better solution for this. If we define SLRU_BANK_SIZE in slur.h then we can use it here right, because these files are anyway include slur.h so. > > I also changed the location of bank_mask in SlruCtlData for better > packing, as advised by pahole; and renamed SLRU_SLOTNO_GET_BANKLOCKNO() > to SlotGetBankNumber(). Okay. > Some very critical comments still need to be updated to the new design, > particularly anything that mentions "control lock"; but also the overall > model needs to be explained in some central location, rather than > incongruently some pieces here and other pieces there. I'll see about > this later. But at least this is code you should be able to play with. Okay, I will review and test this > I've been wondering whether we should add a "slru" to the name of the > GUCs: > > commit_timestamp_slru_buffers > transaction_slru_buffers > etc I am not sure we are exposing anything related to SLRU to the user, I mean transaction_buffers should make sense for the user that it stores transaction-related data in some buffers pool but whether that buffer pool is called SLRU or not doesn't matter much to the user IMHO. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
Here's the rest of it rebased on top of current master. I think it makes sense to have this as one individual commit. I made CLOGShmemBuffers, CommitTsShmemBuffers and SUBTRANSShmemBuffers compute a number that's multiple of SLRU_BANK_SIZE. But it's a crock, because we don't have that macro at that point, so I just used constant 16. Obviously need a better solution for this. I also changed the location of bank_mask in SlruCtlData for better packing, as advised by pahole; and renamed SLRU_SLOTNO_GET_BANKLOCKNO() to SlotGetBankNumber(). Some very critical comments still need to be updated to the new design, particularly anything that mentions "control lock"; but also the overall model needs to be explained in some central location, rather than incongruently some pieces here and other pieces there. I'll see about this later. But at least this is code you should be able to play with. I've been wondering whether we should add a "slru" to the name of the GUCs: commit_timestamp_slru_buffers transaction_slru_buffers etc -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ "Aprender sin pensar es inútil; pensar sin aprender, peligroso" (Confucio) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 61038472c5..3e3119865a 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -2006,6 +2006,145 @@ include_dir 'conf.d' + + commit_timestamp_buffers (integer) + + commit_timestamp_buffers configuration parameter + + + + +Specifies the amount of memory to use to cache the contents of +pg_commit_ts (see +). +If this value is specified without units, it is taken as blocks, +that is BLCKSZ bytes, typically 8kB. +The default value is 0, which requests +shared_buffers/512 up to 1024 blocks, +but not fewer than 16 blocks. +This parameter can only be set at server start. + + + + + + multixact_members_buffers (integer) + + multixact_members_buffers configuration parameter + + + + +Specifies the amount of shared memory to use to cache the contents +of pg_multixact/members (see +). +If this value is specified without units, it is taken as blocks, +that is BLCKSZ bytes, typically 8kB. +The default value is 32. +This parameter can only be set at server start. + + + + + + multixact_offsets_buffers (integer) + + multixact_offsets_buffers configuration parameter + + + + +Specifies the amount of shared memory to use to cache the contents +of pg_multixact/offsets (see +). +If this value is specified without units, it is taken as blocks, +that is BLCKSZ bytes, typically 8kB. +The default value is 16. +This parameter can only be set at server start. + + + + + + notify_buffers (integer) + + notify_buffers configuration parameter + + + + +Specifies the amount of shared memory to use to cache the contents +of pg_notify (see +). +If this value is specified without units, it is taken as blocks, +that is BLCKSZ bytes, typically 8kB. +The default value is 16. +This parameter can only be set at server start. + + + + + + serializable_buffers (integer) + + serializable_buffers configuration parameter + + + + +Specifies the amount of shared memory to use to cache the contents +of pg_serial (see +). +If this value is specified without units, it is taken as blocks, +that is BLCKSZ bytes, typically 8kB. +The default value is 32. +This parameter can only be set at server start. + + + + + + subtransaction_buffers (integer) + + subtransaction_buffers configuration parameter + + + + +Specifies the amount of shared memory to use to cache the contents +of pg_subtrans (see +). +If this value is specified without units, it is taken as blocks, +that is BLCKSZ bytes, typically 8kB. +The default value is 0, which requests +shared_buffers/512 up to 1024 blocks, +but not fewer than 16 blocks. +This parameter can only be set at server start. + + + + + + transaction_buffers (integer) + + transaction_buffers configuration parameter + + + + +Specifies the amount of shared memory to use to cache the contents +of pg_xact (see +). +If this value is specified without units, it is taken as blocks, +that is BLCKSZ bytes, typically 8kB. +The default
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Tue, Feb 6, 2024 at 4:23 PM Alvaro Herrera wrote: > > > > (We also have SimpleLruTruncate, but I think it's not as critical to > > > have a barrier there anyhow: accessing a slightly outdated page number > > > could only be a problem if a bug elsewhere causes us to try to truncate > > > in the current page. I think we only have this code there because we > > > did have such bugs in the past, but IIUC this shouldn't happen anymore.) > > > > +1, I agree with this theory in general. But the below comment in > > SimpleLruTrucate in your v3 patch doesn't seem correct, because here > > we are checking if the latest_page_number is smaller than the cutoff > > if so we log it as wraparound and skip the whole thing and that is > > fine even if we are reading with atomic variable and slightly outdated > > value should not be a problem but the comment claim that this safe > > because we have the same bank lock as SimpleLruZeroPage(), but that's > > not true here we will be acquiring different bank locks one by one > > based on which slotno we are checking. Am I missing something? > > I think you're correct. I reworded this comment, so now it says this: > > /* > * An important safety check: the current endpoint page must not be > * eligible for removal. This check is just a backstop against wraparound > * bugs elsewhere in SLRU handling, so we don't care if we read a slightly > * outdated value; therefore we don't add a memory barrier. > */ > > Pushed with those changes. Thank you! Yeah, this looks perfect, thanks. > Now I'll go rebase the rest of the patch on top. Okay, I will review and test after that. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On 2024-Feb-04, Andrey M. Borodin wrote: > This patch uses wording "banks" in comments before banks start to > exist. But as far as I understand, it is expected to be committed > before "banks" patch. True -- changed that to use ControlLock. > Besides this patch looks good to me. Many thanks for reviewing. On 2024-Feb-05, Dilip Kumar wrote: > > (We also have SimpleLruTruncate, but I think it's not as critical to > > have a barrier there anyhow: accessing a slightly outdated page number > > could only be a problem if a bug elsewhere causes us to try to truncate > > in the current page. I think we only have this code there because we > > did have such bugs in the past, but IIUC this shouldn't happen anymore.) > > +1, I agree with this theory in general. But the below comment in > SimpleLruTrucate in your v3 patch doesn't seem correct, because here > we are checking if the latest_page_number is smaller than the cutoff > if so we log it as wraparound and skip the whole thing and that is > fine even if we are reading with atomic variable and slightly outdated > value should not be a problem but the comment claim that this safe > because we have the same bank lock as SimpleLruZeroPage(), but that's > not true here we will be acquiring different bank locks one by one > based on which slotno we are checking. Am I missing something? I think you're correct. I reworded this comment, so now it says this: /* * An important safety check: the current endpoint page must not be * eligible for removal. This check is just a backstop against wraparound * bugs elsewhere in SLRU handling, so we don't care if we read a slightly * outdated value; therefore we don't add a memory barrier. */ Pushed with those changes. Thank you! Now I'll go rebase the rest of the patch on top. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Having your biases confirmed independently is how scientific progress is made, and hence made our great society what it is today" (Mary Gardiner)
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Sun, Feb 4, 2024 at 7:10 PM Alvaro Herrera wrote: > > On 2024-Feb-02, Dilip Kumar wrote: > > > I have checked the patch and it looks fine to me other than the above > > question related to memory barrier usage one more question about the > > same, basically below to instances 1 and 2 look similar but in 1 you > > are not using the memory write_barrier whereas in 2 you are using the > > write_barrier, why is it so? I mean why the reordering can not happen > > in 1 and it may happen in 2? > > What I was thinking is that there's a lwlock operation just below, which > acts as a barrier. But I realized something more important: there are > only two places that matter, which are SlruSelectLRUPage and > SimpleLruZeroPage. The others are all initialization code that run at a > point where there's no going to be any concurrency in SLRU access, so we > don't need barriers anyway. In SlruSelectLRUPage we definitely don't > want to evict the page that SimpleLruZeroPage has initialized, starting > from the point where it returns that new page to its caller. > But if you consider the code of those two routines, you realize that the > only time an equality between latest_page_number and "this_page_number" > is going to occur, is when both pages are in the same bank ... and both > routines are required to be holding the bank lock while they run, so in > practice this is never a problem. Right, in fact when I first converted this 'latest_page_number' to an atomic the thinking was to protect it from concurrently setting the values in SimpleLruZeroPage() and also concurrently reading in SlruSelectLRUPage() should not read the corrupted value. All other usages were during the initialization phase where we do not need any protection. > > We need the atomic write and atomic read so that multiple processes > processing pages in different banks can update latest_page_number > simultaneously. But the equality condition that we're looking for? > it can never happen concurrently. Yeah, that's right, after you told I also realized that the case is protected by the bank lock. Earlier I didn't think about this case. > In other words, these barriers are fully useless. > > (We also have SimpleLruTruncate, but I think it's not as critical to > have a barrier there anyhow: accessing a slightly outdated page number > could only be a problem if a bug elsewhere causes us to try to truncate > in the current page. I think we only have this code there because we > did have such bugs in the past, but IIUC this shouldn't happen anymore.) +1, I agree with this theory in general. But the below comment in SimpleLruTrucate in your v3 patch doesn't seem correct, because here we are checking if the latest_page_number is smaller than the cutoff if so we log it as wraparound and skip the whole thing and that is fine even if we are reading with atomic variable and slightly outdated value should not be a problem but the comment claim that this safe because we have the same bank lock as SimpleLruZeroPage(), but that's not true here we will be acquiring different bank locks one by one based on which slotno we are checking. Am I missing something? + * An important safety check: the current endpoint page must not be + * eligible for removal. Like SlruSelectLRUPage, we don't need a + * memory barrier here because for the affected page to be relevant, + * we'd have to have the same bank lock as SimpleLruZeroPage. */ - if (ctl->PagePrecedes(shared->latest_page_number, cutoffPage)) + if (ctl->PagePrecedes(pg_atomic_read_u64(>latest_page_number), + cutoffPage)) -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
> On 4 Feb 2024, at 18:38, Alvaro Herrera wrote: > > In other words, these barriers are fully useless. +1. I've tried to understand ideas behind barriers, but latest_page_number is heuristics that does not need any guarantees at all. It's also is used in safety check which can fire only when everything is already broken beyond any repair.. (Though using atomic access seems a good idea anyway) This patch uses wording "banks" in comments before banks start to exist. But as far as I understand, it is expected to be committed before "banks" patch. Besides this patch looks good to me. Best regards, Andrey Borodin.
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
Sorry, brown paper bag bug there. Here's the correct one. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "I can't go to a restaurant and order food because I keep looking at the fonts on the menu. Five minutes later I realize that it's also talking about food" (Donald Knuth) >From 99cadfdf7475146953e9846c20c4a708a3527937 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Wed, 31 Jan 2024 12:27:51 +0100 Subject: [PATCH v3] Use atomics for SlruSharedData->latest_page_number --- src/backend/access/transam/clog.c | 6 +--- src/backend/access/transam/commit_ts.c | 7 ++--- src/backend/access/transam/multixact.c | 28 +++--- src/backend/access/transam/slru.c | 40 +++--- src/include/access/slru.h | 5 +++- 5 files changed, 54 insertions(+), 32 deletions(-) diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c index f6e7da7ffc..06fc2989ba 100644 --- a/src/backend/access/transam/clog.c +++ b/src/backend/access/transam/clog.c @@ -766,14 +766,10 @@ StartupCLOG(void) TransactionId xid = XidFromFullTransactionId(TransamVariables->nextXid); int64 pageno = TransactionIdToPage(xid); - LWLockAcquire(XactSLRULock, LW_EXCLUSIVE); - /* * Initialize our idea of the latest page number. */ - XactCtl->shared->latest_page_number = pageno; - - LWLockRelease(XactSLRULock); + pg_atomic_write_u64(>shared->latest_page_number, pageno); } /* diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c index 61b82385f3..6bfe60343e 100644 --- a/src/backend/access/transam/commit_ts.c +++ b/src/backend/access/transam/commit_ts.c @@ -689,9 +689,7 @@ ActivateCommitTs(void) /* * Re-Initialize our idea of the latest page number. */ - LWLockAcquire(CommitTsSLRULock, LW_EXCLUSIVE); - CommitTsCtl->shared->latest_page_number = pageno; - LWLockRelease(CommitTsSLRULock); + pg_atomic_write_u64(>shared->latest_page_number, pageno); /* * If CommitTs is enabled, but it wasn't in the previous server run, we @@ -1006,7 +1004,8 @@ commit_ts_redo(XLogReaderState *record) * During XLOG replay, latest_page_number isn't set up yet; insert a * suitable value to bypass the sanity test in SimpleLruTruncate. */ - CommitTsCtl->shared->latest_page_number = trunc->pageno; + pg_atomic_write_u64(>shared->latest_page_number, + trunc->pageno); SimpleLruTruncate(CommitTsCtl, trunc->pageno); } diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c index 59523be901..febc429f72 100644 --- a/src/backend/access/transam/multixact.c +++ b/src/backend/access/transam/multixact.c @@ -2017,13 +2017,15 @@ StartupMultiXact(void) * Initialize offset's idea of the latest page number. */ pageno = MultiXactIdToOffsetPage(multi); - MultiXactOffsetCtl->shared->latest_page_number = pageno; + pg_atomic_write_u64(>shared->latest_page_number, + pageno); /* * Initialize member's idea of the latest page number. */ pageno = MXOffsetToMemberPage(offset); - MultiXactMemberCtl->shared->latest_page_number = pageno; + pg_atomic_write_u64(>shared->latest_page_number, + pageno); } /* @@ -2047,14 +2049,15 @@ TrimMultiXact(void) oldestMXactDB = MultiXactState->oldestMultiXactDB; LWLockRelease(MultiXactGenLock); - /* Clean up offsets state */ - LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE); - /* * (Re-)Initialize our idea of the latest page number for offsets. */ pageno = MultiXactIdToOffsetPage(nextMXact); - MultiXactOffsetCtl->shared->latest_page_number = pageno; + pg_atomic_write_u64(>shared->latest_page_number, + pageno); + + /* Clean up offsets state */ + LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE); /* * Zero out the remainder of the current offsets page. See notes in @@ -2081,14 +2084,16 @@ TrimMultiXact(void) LWLockRelease(MultiXactOffsetSLRULock); - /* And the same for members */ - LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE); - /* + * And the same for members. + * * (Re-)Initialize our idea of the latest page number for members. */ pageno = MXOffsetToMemberPage(offset); - MultiXactMemberCtl->shared->latest_page_number = pageno; + pg_atomic_write_u64(>shared->latest_page_number, + pageno); + + LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE); /* * Zero out the remainder of the current members page. See notes in @@ -,7 +3338,8 @@ multixact_redo(XLogReaderState *record) * SimpleLruTruncate. */ pageno = MultiXactIdToOffsetPage(xlrec.endTruncOff); - MultiXactOffsetCtl->shared->latest_page_number = pageno; + pg_atomic_write_u64(>shared->latest_page_number, + pageno); PerformOffsetsTruncation(xlrec.startTruncOff, xlrec.endTruncOff); LWLockRelease(MultiXactTruncationLock); diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c index 9ac4790f16..c1d0dfc73b 100644 ---
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
In short, I propose the attached. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ >From b4ba8135f8044e0077a27fcf6ad18451380cbcb3 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Wed, 31 Jan 2024 12:27:51 +0100 Subject: [PATCH v2] Use atomics for SlruSharedData->latest_page_number --- src/backend/access/transam/clog.c | 6 +--- src/backend/access/transam/commit_ts.c | 7 ++--- src/backend/access/transam/multixact.c | 28 +++--- src/backend/access/transam/slru.c | 40 +++--- src/include/access/slru.h | 5 +++- 5 files changed, 54 insertions(+), 32 deletions(-) diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c index f6e7da7ffc..f8aa91eb0a 100644 --- a/src/backend/access/transam/clog.c +++ b/src/backend/access/transam/clog.c @@ -766,14 +766,10 @@ StartupCLOG(void) TransactionId xid = XidFromFullTransactionId(TransamVariables->nextXid); int64 pageno = TransactionIdToPage(xid); - LWLockAcquire(XactSLRULock, LW_EXCLUSIVE); - /* * Initialize our idea of the latest page number. */ - XactCtl->shared->latest_page_number = pageno; - - LWLockRelease(XactSLRULock); + pg_atomic_write_u64(XactCtl->shared->latest_page_number, pageno); } /* diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c index 61b82385f3..6bfe60343e 100644 --- a/src/backend/access/transam/commit_ts.c +++ b/src/backend/access/transam/commit_ts.c @@ -689,9 +689,7 @@ ActivateCommitTs(void) /* * Re-Initialize our idea of the latest page number. */ - LWLockAcquire(CommitTsSLRULock, LW_EXCLUSIVE); - CommitTsCtl->shared->latest_page_number = pageno; - LWLockRelease(CommitTsSLRULock); + pg_atomic_write_u64(>shared->latest_page_number, pageno); /* * If CommitTs is enabled, but it wasn't in the previous server run, we @@ -1006,7 +1004,8 @@ commit_ts_redo(XLogReaderState *record) * During XLOG replay, latest_page_number isn't set up yet; insert a * suitable value to bypass the sanity test in SimpleLruTruncate. */ - CommitTsCtl->shared->latest_page_number = trunc->pageno; + pg_atomic_write_u64(>shared->latest_page_number, + trunc->pageno); SimpleLruTruncate(CommitTsCtl, trunc->pageno); } diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c index 59523be901..febc429f72 100644 --- a/src/backend/access/transam/multixact.c +++ b/src/backend/access/transam/multixact.c @@ -2017,13 +2017,15 @@ StartupMultiXact(void) * Initialize offset's idea of the latest page number. */ pageno = MultiXactIdToOffsetPage(multi); - MultiXactOffsetCtl->shared->latest_page_number = pageno; + pg_atomic_write_u64(>shared->latest_page_number, + pageno); /* * Initialize member's idea of the latest page number. */ pageno = MXOffsetToMemberPage(offset); - MultiXactMemberCtl->shared->latest_page_number = pageno; + pg_atomic_write_u64(>shared->latest_page_number, + pageno); } /* @@ -2047,14 +2049,15 @@ TrimMultiXact(void) oldestMXactDB = MultiXactState->oldestMultiXactDB; LWLockRelease(MultiXactGenLock); - /* Clean up offsets state */ - LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE); - /* * (Re-)Initialize our idea of the latest page number for offsets. */ pageno = MultiXactIdToOffsetPage(nextMXact); - MultiXactOffsetCtl->shared->latest_page_number = pageno; + pg_atomic_write_u64(>shared->latest_page_number, + pageno); + + /* Clean up offsets state */ + LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE); /* * Zero out the remainder of the current offsets page. See notes in @@ -2081,14 +2084,16 @@ TrimMultiXact(void) LWLockRelease(MultiXactOffsetSLRULock); - /* And the same for members */ - LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE); - /* + * And the same for members. + * * (Re-)Initialize our idea of the latest page number for members. */ pageno = MXOffsetToMemberPage(offset); - MultiXactMemberCtl->shared->latest_page_number = pageno; + pg_atomic_write_u64(>shared->latest_page_number, + pageno); + + LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE); /* * Zero out the remainder of the current members page. See notes in @@ -,7 +3338,8 @@ multixact_redo(XLogReaderState *record) * SimpleLruTruncate. */ pageno = MultiXactIdToOffsetPage(xlrec.endTruncOff); - MultiXactOffsetCtl->shared->latest_page_number = pageno; + pg_atomic_write_u64(>shared->latest_page_number, + pageno); PerformOffsetsTruncation(xlrec.startTruncOff, xlrec.endTruncOff); LWLockRelease(MultiXactTruncationLock); diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c index 9ac4790f16..c1d0dfc73b 100644 --- a/src/backend/access/transam/slru.c +++ b/src/backend/access/transam/slru.c @@ -17,7 +17,8 @@ * per-buffer LWLocks that synchronize I/O for each buffer. The control lock * must be held to
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On 2024-Feb-02, Dilip Kumar wrote: > I have checked the patch and it looks fine to me other than the above > question related to memory barrier usage one more question about the > same, basically below to instances 1 and 2 look similar but in 1 you > are not using the memory write_barrier whereas in 2 you are using the > write_barrier, why is it so? I mean why the reordering can not happen > in 1 and it may happen in 2? What I was thinking is that there's a lwlock operation just below, which acts as a barrier. But I realized something more important: there are only two places that matter, which are SlruSelectLRUPage and SimpleLruZeroPage. The others are all initialization code that run at a point where there's no going to be any concurrency in SLRU access, so we don't need barriers anyway. In SlruSelectLRUPage we definitely don't want to evict the page that SimpleLruZeroPage has initialized, starting from the point where it returns that new page to its caller. But if you consider the code of those two routines, you realize that the only time an equality between latest_page_number and "this_page_number" is going to occur, is when both pages are in the same bank ... and both routines are required to be holding the bank lock while they run, so in practice this is never a problem. We need the atomic write and atomic read so that multiple processes processing pages in different banks can update latest_page_number simultaneously. But the equality condition that we're looking for? it can never happen concurrently. In other words, these barriers are fully useless. (We also have SimpleLruTruncate, but I think it's not as critical to have a barrier there anyhow: accessing a slightly outdated page number could only be a problem if a bug elsewhere causes us to try to truncate in the current page. I think we only have this code there because we did have such bugs in the past, but IIUC this shouldn't happen anymore.) -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Thu, Feb 1, 2024 at 4:34 PM Dilip Kumar wrote: > > On Thu, Feb 1, 2024 at 4:12 PM Dilip Kumar wrote: > > > > On Thu, Feb 1, 2024 at 3:44 PM Alvaro Herrera > > wrote: > > > Okay. > > > > > > While I have your attention -- if you could give a look to the 0001 > > > patch I posted, I would appreciate it. > > > > > > > I will look into it. Thanks. > > Some quick observations, > > Do we need below two write barriers at the end of the function? > because the next instruction is separated by the function boundary > > @@ -766,14 +766,11 @@ StartupCLOG(void) > .. > - XactCtl->shared->latest_page_number = pageno; > - > - LWLockRelease(XactSLRULock); > + pg_atomic_init_u64(>shared->latest_page_number, pageno); > + pg_write_barrier(); > } > > /* > * Initialize member's idea of the latest page number. > */ > pageno = MXOffsetToMemberPage(offset); > - MultiXactMemberCtl->shared->latest_page_number = pageno; > + pg_atomic_init_u64(>shared->latest_page_number, > +pageno); > + > + pg_write_barrier(); > } > I have checked the patch and it looks fine to me other than the above question related to memory barrier usage one more question about the same, basically below to instances 1 and 2 look similar but in 1 you are not using the memory write_barrier whereas in 2 you are using the write_barrier, why is it so? I mean why the reordering can not happen in 1 and it may happen in 2? 1. + pg_atomic_write_u64(>shared->latest_page_number, + trunc->pageno); SimpleLruTruncate(CommitTsCtl, trunc->pageno); vs 2. - shared->latest_page_number = pageno; + pg_atomic_write_u64(>latest_page_number, pageno); + pg_write_barrier(); /* update the stats counter of zeroed pages */ pgstat_count_slru_page_zeroed(shared->slru_stats_idx); -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Thu, Feb 1, 2024 at 4:12 PM Dilip Kumar wrote: > > On Thu, Feb 1, 2024 at 3:44 PM Alvaro Herrera wrote: > Okay. > > > > While I have your attention -- if you could give a look to the 0001 > > patch I posted, I would appreciate it. > > > > I will look into it. Thanks. Some quick observations, Do we need below two write barriers at the end of the function? because the next instruction is separated by the function boundary @@ -766,14 +766,11 @@ StartupCLOG(void) .. - XactCtl->shared->latest_page_number = pageno; - - LWLockRelease(XactSLRULock); + pg_atomic_init_u64(>shared->latest_page_number, pageno); + pg_write_barrier(); } /* * Initialize member's idea of the latest page number. */ pageno = MXOffsetToMemberPage(offset); - MultiXactMemberCtl->shared->latest_page_number = pageno; + pg_atomic_init_u64(>shared->latest_page_number, +pageno); + + pg_write_barrier(); } I am looking more into this from the concurrency point of view and will update you soon. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Thu, Feb 1, 2024 at 3:44 PM Alvaro Herrera wrote: > > On 2024-Feb-01, Dilip Kumar wrote: > > > On Thu, Feb 1, 2024 at 3:19 PM Alvaro Herrera > > wrote: > > > > > > postgres -c lc_messages=C -c shared_buffers=$((512*17)) > > > > > > 2024-02-01 10:48:13.548 CET [1535379] FATAL: invalid value for parameter > > > "transaction_buffers": 17 > > > 2024-02-01 10:48:13.548 CET [1535379] DETAIL: "transaction_buffers" must > > > be a multiple of 16 > > > > Maybe we should resize it to the next multiple of the SLRU_BANK_SIZE > > instead of giving an error? > > Since this is the auto-tuning feature, I think it should use the > previous multiple rather than the next, but yeah, something like that. Okay. > > While I have your attention -- if you could give a look to the 0001 > patch I posted, I would appreciate it. > I will look into it. Thanks. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On 2024-Feb-01, Dilip Kumar wrote: > On Thu, Feb 1, 2024 at 3:19 PM Alvaro Herrera wrote: > > > > postgres -c lc_messages=C -c shared_buffers=$((512*17)) > > > > 2024-02-01 10:48:13.548 CET [1535379] FATAL: invalid value for parameter > > "transaction_buffers": 17 > > 2024-02-01 10:48:13.548 CET [1535379] DETAIL: "transaction_buffers" must > > be a multiple of 16 > > Maybe we should resize it to the next multiple of the SLRU_BANK_SIZE > instead of giving an error? Since this is the auto-tuning feature, I think it should use the previous multiple rather than the next, but yeah, something like that. While I have your attention -- if you could give a look to the 0001 patch I posted, I would appreciate it. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ "Los trabajadores menos efectivos son sistematicamente llevados al lugar donde pueden hacer el menor daño posible: gerencia." (El principio Dilbert)
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Thu, Feb 1, 2024 at 3:19 PM Alvaro Herrera wrote: > > Hah: > > postgres -c lc_messages=C -c shared_buffers=$((512*17)) > > 2024-02-01 10:48:13.548 CET [1535379] FATAL: invalid value for parameter > "transaction_buffers": 17 > 2024-02-01 10:48:13.548 CET [1535379] DETAIL: "transaction_buffers" must be > a multiple of 16 Maybe we should resize it to the next multiple of the SLRU_BANK_SIZE instead of giving an error? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
Hah: postgres -c lc_messages=C -c shared_buffers=$((512*17)) 2024-02-01 10:48:13.548 CET [1535379] FATAL: invalid value for parameter "transaction_buffers": 17 2024-02-01 10:48:13.548 CET [1535379] DETAIL: "transaction_buffers" must be a multiple of 16 -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ It does it in a really, really complicated way why does it need to be complicated? Because it's MakeMaker.
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On 2024-Jan-29, Dilip Kumar wrote: > Thank you for working on this. There is one thing that I feel is > problematic. We have kept the allowed values for these GUCs to be in > multiple of SLRU_BANK_SIZE i.e. 16 and that's the reason the min > values were changed to 16 but in this refactoring patch for some of > the buffers you have changed that to 8 so I think that's not good. Oh, absolutely, you're right. Restored the minimum to 16. So, here's the patchset as two pieces. 0001 converts SlruSharedData->latest_page_number to use atomics. I don't see any reason to mix this in with the rest of the patch, and though it likely won't have any performance advantage by itself (since the lock acquisition is pretty much the same), it seems better to get it in ahead of the rest -- I think that simplifies matters for the second patch, which is large enough. So, 0002 introduces the rest of the feature. I removed use of banklocks in a different amount as banks, and I made commit_ts use a longer lwlock acquisition at truncation time, rather than forcing all-lwlock acquisition. The more I look at 0002, the more I notice that some comments need badly updated, so please don't read too much into it yet. But I wanted to post it anyway for archives and cfbot purposes. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ >From 464a996b85c333ffc781086263c2e491758b248f Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Wed, 31 Jan 2024 12:27:51 +0100 Subject: [PATCH 1/2] Use atomics for SlruSharedData->latest_page_number --- src/backend/access/transam/clog.c | 7 ++ src/backend/access/transam/commit_ts.c | 7 +++--- src/backend/access/transam/multixact.c | 30 -- src/backend/access/transam/slru.c | 19 ++-- src/include/access/slru.h | 5 - 5 files changed, 41 insertions(+), 27 deletions(-) diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c index f6e7da7ffc..245fd21c8d 100644 --- a/src/backend/access/transam/clog.c +++ b/src/backend/access/transam/clog.c @@ -766,14 +766,11 @@ StartupCLOG(void) TransactionId xid = XidFromFullTransactionId(TransamVariables->nextXid); int64 pageno = TransactionIdToPage(xid); - LWLockAcquire(XactSLRULock, LW_EXCLUSIVE); - /* * Initialize our idea of the latest page number. */ - XactCtl->shared->latest_page_number = pageno; - - LWLockRelease(XactSLRULock); + pg_atomic_init_u64(>shared->latest_page_number, pageno); + pg_write_barrier(); } /* diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c index 61b82385f3..f68705989d 100644 --- a/src/backend/access/transam/commit_ts.c +++ b/src/backend/access/transam/commit_ts.c @@ -689,9 +689,7 @@ ActivateCommitTs(void) /* * Re-Initialize our idea of the latest page number. */ - LWLockAcquire(CommitTsSLRULock, LW_EXCLUSIVE); - CommitTsCtl->shared->latest_page_number = pageno; - LWLockRelease(CommitTsSLRULock); + pg_atomic_init_u64(>shared->latest_page_number, pageno); /* * If CommitTs is enabled, but it wasn't in the previous server run, we @@ -1006,7 +1004,8 @@ commit_ts_redo(XLogReaderState *record) * During XLOG replay, latest_page_number isn't set up yet; insert a * suitable value to bypass the sanity test in SimpleLruTruncate. */ - CommitTsCtl->shared->latest_page_number = trunc->pageno; + pg_atomic_write_u64(>shared->latest_page_number, + trunc->pageno); SimpleLruTruncate(CommitTsCtl, trunc->pageno); } diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c index 59523be901..a886c29892 100644 --- a/src/backend/access/transam/multixact.c +++ b/src/backend/access/transam/multixact.c @@ -2017,13 +2017,17 @@ StartupMultiXact(void) * Initialize offset's idea of the latest page number. */ pageno = MultiXactIdToOffsetPage(multi); - MultiXactOffsetCtl->shared->latest_page_number = pageno; + pg_atomic_init_u64(>shared->latest_page_number, + pageno); /* * Initialize member's idea of the latest page number. */ pageno = MXOffsetToMemberPage(offset); - MultiXactMemberCtl->shared->latest_page_number = pageno; + pg_atomic_init_u64(>shared->latest_page_number, + pageno); + + pg_write_barrier(); } /* @@ -2047,14 +2051,15 @@ TrimMultiXact(void) oldestMXactDB = MultiXactState->oldestMultiXactDB; LWLockRelease(MultiXactGenLock); - /* Clean up offsets state */ - LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE); - /* * (Re-)Initialize our idea of the latest page number for offsets. */ pageno = MultiXactIdToOffsetPage(nextMXact); - MultiXactOffsetCtl->shared->latest_page_number = pageno; + pg_atomic_write_u64(>shared->latest_page_number, + pageno); + + /* Clean up offsets state */ + LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE); /* * Zero out the remainder of the current offsets page. See notes in @@ -2081,14 +2086,16 @@
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Fri, Jan 26, 2024 at 11:08 PM Alvaro Herrera wrote: > > I've continued to review this and decided that I don't like the mess > this patch proposes in order to support pg_commit_ts's deletion of all > files. (Yes, I know that I was the one that proposed this idea. It's > still a bad one). I'd like to change that code by removing the limit > that we can only have 128 bank locks in a SLRU. To recap, the reason we > did this is that commit_ts sometimes wants to delete all files while > running (DeactivateCommitTs), and for this it needs to acquire all bank > locks. Since going above the MAX_SIMUL_LWLOCKS limit is disallowed, we > added the SLRU limit making multiple banks share lwlocks. > > I propose two alternative solutions: > > 1. The easiest is to have DeactivateCommitTs continue to hold > CommitTsLock until the end, including while SlruScanDirectory does its > thing. This sounds terrible, but considering that this code only runs > when the module is being deactivated, I don't think it's really all that > bad in practice. I mean, if you deactivate the commit_ts module and > then try to read commit timestamp data, you deserve to wait for a few > seconds just as a punishment for your stupidity. I think this idea looks reasonable. I agree that if we are trying to read commit_ts after deactivating it then it's fine to make it wait. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Thu, Jan 25, 2024 at 10:03 PM Alvaro Herrera wrote: > > On 2024-Jan-25, Alvaro Herrera wrote: > > > Here's a touched-up version of this patch. > > > diff --git a/src/backend/storage/lmgr/lwlock.c > > b/src/backend/storage/lmgr/lwlock.c > > index 98fa6035cc..4a5e05d5e4 100644 > > --- a/src/backend/storage/lmgr/lwlock.c > > +++ b/src/backend/storage/lmgr/lwlock.c > > @@ -163,6 +163,13 @@ static const char *const BuiltinTrancheNames[] = { > > [LWTRANCHE_LAUNCHER_HASH] = "LogicalRepLauncherHash", > > [LWTRANCHE_DSM_REGISTRY_DSA] = "DSMRegistryDSA", > > [LWTRANCHE_DSM_REGISTRY_HASH] = "DSMRegistryHash", > > + [LWTRANCHE_COMMITTS_SLRU] = "CommitTSSLRU", > > + [LWTRANCHE_MULTIXACTOFFSET_SLRU] = "MultixactOffsetSLRU", > > + [LWTRANCHE_MULTIXACTMEMBER_SLRU] = "MultixactMemberSLRU", > > + [LWTRANCHE_NOTIFY_SLRU] = "NotifySLRU", > > + [LWTRANCHE_SERIAL_SLRU] = "SerialSLRU" > > + [LWTRANCHE_SUBTRANS_SLRU] = "SubtransSLRU", > > + [LWTRANCHE_XACT_SLRU] = "XactSLRU", > > }; > > Eeek. Last minute changes ... Fixed here. Thank you for working on this. There is one thing that I feel is problematic. We have kept the allowed values for these GUCs to be in multiple of SLRU_BANK_SIZE i.e. 16 and that's the reason the min values were changed to 16 but in this refactoring patch for some of the buffers you have changed that to 8 so I think that's not good. + { + {"multixact_offsets_buffers", PGC_POSTMASTER, RESOURCES_MEM, + gettext_noop("Sets the size of the dedicated buffer pool used for the MultiXact offset cache."), + NULL, + GUC_UNIT_BLOCKS + }, + _offsets_buffers, + 16, 8, SLRU_MAX_ALLOWED_BUFFERS, + check_multixact_offsets_buffers, NULL, NULL + }, Other than this patch looks good to me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
> On 26 Jan 2024, at 22:38, Alvaro Herrera wrote: > > This is OK because in the > default compilation each file only has 32 segments, so that requires > only 32 lwlocks held at once while the file is being deleted. Do we account somehow that different subsystems do not accumulate MAX_SIMUL_LWLOCKS together? E.g. GiST during split can combine 75 locks, and somehow commit_ts will be deactivated by this backend at the same moment and add 32 locks more :) I understand that this sounds fantastic, these subsystems do not interfere. But this is fantastic only until something like that actually happens. If possible, I'd prefer one lock at a time, any maybe sometimes two-three with some guarantees that this is safe. So, from my POV first solution that you proposed seems much better to me. Thanks for working on this! Best regard, Andrey Borodin.
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
I've continued to review this and decided that I don't like the mess this patch proposes in order to support pg_commit_ts's deletion of all files. (Yes, I know that I was the one that proposed this idea. It's still a bad one). I'd like to change that code by removing the limit that we can only have 128 bank locks in a SLRU. To recap, the reason we did this is that commit_ts sometimes wants to delete all files while running (DeactivateCommitTs), and for this it needs to acquire all bank locks. Since going above the MAX_SIMUL_LWLOCKS limit is disallowed, we added the SLRU limit making multiple banks share lwlocks. I propose two alternative solutions: 1. The easiest is to have DeactivateCommitTs continue to hold CommitTsLock until the end, including while SlruScanDirectory does its thing. This sounds terrible, but considering that this code only runs when the module is being deactivated, I don't think it's really all that bad in practice. I mean, if you deactivate the commit_ts module and then try to read commit timestamp data, you deserve to wait for a few seconds just as a punishment for your stupidity. AFAICT the cases where anything is done in the module mostly check without locking that commitTsActive is set, so we're not slowing down any critical operations. Also, if you don't like to be delayed for a couple of seconds, just don't deactivate the module. 2. If we want some refinement, the other idea is to change SlruScanDirCbDeleteAll (the callback that SlruScanDirectory uses in this case) so that it acquires the bank lock appropriate for all the slots used by the file that's going to be deleted. This is OK because in the default compilation each file only has 32 segments, so that requires only 32 lwlocks held at once while the file is being deleted. I think we don't need to bother with this, but it's an option if we see the above as unworkable for whatever reason. The only other user of SlruScanDirCbDeleteAll is async.c (the LISTEN/ NOTIFY code), and what that does is delete all the files at server start, where nothing is running concurrently anyway. So whatever we do for commit_ts, it won't affect async.c. So, if we do away with the SLRU_MAX_BANKLOCKS idea, then we're assured one LWLock per bank (instead of sharing some lwlocks to multiple banks), and that makes the code simpler to reason about. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "In fact, the basic problem with Perl 5's subroutines is that they're not crufty enough, so the cruft leaks out into user-defined code instead, by the Conservation of Cruft Principle." (Larry Wall, Apocalypse 6)
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Thu, Jan 25, 2024 at 11:22 AM Alvaro Herrera wrote: > Still with these auto-tuning GUCs, I noticed that the auto-tuning code > would continue to grow the buffer sizes with shared_buffers to > arbitrarily large values. I added an arbitrary maximum of 1024 (8 MB), > which is much higher than the current value of 128; but if you have > (say) 30 GB of shared_buffers (not uncommon these days), do you really > need 30MB of pg_clog cache? It seems mostly unnecessary ... and you can > still set it manually that way if you need it. So, largely I just > rewrote those small functions completely. Yeah, I think that if we're going to scale with shared_buffers, it should be capped. -- Robert Haas EDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On 2024-Jan-25, Alvaro Herrera wrote: > Here's a touched-up version of this patch. > diff --git a/src/backend/storage/lmgr/lwlock.c > b/src/backend/storage/lmgr/lwlock.c > index 98fa6035cc..4a5e05d5e4 100644 > --- a/src/backend/storage/lmgr/lwlock.c > +++ b/src/backend/storage/lmgr/lwlock.c > @@ -163,6 +163,13 @@ static const char *const BuiltinTrancheNames[] = { > [LWTRANCHE_LAUNCHER_HASH] = "LogicalRepLauncherHash", > [LWTRANCHE_DSM_REGISTRY_DSA] = "DSMRegistryDSA", > [LWTRANCHE_DSM_REGISTRY_HASH] = "DSMRegistryHash", > + [LWTRANCHE_COMMITTS_SLRU] = "CommitTSSLRU", > + [LWTRANCHE_MULTIXACTOFFSET_SLRU] = "MultixactOffsetSLRU", > + [LWTRANCHE_MULTIXACTMEMBER_SLRU] = "MultixactMemberSLRU", > + [LWTRANCHE_NOTIFY_SLRU] = "NotifySLRU", > + [LWTRANCHE_SERIAL_SLRU] = "SerialSLRU" > + [LWTRANCHE_SUBTRANS_SLRU] = "SubtransSLRU", > + [LWTRANCHE_XACT_SLRU] = "XactSLRU", > }; Eeek. Last minute changes ... Fixed here. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ "La primera ley de las demostraciones en vivo es: no trate de usar el sistema. Escriba un guión que no toque nada para no causar daños." (Jakob Nielsen) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 61038472c5..3e3119865a 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -2006,6 +2006,145 @@ include_dir 'conf.d' + + commit_timestamp_buffers (integer) + + commit_timestamp_buffers configuration parameter + + + + +Specifies the amount of memory to use to cache the contents of +pg_commit_ts (see +). +If this value is specified without units, it is taken as blocks, +that is BLCKSZ bytes, typically 8kB. +The default value is 0, which requests +shared_buffers/512 up to 1024 blocks, +but not fewer than 16 blocks. +This parameter can only be set at server start. + + + + + + multixact_members_buffers (integer) + + multixact_members_buffers configuration parameter + + + + +Specifies the amount of shared memory to use to cache the contents +of pg_multixact/members (see +). +If this value is specified without units, it is taken as blocks, +that is BLCKSZ bytes, typically 8kB. +The default value is 32. +This parameter can only be set at server start. + + + + + + multixact_offsets_buffers (integer) + + multixact_offsets_buffers configuration parameter + + + + +Specifies the amount of shared memory to use to cache the contents +of pg_multixact/offsets (see +). +If this value is specified without units, it is taken as blocks, +that is BLCKSZ bytes, typically 8kB. +The default value is 16. +This parameter can only be set at server start. + + + + + + notify_buffers (integer) + + notify_buffers configuration parameter + + + + +Specifies the amount of shared memory to use to cache the contents +of pg_notify (see +). +If this value is specified without units, it is taken as blocks, +that is BLCKSZ bytes, typically 8kB. +The default value is 16. +This parameter can only be set at server start. + + + + + + serializable_buffers (integer) + + serializable_buffers configuration parameter + + + + +Specifies the amount of shared memory to use to cache the contents +of pg_serial (see +). +If this value is specified without units, it is taken as blocks, +that is BLCKSZ bytes, typically 8kB. +The default value is 32. +This parameter can only be set at server start. + + + + + + subtransaction_buffers (integer) + + subtransaction_buffers configuration parameter + + + + +Specifies the amount of shared memory to use to cache the contents +of pg_subtrans (see +). +If this value is specified without units, it is taken as blocks, +that is BLCKSZ bytes, typically 8kB. +The default value is 0, which requests +shared_buffers/512 up to 1024 blocks, +but not fewer than 16 blocks. +This parameter can only be set at server start. + + + + + + transaction_buffers (integer) + + transaction_buffers configuration parameter + + + + +Specifies the amount of shared memory to use to cache the contents +of pg_xact (see +). +If this value is specified without units, it is taken as blocks, +that is BLCKSZ bytes, typically 8kB. +
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
Here's a touched-up version of this patch. First, PROC_GLOBAL->clogGroupFirst and SlruCtl->latest_page_number change from being protected by locks to being atomics, but there's no mention of considering memory barriers that they should have. Looking at the code, I think the former doesn't need any additional barriers, but latest_page_number is missing some, which I have added. This deserves another careful look. Second and most user visibly, the GUC names seem to have been chosen based on the source-code variables, which have never meant to be user- visible. So I renamed a few: xact_buffers -> transaction_buffers subtrans_buffers -> subtransaction_buffers serial_buffers -> serializable_buffers commit_ts_buffers -> commit_timestamp_buffers (unchanged: multixact_offsets_buffers, multixact_members_buffers, notify_buffers) I did this explicitly trying to avoid using the term SLRU in a user-visible manner, because what do users care? But immediately after doing this I realized that we already have pg_stat_slru, so maybe the cat is already out of the bag, and so perhaps we should name these GUCS as, say slru_transaction_buffers? That may make the connection between these things a little more explicit. (I do think we need to add cross-links in the documentation of those GUCs to the pg_stat_slru docs and vice-versa.) Another thing that bothered me a bit is that we have auto-tuning for transaction_buffers and commit_timestamp_buffers, but not for subtransaction_buffers. (Autotuning means you set the GUC to 0 and it scales with shared_buffers). I don't quite understand what's the reason for the ommision, so I added it for subtrans too. I think it may make sense to do likewise for the multixact ones too, not sure. It doesn't seem worth having that for pg_serial and notify. While messing about with these GUCs I realized that the usage of the show_hook to print what the current number is, when autoturning is used, was bogus: SHOW would print the number of blocks for (say) transaction_buffers, but if you asked it to print (say) multixact_offsets_buffers, it would give a size in MB or kB. I'm sure such an inconsistency would bite us. So, digging around I found that a good way to handle this is to remove the show_hook, and instead call SetConfigOption() at the time when the ShmemInit function is called, with the correct number of buffers determined. This is pretty much what is used for XLOGbuffers, and it works correctly as far as my testing shows. Still with these auto-tuning GUCs, I noticed that the auto-tuning code would continue to grow the buffer sizes with shared_buffers to arbitrarily large values. I added an arbitrary maximum of 1024 (8 MB), which is much higher than the current value of 128; but if you have (say) 30 GB of shared_buffers (not uncommon these days), do you really need 30MB of pg_clog cache? It seems mostly unnecessary ... and you can still set it manually that way if you need it. So, largely I just rewrote those small functions completely. I also made the SGML documentation and postgresql.sample.conf all match what the code actually docs. The whole thing wasn't particularly consistent. I rewrote a bunch of code comments and moved stuff around to appear in alphabetical order, etc. More comment rewriting still pending. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 61038472c5..3e3119865a 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -2006,6 +2006,145 @@ include_dir 'conf.d' + + commit_timestamp_buffers (integer) + + commit_timestamp_buffers configuration parameter + + + + +Specifies the amount of memory to use to cache the contents of +pg_commit_ts (see +). +If this value is specified without units, it is taken as blocks, +that is BLCKSZ bytes, typically 8kB. +The default value is 0, which requests +shared_buffers/512 up to 1024 blocks, +but not fewer than 16 blocks. +This parameter can only be set at server start. + + + + + + multixact_members_buffers (integer) + + multixact_members_buffers configuration parameter + + + + +Specifies the amount of shared memory to use to cache the contents +of pg_multixact/members (see +). +If this value is specified without units, it is taken as blocks, +that is BLCKSZ bytes, typically 8kB. +The default value is 32. +This parameter can only be set at server start. + + + + + + multixact_offsets_buffers (integer) + + multixact_offsets_buffers configuration parameter + + + + +Specifies the amount of shared memory to use to cache the contents +of pg_multixact/offsets (see +
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On 2024-Jan-08, Dilip Kumar wrote: > On Mon, Jan 8, 2024 at 4:55 PM Alvaro Herrera wrote: > > > > The more I look at TransactionGroupUpdateXidStatus, the more I think > > it's broken, and while we do have some tests, I don't have confidence > > that they cover all possible cases. > > > > Or, at least, if this code is good, then it hasn't been sufficiently > > explained. > > Any thought about a case in which you think it might be broken, I mean > any vague thought might also help where you think it might not work as > expected so that I can also think in that direction. It might be > possible that I might not be thinking of some perspective that you are > thinking and comments might be lacking from that point of view. Eh, apologies. This email was an unfinished draft that I had laying around before the holidays which I intended to discard it but somehow kept around, and just now I happened to press the wrong key combination and it ended up being sent instead. We had some further discussion, after which I no longer think that there is a problem here, so please ignore this email. I'll come back to this patch later this week. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "El que vive para el futuro es un iluso, y el que vive para el pasado, un imbécil" (Luis Adler, "Los tripulantes de la noche")
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Mon, Jan 8, 2024 at 4:55 PM Alvaro Herrera wrote: > > The more I look at TransactionGroupUpdateXidStatus, the more I think > it's broken, and while we do have some tests, I don't have confidence > that they cover all possible cases. > > Or, at least, if this code is good, then it hasn't been sufficiently > explained. Any thought about a case in which you think it might be broken, I mean any vague thought might also help where you think it might not work as expected so that I can also think in that direction. It might be possible that I might not be thinking of some perspective that you are thinking and comments might be lacking from that point of view. > If we have multiple processes trying to write bits to clog, and they are > using different banks, then the LWLockConditionalAcquire will be able to > acquire the bank lock Do you think there is a problem with multiple processes getting the lock? I mean they are modifying different CLOG pages so that can be done concurrently right? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
The more I look at TransactionGroupUpdateXidStatus, the more I think it's broken, and while we do have some tests, I don't have confidence that they cover all possible cases. Or, at least, if this code is good, then it hasn't been sufficiently explained. If we have multiple processes trying to write bits to clog, and they are using different banks, then the LWLockConditionalAcquire will be able to acquire the bank lock -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ "The saddest aspect of life right now is that science gathers knowledge faster than society gathers wisdom." (Isaac Asimov)
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Wed, Jan 3, 2024 at 12:08 AM Dilip Kumar wrote: > Yeah, this is indeed an interesting idea. So I think if we are > interested in working in this direction maybe this can be submitted in > a different thread, IMHO. Yeah, that's something quite different from the patch before us. -- Robert Haas EDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Tue, Jan 2, 2024 at 7:53 PM Robert Haas wrote: > > On Fri, Dec 22, 2023 at 8:14 AM Andrey M. Borodin > wrote: > > Just a side node. > > It seems like commit log is kind of antipattern of data contention. Even > > when we build a super-optimized SLRU. Nearby **bits** are written by > > different CPUs. > > I think that banks and locks are good thing. But also we could reorganize > > log so that > > status of transaction 0 is on a page 0 at bit offset 0 > > status of transaction 1 is on a page 1 at bit offset 0 > > status of transaction 2 is on a page 2 at bit offset 0 > > status of transaction 3 is on a page 3 at bit offset 0 > > status of transaction 4 is on a page 0 at bit offset 2 > > status of transaction 5 is on a page 1 at bit offset 2 > > status of transaction 6 is on a page 2 at bit offset 2 > > status of transaction 7 is on a page 3 at bit offset 2 > > etc... > > This is an interesting idea. A variant would be to stripe across > cachelines within the same page rather than across pages. If we do > stripe across pages as proposed here, we'd probably want to rethink > the way the SLRU is extended -- page at a time wouldn't really make > sense, but preallocating an entire file might. Yeah, this is indeed an interesting idea. So I think if we are interested in working in this direction maybe this can be submitted in a different thread, IMHO. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Tue, Jan 2, 2024 at 1:10 PM Andrey M. Borodin wrote: > > On 2 Jan 2024, at 19:23, Robert Haas wrote: > >> And it would be even better if page for transaction statuses would be > >> determined by backend id somehow. Or at least cache line. Can we allocate > >> a range (sizeof(cacheline)) of xids\subxids\multixacts\whatever for each > >> backend? > > > > I don't understand how this could work. We need to be able to look up > > transaction status by XID, not backend ID. > > When GetNewTransactionId() is called we can reserve 256 xids in backend local > memory. This values will be reused by transactions or subtransactions of this > backend. Here 256 == sizeof(CacheLine). > This would ensure that different backends touch different cache lines. > > But this approach would dramatically increase xid consumption speed on > patterns where client reconnects after several transactions. So we can keep > unused xids in procarray for future reuse. > > I doubt we can find significant performance improvement here, because false > cache line sharing cannot be _that_ bad. Yeah, this seems way too complicated for what we'd potentially gain from it. An additional problem is that the xmin horizon computation assumes that XIDs are assigned in monotonically increasing fashion; breaking that would be messy. And even an occasional leak of XIDs could precipitate enough additional vacuuming to completely outweigh any gains we could hope to achieve here. -- Robert Haas EDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
> On 2 Jan 2024, at 19:23, Robert Haas wrote: > >> >> And it would be even better if page for transaction statuses would be >> determined by backend id somehow. Or at least cache line. Can we allocate a >> range (sizeof(cacheline)) of xids\subxids\multixacts\whatever for each >> backend? > > I don't understand how this could work. We need to be able to look up > transaction status by XID, not backend ID. When GetNewTransactionId() is called we can reserve 256 xids in backend local memory. This values will be reused by transactions or subtransactions of this backend. Here 256 == sizeof(CacheLine). This would ensure that different backends touch different cache lines. But this approach would dramatically increase xid consumption speed on patterns where client reconnects after several transactions. So we can keep unused xids in procarray for future reuse. I doubt we can find significant performance improvement here, because false cache line sharing cannot be _that_ bad. Best regards, Andrey Borodin.
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Fri, Dec 22, 2023 at 8:14 AM Andrey M. Borodin wrote: > Just a side node. > It seems like commit log is kind of antipattern of data contention. Even when > we build a super-optimized SLRU. Nearby **bits** are written by different > CPUs. > I think that banks and locks are good thing. But also we could reorganize log > so that > status of transaction 0 is on a page 0 at bit offset 0 > status of transaction 1 is on a page 1 at bit offset 0 > status of transaction 2 is on a page 2 at bit offset 0 > status of transaction 3 is on a page 3 at bit offset 0 > status of transaction 4 is on a page 0 at bit offset 2 > status of transaction 5 is on a page 1 at bit offset 2 > status of transaction 6 is on a page 2 at bit offset 2 > status of transaction 7 is on a page 3 at bit offset 2 > etc... This is an interesting idea. A variant would be to stripe across cachelines within the same page rather than across pages. If we do stripe across pages as proposed here, we'd probably want to rethink the way the SLRU is extended -- page at a time wouldn't really make sense, but preallocating an entire file might. > And it would be even better if page for transaction statuses would be > determined by backend id somehow. Or at least cache line. Can we allocate a > range (sizeof(cacheline)) of xids\subxids\multixacts\whatever for each > backend? I don't understand how this could work. We need to be able to look up transaction status by XID, not backend ID. -- Robert Haas EDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
> On 19 Dec 2023, at 10:34, Dilip Kumar wrote: Just a side node. It seems like commit log is kind of antipattern of data contention. Even when we build a super-optimized SLRU. Nearby **bits** are written by different CPUs. I think that banks and locks are good thing. But also we could reorganize log so that status of transaction 0 is on a page 0 at bit offset 0 status of transaction 1 is on a page 1 at bit offset 0 status of transaction 2 is on a page 2 at bit offset 0 status of transaction 3 is on a page 3 at bit offset 0 status of transaction 4 is on a page 0 at bit offset 2 status of transaction 5 is on a page 1 at bit offset 2 status of transaction 6 is on a page 2 at bit offset 2 status of transaction 7 is on a page 3 at bit offset 2 etc... And it would be even better if page for transaction statuses would be determined by backend id somehow. Or at least cache line. Can we allocate a range (sizeof(cacheline)) of xids\subxids\multixacts\whatever for each backend? This does not matter much because 0. Patch set in current thread produces robust SLRU anyway 1. One day we are going to throw away SLRU anyway Best regards, Andrey Borodin.
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Mon, Dec 18, 2023 at 11:00 PM Robert Haas wrote: > > On Mon, Dec 18, 2023 at 12:04 PM Robert Haas wrote: > > certain sense they are competing for the same job. However, they do > > aim to alleviate different TYPES of contention: the group XID update > > stuff should be most valuable when lots of processes are trying to > > update the same page, and the banks should be most valuable when there > > is simultaneous access to a bunch of different pages. So I'm not > > convinced that this patch is a reason to remove the group XID update > > mechanism, but someone might argue otherwise. > > Hmm, but, on the other hand: > > Currently all readers and writers are competing for the same LWLock. > But with this change, the readers will (mostly) no longer be competing > with the writers. So, in theory, that might reduce lock contention > enough to make the group update mechanism pointless. Thanks for your feedback, I agree that with a bank-wise lock, we might not need group updates for some of the use cases as you said where readers and writers are contenting on the centralized lock because, in most of the cases, readers will be distributed across different banks. OTOH there are use cases where the writer commit is the bottleneck (on SLRU lock) like pgbench simple-update or TPC-B then we will still benefit by group update. During group update testing we have seen benefits with such a scenario[1] with high client counts. So as per my understanding by distributing the SLRU locks there are scenarios where we will not need group update anymore but OTOH there are also scenarios where we will still benefit from the group update. [1] https://www.postgresql.org/message-id/CAFiTN-u-XEzhd%3DhNGW586fmQwdTy6Qy6_SXe09tNB%3DgBcVzZ_A%40mail.gmail.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Mon, Dec 18, 2023 at 12:53 PM Andrey M. Borodin wrote: > One page still accommodates 32K transaction statuses under one lock. It feels > like a lot. About 1 second of transactions on a typical installation. > > When the group commit was committed did we have a benchmark to estimate > efficiency of this technology? Can we repeat that test again? I think we did, but it might take some research to find it in the archives. If we can, I agree that repeating it feels like a good idea. -- Robert Haas EDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
> On 18 Dec 2023, at 22:30, Robert Haas wrote: > > On Mon, Dec 18, 2023 at 12:04 PM Robert Haas wrote: >> certain sense they are competing for the same job. However, they do >> aim to alleviate different TYPES of contention: the group XID update >> stuff should be most valuable when lots of processes are trying to >> update the same page, and the banks should be most valuable when there >> is simultaneous access to a bunch of different pages. So I'm not >> convinced that this patch is a reason to remove the group XID update >> mechanism, but someone might argue otherwise. > > Hmm, but, on the other hand: > > Currently all readers and writers are competing for the same LWLock. > But with this change, the readers will (mostly) no longer be competing > with the writers. So, in theory, that might reduce lock contention > enough to make the group update mechanism pointless. One page still accommodates 32K transaction statuses under one lock. It feels like a lot. About 1 second of transactions on a typical installation. When the group commit was committed did we have a benchmark to estimate efficiency of this technology? Can we repeat that test again? Best regards, Andrey Borodin.
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Mon, Dec 18, 2023 at 12:04 PM Robert Haas wrote: > certain sense they are competing for the same job. However, they do > aim to alleviate different TYPES of contention: the group XID update > stuff should be most valuable when lots of processes are trying to > update the same page, and the banks should be most valuable when there > is simultaneous access to a bunch of different pages. So I'm not > convinced that this patch is a reason to remove the group XID update > mechanism, but someone might argue otherwise. Hmm, but, on the other hand: Currently all readers and writers are competing for the same LWLock. But with this change, the readers will (mostly) no longer be competing with the writers. So, in theory, that might reduce lock contention enough to make the group update mechanism pointless. -- Robert Haas EDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Tue, Dec 12, 2023 at 8:29 AM Alvaro Herrera wrote: > The problem I see is that the group update mechanism is designed around > contention of the global xact-SLRU control lock; it uses atomics to > coordinate a single queue when the lock is contended. So if we split up > the global SLRU control lock using banks, then multiple processes using > different bank locks might not contend. OK, this is fine, but what > happens if two separate groups of processes encounter contention on two > different bank locks? I think they will both try to update the same > queue, and coordinate access to that *using different bank locks*. I > don't see how can this work correctly. > > I suspect the first part of that algorithm, where atomics are used to > create the list without a lock, might work fine. But will each "leader" > process, each of which is potentially using a different bank lock, > coordinate correctly? Maybe this works correctly because only one > process will find the queue head not empty? If this is what happens, > then there needs to be comments about it. Without any explanation, > this seems broken and potentially dangerous, as some transaction commit > bits might become lost given high enough concurrency and bad luck. I don't want to be dismissive of this concern, but I took a look at this part of the patch set and I don't see a correctness problem. I think the idea of the mechanism is that we have a single linked list in shared memory that can accumulate those waiters. At some point a process pops the entire list of waiters, which allows a new group of waiters to begin accumulating. The process that pops the list must perform the updates for every process in the just-popped list without failing, else updates would be lost. In theory, there can be any number of lists that some process has popped and is currently working its way through at the same time, although in practice I bet it's quite rare for there to be more than one. The only correctness problem is if it's possible for a process that popped the list to error out before it finishes doing the updates that it "promised" to do by popping the list. Having individual bank locks doesn't really change anything here. That's just a matter of what lock has to be held in order to perform the update that was promised, and the algorithm described in the previous paragraph doesn't really care about that. Nor is there a deadlock hazard here as long as processes only take one lock at a time, which I believe is the case. The only potential issue that I see here is with performance. I've heard some questions about whether this machinery performs well even as it stands, but certainly if we divide up the lock into a bunch of bankwise locks then that should tend in the direction of making a mechanism like this less valuable, because both mechanisms are trying to alleviate contention, and so in a certain sense they are competing for the same job. However, they do aim to alleviate different TYPES of contention: the group XID update stuff should be most valuable when lots of processes are trying to update the same page, and the banks should be most valuable when there is simultaneous access to a bunch of different pages. So I'm not convinced that this patch is a reason to remove the group XID update mechanism, but someone might argue otherwise. A related concern is that, if by chance we do end up with multiple updaters from different pages in the same group, it will now be more expensive to sort that out because we'll have to potentially keep switching locks. So that could make the group XID update mechanism less performant than it is currently. I think that's probably not an issue because I think it should be a rare occurrence, as the comments say. A bit more cost in a code path that is almost never taken won't matter. But if that path is more commonly taken than I think, then maybe making it more expensive could hurt. It might be worth adding some debugging to see how often we actually go down that path in a highly stressed system. BTW: +* as well. The main reason for now allowing requesters of different pages now -> not -- Robert Haas EDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Thu, Dec 14, 2023 at 4:36 PM Dilip Kumar wrote: > > On Wed, Dec 13, 2023 at 5:49 PM Andrey M. Borodin > wrote: > > > > On 12 Dec 2023, at 18:28, Alvaro Herrera wrote: > > > > > > Andrey, do you have any stress tests or anything else that you used to > > > gain confidence in this code? > > I have done some more testing for the clog group update as the attached test file executes two concurrent scripts executed with pgbench, the first script is the slow script which will run 10-second long transactions and the second script is a very fast transaction with ~1 transactions per second. Along with that, I have also changed the bank size such that each bank will contain just 1 page i.e. 32k transactions per bank. I have done this way so that we do not need to keep long-running transactions running for very long in order to get the transactions from different banks committed during the same time. With this test, I have got that behavior and the below logs shows that multiple transaction range which is in different slru-bank (considering 32k transactions per bank) are doing group update at the same time. e.g. in the below logs, we can see xid range around 70600, 70548, and 70558, and xid range around 755, and 752 are getting group updates by different leaders but near the same time. It is running fine when running for a long duration, but I am not sure how to validate the sanity of this kind of test. 2023-12-14 14:43:31.813 GMT [3306] LOG: group leader procno 606 updated status of procno 606 xid 70600 2023-12-14 14:43:31.816 GMT [3326] LOG: procno 586 for xid 70548 added for group update 2023-12-14 14:43:31.816 GMT [3326] LOG: procno 586 is group leader and got the lock 2023-12-14 14:43:31.816 GMT [3326] LOG: group leader procno 586 updated status of procno 586 xid 70548 2023-12-14 14:43:31.818 GMT [3327] LOG: procno 585 for xid 70558 added for group update 2023-12-14 14:43:31.818 GMT [3327] LOG: procno 585 is group leader and got the lock 2023-12-14 14:43:31.818 GMT [3327] LOG: group leader procno 585 updated status of procno 585 xid 70558 2023-12-14 14:43:31.829 GMT [3155] LOG: procno 687 for xid 752 added for group update 2023-12-14 14:43:31.829 GMT [3207] LOG: procno 669 for xid 755 added for group update 2023-12-14 14:43:31.829 GMT [3155] LOG: procno 687 is group leader and got the lock 2023-12-14 14:43:31.829 GMT [3155] LOG: group leader procno 687 updated status of procno 669 xid 755 2023-12-14 14:43:31.829 GMT [3155] LOG: group leader procno 687 updated status of procno 687 xid 752 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com # Goal of this script to generate scenario where some old long running slow # transaction get committed with the new transactions such that they falls in # different slru banks rm -rf pgdata ./initdb -D pgdata echo "max_wal_size=20GB" >> pgdata/postgresql.conf echo "shared_buffers=20GB" >> pgdata/postgresql.conf echo "checkpoint_timeout=40min" >> pgdata/postgresql.conf echo "max_connections=700" >> pgdata/postgresql.conf echo "maintenance_work_mem=1GB" >> pgdata/postgresql.conf echo "subtrans_buffers=64" >> pgdata/postgresql.conf echo "multixact_members_buffers=128" >> pgdata/postgresql.conf #create slow_txn.sql script cat > slow_txn.sql << EOF BEGIN; INSERT INTO test VALUES(1); DELETE FROM test WHERE a=1; select pg_sleep(10); COMMIT; EOF #create fast_txn.sql script cat > fast_txn.sql << EOF BEGIN; INSERT INTO test1 VALUES(1); DELETE FROM test1 WHERE a=1; COMMIT; EOF ./pg_ctl -D pgdata -l logfile -c start ./psql -d postgres -c "create table test(a int)" ./psql -d postgres -c "create table test1(a int)" ./pgbench -i -s 1 postgres ./pgbench -f slow_txn.sql -c 28 -j 28 -P 1 -T 60 postgres & ./pgbench -f fast_txn.sql -c 100 -j 100 -P 1 -T 60 postgres sleep(10); ./pg_ctl -D pgdata -l logfile -c stop
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
> On 14 Dec 2023, at 16:32, tender wang wrote: > > enable -O2, only one instruction: > xor eax, eax This is not fast code. This is how friendly C compiler suggests you that mask must be 127, not 128. Best regards, Andrey Borodin.
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
Andrey M. Borodin 于2023年12月14日周四 17:35写道: > > > > On 14 Dec 2023, at 14:28, tender wang wrote: > > > > Now that AND is more faster, Can we replace the '% > SLRU_MAX_BANKLOCKS' operation in SimpleLruGetBankLock() with '& 127' > > unsigned int GetBankno1(unsigned int pageno) { > return pageno & 127; > } > > unsigned int GetBankno2(unsigned int pageno) { > return pageno % 128; > } > > Generates with -O2 > > GetBankno1(unsigned int): > mov eax, edi > and eax, 127 > ret > GetBankno2(unsigned int): > mov eax, edi > and eax, 127 > ret > > > Compiler is smart enough with constants. > Yeah, that's true. int GetBankno(long pageno) { unsigned short bank_mask = 128; int bankno = (pageno & bank_mask) % 128; return bankno; } enable -O2, only one instruction: xor eax, eax But if we all use '%', thing changs as below: int GetBankno(long pageno) { unsigned short bank_mask = 128; int bankno = (pageno % bank_mask) % 128; return bankno; } mov rdx, rdi sar rdx, 63 shr rdx, 57 lea rax, [rdi+rdx] and eax, 127 sub eax, edx > Best regards, Andrey Borodin.
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
> On 14 Dec 2023, at 16:06, Dilip Kumar wrote: > > I have noticed > a very good improvement with the addition of patch 0003. Indeed, a very impressive results! It’s almost x2 of performance on high contention scenario, on top of previous improvements. Best regards, Andrey Borodin.
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Wed, Dec 13, 2023 at 5:49 PM Andrey M. Borodin wrote: > > On 12 Dec 2023, at 18:28, Alvaro Herrera wrote: > > > > Andrey, do you have any stress tests or anything else that you used to > > gain confidence in this code? > > We are using only first two steps of the patchset, these steps do not touch > locking stuff. > > We’ve got some confidence after Yura Sokolov’s benchmarks [0]. Thanks! > I have run this test [1], instead of comparing against the master I have compared the effect of (patch-1 = (0001+0002)slur buffer bank) vs (patch-2 = (0001+0002+0003) slur buffer bank + bank-wise lock), and here is the result of the benchmark-1 and benchmark-2. I have noticed a very good improvement with the addition of patch 0003. Machine information: Architecture: x86_64 CPU op-mode(s):32-bit, 64-bit Byte Order:Little Endian CPU(s):128 On-line CPU(s) list: 0-127 configurations: max_wal_size=20GB shared_buffers=20GB checkpoint_timeout=40min max_connections=700 maintenance_work_mem=1GB subtrans_buffers=$variable multixact_offsets_buffers=$variable multixact_members_buffers=$variable benchmark-1 version | subtrans | multixact | tps | buffers | offs/memb | func+ballast ---+--+--+-- patch-1 | 64 | 64/128| 87 + 58 patch-2 | 64 | 64/128 | 128 +83 patch-1 | 1024| 512/1024| 96 + 64 patch-2 | 1024| 512/1024| 163+108 benchmark-2 version | subtrans | multixact | tps | buffers | offs/memb | func ---+--+--+-- patch-1 | 64 | 64/128| 10 patch-2 | 64 | 64/128 | 12 patch-1 | 1024| 512/1024| 44 patch-2 | 1024| 512/1024| 72 [1] https://www.postgresql.org/message-id/flat/e46cdea96979545b2d8a13b451d8b1ce61dc7238.camel%40postgrespro.ru#0ed2cad11470825d464093fe6b8ef6a3 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
> On 14 Dec 2023, at 14:28, tender wang wrote: > > Now that AND is more faster, Can we replace the '% SLRU_MAX_BANKLOCKS' > operation in SimpleLruGetBankLock() with '& 127' unsigned int GetBankno1(unsigned int pageno) { return pageno & 127; } unsigned int GetBankno2(unsigned int pageno) { return pageno % 128; } Generates with -O2 GetBankno1(unsigned int): mov eax, edi and eax, 127 ret GetBankno2(unsigned int): mov eax, edi and eax, 127 ret Compiler is smart enough with constants. Best regards, Andrey Borodin.
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
Andrey M. Borodin 于2023年12月14日周四 17:02写道: > > > > On 14 Dec 2023, at 08:12, Amul Sul wrote: > > > > > > + int bankno = pageno & ctl->bank_mask; > > > > I am a bit uncomfortable seeing it as a mask, why can't it be simply a > number > > of banks (num_banks) and get the bank number through modulus op (pageno % > > num_banks) instead of bitwise & operation (pageno & ctl->bank_mask) > which is a > > bit difficult to read compared to modulus op which is quite simple, > > straightforward and much common practice in hashing. > > > > Are there any advantages of using & over % ? > use Compiler Explorer[1] tool, '%' has more Assembly instructions than '&' . int GetBankno1(int pageno) { return pageno & 127; } int GetBankno2(int pageno) { return pageno % 127; } under clang 13.0 GetBankno1: # @GetBankno1 pushrbp mov rbp, rsp mov dword ptr [rbp - 4], edi mov eax, dword ptr [rbp - 4] and eax, 127 pop rbp ret GetBankno2: # @GetBankno2 pushrbp mov rbp, rsp mov dword ptr [rbp - 4], edi mov eax, dword ptr [rbp - 4] mov ecx, 127 cdq idivecx mov eax, edx pop rbp ret under gcc 13.2 GetBankno1: pushrbp mov rbp, rsp mov DWORD PTR [rbp-4], edi mov eax, DWORD PTR [rbp-4] and eax, 127 pop rbp ret GetBankno2: pushrbp mov rbp, rsp mov DWORD PTR [rbp-4], edi mov eax, DWORD PTR [rbp-4] movsx rdx, eax imulrdx, rdx, -2130574327 shr rdx, 32 add edx, eax mov ecx, edx sar ecx, 6 cdq sub ecx, edx mov edx, ecx sal edx, 7 sub edx, ecx sub eax, edx mov ecx, eax mov eax, ecx pop rbp ret [1] https://godbolt.org/ The instruction AND is ~20 times faster than IDIV [0]. This is relatively > hot function, worth sacrificing some readability to save ~ten nanoseconds > on each check of a status of a transaction. > Now that AND is more faster, Can we replace the '% SLRU_MAX_BANKLOCKS' operation in SimpleLruGetBankLock() with '& 127' : SimpleLruGetBankLock() { int banklockno = (pageno & ctl->bank_mask) % SLRU_MAX_BANKLOCKS; use '&' return &(ctl->shared->bank_locks[banklockno].lock); } Thoughts? > > [0] https://www.agner.org/optimize/instruction_tables.pdf > >
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
> On 14 Dec 2023, at 08:12, Amul Sul wrote: > > > + int bankno = pageno & ctl->bank_mask; > > I am a bit uncomfortable seeing it as a mask, why can't it be simply a number > of banks (num_banks) and get the bank number through modulus op (pageno % > num_banks) instead of bitwise & operation (pageno & ctl->bank_mask) which is a > bit difficult to read compared to modulus op which is quite simple, > straightforward and much common practice in hashing. > > Are there any advantages of using & over % ? The instruction AND is ~20 times faster than IDIV [0]. This is relatively hot function, worth sacrificing some readability to save ~ten nanoseconds on each check of a status of a transaction. [0] https://www.agner.org/optimize/instruction_tables.pdf
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Mon, Dec 11, 2023 at 10:42 AM Dilip Kumar wrote: > On Thu, Nov 30, 2023 at 3:30 PM Dilip Kumar wrote: > > > > On Wed, Nov 29, 2023 at 4:58 PM Dilip Kumar > wrote: > > Here is the updated patch based on some comments by tender wang (those > comments were sent only to me) > few nitpicks: + + /* +* Mask for slotno banks, considering 1GB SLRU buffer pool size and the +* SLRU_BANK_SIZE bits16 should be sufficient for the bank mask. +*/ + bits16 bank_mask; } SlruCtlData; ... ... + int bankno = pageno & ctl->bank_mask; I am a bit uncomfortable seeing it as a mask, why can't it be simply a number of banks (num_banks) and get the bank number through modulus op (pageno % num_banks) instead of bitwise & operation (pageno & ctl->bank_mask) which is a bit difficult to read compared to modulus op which is quite simple, straightforward and much common practice in hashing. Are there any advantages of using & over % ? Also, a few places in 0002 and 0003 patch, need the bank number, it is better to have a macro for that. --- extern bool SlruScanDirCbDeleteAll(SlruCtl ctl, char *filename, int64 segpage, void *data); - +extern bool check_slru_buffers(const char *name, int *newval); #endif /* SLRU_H */ Add an empty line after the declaration, in 0002 patch. --- -TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr lsn, int slotno) +TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr lsn, + int slotno) Unrelated change for 0003 patch. --- Regards, Amul
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
> On 12 Dec 2023, at 18:28, Alvaro Herrera wrote: > > Andrey, do you have any stress tests or anything else that you used to > gain confidence in this code? We are using only first two steps of the patchset, these steps do not touch locking stuff. We’ve got some confidence after Yura Sokolov’s benchmarks [0]. Thanks! Best regards, Andrey Borodin. [0] https://www.postgresql.org/message-id/flat/e46cdea96979545b2d8a13b451d8b1ce61dc7238.camel%40postgrespro.ru#0ed2cad11470825d464093fe6b8ef6a3
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Tue, Dec 12, 2023 at 6:58 PM Alvaro Herrera wrote: > > [Added Andrey again in CC, because as I understand they are using this > code or something like it in production. Please don't randomly remove > people from CC lists.] Oh, glad to know that. Yeah, I generally do not remove but I have noticed that in the mail chain, some of the reviewers just replied to me and the hackers' list, and from that point onwards I lost track of the CC list. > I've been looking at this some more, and I'm not confident in that the > group clog update stuff is correct. I think the injection points test > case was good enough to discover a problem, but it's hard to get peace > of mind that there aren't other, more subtle problems. Yeah, I agree. > The problem I see is that the group update mechanism is designed around > contention of the global xact-SLRU control lock; it uses atomics to > coordinate a single queue when the lock is contended. So if we split up > the global SLRU control lock using banks, then multiple processes using > different bank locks might not contend. OK, this is fine, but what > happens if two separate groups of processes encounter contention on two > different bank locks? I think they will both try to update the same > queue, and coordinate access to that *using different bank locks*. I > don't see how can this work correctly. Let's back up a bit and start from the current design with the centralized lock. With that, if one process is holding the lock the other processes will try to perform the group update, and if there is already a group that still hasn't got the lock but trying to update the different CLOG page then what this process wants to update then it will not add itself for the group update instead it will fallback to the normal lock wait. Now, in another situation, it may so happen that the group leader of the other group already got the control lock and in such case, it would have cleared the 'procglobal->clogGroupFirst' that means now we will start forming a different group. So logically if we talk only about the optimization part then the thing is that it is assumed that at a time when we are trying to commit a log of concurrent xid then those xids are mostly of the same range and will fall in the same SLRU page and group update will help them. But if we are getting some out-of-range xid of some long-running transaction they might not even go for group update as the page number will be different. Although the situation might be better here with a bank-wise lock because there if those xids are falling in altogether a different bank it might not even contend. Now, let's talk about the correctness, I think even though we are getting processes that might be contending on different bank-lock, still we are ensuring that in a group all the processes are trying to update the same SLRU page (i.e. same bank also, we will talk about the exception later[1]). One of the processes is becoming a leader and as soon as the leader gets the lock it detaches the queue from the 'procglobal->clogGroupFirst' by setting it as INVALID_PGPROCNO so that other group update requesters now can form another parallel group. But here I do not see a problem with correctness. I agree someone might say that since now there is a possibility that different groups might get formed for different bank locks we do not get other groups to get formed until we get the lock for our bank as we do not clear 'procglobal->clogGroupFirst' before we get the lock. Other requesters might want to update the page in different banks so why block them? But the thing is the group update design is optimized for the cases where all requesters are trying to update the status of xids generated near the same range. > I suspect the first part of that algorithm, where atomics are used to > create the list without a lock, might work fine. But will each "leader" > process, each of which is potentially using a different bank lock, > coordinate correctly? Maybe this works correctly because only one > process will find the queue head not empty? If this is what happens, > then there needs to be comments about it. Yes, you got it right, I will try to comment on it better. Without any explanation, > this seems broken and potentially dangerous, as some transaction commit > bits might become lost given high enough concurrency and bad luck. > > Maybe this can be made to work by having one more lwlock that we use > solely to coordinate this task. Do you mean to say a different lock for adding/removing in the list instead of atomic operation? I think then we will lose the benefit we got in the group update by having contention on another lock. [1] I think we already know about the exception case and I have explained in the comments as well that in some cases we might add different clog page update requests in the same group, and for handling that exceptional case we are checking the respective bank lock for each page and if that
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
[Added Andrey again in CC, because as I understand they are using this code or something like it in production. Please don't randomly remove people from CC lists.] I've been looking at this some more, and I'm not confident in that the group clog update stuff is correct. I think the injection points test case was good enough to discover a problem, but it's hard to get peace of mind that there aren't other, more subtle problems. The problem I see is that the group update mechanism is designed around contention of the global xact-SLRU control lock; it uses atomics to coordinate a single queue when the lock is contended. So if we split up the global SLRU control lock using banks, then multiple processes using different bank locks might not contend. OK, this is fine, but what happens if two separate groups of processes encounter contention on two different bank locks? I think they will both try to update the same queue, and coordinate access to that *using different bank locks*. I don't see how can this work correctly. I suspect the first part of that algorithm, where atomics are used to create the list without a lock, might work fine. But will each "leader" process, each of which is potentially using a different bank lock, coordinate correctly? Maybe this works correctly because only one process will find the queue head not empty? If this is what happens, then there needs to be comments about it. Without any explanation, this seems broken and potentially dangerous, as some transaction commit bits might become lost given high enough concurrency and bad luck. Maybe this can be made to work by having one more lwlock that we use solely to coordinate this task. Though we would have to demonstrate that coordinating this task with a different lock works correctly in conjunction with the per-bank lwlock usage in the regular slru.c paths. Andrey, do you have any stress tests or anything else that you used to gain confidence in this code? -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "El sabio habla porque tiene algo que decir; el tonto, porque tiene que decir algo" (Platon).
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Wed, Nov 29, 2023 at 3:29 PM Alvaro Herrera wrote: > > On 2023-Nov-29, tender wang wrote: > > > The v8-0001 patch failed to apply in my local repo as below: > > > > git apply v8-0001-Make-all-SLRU-buffer-sizes-configurable.patch > > error: patch failed: src/backend/access/transam/multixact.c:1851 > > error: src/backend/access/transam/multixact.c: patch does not apply > > error: patch failed: src/backend/access/transam/subtrans.c:184 > > error: src/backend/access/transam/subtrans.c: patch does not apply > > error: patch failed: src/backend/commands/async.c:117 > > error: src/backend/commands/async.c: patch does not apply > > error: patch failed: src/backend/storage/lmgr/predicate.c:808 > > error: src/backend/storage/lmgr/predicate.c: patch does not apply > > error: patch failed: src/include/commands/async.h:15 > > error: src/include/commands/async.h: patch does not apply > > Yeah, this patch series conflicts with today's commit 4ed8f0913bfd. I will send a rebased version by tomorrow. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On 2023-Nov-29, tender wang wrote: > The v8-0001 patch failed to apply in my local repo as below: > > git apply v8-0001-Make-all-SLRU-buffer-sizes-configurable.patch > error: patch failed: src/backend/access/transam/multixact.c:1851 > error: src/backend/access/transam/multixact.c: patch does not apply > error: patch failed: src/backend/access/transam/subtrans.c:184 > error: src/backend/access/transam/subtrans.c: patch does not apply > error: patch failed: src/backend/commands/async.c:117 > error: src/backend/commands/async.c: patch does not apply > error: patch failed: src/backend/storage/lmgr/predicate.c:808 > error: src/backend/storage/lmgr/predicate.c: patch does not apply > error: patch failed: src/include/commands/async.h:15 > error: src/include/commands/async.h: patch does not apply Yeah, this patch series conflicts with today's commit 4ed8f0913bfd. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ Syntax error: function hell() needs an argument. Please choose what hell you want to involve.
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
The v8-0001 patch failed to apply in my local repo as below: git apply v8-0001-Make-all-SLRU-buffer-sizes-configurable.patch error: patch failed: src/backend/access/transam/multixact.c:1851 error: src/backend/access/transam/multixact.c: patch does not apply error: patch failed: src/backend/access/transam/subtrans.c:184 error: src/backend/access/transam/subtrans.c: patch does not apply error: patch failed: src/backend/commands/async.c:117 error: src/backend/commands/async.c: patch does not apply error: patch failed: src/backend/storage/lmgr/predicate.c:808 error: src/backend/storage/lmgr/predicate.c: patch does not apply error: patch failed: src/include/commands/async.h:15 error: src/include/commands/async.h: patch does not apply My local head commit is 15c9ac36299. Is there something I missed? Dilip Kumar 于2023年11月24日周五 17:08写道: > On Fri, Nov 24, 2023 at 10:17 AM Dilip Kumar > wrote: > > > > On Thu, Nov 23, 2023 at 11:34 AM Dilip Kumar > wrote: > > > > > > Note: With this testing, we have found a bug in the bank-wise > > > approach, basically we are clearing a procglobal->clogGroupFirst, even > > > before acquiring the bank lock that means in most of the cases there > > > will be a single process in each group as a group leader > > > > I realized that the bug fix I have done is not proper, so will send > > the updated patch set with the proper fix soon. > > PFA, updated patch set fixes the bug found during the testing of the > group update using the injection point. Also attached a path to test > the injection point but for that, we need to apply the injection point > patches [1] > > [1] https://www.postgresql.org/message-id/ZWACtHPetBFIvP61%40paquier.xyz > > -- > Regards, > Dilip Kumar > EnterpriseDB: http://www.enterprisedb.com >
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Thu, Nov 23, 2023 at 11:34 AM Dilip Kumar wrote: > > Note: With this testing, we have found a bug in the bank-wise > approach, basically we are clearing a procglobal->clogGroupFirst, even > before acquiring the bank lock that means in most of the cases there > will be a single process in each group as a group leader I realized that the bug fix I have done is not proper, so will send the updated patch set with the proper fix soon. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Tue, Nov 21, 2023 at 2:03 PM Dilip Kumar wrote: > > On Mon, Nov 20, 2023 at 4:42 PM Dilip Kumar wrote: > > > > On Mon, Nov 20, 2023 at 2:37 PM Andrey M. Borodin > > wrote: > > > > > > On 20 Nov 2023, at 13:51, Dilip Kumar wrote: > > > > > > > > 2) Do we really need one separate lwlock tranche for each SLRU? > > > > > > > > IMHO if we use the same lwlock tranche then the wait event will show > > > > the same wait event name, right? And that would be confusing for the > > > > user, whether we are waiting for Subtransaction or Multixact or > > > > anything else. Is my understanding no correct here? > > > > > > If we give to a user multiple GUCs to tweak, I think we should give a way > > > to understand which GUC to tweak when they observe wait times. > > PFA, updated patch set, I have worked on review comments by Alvaro and > Andrey. So the only open comments are about clog group commit > testing, for that my question was as I sent in the previous email > exactly what part we are worried about in the coverage report? > > The second point is, if we want to generate a group update we will > have to create the injection point after we hold the control lock so > that other processes go for group update and then for waking up the > waiting process who is holding the SLRU control lock in the exclusive > mode we would need to call a function ('test_injection_points_wake()') > to wake that up and for calling the function we would need to again > acquire the SLRU lock in read mode for visibility check in the catalog > for fetching the procedure row and now this wake up session will block > on control lock for the session which is waiting on injection point so > now it will create a deadlock. Maybe with bank-wise lock we can > create a lot of transaction so that these 2 falls in different banks > and then we can somehow test this, but then we will have to generate > 16 * 4096 = 64k transaction so that the SLRU banks are different for > the transaction which inserted procedure row in system table from the > transaction in which we are trying to do the group commit I have attached a POC patch for testing the group update using the injection point framework. This is just for testing the group update part and is not yet a committable test. I have added a bunch of logs in the code so that we can see what's going on with the group update. >From the below logs, we can see that multiple processes are getting accumulated for the group update and the leader is updating their xid status. Note: With this testing, we have found a bug in the bank-wise approach, basically we are clearing a procglobal->clogGroupFirst, even before acquiring the bank lock that means in most of the cases there will be a single process in each group as a group leader (I think this is what Alvaro was pointing in his coverage report). I have added this fix in this POC just for testing purposes but in my next version I will add this fix to my proper patch version after a proper review and a bit more testing. here is the output after running the test == 2023-11-23 05:55:29.399 UTC [93367] 003_clog_group_commit.pl LOG: procno 6 got the lock 2023-11-23 05:55:29.399 UTC [93367] 003_clog_group_commit.pl STATEMENT: SELECT txid_current(); 2023-11-23 05:55:29.406 UTC [93369] 003_clog_group_commit.pl LOG: statement: SELECT test_injection_points_attach('ClogGroupCommit', 'wait'); 2023-11-23 05:55:29.415 UTC [93371] 003_clog_group_commit.pl LOG: statement: INSERT INTO test VALUES(1); 2023-11-23 05:55:29.416 UTC [93371] 003_clog_group_commit.pl LOG: procno 4 got the lock 2023-11-23 05:55:29.416 UTC [93371] 003_clog_group_commit.pl STATEMENT: INSERT INTO test VALUES(1); 2023-11-23 05:55:29.424 UTC [93373] 003_clog_group_commit.pl LOG: statement: INSERT INTO test VALUES(2); 2023-11-23 05:55:29.425 UTC [93373] 003_clog_group_commit.pl LOG: procno 3 for xid 128742 added for group update 2023-11-23 05:55:29.425 UTC [93373] 003_clog_group_commit.pl STATEMENT: INSERT INTO test VALUES(2); 2023-11-23 05:55:29.431 UTC [93376] 003_clog_group_commit.pl LOG: statement: INSERT INTO test VALUES(3); 2023-11-23 05:55:29.438 UTC [93378] 003_clog_group_commit.pl LOG: statement: INSERT INTO test VALUES(4); 2023-11-23 05:55:29.438 UTC [93376] 003_clog_group_commit.pl LOG: procno 2 for xid 128743 added for group update 2023-11-23 05:55:29.438 UTC [93376] 003_clog_group_commit.pl STATEMENT: INSERT INTO test VALUES(3); 2023-11-23 05:55:29.438 UTC [93376] 003_clog_group_commit.pl LOG: procno 2 is follower and wait for group leader to update commit status of xid 128743 2023-11-23 05:55:29.438 UTC [93376] 003_clog_group_commit.pl STATEMENT: INSERT INTO test VALUES(3); 2023-11-23 05:55:29.439 UTC [93378] 003_clog_group_commit.pl LOG: procno 1 for xid 128744 added for group update 2023-11-23 05:55:29.439 UTC [93378] 003_clog_group_commit.pl STATEMENT: INSERT INTO test VALUES(4); 2023-11-23 05:55:29.439 UTC [93378] 003_clog_group_commit.pl LOG: procno 1 is
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Mon, Nov 20, 2023 at 2:37 PM Andrey M. Borodin wrote: > > On 20 Nov 2023, at 13:51, Dilip Kumar wrote: > > > > 2) Do we really need one separate lwlock tranche for each SLRU? > > > > IMHO if we use the same lwlock tranche then the wait event will show > > the same wait event name, right? And that would be confusing for the > > user, whether we are waiting for Subtransaction or Multixact or > > anything else. Is my understanding no correct here? > > If we give to a user multiple GUCs to tweak, I think we should give a way to > understand which GUC to tweak when they observe wait times. +1 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
> On 20 Nov 2023, at 13:51, Dilip Kumar wrote: > > 2) Do we really need one separate lwlock tranche for each SLRU? > > IMHO if we use the same lwlock tranche then the wait event will show > the same wait event name, right? And that would be confusing for the > user, whether we are waiting for Subtransaction or Multixact or > anything else. Is my understanding no correct here? If we give to a user multiple GUCs to tweak, I think we should give a way to understand which GUC to tweak when they observe wait times. Best regards, Andrey Borodin.
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Fri, Nov 17, 2023 at 7:28 PM Alvaro Herrera wrote: > > On 2023-Nov-17, Dilip Kumar wrote: I think I need some more clarification for some of the review comments > > On Thu, Nov 16, 2023 at 3:11 PM Alvaro Herrera > > wrote: > > > > > > I just noticed that 0003 does some changes to > > > TransactionGroupUpdateXidStatus() that haven't been adequately > > > explained AFAICS. How do you know that these changes are safe? > > > > IMHO this is safe as well as logical to do w.r.t. performance. It's > > safe because whenever we are updating any page in a group we are > > acquiring the respective bank lock in exclusive mode and in extreme > > cases if there are pages from different banks then we do switch the > > lock as well before updating the pages from different groups. > > Looking at the coverage for this code, > https://coverage.postgresql.org/src/backend/access/transam/clog.c.gcov.html#413 > it seems in our test suites we never hit the case where there is > anything in the "nextidx" field for commit groups. 1) I was looking into your coverage report and I have attached a screenshot from the same, it seems we do hit the block where nextidx is not INVALID_PGPROCNO, which means there is some other process other than the group leader. Although I have already started exploring the injection point but just wanted to be sure what is your main concern point about the coverage so though of checking that first. 470 : /* 471 : * If the list was not empty, the leader will update the status of our 472 : * XID. It is impossible to have followers without a leader because the 473 : * first process that has added itself to the list will always have 474 : * nextidx as INVALID_PGPROCNO. 475 : */ 476 98 : if (nextidx != INVALID_PGPROCNO) 477 : { 478 2 : int extraWaits = 0; 479 : 480 : /* Sleep until the leader updates our XID status. */ 481 2 : pgstat_report_wait_start(WAIT_EVENT_XACT_GROUP_UPDATE); 482 : for (;;) 483 : { 484 : /* acts as a read barrier */ 485 2 : PGSemaphoreLock(proc->sem); 486 2 : if (!proc->clogGroupMember) 487 2 : break; 488 0 : extraWaits++; 489 : } 2) Do we really need one separate lwlock tranche for each SLRU? IMHO if we use the same lwlock tranche then the wait event will show the same wait event name, right? And that would be confusing for the user, whether we are waiting for Subtransaction or Multixact or anything else. Is my understanding no correct here? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Sun, Nov 19, 2023 at 12:39 PM Andrey M. Borodin wrote: > > I’ve skimmed through the patch set. Here are some minor notes. Thanks for the review > > 1. Cycles “for (slotno = bankstart; slotno < bankend; slotno++)” in > SlruSelectLRUPage() and SimpleLruReadPage_ReadOnly() now have identical > comments. I think a little of copy-paste is OK. > But SimpleLruReadPage_ReadOnly() does pgstat_count_slru_page_hit(), while > SlruSelectLRUPage() does not. This is not related to the patch set, just a > code nearby. Do you mean to say we need to modify the comments or you are saying pgstat_count_slru_page_hit() is missing in SlruSelectLRUPage(), if it is later then I can see the caller of SlruSelectLRUPage() is calling pgstat_count_slru_page_hit() and the SlruRecentlyUsed(). > 2. Do we really want these functions doing all the same? > extern bool check_multixact_offsets_buffers(int *newval, void > **extra,GucSource source); > extern bool check_multixact_members_buffers(int *newval, void > **extra,GucSource source); > extern bool check_subtrans_buffers(int *newval, void **extra,GucSource > source); > extern bool check_notify_buffers(int *newval, void **extra, GucSource source); > extern bool check_serial_buffers(int *newval, void **extra, GucSource source); > extern bool check_xact_buffers(int *newval, void **extra, GucSource source); > extern bool check_commit_ts_buffers(int *newval, void **extra,GucSource > source); I tried duplicating these by doing all the work inside the check_slru_buffers() function. But I think it is hard to make them a single function because there is no option to pass an SLRU name in the GUC check hook and IMHO in the check hook we need to print the GUC name, any suggestions on how we can avoid having so many functions? > 3. The name SimpleLruGetSLRUBankLock() contains meaning of SLRU twice. I’d > suggest truncating prefix of infix. > > I do not have hard opinion on any of this items. > I prefer SimpleLruGetBankLock() so that it is consistent with other external functions starting with "SimpleLruGet", are you fine with this name? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
> On 17 Nov 2023, at 16:11, Dilip Kumar wrote: > > On Fri, Nov 17, 2023 at 1:09 PM Dilip Kumar wrote: >> >> On Thu, Nov 16, 2023 at 3:11 PM Alvaro Herrera >> wrote: > > PFA, updated patch version, this fixes the comment given by Alvaro and > also improves some of the comments. I’ve skimmed through the patch set. Here are some minor notes. 1. Cycles “for (slotno = bankstart; slotno < bankend; slotno++)” in SlruSelectLRUPage() and SimpleLruReadPage_ReadOnly() now have identical comments. I think a little of copy-paste is OK. But SimpleLruReadPage_ReadOnly() does pgstat_count_slru_page_hit(), while SlruSelectLRUPage() does not. This is not related to the patch set, just a code nearby. 2. Do we really want these functions doing all the same? extern bool check_multixact_offsets_buffers(int *newval, void **extra,GucSource source); extern bool check_multixact_members_buffers(int *newval, void **extra,GucSource source); extern bool check_subtrans_buffers(int *newval, void **extra,GucSource source); extern bool check_notify_buffers(int *newval, void **extra, GucSource source); extern bool check_serial_buffers(int *newval, void **extra, GucSource source); extern bool check_xact_buffers(int *newval, void **extra, GucSource source); extern bool check_commit_ts_buffers(int *newval, void **extra,GucSource source); 3. The name SimpleLruGetSLRUBankLock() contains meaning of SLRU twice. I’d suggest truncating prefix of infix. I do not have hard opinion on any of this items. Best regards, Andrey Borodin.
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On 2023-Nov-18, Dilip Kumar wrote: > On Fri, Nov 17, 2023 at 6:16 PM Alvaro Herrera > wrote: > > I wonder what's the deal with false sharing in the new > > bank_cur_lru_count array. Maybe instead of using LWLockPadded for > > bank_locks, we should have a new struct, with both the LWLock and the > > LRU counter; then pad *that* to the cacheline size. This way, both the > > lwlock and the counter come to the CPU running this code together. > > Actually, the array lengths of both LWLock and the LRU counter are > different so I don't think we can move them to a common structure. > The length of the *buffer_locks array is equal to the number of slots, > the length of the *bank_locks array is Min (number_of_banks, 128), and > the length of the *bank_cur_lru_count array is number_of_banks. Oh. > > Looking at the coverage for this code, > > https://coverage.postgresql.org/src/backend/access/transam/clog.c.gcov.html#413 > > it seems in our test suites we never hit the case where there is > > anything in the "nextidx" field for commit groups. To be honest, I > > don't understand this group stuff, and so I'm doubly hesitant to go > > without any testing here. Maybe it'd be possible to use Michael > > Paquier's injection points somehow? > > Sorry, but I am not aware of "Michael Paquier's injection" Is it > something already in the repo? Can you redirect me to some of the > example test cases if we already have them? Then I will try it out. https://postgr.es/zvwufo_ykztjh...@paquier.xyz -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Sallah, I said NO camels! That's FIVE camels; can't you count?" (Indiana Jones)
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Fri, Nov 17, 2023 at 6:16 PM Alvaro Herrera wrote: Thanks for the review, all comments looks fine to me, replying to those that need some clarification > I wonder what's the deal with false sharing in the new > bank_cur_lru_count array. Maybe instead of using LWLockPadded for > bank_locks, we should have a new struct, with both the LWLock and the > LRU counter; then pad *that* to the cacheline size. This way, both the > lwlock and the counter come to the CPU running this code together. Actually, the array lengths of both LWLock and the LRU counter are different so I don't think we can move them to a common structure. The length of the *buffer_locks array is equal to the number of slots, the length of the *bank_locks array is Min (number_of_banks, 128), and the length of the *bank_cur_lru_count array is number_of_banks. > Looking at the coverage for this code, > https://coverage.postgresql.org/src/backend/access/transam/clog.c.gcov.html#413 > it seems in our test suites we never hit the case where there is > anything in the "nextidx" field for commit groups. To be honest, I > don't understand this group stuff, and so I'm doubly hesitant to go > without any testing here. Maybe it'd be possible to use Michael > Paquier's injection points somehow? Sorry, but I am not aware of "Michael Paquier's injection" Is it something already in the repo? Can you redirect me to some of the example test cases if we already have them? Then I will try it out. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On 2023-Nov-17, Dilip Kumar wrote: > On Thu, Nov 16, 2023 at 3:11 PM Alvaro Herrera > wrote: > > > > I just noticed that 0003 does some changes to > > TransactionGroupUpdateXidStatus() that haven't been adequately > > explained AFAICS. How do you know that these changes are safe? > > IMHO this is safe as well as logical to do w.r.t. performance. It's > safe because whenever we are updating any page in a group we are > acquiring the respective bank lock in exclusive mode and in extreme > cases if there are pages from different banks then we do switch the > lock as well before updating the pages from different groups. Looking at the coverage for this code, https://coverage.postgresql.org/src/backend/access/transam/clog.c.gcov.html#413 it seems in our test suites we never hit the case where there is anything in the "nextidx" field for commit groups. To be honest, I don't understand this group stuff, and so I'm doubly hesitant to go without any testing here. Maybe it'd be possible to use Michael Paquier's injection points somehow? I think in the code comments where you use "w.r.t.", that acronym can be replaced with "for", which improves readability. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ "All rings of power are equal, But some rings of power are more equal than others." (George Orwell's The Lord of the Rings)
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
In SlruSharedData, a new comment is added that starts: "Instead of global counter we maintain a bank-wise lru counter because ..." You don't need to explain what we don't do. Just explain what we do do. So remove the words "Instead of a global counter" from there, because they offer no wisdom. Same with the phrase "so there is no point to ...". I think "The oldest page is therefore" should say "The oldest page *in the bank* is therefore", for extra clarity. I wonder what's the deal with false sharing in the new bank_cur_lru_count array. Maybe instead of using LWLockPadded for bank_locks, we should have a new struct, with both the LWLock and the LRU counter; then pad *that* to the cacheline size. This way, both the lwlock and the counter come to the CPU running this code together. Looking at SlruRecentlyUsed, which was a macro and is now a function. The comment about "multiple evaluation of arguments" no longer applies, so it needs to be removed; and it shouldn't talk about itself being a macro. Using "Size" as type for bank_mask looks odd. For a bitmask, maybe it's be more appropriate to use bits64 if we do need a 64-bit mask (we don't have bits64, but it's easy to add a typedef). I bet we don't really need a 64 bit mask, and a 32-bit or even a 16-bit is sufficient, given the other limitations on number of buffers. I think SimpleLruReadPage should have this assert at start: + Assert(LWLockHeldByMe(SimpleLruGetSLRUBankLock(ctl, pageno))); Do we really need one separate lwlock tranche for each SLRU? -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ "Cuando mañana llegue pelearemos segun lo que mañana exija" (Mowgli)
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Thu, Nov 16, 2023 at 3:11 PM Alvaro Herrera wrote: > > I just noticed that 0003 does some changes to > TransactionGroupUpdateXidStatus() that haven't been adequately > explained AFAICS. How do you know that these changes are safe? IMHO this is safe as well as logical to do w.r.t. performance. It's safe because whenever we are updating any page in a group we are acquiring the respective bank lock in exclusive mode and in extreme cases if there are pages from different banks then we do switch the lock as well before updating the pages from different groups. And, we do not wake any process in a group unless we have done the status update for all the processes so there could not be any race condition as well. Also, It should not affect the performance adversely as well and this will not remove the need for group updates. The main use case of group update is that it will optimize a situation when most of the processes are contending for status updates on the same page and processes that are waiting for status updates on different pages will go to different groups w.r.t. that page, so in short in a group on best effort basis we are trying to have the processes which are waiting to update the same clog page that mean logically all the processes in the group will be waiting on the same bank lock. In an extreme situation if there are processes in the group that are trying to update different pages or even pages from different banks then we are handling it well by changing the lock. Although someone may raise a concern that in cases where there are processes that are waiting for different bank locks then after releasing one lock why not wake up those processes, I think that is not required because that is the situation we are trying to avoid where there are processes trying to update different are in the same group so there is no point in adding complexity to optimize that case. > 0001 contains one typo in the docs, "cotents". > > I'm not a fan of the fact that some CLOG sizing macros moved to clog.h, > leaving others in clog.c. Maybe add commentary cross-linking both. > Alternatively, perhaps allowing xact_buffers to grow beyond 65536 up to > the slru.h-defined limit of 131072 is not that bad, even if it's more > than could possibly be needed for xact_buffers; nobody is going to use > 64k buffers, since useful values are below a couple thousand anyhow. I agree, that allowing xact_buffers to grow beyond 65536 up to the slru.h-defined limit of 131072 is not that bad, so I will change that in the next version. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
I just noticed that 0003 does some changes to TransactionGroupUpdateXidStatus() that haven't been adequately explained AFAICS. How do you know that these changes are safe? 0001 contains one typo in the docs, "cotents". I'm not a fan of the fact that some CLOG sizing macros moved to clog.h, leaving others in clog.c. Maybe add commentary cross-linking both. Alternatively, perhaps allowing xact_buffers to grow beyond 65536 up to the slru.h-defined limit of 131072 is not that bad, even if it's more than could possibly be needed for xact_buffers; nobody is going to use 64k buffers, since useful values are below a couple thousand anyhow. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ Tom: There seems to be something broken here. Teodor: I'm in sackcloth and ashes... Fixed. http://postgr.es/m/482d1632.8010...@sigaev.ru
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Fri, Nov 10, 2023 at 10:17:49AM +0530, Dilip Kumar wrote: > On Thu, Nov 9, 2023 at 4:55 PM Alvaro Herrera wrote: >> The only point on which we do not have full consensus yet is the need to >> have one GUC per SLRU, and a lot of effort seems focused on trying to >> fix the problem without adding so many GUCs (for example, using shared >> buffers instead, or use a single "scaling" GUC). I think that hinders >> progress. Let's just add multiple GUCs, and users can leave most of >> them alone and only adjust the one with which they have a performance >> problems; it's not going to be the same one for everybody. > > +1 +1 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Thu, Nov 9, 2023 at 4:55 PM Alvaro Herrera wrote: > > IMO the whole area of SLRU buffering is in horrible shape and many users > are struggling with overall PG performance because of it. An > improvement doesn't have to be perfect -- it just has to be much better > than the current situation, which should be easy enough. We can > continue to improve later, using more scalable algorithms or ones that > allow us to raise the limits higher. I agree with this. > The only point on which we do not have full consensus yet is the need to > have one GUC per SLRU, and a lot of effort seems focused on trying to > fix the problem without adding so many GUCs (for example, using shared > buffers instead, or use a single "scaling" GUC). I think that hinders > progress. Let's just add multiple GUCs, and users can leave most of > them alone and only adjust the one with which they have a performance > problems; it's not going to be the same one for everybody. +1 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Thu, Nov 9, 2023 at 9:39 PM Robert Haas wrote: > > On Wed, Nov 8, 2023 at 6:41 AM Dilip Kumar wrote: > > Here is the updated version of the patch, here I have taken the > > approach suggested by Andrey and I discussed the same with Alvaro > > offlist and he also agrees with it. So the idea is that we will keep > > the bank size fixed which is 16 buffers per bank and the allowed GUC > > value for each slru buffer must be in multiple of the bank size. We > > have removed the centralized lock but instead of one lock per bank, we > > have kept the maximum limit on the number of bank locks which is 128. > > We kept the max limit as 128 because, in one of the operations (i.e. > > ActivateCommitTs), we need to acquire all the bank locks (but this is > > not a performance path at all) and at a time we can acquire a max of > > 200 LWlocks, so we think this limit of 128 is good. So now if the > > number of banks is <= 128 then we will be using one lock per bank > > otherwise the one lock may protect access of buffer in multiple banks. > > Just so I understand, I guess this means that an SLRU is limited to > 16*128 = 2k buffers = 16MB? Not really, because 128 is the maximum limit on the number of bank locks not on the number of banks. So for example, if you have 16*128 = 2k buffers then each lock will protect one bank, and likewise when you have 16 * 512 = 8k buffers then each lock will protect 4 banks. So in short we can get the lock for each bank by simple computation (banklockno = bankno % 128 ) > When we were talking about this earlier, I suggested fixing the number > of banks and allowing the number of buffers per bank to scale > depending on the setting. That seemed simpler than allowing both the > number of banks and the number of buffers to vary, and it might allow > the compiler to optimize some code better, by converting a calculation > like page_no%number_of_banks into a masking operation like page_no&0xf > or whatever. However, because it allows an individual bank to become > arbitrarily large, it more or less requires us to use a buffer mapping > table. Some of the performance problems mentioned could be alleviated > by omitting the hash table when the number of buffers per bank is > small, and we could also create the dynahash with a custom hash > function that just does modular arithmetic on the page number rather > than a real hashing operation. However, maybe we don't really need to > do any of that. I agree that dynahash is clunky on a good day. I > hadn't realized the impact would be so noticeable. Yes, so one idea is that we keep the number of banks fixed and with that, as you pointed out correctly with a large number of buffers, the bank size can be quite big and for that, we would need a hash table and OTOH what I am doing here is keeping the bank size fixed and smaller (16 buffers per bank) and with that we can have large numbers of the bank when the buffer pool size is quite big. But I feel having more banks is not really a problem if we grow the number of locks beyond a certain limit as in some corner cases we need to acquire all locks together and there is a limit on that. So I like this idea of sharing locks across the banks with that 1) We can have enough locks so that lock contention or cache invalidation due to a common lock should not be a problem anymore 2) We can keep a small bank size with that seq search within the bank is quite fast so reads are fast 3) With small bank size victim buffer search which has to be sequential is quite fast. > This proposal takes the opposite approach of fixing the number of > buffers per bank, letting the number of banks vary. I think that's > probably fine, although it does reduce the effective associativity of > the cache. If there are more hot buffers in a bank than the bank size, > the bank will be contended, even if other banks are cold. However, > given the way SLRUs are accessed, it seems hard to imagine this being > a real problem in practice. There aren't likely to be say 20 hot > buffers that just so happen to all be separated from one another by a > number of pages that is a multiple of the configured number of banks. > And in the seemingly very unlikely event that you have a workload that > behaves like that, you could always adjust the number of banks up or > down by one, and the problem would go away. So this seems OK to me. I agree with this > I also agree with a couple of points that Alvaro made, specifically > that (1) this doesn't have to be perfect, just better than now and (2) > separate GUCs for each SLRU is fine. On the latter point, it's worth > keeping in mind that the cost of a GUC that most people don't need to > tune is fairly low. GUCs like work_mem and shared_buffers are > "expensive" because everybody more or less needs to understand what > they are and how to set them and getting the right value can tricky -- > but a GUC like autovacuum_naptime is a lot cheaper, because almost > nobody needs to change
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Wed, Nov 8, 2023 at 6:41 AM Dilip Kumar wrote: > Here is the updated version of the patch, here I have taken the > approach suggested by Andrey and I discussed the same with Alvaro > offlist and he also agrees with it. So the idea is that we will keep > the bank size fixed which is 16 buffers per bank and the allowed GUC > value for each slru buffer must be in multiple of the bank size. We > have removed the centralized lock but instead of one lock per bank, we > have kept the maximum limit on the number of bank locks which is 128. > We kept the max limit as 128 because, in one of the operations (i.e. > ActivateCommitTs), we need to acquire all the bank locks (but this is > not a performance path at all) and at a time we can acquire a max of > 200 LWlocks, so we think this limit of 128 is good. So now if the > number of banks is <= 128 then we will be using one lock per bank > otherwise the one lock may protect access of buffer in multiple banks. Just so I understand, I guess this means that an SLRU is limited to 16*128 = 2k buffers = 16MB? When we were talking about this earlier, I suggested fixing the number of banks and allowing the number of buffers per bank to scale depending on the setting. That seemed simpler than allowing both the number of banks and the number of buffers to vary, and it might allow the compiler to optimize some code better, by converting a calculation like page_no%number_of_banks into a masking operation like page_no&0xf or whatever. However, because it allows an individual bank to become arbitrarily large, it more or less requires us to use a buffer mapping table. Some of the performance problems mentioned could be alleviated by omitting the hash table when the number of buffers per bank is small, and we could also create the dynahash with a custom hash function that just does modular arithmetic on the page number rather than a real hashing operation. However, maybe we don't really need to do any of that. I agree that dynahash is clunky on a good day. I hadn't realized the impact would be so noticeable. This proposal takes the opposite approach of fixing the number of buffers per bank, letting the number of banks vary. I think that's probably fine, although it does reduce the effective associativity of the cache. If there are more hot buffers in a bank than the bank size, the bank will be contended, even if other banks are cold. However, given the way SLRUs are accessed, it seems hard to imagine this being a real problem in practice. There aren't likely to be say 20 hot buffers that just so happen to all be separated from one another by a number of pages that is a multiple of the configured number of banks. And in the seemingly very unlikely event that you have a workload that behaves like that, you could always adjust the number of banks up or down by one, and the problem would go away. So this seems OK to me. I also agree with a couple of points that Alvaro made, specifically that (1) this doesn't have to be perfect, just better than now and (2) separate GUCs for each SLRU is fine. On the latter point, it's worth keeping in mind that the cost of a GUC that most people don't need to tune is fairly low. GUCs like work_mem and shared_buffers are "expensive" because everybody more or less needs to understand what they are and how to set them and getting the right value can tricky -- but a GUC like autovacuum_naptime is a lot cheaper, because almost nobody needs to change it. It seems to me that these GUCs will fall into the latter category. Users can hopefully just ignore them except if they see a contention on the SLRU bank locks -- and then they can consider increasing the number of banks for that particular SLRU. That seems simple enough. As with autovacuum_naptime, there is a danger that people will configure a ridiculous value of the parameter for no good reason and get bad results, so it would be nice if someday we had a magical system that just got all of this right without the user needing to configure anything. But in the meantime, it's better to have a somewhat manual system to relieve pressure on these locks than no system at all. -- Robert Haas EDB: http://www.enterprisedb.com
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
IMO the whole area of SLRU buffering is in horrible shape and many users are struggling with overall PG performance because of it. An improvement doesn't have to be perfect -- it just has to be much better than the current situation, which should be easy enough. We can continue to improve later, using more scalable algorithms or ones that allow us to raise the limits higher. The only point on which we do not have full consensus yet is the need to have one GUC per SLRU, and a lot of effort seems focused on trying to fix the problem without adding so many GUCs (for example, using shared buffers instead, or use a single "scaling" GUC). I think that hinders progress. Let's just add multiple GUCs, and users can leave most of them alone and only adjust the one with which they have a performance problems; it's not going to be the same one for everybody. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Sallah, I said NO camels! That's FIVE camels; can't you count?" (Indiana Jones)
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
> On 8 Nov 2023, at 14:17, Ants Aasma wrote: > > Is there a particular reason why lock partitions need to be bigger? We have > one lock per buffer anyway, bankwise locks will increase the number of locks > < 10%. The problem was not attracting much attention for some years. So my reasoning was that solution should not have any costs at all. Initial patchset with banks did not add any memory footprint. > On 8 Nov 2023, at 14:17, Ants Aasma wrote: > > I am working on trying out a SIMD based LRU mechanism that uses a 16 entry > bank. FWIW I tried to pack struct parts together to minimize cache lines touched, see step 3 in [0]. So far I could not prove any performance benefits of this approach. But maybe your implementation will be more efficient. Thanks! Best regards, Andrey Borodin. [0] https://www.postgresql.org/message-id/93236d36-b91c-4dfa-af03-99c083840...@yandex-team.ru
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Sat, 4 Nov 2023 at 22:08, Andrey M. Borodin wrote: > On 30 Oct 2023, at 09:20, Dilip Kumar wrote: > > changed the logic of SlruAdjustNSlots() in 0002, such that now it > starts with the next power of 2 value of the configured slots and > keeps doubling the number of banks until we reach the number of banks > to the max SLRU_MAX_BANKS(128) and bank size is bigger than > SLRU_MIN_BANK_SIZE (8). By doing so, we will ensure we don't have too > many banks > > There was nothing wrong with having too many banks. Until bank-wise locks > and counters were added in later patchsets. > Having hashtable to find SLRU page in the buffer IMV is too slow. Some > comments on this approach can be found here [0]. > I'm OK with having HTAB for that if we are sure performance does not > degrade significantly, but I really doubt this is the case. > I even think SLRU buffers used HTAB in some ancient times, but I could not > find commit when it was changed to linear search. > > Maybe we could decouple locks and counters from SLRU banks? Banks were > meant to be small to exploit performance of local linear search. Lock > partitions have to be bigger for sure. > Is there a particular reason why lock partitions need to be bigger? We have one lock per buffer anyway, bankwise locks will increase the number of locks < 10%. I am working on trying out a SIMD based LRU mechanism that uses a 16 entry bank. The data layout is: struct CacheBank { int page_numbers[16]; char access_age[16]; } The first part uses up one cache line, and the second line has 48 bytes of space left over that could fit a lwlock and page_status, page_dirty arrays. Lookup + LRU maintenance has 20 instructions/14 cycle latency and the only branch is for found/not found. Hoping to have a working prototype of SLRU on top in the next couple of days. Regards, Ants Aasma
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Wed, Nov 8, 2023 at 10:52 AM Amul Sul wrote: Thanks for review Amul, > Here are some minor comments: > > + * By default, we'll use 1MB of for every 1GB of shared buffers, up to the > + * maximum value that slru.c will allow, but always at least 16 buffers. > */ > Size > CommitTsShmemBuffers(void) > { > - return Min(256, Max(4, NBuffers / 256)); > + /* Use configured value if provided. */ > + if (commit_ts_buffers > 0) > + return Max(16, commit_ts_buffers); > + return Min(256, Max(16, NBuffers / 256)); > > Do you mean "4MB of for every 1GB" in the comment? You are right > -- > > diff --git a/src/include/access/commit_ts.h b/src/include/access/commit_ts.h > index 5087cdce51..78d017ad85 100644 > --- a/src/include/access/commit_ts.h > +++ b/src/include/access/commit_ts.h > @@ -16,7 +16,6 @@ > #include "replication/origin.h" > #include "storage/sync.h" > > - > extern PGDLLIMPORT bool track_commit_timestamp; > > A spurious change. Will fix > -- > > @@ -168,10 +180,19 @@ SimpleLruShmemSize(int nslots, int nlsns) > > if (nlsns > 0) > sz += MAXALIGN(nslots * nlsns * sizeof(XLogRecPtr));/* > group_lsn[] */ > - > return BUFFERALIGN(sz) + BLCKSZ * nslots; > } > > Another spurious change in 0002 patch. Will fix > -- > > +/* > + * The slru buffer mapping table is partitioned to reduce contention. To > + * determine which partition lock a given pageno requires, compute the > pageno's > + * hash code with SlruBufTableHashCode(), then apply SlruPartitionLock(). > + */ > > I didn't see SlruBufTableHashCode() & SlruPartitionLock() functions anywhere > in > your patches, is that outdated comment? Yes will fix it, actually, there are some major design changes to this. > -- > > - sz += MAXALIGN(nslots * sizeof(LWLockPadded)); /* buffer_locks[] */ > - sz += MAXALIGN(SLRU_NUM_PARTITIONS * sizeof(LWLockPadded)); /* > part_locks[] */ > + sz += MAXALIGN((nslots + SLRU_NUM_PARTITIONS) * sizeof(LWLockPadded)); > /* locks[] */ > > I am a bit uncomfortable with these changes, merging parts and buffer locks > making it hard to understand the code. Not sure what we were getting out of > this? Yes, even I do not like this much because it is confusing. But the advantage of this is that we are using a single pointer for the lock which means the next variable for the LRU counter will come in the same cacheline and frequent updates of lru counter will be benefitted from this. Although I don't have any number which proves this. Currently, I want to focus on all the base patches and keep this patch as add on and later if we find its useful and want to pursue this then we will see how to make it better readable. > > Subject: [PATCH v4 5/5] Ensure slru buffer slots are in multiple of numbe of > partitions > > I think the 0005 patch can be merged to 0001. Yeah in the next version, it is done that way. Planning to post end of the day. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com