subject:"SLRU optimization \- configurable buffer pool and partitioning the SLRU lock"

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-04-03 Thread Alvaro Herrera

On 2024-Apr-03, Dilip Kumar wrote:

> Yeah, we missed acquiring the bank lock w.r.t. intervening pages,
> thanks for reporting.  Your fix looks correct to me.

Thanks for the quick review!  And thanks to Alexander for the report.
Pushed the fix.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"No hay hombre que no aspire a la plenitud, es decir,
la suma de experiencias de que un hombre es capaz"

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-04-03 Thread Dilip Kumar

On Wed, Apr 3, 2024 at 7:40 PM Alvaro Herrera  wrote:
>
> Hello,
>
> On 2024-Apr-03, Alexander Lakhin wrote:
>
> > I've managed to trigger an assert added by 53c2a97a9.
> > Please try the following script against a server compiled with
> > -DTEST_SUMMARIZE_SERIAL (initially I observed this failure without the
> > define, it just simplifies reproducing...):
>
> Ah yes, absolutely, we're missing to trade the correct SLRU bank lock
> there.  This rewrite of that small piece should fix it.  Thanks for
> reporting this.
>

Yeah, we missed acquiring the bank lock w.r.t. intervening pages,
thanks for reporting.  Your fix looks correct to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-04-03 Thread Alvaro Herrera

Hello,

On 2024-Apr-03, Alexander Lakhin wrote:

> I've managed to trigger an assert added by 53c2a97a9.
> Please try the following script against a server compiled with
> -DTEST_SUMMARIZE_SERIAL (initially I observed this failure without the
> define, it just simplifies reproducing...):

Ah yes, absolutely, we're missing to trade the correct SLRU bank lock
there.  This rewrite of that small piece should fix it.  Thanks for
reporting this.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Pido que me den el Nobel por razones humanitarias" (Nicanor Parra)
>From 44c39cf4bf258fb0b65aa1acc5f84e5d7f729eb1 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 3 Apr 2024 16:00:24 +0200
Subject: [PATCH] Fix zeroing of pg_serial page without SLRU bank lock

---
 src/backend/storage/lmgr/predicate.c | 19 ---
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 3f378c0099..d5bbfbd4c6 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -137,7 +137,7 @@
  *	SerialControlLock
  *		- Protects SerialControlData members
  *
- *	SerialSLRULock
+ *	SLRU per-bank locks
  *		- Protects SerialSlruCtl
  *
  * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
@@ -908,20 +908,25 @@ SerialAdd(TransactionId xid, SerCommitSeqNo minConflictCommitSeqNo)
 	if (isNewPage)
 		serialControl->headPage = targetPage;
 
-	LWLockAcquire(lock, LW_EXCLUSIVE);
-
 	if (isNewPage)
 	{
-		/* Initialize intervening pages. */
-		while (firstZeroPage != targetPage)
+		/* Initialize intervening pages; might involve trading locks */
+		for (;;)
 		{
-			(void) SimpleLruZeroPage(SerialSlruCtl, firstZeroPage);
+			lock = SimpleLruGetBankLock(SerialSlruCtl, firstZeroPage);
+			LWLockAcquire(lock, LW_EXCLUSIVE);
+			slotno = SimpleLruZeroPage(SerialSlruCtl, firstZeroPage);
+			if (firstZeroPage == targetPage)
+break;
 			firstZeroPage = SerialNextPage(firstZeroPage);
+			LWLockRelease(lock);
 		}
-		slotno = SimpleLruZeroPage(SerialSlruCtl, targetPage);
 	}
 	else
+	{
+		LWLockAcquire(lock, LW_EXCLUSIVE);
 		slotno = SimpleLruReadPage(SerialSlruCtl, targetPage, true, xid);
+	}
 
 	SerialValue(slotno, xid) = minConflictCommitSeqNo;
 	SerialSlruCtl->shared->page_dirty[slotno] = true;
-- 
2.39.2

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-04-03 Thread Alexander Lakhin


Hello Alvaro,

27.02.2024 20:33, Alvaro Herrera wrote:

Here's the complete set, with these two names using the singular.



I've managed to trigger an assert added by 53c2a97a9.
Please try the following script against a server compiled with
-DTEST_SUMMARIZE_SERIAL (initially I observed this failure without the
define, it just simplifies reproducing...):
# initdb & start ...

createdb test
echo "
SELECT pg_current_xact_id() AS tx
\gset

SELECT format('CREATE TABLE t%s(i int)', g)
  FROM generate_series(1, 1022 - :tx) g
\gexec

BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
SELECT pg_current_xact_id();
SELECT pg_sleep(5);
" | psql test &

echo "
SELECT pg_sleep(1);
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
SELECT 1 INTO a;
COMMIT;

BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
SELECT 2 INTO b;
" | psql test

It fails for me with the following stack trace:
TRAP: failed Assert("LWLockHeldByMeInMode(SimpleLruGetBankLock(ctl, pageno), LW_EXCLUSIVE)"), File: "slru.c", Line: 366, 
PID: 21711

ExceptionalCondition at assert.c:52:13
SimpleLruZeroPage at slru.c:369:11
SerialAdd at predicate.c:921:20
SummarizeOldestCommittedSxact at predicate.c:1521:2
GetSerializableTransactionSnapshotInt at predicate.c:1787:16
GetSerializableTransactionSnapshot at predicate.c:1691:1
GetTransactionSnapshot at snapmgr.c:264:21
exec_simple_query at postgres.c:1162:4
...

Best regards,
Alexander

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-03-05 Thread Alvaro Herrera

On 2024-Mar-04, Tom Lane wrote:

> In hopes of noticing whether there are other similar thinkos,
> I permuted the order of the SlruPageStatus enum values, and
> now I get the expected warnings from gcc:

Thanks for checking!  I pushed the fixes.

Maybe we should assign a nonzero value (= 1) to the first element of
enums, to avoid this kind of mistake.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-03-04 Thread Tom Lane

I wrote:
> It looks like SLRU_PAGE_EMPTY has (by chance, or deliberately)
> the numeric value of zero, so I guess the majority of our BF
> animals are understanding this as "address != NULL".  But that
> doesn't look like a useful test to be making.

In hopes of noticing whether there are other similar thinkos,
I permuted the order of the SlruPageStatus enum values, and
now I get the expected warnings from gcc:

In file included from ../../../../src/include/postgres.h:45,
 from slru.c:59:
slru.c: In function ‘SimpleLruWaitIO’:
slru.c:436:38: warning: comparison between pointer and integer
  Assert(>page_status[slotno] != SLRU_PAGE_EMPTY);
  ^~
../../../../src/include/c.h:862:9: note: in definition of macro ‘Assert’
   if (!(condition)) \
 ^
slru.c: In function ‘SimpleLruWritePage’:
slru.c:717:43: warning: comparison between pointer and integer
  Assert(>shared->page_status[slotno] != SLRU_PAGE_EMPTY);
   ^~
../../../../src/include/c.h:862:9: note: in definition of macro ‘Assert’
   if (!(condition)) \
 ^

So it looks like it's just these two places.

regards, tom lane

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-03-04 Thread Tom Lane

Alvaro Herrera  writes:
> Pushed that way, but we can discuss further wording improvements/changes
> if someone wants to propose any.

I just noticed that drongo is complaining about two lines added
by commit 53c2a97a9:

 drongo| 2024-03-04 14:34:52 | 
../pgsql/src/backend/access/transam/slru.c(436): warning C4047: '!=': 
'SlruPageStatus *' differs in levels of indirection from 'int'
 drongo| 2024-03-04 14:34:52 | 
../pgsql/src/backend/access/transam/slru.c(717): warning C4047: '!=': 
'SlruPageStatus *' differs in levels of indirection from 'int'

These lines are

Assert(>page_status[slotno] != SLRU_PAGE_EMPTY);

Assert(>shared->page_status[slotno] != SLRU_PAGE_EMPTY);

These are comparing the address of something with an enum value,
which surely cannot be sane.  Is the "&" operator incorrect?

It looks like SLRU_PAGE_EMPTY has (by chance, or deliberately)
the numeric value of zero, so I guess the majority of our BF
animals are understanding this as "address != NULL".  But that
doesn't look like a useful test to be making.

regards, tom lane

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-03-03 Thread Alvaro Herrera

On 2024-Feb-29, Alvaro Herrera wrote:

> On 2024-Feb-29, Kyotaro Horiguchi wrote:

> > Some of them, commit_timestamp_buffers, transaction_buffers,
> > subtransaction_buffers use 0 to mean auto-tuning based on
> > shared-buffer size. I think it's worth adding an extra_desc such as "0
> > to automatically determine this value based on the shared buffer
> > size".
> 
> How about this?

Pushed that way, but we can discuss further wording improvements/changes
if someone wants to propose any.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"La rebeldía es la virtud original del hombre" (Arthur Schopenhauer)

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-29 Thread Alvaro Herrera

On 2024-Feb-29, Kyotaro Horiguchi wrote:

> At Tue, 27 Feb 2024 18:33:18 +0100, Alvaro Herrera  
> wrote in 
> > Here's the complete set, with these two names using the singular.
> 
> The commit by the second patch added several GUC descriptions:
> 
> > Sets the size of the dedicated buffer pool used for the commit timestamp 
> > cache.
> 
> Some of them, commit_timestamp_buffers, transaction_buffers,
> subtransaction_buffers use 0 to mean auto-tuning based on
> shared-buffer size. I think it's worth adding an extra_desc such as "0
> to automatically determine this value based on the shared buffer
> size".

How about this?

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"La victoria es para quien se atreve a estar solo"
>From d0d7216eb4e2e2e9e71aa849cf90c218bbe2b164 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Thu, 29 Feb 2024 11:45:31 +0100
Subject: [PATCH] extra_desc

---
 src/backend/utils/misc/guc_tables.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 93ded31ed9..543a87c659 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2287,7 +2287,7 @@ struct config_int ConfigureNamesInt[] =
 	{
 		{"commit_timestamp_buffers", PGC_POSTMASTER, RESOURCES_MEM,
 			gettext_noop("Sets the size of the dedicated buffer pool used for the commit timestamp cache."),
-			NULL,
+			gettext_noop("Specify 0 to have this value determined as a fraction of shared_buffers."),
 			GUC_UNIT_BLOCKS
 		},
 		_timestamp_buffers,
@@ -2342,7 +2342,7 @@ struct config_int ConfigureNamesInt[] =
 	{
 		{"subtransaction_buffers", PGC_POSTMASTER, RESOURCES_MEM,
 			gettext_noop("Sets the size of the dedicated buffer pool used for the sub-transaction cache."),
-			NULL,
+			gettext_noop("Specify 0 to have this value determined as a fraction of shared_buffers."),
 			GUC_UNIT_BLOCKS
 		},
 		_buffers,
@@ -2353,7 +2353,7 @@ struct config_int ConfigureNamesInt[] =
 	{
 		{"transaction_buffers", PGC_POSTMASTER, RESOURCES_MEM,
 			gettext_noop("Sets the size of the dedicated buffer pool used for the transaction status cache."),
-			NULL,
+			gettext_noop("Specify 0 to have this value determined as a fraction of shared_buffers."),
 			GUC_UNIT_BLOCKS
 		},
 		_buffers,
@@ -2868,7 +2868,7 @@ struct config_int ConfigureNamesInt[] =
 	{
 		{"wal_buffers", PGC_POSTMASTER, WAL_SETTINGS,
 			gettext_noop("Sets the number of disk-page buffers in shared memory for WAL."),
-			NULL,
+			gettext_noop("Specify -1 to have this value determined as a fraction of shared_buffers."),
 			GUC_UNIT_XBLOCKS
 		},
 		,
-- 
2.39.2

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-28 Thread Kyotaro Horiguchi

At Tue, 27 Feb 2024 18:33:18 +0100, Alvaro Herrera  
wrote in 
> Here's the complete set, with these two names using the singular.

The commit by the second patch added several GUC descriptions:

> Sets the size of the dedicated buffer pool used for the commit timestamp 
> cache.

Some of them, commit_timestamp_buffers, transaction_buffers,
subtransaction_buffers use 0 to mean auto-tuning based on
shared-buffer size. I think it's worth adding an extra_desc such as "0
to automatically determine this value based on the shared buffer
size".

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-27 Thread Dilip Kumar

On Tue, Feb 27, 2024 at 11:41 PM Alvaro Herrera 
wrote:

> On 2024-Feb-27, Alvaro Herrera wrote:
>
> > Here's the complete set, with these two names using the singular.
>
> BTW one thing I had not noticed is that before this patch we have
> minimum shmem size that's lower than the lowest you can go with the new
> code.
>
> This means Postgres may no longer start when extremely tight memory
> restrictions (and of course use more memory even when idle or with small
> databases).  I wonder to what extent should we make an effort to relax
> that.  For small, largely inactive servers, this is just memory we use
> for no good reason.  However, anything we do here will impact
> performance on the high end, because as Andrey says this will add
> calculations and jumps where there are none today.
>
>
I was just comparing the minimum memory required for SLRU when the system
is minimally configured, correct me if I am wrong.

SLRUunpatched
patched
commit_timestamp_buffers  4   16
subtransaction_buffers 32 16
transaction_buffers   4   16
multixact_offset_buffers8   16
multixact_member_buffers   16  16
notify_buffers 8
 16
serializable_buffers   16  16
-
total buffers 88
112

so that is < 200kB of extra memory on a minimally configured system, IMHO
this should not matter.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-27 Thread Alvaro Herrera

On 2024-Feb-27, Alvaro Herrera wrote:

> Here's the complete set, with these two names using the singular.

BTW one thing I had not noticed is that before this patch we have
minimum shmem size that's lower than the lowest you can go with the new
code.

This means Postgres may no longer start when extremely tight memory
restrictions (and of course use more memory even when idle or with small
databases).  I wonder to what extent should we make an effort to relax
that.  For small, largely inactive servers, this is just memory we use
for no good reason.  However, anything we do here will impact
performance on the high end, because as Andrey says this will add
calculations and jumps where there are none today.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"We’ve narrowed the problem down to the customer’s pants being in a situation
 of vigorous combustion" (Robert Haas, Postgres expert extraordinaire)

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-27 Thread Andrey M. Borodin




> On 27 Feb 2024, at 22:33, Alvaro Herrera  wrote:
> 
> 

These patches look amazing!


Best regards, Andrey Borodin.

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-27 Thread Alvaro Herrera

Here's the complete set, with these two names using the singular.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Uno puede defenderse de los ataques; contra los elogios se esta indefenso"
>From 225b2403f7bb9990656d18c8339c452fcd6822c5 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 27 Feb 2024 16:56:00 +0100
Subject: [PATCH v21 1/2] Rename SLRU elements in pg_stat_slru

The new names are intended to match an upcoming patch that adds a few
GUCs to configure the SLRU buffer sizes.

Discussion: https://postgr.es/m/202402261616.dlriae7b6emv@alvherre.pgsql
---
 doc/src/sgml/monitoring.sgml| 14 
 src/backend/access/transam/clog.c   |  2 +-
 src/backend/access/transam/commit_ts.c  |  2 +-
 src/backend/access/transam/multixact.c  |  4 +--
 src/backend/access/transam/subtrans.c   |  2 +-
 src/backend/commands/async.c|  2 +-
 src/backend/storage/lmgr/predicate.c|  2 +-
 src/include/utils/pgstat_internal.h | 14 
 src/test/isolation/expected/stats.out   | 44 -
 src/test/isolation/expected/stats_1.out | 44 -
 src/test/isolation/specs/stats.spec |  4 +--
 src/test/regress/expected/stats.out | 14 
 src/test/regress/sql/stats.sql  | 14 
 13 files changed, 81 insertions(+), 81 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 5cf9363ac8..9d73d8c1bb 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4853,13 +4853,13 @@ description | Waiting for a newly initialized WAL file to reach durable storage
 NULL or is not specified, all the counters shown in
 the pg_stat_slru view for all SLRU caches are
 reset. The argument can be one of
-CommitTs,
-MultiXactMember,
-MultiXactOffset,
-Notify,
-Serial,
-Subtrans, or
-Xact
+commit_timestamp,
+multixact_member,
+multixact_offset,
+notify,
+serializable,
+subtransaction, or
+transaction
 to reset the counters for only that entry.
 If the argument is other (or indeed, any
 unrecognized name), then the counters for all other SLRU caches, such
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 97f7434da3..34f079cbb1 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -706,7 +706,7 @@ void
 CLOGShmemInit(void)
 {
 	XactCtl->PagePrecedes = CLOGPagePrecedes;
-	SimpleLruInit(XactCtl, "Xact", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE,
+	SimpleLruInit(XactCtl, "transaction", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE,
   XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER,
   SYNC_HANDLER_CLOG, false);
 	SlruPagePrecedesUnitTests(XactCtl, CLOG_XACTS_PER_PAGE);
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 6bfe60343e..d965db89c7 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -529,7 +529,7 @@ CommitTsShmemInit(void)
 	bool		found;
 
 	CommitTsCtl->PagePrecedes = CommitTsPagePrecedes;
-	SimpleLruInit(CommitTsCtl, "CommitTs", CommitTsShmemBuffers(), 0,
+	SimpleLruInit(CommitTsCtl, "commit_timestamp", CommitTsShmemBuffers(), 0,
   CommitTsSLRULock, "pg_commit_ts",
   LWTRANCHE_COMMITTS_BUFFER,
   SYNC_HANDLER_COMMIT_TS,
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index febc429f72..64040d330e 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1851,14 +1851,14 @@ MultiXactShmemInit(void)
 	MultiXactMemberCtl->PagePrecedes = MultiXactMemberPagePrecedes;
 
 	SimpleLruInit(MultiXactOffsetCtl,
-  "MultiXactOffset", NUM_MULTIXACTOFFSET_BUFFERS, 0,
+  "multixact_offset", NUM_MULTIXACTOFFSET_BUFFERS, 0,
   MultiXactOffsetSLRULock, "pg_multixact/offsets",
   LWTRANCHE_MULTIXACTOFFSET_BUFFER,
   SYNC_HANDLER_MULTIXACT_OFFSET,
   false);
 	SlruPagePrecedesUnitTests(MultiXactOffsetCtl, MULTIXACT_OFFSETS_PER_PAGE);
 	SimpleLruInit(MultiXactMemberCtl,
-  "MultiXactMember", NUM_MULTIXACTMEMBER_BUFFERS, 0,
+  "multixact_member", NUM_MULTIXACTMEMBER_BUFFERS, 0,
   MultiXactMemberSLRULock, "pg_multixact/members",
   LWTRANCHE_MULTIXACTMEMBER_BUFFER,
   SYNC_HANDLER_MULTIXACT_MEMBER,
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index b2ed82ac56..6aa47af43e 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -200,7 +200,7 @@ void
 SUBTRANSShmemInit(void)
 {
 	SubTransCtl->PagePrecedes = SubTransPagePrecedes;
-	SimpleLruInit(SubTransCtl, "Subtrans", NUM_SUBTRANS_BUFFERS, 0,
+	SimpleLruInit(SubTransCtl, "subtransaction", NUM_SUBTRANS_BUFFERS, 0,
   SubtransSLRULock, "pg_subtrans",

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-27 Thread Alvaro Herrera

On 2024-Feb-27, Andrey M. Borodin wrote:

> Sorry for the late reply, I have one nit. Are you sure that
> multixact_members and multixact_offsets are plural, while transaction
> and commit_timestamp are singular?
> Maybe multixact_members and multixact_offset? Because there are many
> members and one offset for a givent multixact? Users certainly do not
> care, thought...

I made myself the same question actually, and thought about putting them
both in the singular.  I only backed off because I noticed that the
directories themselves are in plural (an old mistake of mine, evidently). 
Maybe we should follow that instinct and use the singular for these.

If we do that, we can rename the directories to also appear in singular
when/if the patch to add standard page headers to the SLRUs lands --
which is going to need code to rewrite the files during pg_upgrade
anyway, so the rename is not going to be a big deal.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Crear es tan difícil como ser libre" (Elsa Triolet)

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-27 Thread Andrey M. Borodin

> On 27 Feb 2024, at 21:03, Alvaro Herrera  wrote:
> 
> On 2024-Feb-27, Dilip Kumar wrote:
> 
>>> static const char *const slru_names[] = {
>>>"commit_timestamp",
>>>"multixact_members",
>>>"multixact_offsets",
>>>"notify",
>>>"serializable",
>>>"subtransaction",
>>>"transaction",
>>>"other" /* has to be last
>>> */
>>> };
> 
> Here's a patch for the renaming part.

Sorry for the late reply, I have one nit. Are you sure that multixact_members 
and multixact_offsets are plural, while transaction and commit_timestamp are 
singular?
Maybe multixact_members and multixact_offset? Because there are many members 
and one offset for a givent multixact? Users certainly do not care, thought...

Best regards, Andrey Borodin.

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-27 Thread Alvaro Herrera

On 2024-Feb-27, Dilip Kumar wrote:

> > static const char *const slru_names[] = {
> > "commit_timestamp",
> > "multixact_members",
> > "multixact_offsets",
> > "notify",
> > "serializable",
> > "subtransaction",
> > "transaction",
> > "other" /* has to be last
> > */
> > };

Here's a patch for the renaming part.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"No nos atrevemos a muchas cosas porque son difíciles,
pero son difíciles porque no nos atrevemos a hacerlas" (Séneca)
>From 91741984cbd77f88e39b6fac8e8c7dc622d2ccf4 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 27 Feb 2024 16:56:00 +0100
Subject: [PATCH] Rename SLRU elements in pg_stat_slru

The new names are intended to match an upcoming patch that adds a few
GUCs to configure the SLRU buffer sizes.

Discussion: https://postgr.es/m/202402261616.dlriae7b6emv@alvherre.pgsql
---
 doc/src/sgml/monitoring.sgml| 14 
 src/backend/access/transam/clog.c   |  2 +-
 src/backend/access/transam/commit_ts.c  |  2 +-
 src/backend/access/transam/multixact.c  |  4 +--
 src/backend/access/transam/subtrans.c   |  2 +-
 src/backend/commands/async.c|  2 +-
 src/backend/storage/lmgr/predicate.c|  2 +-
 src/include/utils/pgstat_internal.h | 14 
 src/test/isolation/expected/stats.out   | 44 -
 src/test/isolation/expected/stats_1.out | 44 -
 src/test/isolation/specs/stats.spec |  4 +--
 src/test/regress/expected/stats.out | 14 
 src/test/regress/sql/stats.sql  | 14 
 13 files changed, 81 insertions(+), 81 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 5cf9363ac8..7d92e68572 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4853,13 +4853,13 @@ description | Waiting for a newly initialized WAL file to reach durable storage
 NULL or is not specified, all the counters shown in
 the pg_stat_slru view for all SLRU caches are
 reset. The argument can be one of
-CommitTs,
-MultiXactMember,
-MultiXactOffset,
-Notify,
-Serial,
-Subtrans, or
-Xact
+commit_timestamp,
+multixact_members,
+multixact_offsets,
+notify,
+serializable,
+subtransaction, or
+transaction
 to reset the counters for only that entry.
 If the argument is other (or indeed, any
 unrecognized name), then the counters for all other SLRU caches, such
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 97f7434da3..34f079cbb1 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -706,7 +706,7 @@ void
 CLOGShmemInit(void)
 {
 	XactCtl->PagePrecedes = CLOGPagePrecedes;
-	SimpleLruInit(XactCtl, "Xact", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE,
+	SimpleLruInit(XactCtl, "transaction", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE,
   XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER,
   SYNC_HANDLER_CLOG, false);
 	SlruPagePrecedesUnitTests(XactCtl, CLOG_XACTS_PER_PAGE);
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 6bfe60343e..d965db89c7 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -529,7 +529,7 @@ CommitTsShmemInit(void)
 	bool		found;
 
 	CommitTsCtl->PagePrecedes = CommitTsPagePrecedes;
-	SimpleLruInit(CommitTsCtl, "CommitTs", CommitTsShmemBuffers(), 0,
+	SimpleLruInit(CommitTsCtl, "commit_timestamp", CommitTsShmemBuffers(), 0,
   CommitTsSLRULock, "pg_commit_ts",
   LWTRANCHE_COMMITTS_BUFFER,
   SYNC_HANDLER_COMMIT_TS,
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index febc429f72..f8bb83927c 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1851,14 +1851,14 @@ MultiXactShmemInit(void)
 	MultiXactMemberCtl->PagePrecedes = MultiXactMemberPagePrecedes;
 
 	SimpleLruInit(MultiXactOffsetCtl,
-  "MultiXactOffset", NUM_MULTIXACTOFFSET_BUFFERS, 0,
+  "multixact_offsets", NUM_MULTIXACTOFFSET_BUFFERS, 0,
   MultiXactOffsetSLRULock, "pg_multixact/offsets",
   LWTRANCHE_MULTIXACTOFFSET_BUFFER,
   SYNC_HANDLER_MULTIXACT_OFFSET,
   false);
 	SlruPagePrecedesUnitTests(MultiXactOffsetCtl, MULTIXACT_OFFSETS_PER_PAGE);
 	SimpleLruInit(MultiXactMemberCtl,
-  "MultiXactMember", NUM_MULTIXACTMEMBER_BUFFERS, 0,
+  "multixact_members", NUM_MULTIXACTMEMBER_BUFFERS, 0,
   MultiXactMemberSLRULock, "pg_multixact/members",
   LWTRANCHE_MULTIXACTMEMBER_BUFFER,
   SYNC_HANDLER_MULTIXACT_MEMBER,
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index b2ed82ac56..6aa47af43e

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-27 Thread Dilip Kumar

On Mon, Feb 26, 2024 at 9:46 PM Alvaro Herrera 
wrote:

> On 2024-Feb-23, Dilip Kumar wrote:
>
> +  
> +   For each SLRU area that's part of the core server,
> +   there is a configuration parameter that controls its size, with the
> suffix
> +   _buffers appended.  For historical
> +   reasons, the names are not exact matches, but Xact
> +   corresponds to transaction_buffers and the rest
> should
> +   be obvious.
> +   
> +  
>
> I think I would like to suggest renaming the GUCs to have the _slru_ bit
> in the middle:
>
> +# - SLRU Buffers (change requires restart) -
> +
> +#commit_timestamp_slru_buffers = 0  # memory for pg_commit_ts (0
> = auto)
> +#multixact_offsets_slru_buffers = 16# memory for
> pg_multixact/offsets
> +#multixact_members_slru_buffers = 32# memory for
> pg_multixact/members
> +#notify_slru_buffers = 16   # memory for pg_notify
> +#serializable_slru_buffers = 32 # memory for pg_serial
> +#subtransaction_slru_buffers = 0# memory for pg_subtrans (0 =
> auto)
> +#transaction_slru_buffers = 0   # memory for pg_xact (0 =
> auto)
>
> and the pgstat_internal.h table:
>
> static const char *const slru_names[] = {
> "commit_timestamp",
> "multixact_members",
> "multixact_offsets",
> "notify",
> "serializable",
> "subtransaction",
> "transaction",
> "other" /* has to be last
> */
> };
>
> This way they match perfectly.
>

Yeah, I think this looks fine to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-26 Thread Alvaro Herrera

On 2024-Feb-23, Dilip Kumar wrote:

> 1.
> + * If no process is already in the list, we're the leader; our first step
> + * is to "close out the group" by resetting the list pointer from
> + * ProcGlobal->clogGroupFirst (this lets other processes set up other
> + * groups later); then we lock the SLRU bank corresponding to our group's
> + * page, do the SLRU updates, release the SLRU bank lock, and wake up the
> + * sleeping processes.
> 
> I think here we are saying that we "close out the group" before
> acquiring the SLRU lock but that's not true.  We keep the group open
> until we gets the lock so that we can get maximum members in while we
> are anyway waiting for the lock.

Absolutely right.  Reworded that.

> 2.
>  static void
>  TransactionIdSetCommitTs(TransactionId xid, TimestampTz ts,
>   RepOriginId nodeid, int slotno)
>  {
> - Assert(TransactionIdIsNormal(xid));
> + if (!TransactionIdIsNormal(xid))
> + return;
> +
> + entryno = TransactionIdToCTsEntry(xid);
> 
> I do not understand why we need this change.

Ah yeah, I was bothered by the fact that if you pass Xid values earlier
than NormalXid to this function, we'd reply with some nonsensical values
instead of throwing an error.  But you're right that it doesn't belong
in this patch, so I removed that.

Here's a version with these fixes, where I also added some text to the
pg_stat_slru documentation:

+  
+   For each SLRU area that's part of the core server,
+   there is a configuration parameter that controls its size, with the suffix
+   _buffers appended.  For historical
+   reasons, the names are not exact matches, but Xact
+   corresponds to transaction_buffers and the rest should
+   be obvious.
+   
+  

I think I would like to suggest renaming the GUCs to have the _slru_ bit
in the middle:

+# - SLRU Buffers (change requires restart) -
+
+#commit_timestamp_slru_buffers = 0  # memory for pg_commit_ts (0 = 
auto)
+#multixact_offsets_slru_buffers = 16# memory for 
pg_multixact/offsets
+#multixact_members_slru_buffers = 32# memory for 
pg_multixact/members
+#notify_slru_buffers = 16   # memory for pg_notify
+#serializable_slru_buffers = 32 # memory for pg_serial
+#subtransaction_slru_buffers = 0# memory for pg_subtrans (0 = auto)
+#transaction_slru_buffers = 0   # memory for pg_xact (0 = auto)

and the pgstat_internal.h table:
 
static const char *const slru_names[] = {
"commit_timestamp",
"multixact_members",
"multixact_offsets",
"notify",
"serializable",
"subtransaction",
"transaction",
"other" /* has to be last */
};

This way they match perfectly.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"All rings of power are equal,
But some rings of power are more equal than others."
 (George Orwell's The Lord of the Rings)
>From 4dc139e70feb5e43bbe2689cfb044ef0957761b3 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Thu, 22 Feb 2024 18:42:56 +0100
Subject: [PATCH v20] Make SLRU buffer sizes configurable

Also, divide the slot array in banks, so that the LRU algorithm can be
made more scalable.

Also remove the centralized control lock for even better scalability.

Authors: Dilip Kumar, Andrey Borodin
---
 doc/src/sgml/config.sgml  | 139 +++
 doc/src/sgml/monitoring.sgml  |  14 +-
 src/backend/access/transam/clog.c | 236 
 src/backend/access/transam/commit_ts.c|  81 ++--
 src/backend/access/transam/multixact.c| 190 +++---
 src/backend/access/transam/slru.c | 357 +-
 src/backend/access/transam/subtrans.c | 103 -
 src/backend/commands/async.c  |  61 ++-
 src/backend/storage/lmgr/lwlock.c |   9 +-
 src/backend/storage/lmgr/lwlocknames.txt  |  14 +-
 src/backend/storage/lmgr/predicate.c  |  34 +-
 .../utils/activity/wait_event_names.txt   |  15 +-
 src/backend/utils/init/globals.c  |   9 +
 src/backend/utils/misc/guc_tables.c   |  78 
 src/backend/utils/misc/postgresql.conf.sample |   9 +
 src/include/access/clog.h |   1 -
 src/include/access/commit_ts.h|   1 -
 src/include/access/multixact.h|   4 -
 src/include/access/slru.h |  86 +++--
 src/include/access/subtrans.h |   3 -
 src/include/commands/async.h  |   5 -
 src/include/miscadmin.h   |   8 +
 src/include/storage/lwlock.h  |   7 +
 src/include/storage/predicate.h   |   4 -
 src/include/utils/guc_hooks.h |  11 +
 src/test/modules/test_slru/test_slru.c|  35 +-
 26 files changed, 1161 insertions(+), 353 deletions(-)

diff --git a/doc/src/sgml/config.sgml

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-23 Thread Alvaro Herrera

On 2024-Feb-23, Andrey M. Borodin wrote:

> I'm sure anyone with multiple CPUs should increase, not decrease
> previous default of 128 buffers (with 512MB shared buffers). Having
> more CPUs (the only way to benefit from more locks) implies bigger
> transaction buffers.

Sure.

> IMO making bank size variable adds unneeded computation overhead, bank
> search loops should be unrollable by compiler etc.

Makes sense.

> Originally there was a patch set step, that packed bank's page
> addresses together in one array. It was done to make bank search a
> SIMD instruction.

Ants Aasma had proposed a rework of the LRU code for better performance.
He told me it depended on bank size being 16, so you're right that it's
probably not a good idea to make it variable.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-23 Thread Dilip Kumar

On Fri, Feb 23, 2024 at 1:06 PM Dilip Kumar  wrote:
>
> On Fri, Feb 23, 2024 at 1:48 AM Alvaro Herrera  
> wrote:
> >
> > On 2024-Feb-07, Dilip Kumar wrote:
> >
> > > On Wed, Feb 7, 2024 at 3:49 PM Alvaro Herrera  
> > > wrote:
> >
> > > > Sure, but is that really what we want?
> > >
> > > So your question is do we want these buffers to be in multiple of
> > > SLRU_BANK_SIZE?  Maybe we can have the last bank to be partial, I
> > > don't think it should create any problem logically.  I mean we can
> > > look again in the patch to see if we have made any such assumptions
> > > but that should be fairly easy to fix, then maybe if we are going in
> > > this way we should get rid of the check_slru_buffers() function as
> > > well.
> >
> > Not really, I just don't think the macro should be in slru.h.
>
> Okay
>
> > Another thing I've been thinking is that perhaps it would be useful to
> > make the banks smaller, when the total number of buffers is small.  For
> > example, if you have 16 or 32 buffers, it's not really clear to me that
> > it makes sense to have just 1 bank or 2 banks.  It might be more
> > sensible to have 4 banks with 4 or 8 buffers instead.  That should make
> > the algorithm scale down as well as up ...
>
> It might be helpful to have small-size banks when SLRU buffers are set
> to a very low value and we are only accessing a couple of pages at a
> time (i.e. no buffer replacement) because in such cases most of the
> contention will be on SLRU Bank lock. Although I am not sure how
> practical such a use case would be, I mean if someone is using
> multi-xact very heavily or creating frequent subtransaction overflow
> then wouldn't they should set this buffer limit to some big enough
> value?  By doing this we would lose some simplicity of the patch I
> mean instead of using the simple macro i.e. SLRU_BANK_SIZE we would
> need to compute this and store it in SlruShared. Maybe that's not that
> bad.
>
> >
> > I haven't done either of those things in the attached v19 version.  I
> > did go over the comments once again and rewrote the parts I was unhappy
> > with, including some existing ones.  I think it's OK now from that point
> > of view ... at some point I thought about creating a separate README,
> > but in the end I thought it not necessary.
>
> Thanks, I will review those changes.

Few other things I noticed while reading through the patch, I haven't
read it completely yet but this is what I got for now.

1.
+ * If no process is already in the list, we're the leader; our first step
+ * is to "close out the group" by resetting the list pointer from
+ * ProcGlobal->clogGroupFirst (this lets other processes set up other
+ * groups later); then we lock the SLRU bank corresponding to our group's
+ * page, do the SLRU updates, release the SLRU bank lock, and wake up the
+ * sleeping processes.

I think here we are saying that we "close out the group" before
acquiring the SLRU lock but that's not true.  We keep the group open
until we gets the lock so that we can get maximum members in while we
are anyway waiting for the lock.

2.
 static void
 TransactionIdSetCommitTs(TransactionId xid, TimestampTz ts,
  RepOriginId nodeid, int slotno)
 {
- Assert(TransactionIdIsNormal(xid));
+ if (!TransactionIdIsNormal(xid))
+ return;
+
+ entryno = TransactionIdToCTsEntry(xid);

I do not understand why we need this change.



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-23 Thread Andrey M. Borodin

> On 23 Feb 2024, at 12:36, Dilip Kumar  wrote:
> 
>> Another thing I've been thinking is that perhaps it would be useful to
>> make the banks smaller, when the total number of buffers is small.  For
>> example, if you have 16 or 32 buffers, it's not really clear to me that
>> it makes sense to have just 1 bank or 2 banks.  It might be more
>> sensible to have 4 banks with 4 or 8 buffers instead.  That should make
>> the algorithm scale down as well as up ...
> 
> It might be helpful to have small-size banks when SLRU buffers are set
> to a very low value and we are only accessing a couple of pages at a
> time (i.e. no buffer replacement) because in such cases most of the
> contention will be on SLRU Bank lock. Although I am not sure how
> practical such a use case would be, I mean if someone is using
> multi-xact very heavily or creating frequent subtransaction overflow
> then wouldn't they should set this buffer limit to some big enough
> value?  By doing this we would lose some simplicity of the patch I
> mean instead of using the simple macro i.e. SLRU_BANK_SIZE we would
> need to compute this and store it in SlruShared. Maybe that's not that
> bad.

I'm sure anyone with multiple CPUs should increase, not decrease previous 
default of 128 buffers (with 512MB shared buffers). Having more CPUs (the only 
way to benefit from more locks) implies bigger transaction buffers.
IMO making bank size variable adds unneeded computation overhead, bank search 
loops should be unrollable by compiler etc.
Originally there was a patch set step, that packed bank's page addresses 
together in one array. It was done to make bank search a SIMD instruction.

Best regards, Andrey Borodin.

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-22 Thread Dilip Kumar

On Fri, Feb 23, 2024 at 1:48 AM Alvaro Herrera  wrote:
>
> On 2024-Feb-07, Dilip Kumar wrote:
>
> > On Wed, Feb 7, 2024 at 3:49 PM Alvaro Herrera  
> > wrote:
>
> > > Sure, but is that really what we want?
> >
> > So your question is do we want these buffers to be in multiple of
> > SLRU_BANK_SIZE?  Maybe we can have the last bank to be partial, I
> > don't think it should create any problem logically.  I mean we can
> > look again in the patch to see if we have made any such assumptions
> > but that should be fairly easy to fix, then maybe if we are going in
> > this way we should get rid of the check_slru_buffers() function as
> > well.
>
> Not really, I just don't think the macro should be in slru.h.

Okay

> Another thing I've been thinking is that perhaps it would be useful to
> make the banks smaller, when the total number of buffers is small.  For
> example, if you have 16 or 32 buffers, it's not really clear to me that
> it makes sense to have just 1 bank or 2 banks.  It might be more
> sensible to have 4 banks with 4 or 8 buffers instead.  That should make
> the algorithm scale down as well as up ...

It might be helpful to have small-size banks when SLRU buffers are set
to a very low value and we are only accessing a couple of pages at a
time (i.e. no buffer replacement) because in such cases most of the
contention will be on SLRU Bank lock. Although I am not sure how
practical such a use case would be, I mean if someone is using
multi-xact very heavily or creating frequent subtransaction overflow
then wouldn't they should set this buffer limit to some big enough
value?  By doing this we would lose some simplicity of the patch I
mean instead of using the simple macro i.e. SLRU_BANK_SIZE we would
need to compute this and store it in SlruShared. Maybe that's not that
bad.

>
> I haven't done either of those things in the attached v19 version.  I
> did go over the comments once again and rewrote the parts I was unhappy
> with, including some existing ones.  I think it's OK now from that point
> of view ... at some point I thought about creating a separate README,
> but in the end I thought it not necessary.

Thanks, I will review those changes.

> I did add a bunch of Assert()s to make sure the locks that are supposed
> to be held are actually held.  This led me to testing the page status to
> be not EMPTY during SimpleLruWriteAll() before calling
> SlruInternalWritePage(), because the assert was firing.  The previous
> code is not really *buggy*, but to me it's weird to call WritePage() on
> a slot with no contents.

Okay,  I mean internally SlruInternalWritePage() will flush only if
the status is SLRU_PAGE_VALID, but it is better the way you have done.

> Another change was in TransactionGroupUpdateXidStatus: the original code
> had the leader doing pg_atomic_read_u32(>clogGroupFirst) to
> know which bank to lock.  I changed it to simply be the page used by the
> leader process; this doesn't need an atomic read, and should be the same
> page anyway.  (If it isn't, it's no big deal).  But what's more: even if
> we do read ->clogGroupFirst at that point, there's no guarantee that
> this is going to be exactly for the same process that ends up being the
> first in the list, because since we have not set it to INVALID by the
> time we grab the bank lock, it is quite possible for more processes to
> add themselves to the list.

Yeah, this looks better

> I realized all this while rewriting the comments in a way that would let
> me understand what was going on ... so IMO the effort was worthwhile.

+1

I will review and do some more testing early next week and share my feedback.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-22 Thread Alvaro Herrera

On 2024-Feb-07, Dilip Kumar wrote:

> On Wed, Feb 7, 2024 at 3:49 PM Alvaro Herrera  wrote:

> > Sure, but is that really what we want?
> 
> So your question is do we want these buffers to be in multiple of
> SLRU_BANK_SIZE?  Maybe we can have the last bank to be partial, I
> don't think it should create any problem logically.  I mean we can
> look again in the patch to see if we have made any such assumptions
> but that should be fairly easy to fix, then maybe if we are going in
> this way we should get rid of the check_slru_buffers() function as
> well.

Not really, I just don't think the macro should be in slru.h.

Another thing I've been thinking is that perhaps it would be useful to
make the banks smaller, when the total number of buffers is small.  For
example, if you have 16 or 32 buffers, it's not really clear to me that
it makes sense to have just 1 bank or 2 banks.  It might be more
sensible to have 4 banks with 4 or 8 buffers instead.  That should make
the algorithm scale down as well as up ...

I haven't done either of those things in the attached v19 version.  I
did go over the comments once again and rewrote the parts I was unhappy
with, including some existing ones.  I think it's OK now from that point
of view ... at some point I thought about creating a separate README,
but in the end I thought it not necessary.

I did add a bunch of Assert()s to make sure the locks that are supposed
to be held are actually held.  This led me to testing the page status to
be not EMPTY during SimpleLruWriteAll() before calling
SlruInternalWritePage(), because the assert was firing.  The previous
code is not really *buggy*, but to me it's weird to call WritePage() on
a slot with no contents.

Another change was in TransactionGroupUpdateXidStatus: the original code
had the leader doing pg_atomic_read_u32(>clogGroupFirst) to
know which bank to lock.  I changed it to simply be the page used by the
leader process; this doesn't need an atomic read, and should be the same
page anyway.  (If it isn't, it's no big deal).  But what's more: even if
we do read ->clogGroupFirst at that point, there's no guarantee that
this is going to be exactly for the same process that ends up being the
first in the list, because since we have not set it to INVALID by the
time we grab the bank lock, it is quite possible for more processes to
add themselves to the list.

I realized all this while rewriting the comments in a way that would let
me understand what was going on ... so IMO the effort was worthwhile.

Anyway, what I send now should be pretty much final, modulo the
change to the check_slru_buffers routines and documentation additions to
match pg_stat_slru to the new GUC names.

> > Another painful point is that pg_stat_slru uses internal names in the
> > data it outputs, which obviously do not match the new GUCs.
> 
> Yeah, that's true, but I think this could be explained somewhere not
> sure what is the right place for this.

Or we can change those names in the view.

> FYI, I have also repeated all the performance tests I performed in my
> first email[1], and I am seeing a similar gain.

Excellent, thanks for doing that.

> Some comments on v18 in my first pass of the review.
> 
> 1.
> @@ -665,7 +765,7 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
>   lsnindex = GetLSNIndex(slotno, xid);
>   *lsn = XactCtl->shared->group_lsn[lsnindex];
> 
> - LWLockRelease(XactSLRULock);
> + LWLockRelease(SimpleLruGetBankLock(XactCtl, pageno));
> 
> Maybe here we can add an assert before releasing the lock for a safety check
> 
> Assert(LWLockHeldByMe(SimpleLruGetBankLock(XactCtl, pageno)));

Hmm, I think this would just throw a warning or error "you don't hold
such lwlock", so it doesn't seem necessary.

> 2.
> + *
> + * XXX could we make the LSNs to be bank-based?
>   */
>   XLogRecPtr *group_lsn;
> 
> IMHO, the flush still happens at the page level so up to which LSN
> should be flush before flushing the particular clog page should also
> be maintained at the page level.

Yeah, this was a misguided thought, nevermind.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
 It does it in a really, really complicated way
 why does it need to be complicated?
 Because it's MakeMaker.
>From e1aabcbf5ce1417decfe24f513e5cfe8b6de77f2 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Thu, 22 Feb 2024 18:42:56 +0100
Subject: [PATCH v19] Make SLRU buffer sizes configurable

Also, divide the slot array in banks, so that the LRU algorithm can be
made more scalable.

Also remove the centralized control lock for even better scalability.

Authors: Dilip Kumar, Andrey Borodin
---
 doc/src/sgml/config.sgml  | 139 +++
 src/backend/access/transam/clog.c | 235 
 src/backend/access/transam/commit_ts.c|  89 +++--
 src/backend/access/transam/multixact.c| 190 +++---
 src/backend/access/transam/slru.c | 345 +-

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-07 Thread Dilip Kumar

On Wed, Feb 7, 2024 at 3:49 PM Alvaro Herrera  wrote:
>
> On 2024-Feb-07, Dilip Kumar wrote:
>
> > On Tue, Feb 6, 2024 at 8:55 PM Alvaro Herrera  
> > wrote:
> > >
> > > I made CLOGShmemBuffers, CommitTsShmemBuffers and SUBTRANSShmemBuffers
> > > compute a number that's multiple of SLRU_BANK_SIZE.  But it's a crock,
> > > because we don't have that macro at that point, so I just used constant
> > > 16.  Obviously need a better solution for this.
> >
> > If we define SLRU_BANK_SIZE in slur.h then we can use it here right,
> > because these files are anyway include slur.h so.
>
> Sure, but is that really what we want?

So your question is do we want these buffers to be in multiple of
SLRU_BANK_SIZE?  Maybe we can have the last bank to be partial, I
don't think it should create any problem logically.  I mean we can
look again in the patch to see if we have made any such assumptions
but that should be fairly easy to fix, then maybe if we are going in
this way we should get rid of the check_slru_buffers() function as
well.

> > > I've been wondering whether we should add a "slru" to the name of the
> > > GUCs:
> > >
> > > commit_timestamp_slru_buffers
> > > transaction_slru_buffers
> > > etc
> >
> > I am not sure we are exposing anything related to SLRU to the user,
>
> We do -- we have pg_stat_slru already.
>
> > I mean transaction_buffers should make sense for the user that it
> > stores transaction-related data in some buffers pool but whether that
> > buffer pool is called SLRU or not doesn't matter much to the user
> > IMHO.
>
> Yeah, that's exactly what my initial argument was for naming these this
> way.  But since the term slru already escaped into the wild via the
> pg_stat_slru view, perhaps it helps users make the connection between
> these things.  Alternatively, we can cross-reference each term from the
> other's documentation and call it a day.

Yeah, that's true I forgot this point about the pg_stat_slru, from
this pov if the configuration has the name slru they would be able to
make a better connection with the configured value, and the stats in
this view based on that they can take call if they need to somehow
increase the size of these slru buffers.

> Another painful point is that pg_stat_slru uses internal names in the
> data it outputs, which obviously do not match the new GUCs.

Yeah, that's true, but I think this could be explained somewhere not
sure what is the right place for this.

FYI, I have also repeated all the performance tests I performed in my
first email[1], and I am seeing a similar gain.

Some comments on v18 in my first pass of the review.

1.
@@ -665,7 +765,7 @@ TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
  lsnindex = GetLSNIndex(slotno, xid);
  *lsn = XactCtl->shared->group_lsn[lsnindex];

- LWLockRelease(XactSLRULock);
+ LWLockRelease(SimpleLruGetBankLock(XactCtl, pageno));

Maybe here we can add an assert before releasing the lock for a safety check

Assert(LWLockHeldByMe(SimpleLruGetBankLock(XactCtl, pageno)));

2.
+ *
+ * XXX could we make the LSNs to be bank-based?
  */
  XLogRecPtr *group_lsn;

IMHO, the flush still happens at the page level so up to which LSN
should be flush before flushing the particular clog page should also
be maintained at the page level.

[1] 
https://www.postgresql.org/message-id/CAFiTN-vzDvNz%3DExGXz6gdyjtzGixKSqs0mKHMmaQ8sOSEFZ33A%40mail.gmail.com

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-07 Thread Alvaro Herrera

On 2024-Feb-07, Dilip Kumar wrote:

> On Tue, Feb 6, 2024 at 8:55 PM Alvaro Herrera  wrote:
> >
> > I made CLOGShmemBuffers, CommitTsShmemBuffers and SUBTRANSShmemBuffers
> > compute a number that's multiple of SLRU_BANK_SIZE.  But it's a crock,
> > because we don't have that macro at that point, so I just used constant
> > 16.  Obviously need a better solution for this.
> 
> If we define SLRU_BANK_SIZE in slur.h then we can use it here right,
> because these files are anyway include slur.h so.

Sure, but is that really what we want?

> > I've been wondering whether we should add a "slru" to the name of the
> > GUCs:
> >
> > commit_timestamp_slru_buffers
> > transaction_slru_buffers
> > etc
> 
> I am not sure we are exposing anything related to SLRU to the user,

We do -- we have pg_stat_slru already.

> I mean transaction_buffers should make sense for the user that it
> stores transaction-related data in some buffers pool but whether that
> buffer pool is called SLRU or not doesn't matter much to the user
> IMHO.

Yeah, that's exactly what my initial argument was for naming these this
way.  But since the term slru already escaped into the wild via the
pg_stat_slru view, perhaps it helps users make the connection between
these things.  Alternatively, we can cross-reference each term from the
other's documentation and call it a day.

Another painful point is that pg_stat_slru uses internal names in the
data it outputs, which obviously do not match the new GUCs.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Uno puede defenderse de los ataques; contra los elogios se esta indefenso"

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-06 Thread Andrey M. Borodin

> On 7 Feb 2024, at 10:58, Dilip Kumar  wrote:
> 
>> commit_timestamp_slru_buffers
>> transaction_slru_buffers
>> etc
> 
> I am not sure we are exposing anything related to SLRU to the user, 

I think we already tell something about SLRU to the user. I’d rather consider 
if “transaction_slru_buffers" is easier to understand than 
“transaction_buffers” ..
IMO transaction_buffers is clearer. But I do not have strong opinion.

> I
> mean transaction_buffers should make sense for the user that it stores
> transaction-related data in some buffers pool but whether that buffer
> pool is called SLRU or not doesn't matter much to the user IMHO.
+1

Best regards, Andrey Borodin.

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-06 Thread Dilip Kumar

On Tue, Feb 6, 2024 at 8:55 PM Alvaro Herrera  wrote:
>
> Here's the rest of it rebased on top of current master.  I think it
> makes sense to have this as one individual commit.
>
> I made CLOGShmemBuffers, CommitTsShmemBuffers and SUBTRANSShmemBuffers
> compute a number that's multiple of SLRU_BANK_SIZE.  But it's a crock,
> because we don't have that macro at that point, so I just used constant
> 16.  Obviously need a better solution for this.

If we define SLRU_BANK_SIZE in slur.h then we can use it here right,
because these files are anyway include slur.h so.

>
> I also changed the location of bank_mask in SlruCtlData for better
> packing, as advised by pahole; and renamed SLRU_SLOTNO_GET_BANKLOCKNO()
> to SlotGetBankNumber().

Okay.

> Some very critical comments still need to be updated to the new design,
> particularly anything that mentions "control lock"; but also the overall
> model needs to be explained in some central location, rather than
> incongruently some pieces here and other pieces there.  I'll see about
> this later.  But at least this is code you should be able to play with.

Okay, I will review and test this

> I've been wondering whether we should add a "slru" to the name of the
> GUCs:
>
> commit_timestamp_slru_buffers
> transaction_slru_buffers
> etc

I am not sure we are exposing anything related to SLRU to the user, I
mean transaction_buffers should make sense for the user that it stores
transaction-related data in some buffers pool but whether that buffer
pool is called SLRU or not doesn't matter much to the user IMHO.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-06 Thread Alvaro Herrera

Here's the rest of it rebased on top of current master.  I think it
makes sense to have this as one individual commit.

I made CLOGShmemBuffers, CommitTsShmemBuffers and SUBTRANSShmemBuffers
compute a number that's multiple of SLRU_BANK_SIZE.  But it's a crock,
because we don't have that macro at that point, so I just used constant
16.  Obviously need a better solution for this.

I also changed the location of bank_mask in SlruCtlData for better
packing, as advised by pahole; and renamed SLRU_SLOTNO_GET_BANKLOCKNO()
to SlotGetBankNumber().

Some very critical comments still need to be updated to the new design,
particularly anything that mentions "control lock"; but also the overall
model needs to be explained in some central location, rather than
incongruently some pieces here and other pieces there.  I'll see about
this later.  But at least this is code you should be able to play with.


I've been wondering whether we should add a "slru" to the name of the
GUCs:

commit_timestamp_slru_buffers
transaction_slru_buffers
etc

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Aprender sin pensar es inútil; pensar sin aprender, peligroso" (Confucio)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 61038472c5..3e3119865a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2006,6 +2006,145 @@ include_dir 'conf.d'
   
  
 
+ 
+  commit_timestamp_buffers (integer)
+  
+   commit_timestamp_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of memory to use to cache the contents of
+pg_commit_ts (see
+).
+If this value is specified without units, it is taken as blocks,
+that is BLCKSZ bytes, typically 8kB.
+The default value is 0, which requests
+shared_buffers/512 up to 1024 blocks,
+but not fewer than 16 blocks.
+This parameter can only be set at server start.
+   
+  
+ 
+
+ 
+  multixact_members_buffers (integer)
+  
+   multixact_members_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of shared memory to use to cache the contents
+of pg_multixact/members (see
+).
+If this value is specified without units, it is taken as blocks,
+that is BLCKSZ bytes, typically 8kB.
+The default value is 32.
+This parameter can only be set at server start.
+   
+  
+ 
+
+ 
+  multixact_offsets_buffers (integer)
+  
+   multixact_offsets_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of shared memory to use to cache the contents
+of pg_multixact/offsets (see
+).
+If this value is specified without units, it is taken as blocks,
+that is BLCKSZ bytes, typically 8kB.
+The default value is 16.
+This parameter can only be set at server start.
+   
+  
+ 
+
+ 
+  notify_buffers (integer)
+  
+   notify_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of shared memory to use to cache the contents
+of pg_notify (see
+).
+If this value is specified without units, it is taken as blocks,
+that is BLCKSZ bytes, typically 8kB.
+The default value is 16.
+This parameter can only be set at server start.
+   
+  
+ 
+
+ 
+  serializable_buffers (integer)
+  
+   serializable_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of shared memory to use to cache the contents
+of pg_serial (see
+).
+If this value is specified without units, it is taken as blocks,
+that is BLCKSZ bytes, typically 8kB.
+The default value is 32.
+This parameter can only be set at server start.
+   
+  
+ 
+
+ 
+  subtransaction_buffers (integer)
+  
+   subtransaction_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of shared memory to use to cache the contents
+of pg_subtrans (see
+).
+If this value is specified without units, it is taken as blocks,
+that is BLCKSZ bytes, typically 8kB.
+The default value is 0, which requests
+shared_buffers/512 up to 1024 blocks,
+but not fewer than 16 blocks.
+This parameter can only be set at server start.
+   
+  
+ 
+
+ 
+  transaction_buffers (integer)
+  
+   transaction_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of shared memory to use to cache the contents
+of pg_xact (see
+).
+If this value is specified without units, it is taken as blocks,
+that is BLCKSZ bytes, typically 8kB.
+The default

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-06 Thread Dilip Kumar

On Tue, Feb 6, 2024 at 4:23 PM Alvaro Herrera  wrote:
>
> > > (We also have SimpleLruTruncate, but I think it's not as critical to
> > > have a barrier there anyhow: accessing a slightly outdated page number
> > > could only be a problem if a bug elsewhere causes us to try to truncate
> > > in the current page.  I think we only have this code there because we
> > > did have such bugs in the past, but IIUC this shouldn't happen anymore.)
> >
> > +1, I agree with this theory in general.  But the below comment in
> > SimpleLruTrucate in your v3 patch doesn't seem correct, because here
> > we are checking if the latest_page_number is smaller than the cutoff
> > if so we log it as wraparound and skip the whole thing and that is
> > fine even if we are reading with atomic variable and slightly outdated
> > value should not be a problem but the comment claim that this safe
> > because we have the same bank lock as SimpleLruZeroPage(), but that's
> > not true here we will be acquiring different bank locks one by one
> > based on which slotno we are checking.  Am I missing something?
>
> I think you're correct.  I reworded this comment, so now it says this:
>
> /*
>  * An important safety check: the current endpoint page must not be
>  * eligible for removal.  This check is just a backstop against wraparound
>  * bugs elsewhere in SLRU handling, so we don't care if we read a slightly
>  * outdated value; therefore we don't add a memory barrier.
>  */
>
> Pushed with those changes.  Thank you!

Yeah, this looks perfect, thanks.

> Now I'll go rebase the rest of the patch on top.

Okay, I will review and test after that.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-06 Thread Alvaro Herrera

On 2024-Feb-04, Andrey M. Borodin wrote:

> This patch uses wording "banks" in comments before banks start to
> exist. But as far as I understand, it is expected to be committed
> before "banks" patch.

True -- changed that to use ControlLock.

> Besides this patch looks good to me.

Many thanks for reviewing.

On 2024-Feb-05, Dilip Kumar wrote:

> > (We also have SimpleLruTruncate, but I think it's not as critical to
> > have a barrier there anyhow: accessing a slightly outdated page number
> > could only be a problem if a bug elsewhere causes us to try to truncate
> > in the current page.  I think we only have this code there because we
> > did have such bugs in the past, but IIUC this shouldn't happen anymore.)
> 
> +1, I agree with this theory in general.  But the below comment in
> SimpleLruTrucate in your v3 patch doesn't seem correct, because here
> we are checking if the latest_page_number is smaller than the cutoff
> if so we log it as wraparound and skip the whole thing and that is
> fine even if we are reading with atomic variable and slightly outdated
> value should not be a problem but the comment claim that this safe
> because we have the same bank lock as SimpleLruZeroPage(), but that's
> not true here we will be acquiring different bank locks one by one
> based on which slotno we are checking.  Am I missing something?

I think you're correct.  I reworded this comment, so now it says this:

/*
 * An important safety check: the current endpoint page must not be
 * eligible for removal.  This check is just a backstop against wraparound
 * bugs elsewhere in SLRU handling, so we don't care if we read a slightly
 * outdated value; therefore we don't add a memory barrier.
 */

Pushed with those changes.  Thank you!

Now I'll go rebase the rest of the patch on top.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Having your biases confirmed independently is how scientific progress is
made, and hence made our great society what it is today" (Mary Gardiner)

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-04 Thread Dilip Kumar

On Sun, Feb 4, 2024 at 7:10 PM Alvaro Herrera  wrote:
>
> On 2024-Feb-02, Dilip Kumar wrote:
>
> > I have checked the patch and it looks fine to me other than the above
> > question related to memory barrier usage one more question about the
> > same, basically below to instances 1 and 2 look similar but in 1 you
> > are not using the memory write_barrier whereas in 2 you are using the
> > write_barrier, why is it so?  I mean why the reordering can not happen
> > in 1 and it may happen in 2?
>
> What I was thinking is that there's a lwlock operation just below, which
> acts as a barrier.  But I realized something more important: there are
> only two places that matter, which are SlruSelectLRUPage and
> SimpleLruZeroPage.  The others are all initialization code that run at a
> point where there's no going to be any concurrency in SLRU access, so we
> don't need barriers anyway.  In SlruSelectLRUPage we definitely don't
> want to evict the page that SimpleLruZeroPage has initialized, starting
> from the point where it returns that new page to its caller.
> But if you consider the code of those two routines, you realize that the
> only time an equality between latest_page_number and "this_page_number"
> is going to occur, is when both pages are in the same bank ... and both
> routines are required to be holding the bank lock while they run, so in
> practice this is never a problem.

Right, in fact when I first converted this 'latest_page_number' to an
atomic the thinking was to protect it from concurrently setting the
values in SimpleLruZeroPage() and also concurrently reading in
SlruSelectLRUPage() should not read the corrupted value.  All other
usages were during the initialization phase where we do not need any
protection.

>
> We need the atomic write and atomic read so that multiple processes
> processing pages in different banks can update latest_page_number
> simultaneously.  But the equality condition that we're looking for?
> it can never happen concurrently.

Yeah, that's right, after you told I also realized that the case is
protected by the bank lock.  Earlier I didn't think about this case.

> In other words, these barriers are fully useless.
>
> (We also have SimpleLruTruncate, but I think it's not as critical to
> have a barrier there anyhow: accessing a slightly outdated page number
> could only be a problem if a bug elsewhere causes us to try to truncate
> in the current page.  I think we only have this code there because we
> did have such bugs in the past, but IIUC this shouldn't happen anymore.)

+1, I agree with this theory in general.  But the below comment in
SimpleLruTrucate in your v3 patch doesn't seem correct, because here
we are checking if the latest_page_number is smaller than the cutoff
if so we log it as wraparound and skip the whole thing and that is
fine even if we are reading with atomic variable and slightly outdated
value should not be a problem but the comment claim that this safe
because we have the same bank lock as SimpleLruZeroPage(), but that's
not true here we will be acquiring different bank locks one by one
based on which slotno we are checking.  Am I missing something?


+ * An important safety check: the current endpoint page must not be
+ * eligible for removal.  Like SlruSelectLRUPage, we don't need a
+ * memory barrier here because for the affected page to be relevant,
+ * we'd have to have the same bank lock as SimpleLruZeroPage.
  */
- if (ctl->PagePrecedes(shared->latest_page_number, cutoffPage))
+ if (ctl->PagePrecedes(pg_atomic_read_u64(>latest_page_number),
+   cutoffPage))


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-04 Thread Andrey M. Borodin

> On 4 Feb 2024, at 18:38, Alvaro Herrera  wrote:
> 
> In other words, these barriers are fully useless.

+1. I've tried to understand ideas behind barriers, but latest_page_number is 
heuristics that does not need any guarantees at all. It's also is used in 
safety check which can fire only when everything is already broken beyond any 
repair.. (Though using atomic access seems a good idea anyway)

This patch uses wording "banks" in comments before banks start to exist. But as 
far as I understand, it is expected to be committed before "banks" patch.

Besides this patch looks good to me.

Best regards, Andrey Borodin.

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-04 Thread Alvaro Herrera

Sorry, brown paper bag bug there.  Here's the correct one.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"I can't go to a restaurant and order food because I keep looking at the
fonts on the menu.  Five minutes later I realize that it's also talking
about food" (Donald Knuth)
>From 99cadfdf7475146953e9846c20c4a708a3527937 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 31 Jan 2024 12:27:51 +0100
Subject: [PATCH v3] Use atomics for SlruSharedData->latest_page_number

---
 src/backend/access/transam/clog.c  |  6 +---
 src/backend/access/transam/commit_ts.c |  7 ++---
 src/backend/access/transam/multixact.c | 28 +++---
 src/backend/access/transam/slru.c  | 40 +++---
 src/include/access/slru.h  |  5 +++-
 5 files changed, 54 insertions(+), 32 deletions(-)

diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index f6e7da7ffc..06fc2989ba 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -766,14 +766,10 @@ StartupCLOG(void)
 	TransactionId xid = XidFromFullTransactionId(TransamVariables->nextXid);
 	int64		pageno = TransactionIdToPage(xid);
 
-	LWLockAcquire(XactSLRULock, LW_EXCLUSIVE);
-
 	/*
 	 * Initialize our idea of the latest page number.
 	 */
-	XactCtl->shared->latest_page_number = pageno;
-
-	LWLockRelease(XactSLRULock);
+	pg_atomic_write_u64(>shared->latest_page_number, pageno);
 }
 
 /*
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 61b82385f3..6bfe60343e 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -689,9 +689,7 @@ ActivateCommitTs(void)
 	/*
 	 * Re-Initialize our idea of the latest page number.
 	 */
-	LWLockAcquire(CommitTsSLRULock, LW_EXCLUSIVE);
-	CommitTsCtl->shared->latest_page_number = pageno;
-	LWLockRelease(CommitTsSLRULock);
+	pg_atomic_write_u64(>shared->latest_page_number, pageno);
 
 	/*
 	 * If CommitTs is enabled, but it wasn't in the previous server run, we
@@ -1006,7 +1004,8 @@ commit_ts_redo(XLogReaderState *record)
 		 * During XLOG replay, latest_page_number isn't set up yet; insert a
 		 * suitable value to bypass the sanity test in SimpleLruTruncate.
 		 */
-		CommitTsCtl->shared->latest_page_number = trunc->pageno;
+		pg_atomic_write_u64(>shared->latest_page_number,
+			trunc->pageno);
 
 		SimpleLruTruncate(CommitTsCtl, trunc->pageno);
 	}
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 59523be901..febc429f72 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -2017,13 +2017,15 @@ StartupMultiXact(void)
 	 * Initialize offset's idea of the latest page number.
 	 */
 	pageno = MultiXactIdToOffsetPage(multi);
-	MultiXactOffsetCtl->shared->latest_page_number = pageno;
+	pg_atomic_write_u64(>shared->latest_page_number,
+		pageno);
 
 	/*
 	 * Initialize member's idea of the latest page number.
 	 */
 	pageno = MXOffsetToMemberPage(offset);
-	MultiXactMemberCtl->shared->latest_page_number = pageno;
+	pg_atomic_write_u64(>shared->latest_page_number,
+		pageno);
 }
 
 /*
@@ -2047,14 +2049,15 @@ TrimMultiXact(void)
 	oldestMXactDB = MultiXactState->oldestMultiXactDB;
 	LWLockRelease(MultiXactGenLock);
 
-	/* Clean up offsets state */
-	LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
-
 	/*
 	 * (Re-)Initialize our idea of the latest page number for offsets.
 	 */
 	pageno = MultiXactIdToOffsetPage(nextMXact);
-	MultiXactOffsetCtl->shared->latest_page_number = pageno;
+	pg_atomic_write_u64(>shared->latest_page_number,
+		pageno);
+
+	/* Clean up offsets state */
+	LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
 
 	/*
 	 * Zero out the remainder of the current offsets page.  See notes in
@@ -2081,14 +2084,16 @@ TrimMultiXact(void)
 
 	LWLockRelease(MultiXactOffsetSLRULock);
 
-	/* And the same for members */
-	LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
-
 	/*
+	 * And the same for members.
+	 *
 	 * (Re-)Initialize our idea of the latest page number for members.
 	 */
 	pageno = MXOffsetToMemberPage(offset);
-	MultiXactMemberCtl->shared->latest_page_number = pageno;
+	pg_atomic_write_u64(>shared->latest_page_number,
+		pageno);
+
+	LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
 
 	/*
 	 * Zero out the remainder of the current members page.  See notes in
@@ -,7 +3338,8 @@ multixact_redo(XLogReaderState *record)
 		 * SimpleLruTruncate.
 		 */
 		pageno = MultiXactIdToOffsetPage(xlrec.endTruncOff);
-		MultiXactOffsetCtl->shared->latest_page_number = pageno;
+		pg_atomic_write_u64(>shared->latest_page_number,
+			pageno);
 		PerformOffsetsTruncation(xlrec.startTruncOff, xlrec.endTruncOff);
 
 		LWLockRelease(MultiXactTruncationLock);
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 9ac4790f16..c1d0dfc73b 100644
---

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-04 Thread Alvaro Herrera

In short, I propose the attached.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
>From b4ba8135f8044e0077a27fcf6ad18451380cbcb3 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 31 Jan 2024 12:27:51 +0100
Subject: [PATCH v2] Use atomics for SlruSharedData->latest_page_number

---
 src/backend/access/transam/clog.c  |  6 +---
 src/backend/access/transam/commit_ts.c |  7 ++---
 src/backend/access/transam/multixact.c | 28 +++---
 src/backend/access/transam/slru.c  | 40 +++---
 src/include/access/slru.h  |  5 +++-
 5 files changed, 54 insertions(+), 32 deletions(-)

diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index f6e7da7ffc..f8aa91eb0a 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -766,14 +766,10 @@ StartupCLOG(void)
 	TransactionId xid = XidFromFullTransactionId(TransamVariables->nextXid);
 	int64		pageno = TransactionIdToPage(xid);
 
-	LWLockAcquire(XactSLRULock, LW_EXCLUSIVE);
-
 	/*
 	 * Initialize our idea of the latest page number.
 	 */
-	XactCtl->shared->latest_page_number = pageno;
-
-	LWLockRelease(XactSLRULock);
+	pg_atomic_write_u64(XactCtl->shared->latest_page_number, pageno);
 }
 
 /*
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 61b82385f3..6bfe60343e 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -689,9 +689,7 @@ ActivateCommitTs(void)
 	/*
 	 * Re-Initialize our idea of the latest page number.
 	 */
-	LWLockAcquire(CommitTsSLRULock, LW_EXCLUSIVE);
-	CommitTsCtl->shared->latest_page_number = pageno;
-	LWLockRelease(CommitTsSLRULock);
+	pg_atomic_write_u64(>shared->latest_page_number, pageno);
 
 	/*
 	 * If CommitTs is enabled, but it wasn't in the previous server run, we
@@ -1006,7 +1004,8 @@ commit_ts_redo(XLogReaderState *record)
 		 * During XLOG replay, latest_page_number isn't set up yet; insert a
 		 * suitable value to bypass the sanity test in SimpleLruTruncate.
 		 */
-		CommitTsCtl->shared->latest_page_number = trunc->pageno;
+		pg_atomic_write_u64(>shared->latest_page_number,
+			trunc->pageno);
 
 		SimpleLruTruncate(CommitTsCtl, trunc->pageno);
 	}
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 59523be901..febc429f72 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -2017,13 +2017,15 @@ StartupMultiXact(void)
 	 * Initialize offset's idea of the latest page number.
 	 */
 	pageno = MultiXactIdToOffsetPage(multi);
-	MultiXactOffsetCtl->shared->latest_page_number = pageno;
+	pg_atomic_write_u64(>shared->latest_page_number,
+		pageno);
 
 	/*
 	 * Initialize member's idea of the latest page number.
 	 */
 	pageno = MXOffsetToMemberPage(offset);
-	MultiXactMemberCtl->shared->latest_page_number = pageno;
+	pg_atomic_write_u64(>shared->latest_page_number,
+		pageno);
 }
 
 /*
@@ -2047,14 +2049,15 @@ TrimMultiXact(void)
 	oldestMXactDB = MultiXactState->oldestMultiXactDB;
 	LWLockRelease(MultiXactGenLock);
 
-	/* Clean up offsets state */
-	LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
-
 	/*
 	 * (Re-)Initialize our idea of the latest page number for offsets.
 	 */
 	pageno = MultiXactIdToOffsetPage(nextMXact);
-	MultiXactOffsetCtl->shared->latest_page_number = pageno;
+	pg_atomic_write_u64(>shared->latest_page_number,
+		pageno);
+
+	/* Clean up offsets state */
+	LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
 
 	/*
 	 * Zero out the remainder of the current offsets page.  See notes in
@@ -2081,14 +2084,16 @@ TrimMultiXact(void)
 
 	LWLockRelease(MultiXactOffsetSLRULock);
 
-	/* And the same for members */
-	LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
-
 	/*
+	 * And the same for members.
+	 *
 	 * (Re-)Initialize our idea of the latest page number for members.
 	 */
 	pageno = MXOffsetToMemberPage(offset);
-	MultiXactMemberCtl->shared->latest_page_number = pageno;
+	pg_atomic_write_u64(>shared->latest_page_number,
+		pageno);
+
+	LWLockAcquire(MultiXactMemberSLRULock, LW_EXCLUSIVE);
 
 	/*
 	 * Zero out the remainder of the current members page.  See notes in
@@ -,7 +3338,8 @@ multixact_redo(XLogReaderState *record)
 		 * SimpleLruTruncate.
 		 */
 		pageno = MultiXactIdToOffsetPage(xlrec.endTruncOff);
-		MultiXactOffsetCtl->shared->latest_page_number = pageno;
+		pg_atomic_write_u64(>shared->latest_page_number,
+			pageno);
 		PerformOffsetsTruncation(xlrec.startTruncOff, xlrec.endTruncOff);
 
 		LWLockRelease(MultiXactTruncationLock);
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 9ac4790f16..c1d0dfc73b 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -17,7 +17,8 @@
  * per-buffer LWLocks that synchronize I/O for each buffer.  The control lock
  * must be held to

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-04 Thread Alvaro Herrera

On 2024-Feb-02, Dilip Kumar wrote:

> I have checked the patch and it looks fine to me other than the above
> question related to memory barrier usage one more question about the
> same, basically below to instances 1 and 2 look similar but in 1 you
> are not using the memory write_barrier whereas in 2 you are using the
> write_barrier, why is it so?  I mean why the reordering can not happen
> in 1 and it may happen in 2?

What I was thinking is that there's a lwlock operation just below, which
acts as a barrier.  But I realized something more important: there are
only two places that matter, which are SlruSelectLRUPage and
SimpleLruZeroPage.  The others are all initialization code that run at a
point where there's no going to be any concurrency in SLRU access, so we
don't need barriers anyway.  In SlruSelectLRUPage we definitely don't
want to evict the page that SimpleLruZeroPage has initialized, starting
from the point where it returns that new page to its caller.

But if you consider the code of those two routines, you realize that the
only time an equality between latest_page_number and "this_page_number"
is going to occur, is when both pages are in the same bank ... and both
routines are required to be holding the bank lock while they run, so in
practice this is never a problem.

We need the atomic write and atomic read so that multiple processes
processing pages in different banks can update latest_page_number
simultaneously.  But the equality condition that we're looking for?
it can never happen concurrently.

In other words, these barriers are fully useless.

(We also have SimpleLruTruncate, but I think it's not as critical to
have a barrier there anyhow: accessing a slightly outdated page number
could only be a problem if a bug elsewhere causes us to try to truncate
in the current page.  I think we only have this code there because we
did have such bugs in the past, but IIUC this shouldn't happen anymore.)

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-01 Thread Dilip Kumar

On Thu, Feb 1, 2024 at 4:34 PM Dilip Kumar  wrote:
>
> On Thu, Feb 1, 2024 at 4:12 PM Dilip Kumar  wrote:
> >
> > On Thu, Feb 1, 2024 at 3:44 PM Alvaro Herrera  
> > wrote:
>
> > Okay.
> > >
> > > While I have your attention -- if you could give a look to the 0001
> > > patch I posted, I would appreciate it.
> > >
> >
> > I will look into it.  Thanks.
>
> Some quick observations,
>
> Do we need below two write barriers at the end of the function?
> because the next instruction is separated by the function boundary
>
> @@ -766,14 +766,11 @@ StartupCLOG(void)
>   ..
> - XactCtl->shared->latest_page_number = pageno;
> -
> - LWLockRelease(XactSLRULock);
> + pg_atomic_init_u64(>shared->latest_page_number, pageno);
> + pg_write_barrier();
>  }
>
> /*
>   * Initialize member's idea of the latest page number.
>   */
>   pageno = MXOffsetToMemberPage(offset);
> - MultiXactMemberCtl->shared->latest_page_number = pageno;
> + pg_atomic_init_u64(>shared->latest_page_number,
> +pageno);
> +
> + pg_write_barrier();
>  }
>

I have checked the patch and it looks fine to me other than the above
question related to memory barrier usage one more question about the
same, basically below to instances 1 and 2 look similar but in 1 you
are not using the memory write_barrier whereas in 2 you are using the
write_barrier, why is it so?  I mean why the reordering can not happen
in 1 and it may happen in 2?

1.
+ pg_atomic_write_u64(>shared->latest_page_number,
+ trunc->pageno);

  SimpleLruTruncate(CommitTsCtl, trunc->pageno);

vs
2.

  - shared->latest_page_number = pageno;
+ pg_atomic_write_u64(>latest_page_number, pageno);
+ pg_write_barrier();

  /* update the stats counter of zeroed pages */
  pgstat_count_slru_page_zeroed(shared->slru_stats_idx);



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-01 Thread Dilip Kumar

On Thu, Feb 1, 2024 at 4:12 PM Dilip Kumar  wrote:
>
> On Thu, Feb 1, 2024 at 3:44 PM Alvaro Herrera  wrote:

> Okay.
> >
> > While I have your attention -- if you could give a look to the 0001
> > patch I posted, I would appreciate it.
> >
>
> I will look into it.  Thanks.

Some quick observations,

Do we need below two write barriers at the end of the function?
because the next instruction is separated by the function boundary

@@ -766,14 +766,11 @@ StartupCLOG(void)
  ..
- XactCtl->shared->latest_page_number = pageno;
-
- LWLockRelease(XactSLRULock);
+ pg_atomic_init_u64(>shared->latest_page_number, pageno);
+ pg_write_barrier();
 }

/*
  * Initialize member's idea of the latest page number.
  */
  pageno = MXOffsetToMemberPage(offset);
- MultiXactMemberCtl->shared->latest_page_number = pageno;
+ pg_atomic_init_u64(>shared->latest_page_number,
+pageno);
+
+ pg_write_barrier();
 }

I am looking more into this from the concurrency point of view and
will update you soon.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-01 Thread Dilip Kumar

On Thu, Feb 1, 2024 at 3:44 PM Alvaro Herrera  wrote:
>
> On 2024-Feb-01, Dilip Kumar wrote:
>
> > On Thu, Feb 1, 2024 at 3:19 PM Alvaro Herrera  
> > wrote:
> > >
> > > postgres -c lc_messages=C -c shared_buffers=$((512*17))
> > >
> > > 2024-02-01 10:48:13.548 CET [1535379] FATAL:  invalid value for parameter 
> > > "transaction_buffers": 17
> > > 2024-02-01 10:48:13.548 CET [1535379] DETAIL:  "transaction_buffers" must 
> > > be a multiple of 16
> >
> > Maybe we should resize it to the next multiple of the SLRU_BANK_SIZE
> > instead of giving an error?
>
> Since this is the auto-tuning feature, I think it should use the
> previous multiple rather than the next, but yeah, something like that.

Okay.
>
> While I have your attention -- if you could give a look to the 0001
> patch I posted, I would appreciate it.
>

I will look into it.  Thanks.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-01 Thread Alvaro Herrera

On 2024-Feb-01, Dilip Kumar wrote:

> On Thu, Feb 1, 2024 at 3:19 PM Alvaro Herrera  wrote:
> >
> > postgres -c lc_messages=C -c shared_buffers=$((512*17))
> >
> > 2024-02-01 10:48:13.548 CET [1535379] FATAL:  invalid value for parameter 
> > "transaction_buffers": 17
> > 2024-02-01 10:48:13.548 CET [1535379] DETAIL:  "transaction_buffers" must 
> > be a multiple of 16
> 
> Maybe we should resize it to the next multiple of the SLRU_BANK_SIZE
> instead of giving an error?

Since this is the auto-tuning feature, I think it should use the
previous multiple rather than the next, but yeah, something like that.


While I have your attention -- if you could give a look to the 0001
patch I posted, I would appreciate it.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Los trabajadores menos efectivos son sistematicamente llevados al lugar
donde pueden hacer el menor daño posible: gerencia."  (El principio Dilbert)

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-01 Thread Dilip Kumar

On Thu, Feb 1, 2024 at 3:19 PM Alvaro Herrera  wrote:
>
> Hah:
>
> postgres -c lc_messages=C -c shared_buffers=$((512*17))
>
> 2024-02-01 10:48:13.548 CET [1535379] FATAL:  invalid value for parameter 
> "transaction_buffers": 17
> 2024-02-01 10:48:13.548 CET [1535379] DETAIL:  "transaction_buffers" must be 
> a multiple of 16

Maybe we should resize it to the next multiple of the SLRU_BANK_SIZE
instead of giving an error?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-01 Thread Alvaro Herrera

Hah:

postgres -c lc_messages=C -c shared_buffers=$((512*17))

2024-02-01 10:48:13.548 CET [1535379] FATAL:  invalid value for parameter 
"transaction_buffers": 17
2024-02-01 10:48:13.548 CET [1535379] DETAIL:  "transaction_buffers" must be a 
multiple of 16

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
 It does it in a really, really complicated way
 why does it need to be complicated?
 Because it's MakeMaker.

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-01-31 Thread Alvaro Herrera

On 2024-Jan-29, Dilip Kumar wrote:

> Thank you for working on this.  There is one thing that I feel is
> problematic.  We have kept the allowed values for these GUCs to be in
> multiple of SLRU_BANK_SIZE i.e. 16 and that's the reason the min
> values were changed to 16 but in this refactoring patch for some of
> the buffers you have changed that to 8 so I think that's not good.

Oh, absolutely, you're right.  Restored the minimum to 16.

So, here's the patchset as two pieces.  0001 converts
SlruSharedData->latest_page_number to use atomics.  I don't see any
reason to mix this in with the rest of the patch, and though it likely
won't have any performance advantage by itself (since the lock
acquisition is pretty much the same), it seems better to get it in ahead
of the rest -- I think that simplifies matters for the second patch,
which is large enough.

So, 0002 introduces the rest of the feature.  I removed use of banklocks
in a different amount as banks, and I made commit_ts use a longer
lwlock acquisition at truncation time, rather than forcing all-lwlock
acquisition.

The more I look at 0002, the more I notice that some comments need badly
updated, so please don't read too much into it yet.  But I wanted to
post it anyway for archives and cfbot purposes.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
>From 464a996b85c333ffc781086263c2e491758b248f Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 31 Jan 2024 12:27:51 +0100
Subject: [PATCH 1/2] Use atomics for SlruSharedData->latest_page_number

---
 src/backend/access/transam/clog.c  |  7 ++
 src/backend/access/transam/commit_ts.c |  7 +++---
 src/backend/access/transam/multixact.c | 30 --
 src/backend/access/transam/slru.c  | 19 ++--
 src/include/access/slru.h  |  5 -
 5 files changed, 41 insertions(+), 27 deletions(-)

diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index f6e7da7ffc..245fd21c8d 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -766,14 +766,11 @@ StartupCLOG(void)
 	TransactionId xid = XidFromFullTransactionId(TransamVariables->nextXid);
 	int64		pageno = TransactionIdToPage(xid);
 
-	LWLockAcquire(XactSLRULock, LW_EXCLUSIVE);
-
 	/*
 	 * Initialize our idea of the latest page number.
 	 */
-	XactCtl->shared->latest_page_number = pageno;
-
-	LWLockRelease(XactSLRULock);
+	pg_atomic_init_u64(>shared->latest_page_number, pageno);
+	pg_write_barrier();
 }
 
 /*
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 61b82385f3..f68705989d 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -689,9 +689,7 @@ ActivateCommitTs(void)
 	/*
 	 * Re-Initialize our idea of the latest page number.
 	 */
-	LWLockAcquire(CommitTsSLRULock, LW_EXCLUSIVE);
-	CommitTsCtl->shared->latest_page_number = pageno;
-	LWLockRelease(CommitTsSLRULock);
+	pg_atomic_init_u64(>shared->latest_page_number, pageno);
 
 	/*
 	 * If CommitTs is enabled, but it wasn't in the previous server run, we
@@ -1006,7 +1004,8 @@ commit_ts_redo(XLogReaderState *record)
 		 * During XLOG replay, latest_page_number isn't set up yet; insert a
 		 * suitable value to bypass the sanity test in SimpleLruTruncate.
 		 */
-		CommitTsCtl->shared->latest_page_number = trunc->pageno;
+		pg_atomic_write_u64(>shared->latest_page_number,
+			trunc->pageno);
 
 		SimpleLruTruncate(CommitTsCtl, trunc->pageno);
 	}
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 59523be901..a886c29892 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -2017,13 +2017,17 @@ StartupMultiXact(void)
 	 * Initialize offset's idea of the latest page number.
 	 */
 	pageno = MultiXactIdToOffsetPage(multi);
-	MultiXactOffsetCtl->shared->latest_page_number = pageno;
+	pg_atomic_init_u64(>shared->latest_page_number,
+	   pageno);
 
 	/*
 	 * Initialize member's idea of the latest page number.
 	 */
 	pageno = MXOffsetToMemberPage(offset);
-	MultiXactMemberCtl->shared->latest_page_number = pageno;
+	pg_atomic_init_u64(>shared->latest_page_number,
+	   pageno);
+
+	pg_write_barrier();
 }
 
 /*
@@ -2047,14 +2051,15 @@ TrimMultiXact(void)
 	oldestMXactDB = MultiXactState->oldestMultiXactDB;
 	LWLockRelease(MultiXactGenLock);
 
-	/* Clean up offsets state */
-	LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
-
 	/*
 	 * (Re-)Initialize our idea of the latest page number for offsets.
 	 */
 	pageno = MultiXactIdToOffsetPage(nextMXact);
-	MultiXactOffsetCtl->shared->latest_page_number = pageno;
+	pg_atomic_write_u64(>shared->latest_page_number,
+		pageno);
+
+	/* Clean up offsets state */
+	LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
 
 	/*
 	 * Zero out the remainder of the current offsets page.  See notes in
@@ -2081,14 +2086,16 @@

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-01-28 Thread Dilip Kumar

On Fri, Jan 26, 2024 at 11:08 PM Alvaro Herrera  wrote:
>
> I've continued to review this and decided that I don't like the mess
> this patch proposes in order to support pg_commit_ts's deletion of all
> files.  (Yes, I know that I was the one that proposed this idea. It's
> still a bad one).  I'd like to change that code by removing the limit
> that we can only have 128 bank locks in a SLRU.  To recap, the reason we
> did this is that commit_ts sometimes wants to delete all files while
> running (DeactivateCommitTs), and for this it needs to acquire all bank
> locks.  Since going above the MAX_SIMUL_LWLOCKS limit is disallowed, we
> added the SLRU limit making multiple banks share lwlocks.
>
> I propose two alternative solutions:
>
> 1. The easiest is to have DeactivateCommitTs continue to hold
> CommitTsLock until the end, including while SlruScanDirectory does its
> thing.  This sounds terrible, but considering that this code only runs
> when the module is being deactivated, I don't think it's really all that
> bad in practice.  I mean, if you deactivate the commit_ts module and
> then try to read commit timestamp data, you deserve to wait for a few
> seconds just as a punishment for your stupidity.

I think this idea looks reasonable.  I agree that if we are trying to
read commit_ts after deactivating it then it's fine to make it wait.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-01-28 Thread Dilip Kumar

On Thu, Jan 25, 2024 at 10:03 PM Alvaro Herrera  wrote:
>
> On 2024-Jan-25, Alvaro Herrera wrote:
>
> > Here's a touched-up version of this patch.
>
> > diff --git a/src/backend/storage/lmgr/lwlock.c 
> > b/src/backend/storage/lmgr/lwlock.c
> > index 98fa6035cc..4a5e05d5e4 100644
> > --- a/src/backend/storage/lmgr/lwlock.c
> > +++ b/src/backend/storage/lmgr/lwlock.c
> > @@ -163,6 +163,13 @@ static const char *const BuiltinTrancheNames[] = {
> >   [LWTRANCHE_LAUNCHER_HASH] = "LogicalRepLauncherHash",
> >   [LWTRANCHE_DSM_REGISTRY_DSA] = "DSMRegistryDSA",
> >   [LWTRANCHE_DSM_REGISTRY_HASH] = "DSMRegistryHash",
> > + [LWTRANCHE_COMMITTS_SLRU] = "CommitTSSLRU",
> > + [LWTRANCHE_MULTIXACTOFFSET_SLRU] = "MultixactOffsetSLRU",
> > + [LWTRANCHE_MULTIXACTMEMBER_SLRU] = "MultixactMemberSLRU",
> > + [LWTRANCHE_NOTIFY_SLRU] = "NotifySLRU",
> > + [LWTRANCHE_SERIAL_SLRU] = "SerialSLRU"
> > + [LWTRANCHE_SUBTRANS_SLRU] = "SubtransSLRU",
> > + [LWTRANCHE_XACT_SLRU] = "XactSLRU",
> >  };
>
> Eeek.  Last minute changes ...  Fixed here.

Thank you for working on this.  There is one thing that I feel is
problematic.  We have kept the allowed values for these GUCs to be in
multiple of SLRU_BANK_SIZE i.e. 16 and that's the reason the min
values were changed to 16 but in this refactoring patch for some of
the buffers you have changed that to 8 so I think that's not good.

+ {
+ {"multixact_offsets_buffers", PGC_POSTMASTER, RESOURCES_MEM,
+ gettext_noop("Sets the size of the dedicated buffer pool used for
the MultiXact offset cache."),
+ NULL,
+ GUC_UNIT_BLOCKS
+ },
+ _offsets_buffers,
+ 16, 8, SLRU_MAX_ALLOWED_BUFFERS,
+ check_multixact_offsets_buffers, NULL, NULL
+ },

Other than this patch looks good to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-01-26 Thread Andrey Borodin

> On 26 Jan 2024, at 22:38, Alvaro Herrera  wrote:
> 
> This is OK because in the
> default compilation each file only has 32 segments, so that requires
> only 32 lwlocks held at once while the file is being deleted.

Do we account somehow that different subsystems do not accumulate 
MAX_SIMUL_LWLOCKS together?
E.g. GiST during split can combine 75 locks, and somehow commit_ts will be 
deactivated by this backend at the same moment and add 32 locks more :)
I understand that this sounds fantastic, these subsystems do not interfere. But 
this is fantastic only until something like that actually happens.
If possible, I'd prefer one lock at a time, any maybe sometimes two-three with 
some guarantees that this is safe.
So, from my POV first solution that you proposed seems much better to me.

Thanks for working on this!

Best regard, Andrey Borodin.

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-01-26 Thread Alvaro Herrera

I've continued to review this and decided that I don't like the mess
this patch proposes in order to support pg_commit_ts's deletion of all
files.  (Yes, I know that I was the one that proposed this idea. It's
still a bad one).  I'd like to change that code by removing the limit
that we can only have 128 bank locks in a SLRU.  To recap, the reason we
did this is that commit_ts sometimes wants to delete all files while
running (DeactivateCommitTs), and for this it needs to acquire all bank
locks.  Since going above the MAX_SIMUL_LWLOCKS limit is disallowed, we
added the SLRU limit making multiple banks share lwlocks.

I propose two alternative solutions:

1. The easiest is to have DeactivateCommitTs continue to hold
CommitTsLock until the end, including while SlruScanDirectory does its
thing.  This sounds terrible, but considering that this code only runs
when the module is being deactivated, I don't think it's really all that
bad in practice.  I mean, if you deactivate the commit_ts module and
then try to read commit timestamp data, you deserve to wait for a few
seconds just as a punishment for your stupidity.  AFAICT the cases where
anything is done in the module mostly check without locking that
commitTsActive is set, so we're not slowing down any critical
operations.  Also, if you don't like to be delayed for a couple of
seconds, just don't deactivate the module.

2. If we want some refinement, the other idea is to change
SlruScanDirCbDeleteAll (the callback that SlruScanDirectory uses in this
case) so that it acquires the bank lock appropriate for all the slots
used by the file that's going to be deleted.  This is OK because in the
default compilation each file only has 32 segments, so that requires
only 32 lwlocks held at once while the file is being deleted.  I think
we don't need to bother with this, but it's an option if we see the
above as unworkable for whatever reason.


The only other user of SlruScanDirCbDeleteAll is async.c (the LISTEN/
NOTIFY code), and what that does is delete all the files at server
start, where nothing is running concurrently anyway.  So whatever we do
for commit_ts, it won't affect async.c.


So, if we do away with the SLRU_MAX_BANKLOCKS idea, then we're assured
one LWLock per bank (instead of sharing some lwlocks to multiple banks),
and that makes the code simpler to reason about.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"In fact, the basic problem with Perl 5's subroutines is that they're not
crufty enough, so the cruft leaks out into user-defined code instead, by
the Conservation of Cruft Principle."  (Larry Wall, Apocalypse 6)

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-01-25 Thread Robert Haas

On Thu, Jan 25, 2024 at 11:22 AM Alvaro Herrera  wrote:
> Still with these auto-tuning GUCs, I noticed that the auto-tuning code
> would continue to grow the buffer sizes with shared_buffers to
> arbitrarily large values.  I added an arbitrary maximum of 1024 (8 MB),
> which is much higher than the current value of 128; but if you have
> (say) 30 GB of shared_buffers (not uncommon these days), do you really
> need 30MB of pg_clog cache?  It seems mostly unnecessary ... and you can
> still set it manually that way if you need it.  So, largely I just
> rewrote those small functions completely.

Yeah, I think that if we're going to scale with shared_buffers, it
should be capped.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-01-25 Thread Alvaro Herrera

On 2024-Jan-25, Alvaro Herrera wrote:

> Here's a touched-up version of this patch.

> diff --git a/src/backend/storage/lmgr/lwlock.c 
> b/src/backend/storage/lmgr/lwlock.c
> index 98fa6035cc..4a5e05d5e4 100644
> --- a/src/backend/storage/lmgr/lwlock.c
> +++ b/src/backend/storage/lmgr/lwlock.c
> @@ -163,6 +163,13 @@ static const char *const BuiltinTrancheNames[] = {
>   [LWTRANCHE_LAUNCHER_HASH] = "LogicalRepLauncherHash",
>   [LWTRANCHE_DSM_REGISTRY_DSA] = "DSMRegistryDSA",
>   [LWTRANCHE_DSM_REGISTRY_HASH] = "DSMRegistryHash",
> + [LWTRANCHE_COMMITTS_SLRU] = "CommitTSSLRU",
> + [LWTRANCHE_MULTIXACTOFFSET_SLRU] = "MultixactOffsetSLRU",
> + [LWTRANCHE_MULTIXACTMEMBER_SLRU] = "MultixactMemberSLRU",
> + [LWTRANCHE_NOTIFY_SLRU] = "NotifySLRU",
> + [LWTRANCHE_SERIAL_SLRU] = "SerialSLRU"
> + [LWTRANCHE_SUBTRANS_SLRU] = "SubtransSLRU",
> + [LWTRANCHE_XACT_SLRU] = "XactSLRU",
>  };

Eeek.  Last minute changes ...  Fixed here.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"La primera ley de las demostraciones en vivo es: no trate de usar el sistema.
Escriba un guión que no toque nada para no causar daños." (Jakob Nielsen)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 61038472c5..3e3119865a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2006,6 +2006,145 @@ include_dir 'conf.d'
   
  
 
+ 
+  commit_timestamp_buffers (integer)
+  
+   commit_timestamp_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of memory to use to cache the contents of
+pg_commit_ts (see
+).
+If this value is specified without units, it is taken as blocks,
+that is BLCKSZ bytes, typically 8kB.
+The default value is 0, which requests
+shared_buffers/512 up to 1024 blocks,
+but not fewer than 16 blocks.
+This parameter can only be set at server start.
+   
+  
+ 
+
+ 
+  multixact_members_buffers (integer)
+  
+   multixact_members_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of shared memory to use to cache the contents
+of pg_multixact/members (see
+).
+If this value is specified without units, it is taken as blocks,
+that is BLCKSZ bytes, typically 8kB.
+The default value is 32.
+This parameter can only be set at server start.
+   
+  
+ 
+
+ 
+  multixact_offsets_buffers (integer)
+  
+   multixact_offsets_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of shared memory to use to cache the contents
+of pg_multixact/offsets (see
+).
+If this value is specified without units, it is taken as blocks,
+that is BLCKSZ bytes, typically 8kB.
+The default value is 16.
+This parameter can only be set at server start.
+   
+  
+ 
+
+ 
+  notify_buffers (integer)
+  
+   notify_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of shared memory to use to cache the contents
+of pg_notify (see
+).
+If this value is specified without units, it is taken as blocks,
+that is BLCKSZ bytes, typically 8kB.
+The default value is 16.
+This parameter can only be set at server start.
+   
+  
+ 
+
+ 
+  serializable_buffers (integer)
+  
+   serializable_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of shared memory to use to cache the contents
+of pg_serial (see
+).
+If this value is specified without units, it is taken as blocks,
+that is BLCKSZ bytes, typically 8kB.
+The default value is 32.
+This parameter can only be set at server start.
+   
+  
+ 
+
+ 
+  subtransaction_buffers (integer)
+  
+   subtransaction_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of shared memory to use to cache the contents
+of pg_subtrans (see
+).
+If this value is specified without units, it is taken as blocks,
+that is BLCKSZ bytes, typically 8kB.
+The default value is 0, which requests
+shared_buffers/512 up to 1024 blocks,
+but not fewer than 16 blocks.
+This parameter can only be set at server start.
+   
+  
+ 
+
+ 
+  transaction_buffers (integer)
+  
+   transaction_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of shared memory to use to cache the contents
+of pg_xact (see
+).
+If this value is specified without units, it is taken as blocks,
+that is BLCKSZ bytes, typically 8kB.
+

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-01-25 Thread Alvaro Herrera

Here's a touched-up version of this patch.

First, PROC_GLOBAL->clogGroupFirst and SlruCtl->latest_page_number
change from being protected by locks to being atomics, but there's no
mention of considering memory barriers that they should have.  Looking
at the code, I think the former doesn't need any additional barriers,
but latest_page_number is missing some, which I have added.  This
deserves another careful look.

Second and most user visibly, the GUC names seem to have been chosen
based on the source-code variables, which have never meant to be user-
visible.  So I renamed a few:

xact_buffers -> transaction_buffers
subtrans_buffers -> subtransaction_buffers
serial_buffers -> serializable_buffers
commit_ts_buffers -> commit_timestamp_buffers

(unchanged: multixact_offsets_buffers, multixact_members_buffers,
notify_buffers)

I did this explicitly trying to avoid using the term SLRU in a
user-visible manner, because what do users care?  But immediately after
doing this I realized that we already have pg_stat_slru, so maybe the
cat is already out of the bag, and so perhaps we should name these GUCS
as, say slru_transaction_buffers?  That may make the connection between
these things a little more explicit.  (I do think we need to add
cross-links in the documentation of those GUCs to the pg_stat_slru
docs and vice-versa.)


Another thing that bothered me a bit is that we have auto-tuning for
transaction_buffers and commit_timestamp_buffers, but not for
subtransaction_buffers.  (Autotuning means you set the GUC to 0 and it
scales with shared_buffers).  I don't quite understand what's the reason
for the ommision, so I added it for subtrans too.  I think it may make
sense to do likewise for the multixact ones too, not sure.  It doesn't
seem worth having that for pg_serial and notify.

While messing about with these GUCs I realized that the usage of the
show_hook to print what the current number is, when autoturning is used,
was bogus: SHOW would print the number of blocks for (say)
transaction_buffers, but if you asked it to print (say)
multixact_offsets_buffers, it would give a size in MB or kB.  I'm sure
such an inconsistency would bite us.  So, digging around I found that a
good way to handle this is to remove the show_hook, and instead call
SetConfigOption() at the time when the ShmemInit function is called,
with the correct number of buffers determined.  This is pretty much what
is used for XLOGbuffers, and it works correctly as far as my testing
shows.

Still with these auto-tuning GUCs, I noticed that the auto-tuning code
would continue to grow the buffer sizes with shared_buffers to
arbitrarily large values.  I added an arbitrary maximum of 1024 (8 MB),
which is much higher than the current value of 128; but if you have
(say) 30 GB of shared_buffers (not uncommon these days), do you really
need 30MB of pg_clog cache?  It seems mostly unnecessary ... and you can
still set it manually that way if you need it.  So, largely I just
rewrote those small functions completely.

I also made the SGML documentation and postgresql.sample.conf all match
what the code actually docs.  The whole thing wasn't particularly
consistent.

I rewrote a bunch of code comments and moved stuff around to appear in
alphabetical order, etc.

More comment rewriting still pending.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 61038472c5..3e3119865a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2006,6 +2006,145 @@ include_dir 'conf.d'
   
  
 
+ 
+  commit_timestamp_buffers (integer)
+  
+   commit_timestamp_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of memory to use to cache the contents of
+pg_commit_ts (see
+).
+If this value is specified without units, it is taken as blocks,
+that is BLCKSZ bytes, typically 8kB.
+The default value is 0, which requests
+shared_buffers/512 up to 1024 blocks,
+but not fewer than 16 blocks.
+This parameter can only be set at server start.
+   
+  
+ 
+
+ 
+  multixact_members_buffers (integer)
+  
+   multixact_members_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of shared memory to use to cache the contents
+of pg_multixact/members (see
+).
+If this value is specified without units, it is taken as blocks,
+that is BLCKSZ bytes, typically 8kB.
+The default value is 32.
+This parameter can only be set at server start.
+   
+  
+ 
+
+ 
+  multixact_offsets_buffers (integer)
+  
+   multixact_offsets_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of shared memory to use to cache the contents
+of pg_multixact/offsets (see
+

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-01-08 Thread Alvaro Herrera

On 2024-Jan-08, Dilip Kumar wrote:

> On Mon, Jan 8, 2024 at 4:55 PM Alvaro Herrera  wrote:
> >
> > The more I look at TransactionGroupUpdateXidStatus, the more I think
> > it's broken, and while we do have some tests, I don't have confidence
> > that they cover all possible cases.
> >
> > Or, at least, if this code is good, then it hasn't been sufficiently
> > explained.
> 
> Any thought about a case in which you think it might be broken, I mean
> any vague thought might also help where you think it might not work as
> expected so that I can also think in that direction.  It might be
> possible that I might not be thinking of some perspective that you are
> thinking and comments might be lacking from that point of view.

Eh, apologies.  This email was an unfinished draft that I had laying
around before the holidays which I intended to discard it but somehow
kept around, and just now I happened to press the wrong key combination
and it ended up being sent instead.  We had some further discussion,
after which I no longer think that there is a problem here, so please
ignore this email.

I'll come back to this patch later this week.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"El que vive para el futuro es un iluso, y el que vive para el pasado,
un imbécil" (Luis Adler, "Los tripulantes de la noche")

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-01-08 Thread Dilip Kumar

On Mon, Jan 8, 2024 at 4:55 PM Alvaro Herrera  wrote:
>
> The more I look at TransactionGroupUpdateXidStatus, the more I think
> it's broken, and while we do have some tests, I don't have confidence
> that they cover all possible cases.
>
> Or, at least, if this code is good, then it hasn't been sufficiently
> explained.

Any thought about a case in which you think it might be broken, I mean
any vague thought might also help where you think it might not work as
expected so that I can also think in that direction.  It might be
possible that I might not be thinking of some perspective that you are
thinking and comments might be lacking from that point of view.

> If we have multiple processes trying to write bits to clog, and they are
> using different banks, then the LWLockConditionalAcquire will be able to
> acquire the bank lock

Do you think there is a problem with multiple processes getting the
lock? I mean they are modifying different CLOG pages so that can be
done concurrently right?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-01-08 Thread Alvaro Herrera

The more I look at TransactionGroupUpdateXidStatus, the more I think
it's broken, and while we do have some tests, I don't have confidence
that they cover all possible cases.

Or, at least, if this code is good, then it hasn't been sufficiently
explained.

If we have multiple processes trying to write bits to clog, and they are
using different banks, then the LWLockConditionalAcquire will be able to
acquire the bank lock 

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"The saddest aspect of life right now is that science gathers knowledge faster
 than society gathers wisdom."  (Isaac Asimov)

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-01-03 Thread Robert Haas

On Wed, Jan 3, 2024 at 12:08 AM Dilip Kumar  wrote:
> Yeah, this is indeed an interesting idea.  So I think if we are
> interested in working in this direction maybe this can be submitted in
> a different thread, IMHO.

Yeah, that's something quite different from the patch before us.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-01-02 Thread Dilip Kumar

On Tue, Jan 2, 2024 at 7:53 PM Robert Haas  wrote:
>
> On Fri, Dec 22, 2023 at 8:14 AM Andrey M. Borodin  
> wrote:
> > Just a side node.
> > It seems like commit log is kind of antipattern of data contention. Even 
> > when we build a super-optimized SLRU. Nearby **bits** are written by 
> > different CPUs.
> > I think that banks and locks are good thing. But also we could reorganize 
> > log so that
> > status of transaction 0 is on a page 0 at bit offset 0
> > status of transaction 1 is on a page 1 at bit offset 0
> > status of transaction 2 is on a page 2 at bit offset 0
> > status of transaction 3 is on a page 3 at bit offset 0
> > status of transaction 4 is on a page 0 at bit offset 2
> > status of transaction 5 is on a page 1 at bit offset 2
> > status of transaction 6 is on a page 2 at bit offset 2
> > status of transaction 7 is on a page 3 at bit offset 2
> > etc...
>
> This is an interesting idea. A variant would be to stripe across
> cachelines within the same page rather than across pages. If we do
> stripe across pages as proposed here, we'd probably want to rethink
> the way the SLRU is extended -- page at a time wouldn't really make
> sense, but preallocating an entire file might.

Yeah, this is indeed an interesting idea.  So I think if we are
interested in working in this direction maybe this can be submitted in
a different thread, IMHO.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-01-02 Thread Robert Haas

On Tue, Jan 2, 2024 at 1:10 PM Andrey M. Borodin  wrote:
> > On 2 Jan 2024, at 19:23, Robert Haas  wrote:
> >> And it would be even better if page for transaction statuses would be 
> >> determined by backend id somehow. Or at least cache line. Can we allocate 
> >> a range (sizeof(cacheline)) of xids\subxids\multixacts\whatever for each 
> >> backend?
> >
> > I don't understand how this could work. We need to be able to look up
> > transaction status by XID, not backend ID.
>
> When GetNewTransactionId() is called we can reserve 256 xids in backend local 
> memory. This values will be reused by transactions or subtransactions of this 
> backend. Here 256 == sizeof(CacheLine).
> This would ensure that different backends touch different cache lines.
>
> But this approach would dramatically increase xid consumption speed on 
> patterns where client reconnects after several transactions. So we can keep 
> unused xids in procarray for future reuse.
>
> I doubt we can find significant performance improvement here, because false 
> cache line sharing cannot be _that_ bad.

Yeah, this seems way too complicated for what we'd potentially gain
from it. An additional problem is that the xmin horizon computation
assumes that XIDs are assigned in monotonically increasing fashion;
breaking that would be messy. And even an occasional leak of XIDs
could precipitate enough additional vacuuming to completely outweigh
any gains we could hope to achieve here.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-01-02 Thread Andrey M. Borodin

> On 2 Jan 2024, at 19:23, Robert Haas  wrote:
> 
>> 
>> And it would be even better if page for transaction statuses would be 
>> determined by backend id somehow. Or at least cache line. Can we allocate a 
>> range (sizeof(cacheline)) of xids\subxids\multixacts\whatever for each 
>> backend?
> 
> I don't understand how this could work. We need to be able to look up
> transaction status by XID, not backend ID.

When GetNewTransactionId() is called we can reserve 256 xids in backend local 
memory. This values will be reused by transactions or subtransactions of this 
backend. Here 256 == sizeof(CacheLine).
This would ensure that different backends touch different cache lines.

But this approach would dramatically increase xid consumption speed on patterns 
where client reconnects after several transactions. So we can keep unused xids 
in procarray for future reuse.

I doubt we can find significant performance improvement here, because false 
cache line sharing cannot be _that_ bad.

Best regards, Andrey Borodin.

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-01-02 Thread Robert Haas

On Fri, Dec 22, 2023 at 8:14 AM Andrey M. Borodin  wrote:
> Just a side node.
> It seems like commit log is kind of antipattern of data contention. Even when 
> we build a super-optimized SLRU. Nearby **bits** are written by different 
> CPUs.
> I think that banks and locks are good thing. But also we could reorganize log 
> so that
> status of transaction 0 is on a page 0 at bit offset 0
> status of transaction 1 is on a page 1 at bit offset 0
> status of transaction 2 is on a page 2 at bit offset 0
> status of transaction 3 is on a page 3 at bit offset 0
> status of transaction 4 is on a page 0 at bit offset 2
> status of transaction 5 is on a page 1 at bit offset 2
> status of transaction 6 is on a page 2 at bit offset 2
> status of transaction 7 is on a page 3 at bit offset 2
> etc...

This is an interesting idea. A variant would be to stripe across
cachelines within the same page rather than across pages. If we do
stripe across pages as proposed here, we'd probably want to rethink
the way the SLRU is extended -- page at a time wouldn't really make
sense, but preallocating an entire file might.

> And it would be even better if page for transaction statuses would be 
> determined by backend id somehow. Or at least cache line. Can we allocate a 
> range (sizeof(cacheline)) of xids\subxids\multixacts\whatever for each 
> backend?

I don't understand how this could work. We need to be able to look up
transaction status by XID, not backend ID.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-12-22 Thread Andrey M. Borodin



> On 19 Dec 2023, at 10:34, Dilip Kumar  wrote:

Just a side node.
It seems like commit log is kind of antipattern of data contention. Even when 
we build a super-optimized SLRU. Nearby **bits** are written by different CPUs.
I think that banks and locks are good thing. But also we could reorganize log 
so that 
status of transaction 0 is on a page 0 at bit offset 0
status of transaction 1 is on a page 1 at bit offset 0
status of transaction 2 is on a page 2 at bit offset 0
status of transaction 3 is on a page 3 at bit offset 0
status of transaction 4 is on a page 0 at bit offset 2
status of transaction 5 is on a page 1 at bit offset 2
status of transaction 6 is on a page 2 at bit offset 2
status of transaction 7 is on a page 3 at bit offset 2
etc...

And it would be even better if page for transaction statuses would be 
determined by backend id somehow. Or at least cache line. Can we allocate a 
range (sizeof(cacheline)) of xids\subxids\multixacts\whatever for each backend?

This does not matter much because
0. Patch set in current thread produces robust SLRU anyway
1. One day we are going to throw away SLRU anyway


Best regards, Andrey Borodin.

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-12-18 Thread Dilip Kumar

On Mon, Dec 18, 2023 at 11:00 PM Robert Haas  wrote:
>
> On Mon, Dec 18, 2023 at 12:04 PM Robert Haas  wrote:
> > certain sense they are competing for the same job. However, they do
> > aim to alleviate different TYPES of contention: the group XID update
> > stuff should be most valuable when lots of processes are trying to
> > update the same page, and the banks should be most valuable when there
> > is simultaneous access to a bunch of different pages. So I'm not
> > convinced that this patch is a reason to remove the group XID update
> > mechanism, but someone might argue otherwise.
>
> Hmm, but, on the other hand:
>
> Currently all readers and writers are competing for the same LWLock.
> But with this change, the readers will (mostly) no longer be competing
> with the writers. So, in theory, that might reduce lock contention
> enough to make the group update mechanism pointless.

Thanks for your feedback, I agree that with a bank-wise lock, we might
not need group updates for some of the use cases as you said where
readers and writers are contenting on the centralized lock because, in
most of the cases, readers will be distributed across different banks.
OTOH there are use cases where the writer commit is the bottleneck (on
SLRU lock) like pgbench simple-update or TPC-B then we will still
benefit by group update.  During group update testing we have seen
benefits with such a scenario[1] with high client counts.  So as per
my understanding by distributing the SLRU locks there are scenarios
where we will not need group update anymore but OTOH there are also
scenarios where we will still benefit from the group update.

[1] 
https://www.postgresql.org/message-id/CAFiTN-u-XEzhd%3DhNGW586fmQwdTy6Qy6_SXe09tNB%3DgBcVzZ_A%40mail.gmail.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-12-18 Thread Robert Haas

On Mon, Dec 18, 2023 at 12:53 PM Andrey M. Borodin  wrote:
> One page still accommodates 32K transaction statuses under one lock. It feels 
> like a lot. About 1 second of transactions on a typical installation.
>
> When the group commit was committed did we have a benchmark to estimate 
> efficiency of this technology? Can we repeat that test again?

I think we did, but it might take some research to find it in the
archives. If we can, I agree that repeating it feels like a good idea.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-12-18 Thread Andrey M. Borodin




> On 18 Dec 2023, at 22:30, Robert Haas  wrote:
> 
> On Mon, Dec 18, 2023 at 12:04 PM Robert Haas  wrote:
>> certain sense they are competing for the same job. However, they do
>> aim to alleviate different TYPES of contention: the group XID update
>> stuff should be most valuable when lots of processes are trying to
>> update the same page, and the banks should be most valuable when there
>> is simultaneous access to a bunch of different pages. So I'm not
>> convinced that this patch is a reason to remove the group XID update
>> mechanism, but someone might argue otherwise.
> 
> Hmm, but, on the other hand:
> 
> Currently all readers and writers are competing for the same LWLock.
> But with this change, the readers will (mostly) no longer be competing
> with the writers. So, in theory, that might reduce lock contention
> enough to make the group update mechanism pointless.

One page still accommodates 32K transaction statuses under one lock. It feels 
like a lot. About 1 second of transactions on a typical installation.

When the group commit was committed did we have a benchmark to estimate 
efficiency of this technology? Can we repeat that test again?


Best regards, Andrey Borodin.

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-12-18 Thread Robert Haas

On Mon, Dec 18, 2023 at 12:04 PM Robert Haas  wrote:
> certain sense they are competing for the same job. However, they do
> aim to alleviate different TYPES of contention: the group XID update
> stuff should be most valuable when lots of processes are trying to
> update the same page, and the banks should be most valuable when there
> is simultaneous access to a bunch of different pages. So I'm not
> convinced that this patch is a reason to remove the group XID update
> mechanism, but someone might argue otherwise.

Hmm, but, on the other hand:

Currently all readers and writers are competing for the same LWLock.
But with this change, the readers will (mostly) no longer be competing
with the writers. So, in theory, that might reduce lock contention
enough to make the group update mechanism pointless.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-12-18 Thread Robert Haas

On Tue, Dec 12, 2023 at 8:29 AM Alvaro Herrera  wrote:
> The problem I see is that the group update mechanism is designed around
> contention of the global xact-SLRU control lock; it uses atomics to
> coordinate a single queue when the lock is contended.  So if we split up
> the global SLRU control lock using banks, then multiple processes using
> different bank locks might not contend.  OK, this is fine, but what
> happens if two separate groups of processes encounter contention on two
> different bank locks?  I think they will both try to update the same
> queue, and coordinate access to that *using different bank locks*.  I
> don't see how can this work correctly.
>
> I suspect the first part of that algorithm, where atomics are used to
> create the list without a lock, might work fine.  But will each "leader"
> process, each of which is potentially using a different bank lock,
> coordinate correctly?  Maybe this works correctly because only one
> process will find the queue head not empty?  If this is what happens,
> then there needs to be comments about it.  Without any explanation,
> this seems broken and potentially dangerous, as some transaction commit
> bits might become lost given high enough concurrency and bad luck.

I don't want to be dismissive of this concern, but I took a look at
this part of the patch set and I don't see a correctness problem. I
think the idea of the mechanism is that we have a single linked list
in shared memory that can accumulate those waiters. At some point a
process pops the entire list of waiters, which allows a new group of
waiters to begin accumulating. The process that pops the list must
perform the updates for every process in the just-popped list without
failing, else updates would be lost. In theory, there can be any
number of lists that some process has popped and is currently working
its way through at the same time, although in practice I bet it's
quite rare for there to be more than one. The only correctness problem
is if it's possible for a process that popped the list to error out
before it finishes doing the updates that it "promised" to do by
popping the list.

Having individual bank locks doesn't really change anything here.
That's just a matter of what lock has to be held in order to perform
the update that was promised, and the algorithm described in the
previous paragraph doesn't really care about that. Nor is there a
deadlock hazard here as long as processes only take one lock at a
time, which I believe is the case. The only potential issue that I see
here is with performance. I've heard some questions about whether this
machinery performs well even as it stands, but certainly if we divide
up the lock into a bunch of bankwise locks then that should tend in
the direction of making a mechanism like this less valuable, because
both mechanisms are trying to alleviate contention, and so in a
certain sense they are competing for the same job. However, they do
aim to alleviate different TYPES of contention: the group XID update
stuff should be most valuable when lots of processes are trying to
update the same page, and the banks should be most valuable when there
is simultaneous access to a bunch of different pages. So I'm not
convinced that this patch is a reason to remove the group XID update
mechanism, but someone might argue otherwise.

A related concern is that, if by chance we do end up with multiple
updaters from different pages in the same group, it will now be more
expensive to sort that out because we'll have to potentially keep
switching locks. So that could make the group XID update mechanism
less performant than it is currently. I think that's probably not an
issue because I think it should be a rare occurrence, as the comments
say. A bit more cost in a code path that is almost never taken won't
matter. But if that path is more commonly taken than I think, then
maybe making it more expensive could hurt. It might be worth adding
some debugging to see how often we actually go down that path in a
highly stressed system.

BTW:

+* as well.  The main reason for now allowing requesters of
different pages

now -> not

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-12-14 Thread Dilip Kumar

On Thu, Dec 14, 2023 at 4:36 PM Dilip Kumar  wrote:
>
> On Wed, Dec 13, 2023 at 5:49 PM Andrey M. Borodin  
> wrote:
>
> > > On 12 Dec 2023, at 18:28, Alvaro Herrera  wrote:
> > >
> > > Andrey, do you have any stress tests or anything else that you used to
> > > gain confidence in this code?
> >

I have done some more testing for the clog group update as the
attached test file executes two concurrent scripts executed with
pgbench, the first script is the slow script which will run 10-second
long transactions and the second script is a very fast transaction
with ~1 transactions per second.  Along with that, I have also
changed the bank size such that each bank will contain just 1 page
i.e. 32k transactions per bank.  I have done this way so that we do
not need to keep long-running transactions running for very long in
order to get the transactions from different banks committed during
the same time.  With this test, I have got that behavior and the below
logs shows that multiple transaction range which is in different
slru-bank (considering 32k transactions per bank) are doing group
update at the same time. e.g. in the below logs, we can see xid range
around 70600, 70548, and 70558, and xid range around 755, and 752 are
getting group updates by different leaders but near the same time.

It is running fine when running for a long duration, but I am not sure
how to validate the sanity of this kind of test.

2023-12-14 14:43:31.813 GMT [3306] LOG:  group leader procno 606
updated status of procno 606 xid 70600
2023-12-14 14:43:31.816 GMT [3326] LOG:  procno 586 for xid 70548
added for group update
2023-12-14 14:43:31.816 GMT [3326] LOG:  procno 586 is group leader
and got the lock
2023-12-14 14:43:31.816 GMT [3326] LOG:  group leader procno 586
updated status of procno 586 xid 70548
2023-12-14 14:43:31.818 GMT [3327] LOG:  procno 585 for xid 70558
added for group update
2023-12-14 14:43:31.818 GMT [3327] LOG:  procno 585 is group leader
and got the lock
2023-12-14 14:43:31.818 GMT [3327] LOG:  group leader procno 585
updated status of procno 585 xid 70558
2023-12-14 14:43:31.829 GMT [3155] LOG:  procno 687 for xid 752 added
for group update
2023-12-14 14:43:31.829 GMT [3207] LOG:  procno 669 for xid 755 added
for group update
2023-12-14 14:43:31.829 GMT [3155] LOG:  procno 687 is group leader
and got the lock
2023-12-14 14:43:31.829 GMT [3155] LOG:  group leader procno 687
updated status of procno 669 xid 755
2023-12-14 14:43:31.829 GMT [3155] LOG:  group leader procno 687
updated status of procno 687 xid 752


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
# Goal of this script to generate scenario where some old long running slow
# transaction get committed with the new transactions such that they falls in
# different slru banks


rm -rf pgdata
./initdb -D pgdata
echo "max_wal_size=20GB" >> pgdata/postgresql.conf
echo "shared_buffers=20GB" >> pgdata/postgresql.conf
echo  "checkpoint_timeout=40min" >> pgdata/postgresql.conf
echo "max_connections=700" >> pgdata/postgresql.conf
echo "maintenance_work_mem=1GB" >> pgdata/postgresql.conf
echo "subtrans_buffers=64" >> pgdata/postgresql.conf
echo "multixact_members_buffers=128" >> pgdata/postgresql.conf

#create slow_txn.sql script
cat > slow_txn.sql << EOF
BEGIN;
INSERT INTO test VALUES(1);
DELETE FROM test WHERE a=1;
select pg_sleep(10);
COMMIT;
EOF

#create fast_txn.sql script
cat > fast_txn.sql << EOF
BEGIN;
INSERT INTO test1 VALUES(1);
DELETE FROM test1 WHERE a=1;
COMMIT;
EOF

./pg_ctl -D pgdata -l logfile -c start
./psql -d postgres -c "create table test(a int)"
./psql -d postgres -c "create table test1(a int)"
./pgbench -i -s 1 postgres
./pgbench -f slow_txn.sql -c 28 -j 28 -P 1 -T 60 postgres &
./pgbench -f fast_txn.sql -c 100 -j 100 -P 1 -T 60 postgres
sleep(10);
./pg_ctl -D pgdata -l logfile -c stop

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-12-14 Thread Andrey M. Borodin




> On 14 Dec 2023, at 16:32, tender wang  wrote:
> 
> enable -O2, only one  instruction:
> xor eax, eax

This is not fast code.  This is how friendly C compiler suggests you that mask 
must be 127, not 128.


Best regards, Andrey Borodin.

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-12-14 Thread tender wang

Andrey M. Borodin  于2023年12月14日周四 17:35写道：

>
>
> > On 14 Dec 2023, at 14:28, tender wang  wrote:
> >
> >   Now that AND is more faster, Can we  replace the '%
> SLRU_MAX_BANKLOCKS' operation in  SimpleLruGetBankLock()  with '& 127'
>
> unsigned int GetBankno1(unsigned int pageno) {
> return pageno & 127;
> }
>
> unsigned int GetBankno2(unsigned int pageno) {
> return pageno % 128;
> }
>
> Generates with -O2
>
> GetBankno1(unsigned int):
> mov eax, edi
> and eax, 127
> ret
> GetBankno2(unsigned int):
> mov eax, edi
> and eax, 127
> ret
>
>
> Compiler is smart enough with constants.
>

Yeah, that's true.

int GetBankno(long pageno) {
unsigned short bank_mask = 128;
int bankno = (pageno & bank_mask) % 128;
return bankno;
}
enable -O2, only one  instruction:
xor eax, eax

But if we all use '%',  thing changs as below:
int GetBankno(long pageno) {
unsigned short bank_mask = 128;
int bankno = (pageno % bank_mask) % 128;
return bankno;
}
mov rdx, rdi
sar rdx, 63
shr rdx, 57
lea rax, [rdi+rdx]
and eax, 127
sub eax, edx


> Best regards, Andrey Borodin.

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-12-14 Thread Andrey M. Borodin




> On 14 Dec 2023, at 16:06, Dilip Kumar  wrote:
> 
>  I have noticed
> a very good improvement with the addition of patch 0003.

Indeed, a very impressive results! It’s almost x2 of performance on high 
contention scenario, on top of previous improvements.

Best regards, Andrey Borodin.

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-12-14 Thread Dilip Kumar

On Wed, Dec 13, 2023 at 5:49 PM Andrey M. Borodin  wrote:

> > On 12 Dec 2023, at 18:28, Alvaro Herrera  wrote:
> >
> > Andrey, do you have any stress tests or anything else that you used to
> > gain confidence in this code?
>
> We are using only first two steps of the patchset, these steps do not touch 
> locking stuff.
>
> We’ve got some confidence after Yura Sokolov’s benchmarks [0]. Thanks!
>

I have run this test [1], instead of comparing against the master I
have compared the effect of (patch-1 = (0001+0002)slur buffer bank) vs
(patch-2 = (0001+0002+0003) slur buffer bank + bank-wise lock), and
here is the result of the benchmark-1 and benchmark-2.  I have noticed
a very good improvement with the addition of patch 0003.

Machine information:
Architecture:  x86_64
CPU op-mode(s):32-bit, 64-bit
Byte Order:Little Endian
CPU(s):128
On-line CPU(s) list:   0-127

configurations:

max_wal_size=20GB
shared_buffers=20GB
checkpoint_timeout=40min
max_connections=700
maintenance_work_mem=1GB

subtrans_buffers=$variable
multixact_offsets_buffers=$variable
multixact_members_buffers=$variable

benchmark-1
version  | subtrans | multixact   | tps
  | buffers   | offs/memb | func+ballast
---+--+--+--
patch-1 | 64   | 64/128|  87 + 58
patch-2 | 64   | 64/128 | 128 +83
patch-1 | 1024| 512/1024|  96 + 64
patch-2 | 1024| 512/1024| 163+108

benchmark-2

version  | subtrans | multixact   | tps
  | buffers   | offs/memb | func
---+--+--+--
patch-1 | 64   | 64/128|  10
patch-2 | 64   | 64/128 | 12
patch-1 | 1024| 512/1024| 44
patch-2 | 1024| 512/1024| 72


[1] 
https://www.postgresql.org/message-id/flat/e46cdea96979545b2d8a13b451d8b1ce61dc7238.camel%40postgrespro.ru#0ed2cad11470825d464093fe6b8ef6a3


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-12-14 Thread Andrey M. Borodin




> On 14 Dec 2023, at 14:28, tender wang  wrote:
> 
>   Now that AND is more faster, Can we  replace the '% SLRU_MAX_BANKLOCKS' 
> operation in  SimpleLruGetBankLock()  with '& 127'

unsigned int GetBankno1(unsigned int pageno) {
return pageno & 127;
}

unsigned int GetBankno2(unsigned int pageno) {
return pageno % 128;
}

Generates with -O2

GetBankno1(unsigned int):
mov eax, edi
and eax, 127
ret
GetBankno2(unsigned int):
mov eax, edi
and eax, 127
ret


Compiler is smart enough with constants.


Best regards, Andrey Borodin.

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-12-14 Thread tender wang

Andrey M. Borodin  于2023年12月14日周四 17:02写道：

>
>
> > On 14 Dec 2023, at 08:12, Amul Sul  wrote:
> >
> >
> > + int bankno = pageno & ctl->bank_mask;
> >
> > I am a bit uncomfortable seeing it as a mask, why can't it be simply a
> number
> > of banks (num_banks) and get the bank number through modulus op (pageno %
> > num_banks) instead of bitwise & operation (pageno & ctl->bank_mask)
> which is a
> > bit difficult to read compared to modulus op which is quite simple,
> > straightforward and much common practice in hashing.
> >
> > Are there any advantages of using &  over % ?
>

use Compiler Explorer[1] tool, '%' has more Assembly instructions than '&'
.
int GetBankno1(int pageno) {
return pageno & 127;
}

int GetBankno2(int pageno) {
return pageno % 127;
}
under clang 13.0
GetBankno1: # @GetBankno1
pushrbp
mov rbp, rsp
mov dword ptr [rbp - 4], edi
mov eax, dword ptr [rbp - 4]
and eax, 127
pop rbp
ret
GetBankno2: # @GetBankno2
pushrbp
mov rbp, rsp
mov dword ptr [rbp - 4], edi
mov eax, dword ptr [rbp - 4]
mov ecx, 127
cdq
idivecx
mov eax, edx
pop rbp
ret
under gcc 13.2
GetBankno1:
pushrbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
mov eax, DWORD PTR [rbp-4]
and eax, 127
pop rbp
ret
GetBankno2:
pushrbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
mov eax, DWORD PTR [rbp-4]
movsx   rdx, eax
imulrdx, rdx, -2130574327
shr rdx, 32
add edx, eax
mov ecx, edx
sar ecx, 6
cdq
sub ecx, edx
mov edx, ecx
sal edx, 7
sub edx, ecx
sub eax, edx
mov ecx, eax
mov eax, ecx
pop rbp
ret


[1] https://godbolt.org/

The instruction AND is ~20 times faster than IDIV [0]. This is relatively
> hot function, worth sacrificing some readability to save ~ten nanoseconds
> on each check of a status of a transaction.
>

  Now that AND is more faster, Can we  replace the '% SLRU_MAX_BANKLOCKS'
operation in  SimpleLruGetBankLock()  with '& 127' :
  SimpleLruGetBankLock()
{
   int banklockno = (pageno & ctl->bank_mask) % SLRU_MAX_BANKLOCKS;

   use '&'
   return &(ctl->shared->bank_locks[banklockno].lock);
}
Thoughts?

>
> [0] https://www.agner.org/optimize/instruction_tables.pdf
>
>

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-12-14 Thread Andrey M. Borodin




> On 14 Dec 2023, at 08:12, Amul Sul  wrote:
> 
> 
> + int bankno = pageno & ctl->bank_mask;
> 
> I am a bit uncomfortable seeing it as a mask, why can't it be simply a number
> of banks (num_banks) and get the bank number through modulus op (pageno %
> num_banks) instead of bitwise & operation (pageno & ctl->bank_mask) which is a
> bit difficult to read compared to modulus op which is quite simple,
> straightforward and much common practice in hashing.
> 
> Are there any advantages of using &  over % ?

The instruction AND is ~20 times faster than IDIV [0]. This is relatively hot 
function, worth sacrificing some readability to save ~ten nanoseconds on each 
check of a status of a transaction.


[0] https://www.agner.org/optimize/instruction_tables.pdf

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-12-13 Thread Amul Sul

On Mon, Dec 11, 2023 at 10:42 AM Dilip Kumar  wrote:

> On Thu, Nov 30, 2023 at 3:30 PM Dilip Kumar  wrote:
> >
> > On Wed, Nov 29, 2023 at 4:58 PM Dilip Kumar 
> wrote:
>
> Here is the updated patch based on some comments by tender wang (those
> comments were sent only to me)
>

few nitpicks:

+
+   /*
+* Mask for slotno banks, considering 1GB SLRU buffer pool size and the
+* SLRU_BANK_SIZE bits16 should be sufficient for the bank mask.
+*/
+   bits16  bank_mask;
 } SlruCtlData;

...
...

+ int bankno = pageno & ctl->bank_mask;

I am a bit uncomfortable seeing it as a mask, why can't it be simply a
number
of banks (num_banks) and get the bank number through modulus op (pageno %
num_banks) instead of bitwise & operation (pageno & ctl->bank_mask) which
is a
bit difficult to read compared to modulus op which is quite simple,
straightforward and much common practice in hashing.

Are there any advantages of using &  over % ?

Also, a few places in 0002 and 0003 patch, need the bank number, it is
better
to have a macro for that.
---

 extern bool SlruScanDirCbDeleteAll(SlruCtl ctl, char *filename, int64
segpage,
   void *data);
-
+extern bool check_slru_buffers(const char *name, int *newval);
 #endif /* SLRU_H */


Add an empty line after the declaration, in 0002 patch.
---

-TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr
lsn, int slotno)
+TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr
lsn,
+ int slotno)

Unrelated change for 0003 patch.
---

Regards,
Amul

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-12-13 Thread Andrey M. Borodin

> On 12 Dec 2023, at 18:28, Alvaro Herrera  wrote:
> 
> Andrey, do you have any stress tests or anything else that you used to
> gain confidence in this code?

We are using only first two steps of the patchset, these steps do not touch 
locking stuff.

We’ve got some confidence after Yura Sokolov’s benchmarks [0]. Thanks!

Best regards, Andrey Borodin.

[0] 
https://www.postgresql.org/message-id/flat/e46cdea96979545b2d8a13b451d8b1ce61dc7238.camel%40postgrespro.ru#0ed2cad11470825d464093fe6b8ef6a3

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-12-12 Thread Dilip Kumar

On Tue, Dec 12, 2023 at 6:58 PM Alvaro Herrera  wrote:
>
> [Added Andrey again in CC, because as I understand they are using this
> code or something like it in production.  Please don't randomly remove
> people from CC lists.]

Oh, glad to know that.  Yeah, I generally do not remove but I have
noticed that in the mail chain, some of the reviewers just replied to
me and the hackers' list, and from that point onwards I lost track of
the CC list.

> I've been looking at this some more, and I'm not confident in that the
> group clog update stuff is correct.  I think the injection points test
> case was good enough to discover a problem, but it's hard to get peace
> of mind that there aren't other, more subtle problems.

Yeah, I agree.

> The problem I see is that the group update mechanism is designed around
> contention of the global xact-SLRU control lock; it uses atomics to
> coordinate a single queue when the lock is contended.  So if we split up
> the global SLRU control lock using banks, then multiple processes using
> different bank locks might not contend.  OK, this is fine, but what
> happens if two separate groups of processes encounter contention on two
> different bank locks?  I think they will both try to update the same
> queue, and coordinate access to that *using different bank locks*.  I
> don't see how can this work correctly.

Let's back up a bit and start from the current design with the
centralized lock.  With that, if one process is holding the lock the
other processes will try to perform the group update, and if there is
already a group that still hasn't got the lock but trying to update
the different CLOG page then what this process wants to update then it
will not add itself for the group update instead it will fallback to
the normal lock wait.  Now, in another situation, it may so happen
that the group leader of the other group already got the control lock
and in such case, it would have cleared the
'procglobal->clogGroupFirst' that means now we will start forming a
different group.  So logically if we talk only about the optimization
part then the thing is that it is assumed that at a time when we are
trying to commit a log of concurrent xid then those xids are mostly of
the same range and will fall in the same SLRU page and group update
will help them.  But if we are getting some out-of-range xid of some
long-running transaction they might not even go for group update as
the page number will be different.  Although the situation might be
better here with a bank-wise lock because there if those xids are
falling in altogether a different bank it might not even contend.

Now, let's talk about the correctness, I think even though we are
getting processes that might be contending on different bank-lock,
still we are ensuring that in a group all the processes are trying to
update the same SLRU page (i.e. same bank also, we will talk about the
exception later[1]).  One of the processes is becoming a leader and as
soon as the leader gets the lock it detaches the queue from the
'procglobal->clogGroupFirst' by setting it as INVALID_PGPROCNO so that
other group update requesters now can form another parallel group.
But here I do not see a problem with correctness.

I agree someone might say that since now there is a possibility that
different groups might get formed for different bank locks we do not
get other groups to get formed until we get the lock for our bank as
we do not clear  'procglobal->clogGroupFirst' before we get the lock.
Other requesters might want to update the page in different banks so
why block them?  But the thing is the group update design is optimized
for the cases where all requesters are trying to update the status of
xids generated near the same range.

> I suspect the first part of that algorithm, where atomics are used to
> create the list without a lock, might work fine.  But will each "leader"
> process, each of which is potentially using a different bank lock,
> coordinate correctly?  Maybe this works correctly because only one
> process will find the queue head not empty?  If this is what happens,
> then there needs to be comments about it.

Yes, you got it right, I will try to comment on it better.

  Without any explanation,
> this seems broken and potentially dangerous, as some transaction commit
> bits might become lost given high enough concurrency and bad luck.
>
> Maybe this can be made to work by having one more lwlock that we use
> solely to coordinate this task.

Do you mean to say a different lock for adding/removing in the list
instead of atomic operation?  I think then we will lose the benefit we
got in the group update by having contention on another lock.

[1] I think we already know about the exception case and I have
explained in the comments as well that in some cases we might add
different clog page update requests in the same group, and for
handling that exceptional case we are checking the respective bank
lock for each page and if that

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-12-12 Thread Alvaro Herrera

[Added Andrey again in CC, because as I understand they are using this
code or something like it in production.  Please don't randomly remove
people from CC lists.]

I've been looking at this some more, and I'm not confident in that the
group clog update stuff is correct.  I think the injection points test
case was good enough to discover a problem, but it's hard to get peace
of mind that there aren't other, more subtle problems.

The problem I see is that the group update mechanism is designed around
contention of the global xact-SLRU control lock; it uses atomics to
coordinate a single queue when the lock is contended.  So if we split up
the global SLRU control lock using banks, then multiple processes using
different bank locks might not contend.  OK, this is fine, but what
happens if two separate groups of processes encounter contention on two
different bank locks?  I think they will both try to update the same
queue, and coordinate access to that *using different bank locks*.  I
don't see how can this work correctly.

I suspect the first part of that algorithm, where atomics are used to
create the list without a lock, might work fine.  But will each "leader"
process, each of which is potentially using a different bank lock,
coordinate correctly?  Maybe this works correctly because only one
process will find the queue head not empty?  If this is what happens,
then there needs to be comments about it.  Without any explanation,
this seems broken and potentially dangerous, as some transaction commit
bits might become lost given high enough concurrency and bad luck.

Maybe this can be made to work by having one more lwlock that we use
solely to coordinate this task.  Though we would have to demonstrate
that coordinating this task with a different lock works correctly in
conjunction with the per-bank lwlock usage in the regular slru.c paths.


Andrey, do you have any stress tests or anything else that you used to
gain confidence in this code?

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"El sabio habla porque tiene algo que decir;
el tonto, porque tiene que decir algo" (Platon).

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-29 Thread Dilip Kumar

On Wed, Nov 29, 2023 at 3:29 PM Alvaro Herrera  wrote:
>
> On 2023-Nov-29, tender wang wrote:
>
> > The v8-0001 patch failed to apply in my local repo as below:
> >
> > git apply v8-0001-Make-all-SLRU-buffer-sizes-configurable.patch
> > error: patch failed: src/backend/access/transam/multixact.c:1851
> > error: src/backend/access/transam/multixact.c: patch does not apply
> > error: patch failed: src/backend/access/transam/subtrans.c:184
> > error: src/backend/access/transam/subtrans.c: patch does not apply
> > error: patch failed: src/backend/commands/async.c:117
> > error: src/backend/commands/async.c: patch does not apply
> > error: patch failed: src/backend/storage/lmgr/predicate.c:808
> > error: src/backend/storage/lmgr/predicate.c: patch does not apply
> > error: patch failed: src/include/commands/async.h:15
> > error: src/include/commands/async.h: patch does not apply
>
> Yeah, this patch series conflicts with today's commit 4ed8f0913bfd.

I will send a rebased version by tomorrow.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-29 Thread Alvaro Herrera

On 2023-Nov-29, tender wang wrote:

> The v8-0001 patch failed to apply in my local repo as below:
> 
> git apply v8-0001-Make-all-SLRU-buffer-sizes-configurable.patch
> error: patch failed: src/backend/access/transam/multixact.c:1851
> error: src/backend/access/transam/multixact.c: patch does not apply
> error: patch failed: src/backend/access/transam/subtrans.c:184
> error: src/backend/access/transam/subtrans.c: patch does not apply
> error: patch failed: src/backend/commands/async.c:117
> error: src/backend/commands/async.c: patch does not apply
> error: patch failed: src/backend/storage/lmgr/predicate.c:808
> error: src/backend/storage/lmgr/predicate.c: patch does not apply
> error: patch failed: src/include/commands/async.h:15
> error: src/include/commands/async.h: patch does not apply

Yeah, this patch series conflicts with today's commit 4ed8f0913bfd.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
Syntax error: function hell() needs an argument.
Please choose what hell you want to involve.

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-29 Thread tender wang

The v8-0001 patch failed to apply in my local repo as below:

git apply v8-0001-Make-all-SLRU-buffer-sizes-configurable.patch
error: patch failed: src/backend/access/transam/multixact.c:1851
error: src/backend/access/transam/multixact.c: patch does not apply
error: patch failed: src/backend/access/transam/subtrans.c:184
error: src/backend/access/transam/subtrans.c: patch does not apply
error: patch failed: src/backend/commands/async.c:117
error: src/backend/commands/async.c: patch does not apply
error: patch failed: src/backend/storage/lmgr/predicate.c:808
error: src/backend/storage/lmgr/predicate.c: patch does not apply
error: patch failed: src/include/commands/async.h:15
error: src/include/commands/async.h: patch does not apply

My local head commit is 15c9ac36299.  Is there something I missed?

Dilip Kumar  于2023年11月24日周五 17:08写道：

> On Fri, Nov 24, 2023 at 10:17 AM Dilip Kumar 
> wrote:
> >
> > On Thu, Nov 23, 2023 at 11:34 AM Dilip Kumar 
> wrote:
> > >
> > > Note: With this testing, we have found a bug in the bank-wise
> > > approach, basically we are clearing a procglobal->clogGroupFirst, even
> > > before acquiring the bank lock that means in most of the cases there
> > > will be a single process in each group as a group leader
> >
> > I realized that the bug fix I have done is not proper, so will send
> > the updated patch set with the proper fix soon.
>
> PFA, updated patch set fixes the bug found during the testing of the
> group update using the injection point.  Also attached a path to test
> the injection point but for that, we need to apply the injection point
> patches [1]
>
> [1] https://www.postgresql.org/message-id/ZWACtHPetBFIvP61%40paquier.xyz
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com
>

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-23 Thread Dilip Kumar

On Thu, Nov 23, 2023 at 11:34 AM Dilip Kumar  wrote:
>
> Note: With this testing, we have found a bug in the bank-wise
> approach, basically we are clearing a procglobal->clogGroupFirst, even
> before acquiring the bank lock that means in most of the cases there
> will be a single process in each group as a group leader

I realized that the bug fix I have done is not proper, so will send
the updated patch set with the proper fix soon.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-22 Thread Dilip Kumar

On Tue, Nov 21, 2023 at 2:03 PM Dilip Kumar  wrote:
>
> On Mon, Nov 20, 2023 at 4:42 PM Dilip Kumar  wrote:
> >
> > On Mon, Nov 20, 2023 at 2:37 PM Andrey M. Borodin  
> > wrote:
> >
> > > > On 20 Nov 2023, at 13:51, Dilip Kumar  wrote:
> > > >
> > > > 2) Do we really need one separate lwlock tranche for each SLRU?
> > > >
> > > > IMHO if we use the same lwlock tranche then the wait event will show
> > > > the same wait event name, right? And that would be confusing for the
> > > > user, whether we are waiting for Subtransaction or Multixact or
> > > > anything else.  Is my understanding no correct here?
> > >
> > > If we give to a user multiple GUCs to tweak, I think we should give a way 
> > > to understand which GUC to tweak when they observe wait times.
>
> PFA, updated patch set, I have worked on review comments by Alvaro and
> Andrey.  So the only open comments are about clog group commit
> testing, for that my question was as I sent in the previous email
> exactly what part we are worried about in the coverage report?
>
> The second point is, if we want to generate a group update we will
> have to create the injection point after we hold the control lock so
> that other processes go for group update and then for waking up the
> waiting process who is holding the SLRU control lock in the exclusive
> mode we would need to call a function ('test_injection_points_wake()')
> to wake that up and for calling the function we would need to again
> acquire the SLRU lock in read mode for visibility check in the catalog
> for fetching the procedure row and now this wake up session will block
> on control lock for the session which is waiting on injection point so
> now it will create a deadlock.   Maybe with bank-wise lock we can
> create a lot of transaction so that these 2 falls in different banks
> and then we can somehow test this, but then we will have to generate
> 16 * 4096 = 64k transaction so that the SLRU banks are different for
> the transaction which inserted procedure row in system table from the
> transaction in which we are trying to do the group commit

I have attached a POC patch for testing the group update using the
injection point framework.  This is just for testing the group update
part and is not yet a committable test.  I have added a bunch of logs
in the code so that we can see what's going on with the group update.
>From the below logs, we can see that multiple processes are getting
accumulated for the group update and the leader is updating their xid
status.


Note: With this testing, we have found a bug in the bank-wise
approach, basically we are clearing a procglobal->clogGroupFirst, even
before acquiring the bank lock that means in most of the cases there
will be a single process in each group as a group leader (I think this
is what Alvaro was pointing in his coverage report).  I have added
this fix in this POC just for testing purposes but in my next version
I will add this fix to my proper patch version after a proper review
and a bit more testing.


here is the output after running the test
==
2023-11-23 05:55:29.399 UTC [93367] 003_clog_group_commit.pl LOG:
procno 6 got the lock
2023-11-23 05:55:29.399 UTC [93367] 003_clog_group_commit.pl
STATEMENT:  SELECT txid_current();
2023-11-23 05:55:29.406 UTC [93369] 003_clog_group_commit.pl LOG:
statement: SELECT test_injection_points_attach('ClogGroupCommit',
'wait');
2023-11-23 05:55:29.415 UTC [93371] 003_clog_group_commit.pl LOG:
statement: INSERT INTO test VALUES(1);
2023-11-23 05:55:29.416 UTC [93371] 003_clog_group_commit.pl LOG:
procno 4 got the lock
2023-11-23 05:55:29.416 UTC [93371] 003_clog_group_commit.pl
STATEMENT:  INSERT INTO test VALUES(1);
2023-11-23 05:55:29.424 UTC [93373] 003_clog_group_commit.pl LOG:
statement: INSERT INTO test VALUES(2);
2023-11-23 05:55:29.425 UTC [93373] 003_clog_group_commit.pl LOG:
procno 3 for xid 128742 added for group update
2023-11-23 05:55:29.425 UTC [93373] 003_clog_group_commit.pl
STATEMENT:  INSERT INTO test VALUES(2);
2023-11-23 05:55:29.431 UTC [93376] 003_clog_group_commit.pl LOG:
statement: INSERT INTO test VALUES(3);
2023-11-23 05:55:29.438 UTC [93378] 003_clog_group_commit.pl LOG:
statement: INSERT INTO test VALUES(4);
2023-11-23 05:55:29.438 UTC [93376] 003_clog_group_commit.pl LOG:
procno 2 for xid 128743 added for group update
2023-11-23 05:55:29.438 UTC [93376] 003_clog_group_commit.pl
STATEMENT:  INSERT INTO test VALUES(3);
2023-11-23 05:55:29.438 UTC [93376] 003_clog_group_commit.pl LOG:
procno 2 is follower and wait for group leader to update commit status
of xid 128743
2023-11-23 05:55:29.438 UTC [93376] 003_clog_group_commit.pl
STATEMENT:  INSERT INTO test VALUES(3);
2023-11-23 05:55:29.439 UTC [93378] 003_clog_group_commit.pl LOG:
procno 1 for xid 128744 added for group update
2023-11-23 05:55:29.439 UTC [93378] 003_clog_group_commit.pl
STATEMENT:  INSERT INTO test VALUES(4);
2023-11-23 05:55:29.439 UTC [93378] 003_clog_group_commit.pl LOG:
procno 1 is

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-20 Thread Dilip Kumar

On Mon, Nov 20, 2023 at 2:37 PM Andrey M. Borodin  wrote:

> > On 20 Nov 2023, at 13:51, Dilip Kumar  wrote:
> >
> > 2) Do we really need one separate lwlock tranche for each SLRU?
> >
> > IMHO if we use the same lwlock tranche then the wait event will show
> > the same wait event name, right? And that would be confusing for the
> > user, whether we are waiting for Subtransaction or Multixact or
> > anything else.  Is my understanding no correct here?
>
> If we give to a user multiple GUCs to tweak, I think we should give a way to 
> understand which GUC to tweak when they observe wait times.

+1

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-20 Thread Andrey M. Borodin




> On 20 Nov 2023, at 13:51, Dilip Kumar  wrote:
> 
> 2) Do we really need one separate lwlock tranche for each SLRU?
> 
> IMHO if we use the same lwlock tranche then the wait event will show
> the same wait event name, right? And that would be confusing for the
> user, whether we are waiting for Subtransaction or Multixact or
> anything else.  Is my understanding no correct here?

If we give to a user multiple GUCs to tweak, I think we should give a way to 
understand which GUC to tweak when they observe wait times.

Best regards, Andrey Borodin.

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-20 Thread Dilip Kumar

On Fri, Nov 17, 2023 at 7:28 PM Alvaro Herrera  wrote:
>
> On 2023-Nov-17, Dilip Kumar wrote:

I think I need some more clarification for some of the review comments

> > On Thu, Nov 16, 2023 at 3:11 PM Alvaro Herrera  
> > wrote:
> > >
> > > I just noticed that 0003 does some changes to
> > > TransactionGroupUpdateXidStatus() that haven't been adequately
> > > explained AFAICS.  How do you know that these changes are safe?
> >
> > IMHO this is safe as well as logical to do w.r.t. performance.  It's
> > safe because whenever we are updating any page in a group we are
> > acquiring the respective bank lock in exclusive mode and in extreme
> > cases if there are pages from different banks then we do switch the
> > lock as well before updating the pages from different groups.
>
> Looking at the coverage for this code,
> https://coverage.postgresql.org/src/backend/access/transam/clog.c.gcov.html#413
> it seems in our test suites we never hit the case where there is
> anything in the "nextidx" field for commit groups.

1)
I was looking into your coverage report and I have attached a
screenshot from the same, it seems we do hit the block where nextidx
is not INVALID_PGPROCNO, which means there is some other process other
than the group leader.  Although I have already started exploring the
injection point but just wanted to be sure what is your main concern
point about the coverage so though of checking that first.

470 : /*
 471 :  * If the list was not empty, the leader
will update the status of our
 472 :  * XID. It is impossible to have followers
without a leader because the
 473 :  * first process that has added itself to
the list will always have
 474 :  * nextidx as INVALID_PGPROCNO.
 475 :  */
 476  98 : if (nextidx != INVALID_PGPROCNO)
 477 : {
 478   2 : int extraWaits = 0;
 479 :
 480 : /* Sleep until the leader updates our
XID status. */
 481   2 :
pgstat_report_wait_start(WAIT_EVENT_XACT_GROUP_UPDATE);
 482 : for (;;)
 483 : {
 484 : /* acts as a read barrier */
 485   2 : PGSemaphoreLock(proc->sem);
 486   2 : if (!proc->clogGroupMember)
 487   2 : break;
 488   0 : extraWaits++;
 489 : }

2) Do we really need one separate lwlock tranche for each SLRU?

IMHO if we use the same lwlock tranche then the wait event will show
the same wait event name, right? And that would be confusing for the
user, whether we are waiting for Subtransaction or Multixact or
anything else.  Is my understanding no correct here?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-19 Thread Dilip Kumar

On Sun, Nov 19, 2023 at 12:39 PM Andrey M. Borodin  wrote:
>
> I’ve skimmed through the patch set. Here are some minor notes.

Thanks for the review
>
> 1. Cycles “for (slotno = bankstart; slotno < bankend; slotno++)” in 
> SlruSelectLRUPage() and SimpleLruReadPage_ReadOnly() now have identical 
> comments. I think a little of copy-paste is OK.
> But SimpleLruReadPage_ReadOnly() does pgstat_count_slru_page_hit(), while 
> SlruSelectLRUPage() does not. This is not related to the patch set, just a 
> code nearby.

Do you mean to say we need to modify the comments or you are saying
pgstat_count_slru_page_hit() is missing in SlruSelectLRUPage(), if it
is later then I can see the caller of SlruSelectLRUPage() is calling
pgstat_count_slru_page_hit() and the SlruRecentlyUsed().

> 2. Do we really want these functions doing all the same?
> extern bool check_multixact_offsets_buffers(int *newval, void 
> **extra,GucSource source);
> extern bool check_multixact_members_buffers(int *newval, void 
> **extra,GucSource source);
> extern bool check_subtrans_buffers(int *newval, void **extra,GucSource 
> source);
> extern bool check_notify_buffers(int *newval, void **extra, GucSource source);
> extern bool check_serial_buffers(int *newval, void **extra, GucSource source);
> extern bool check_xact_buffers(int *newval, void **extra, GucSource source);
> extern bool check_commit_ts_buffers(int *newval, void **extra,GucSource 
> source);

I tried duplicating these by doing all the work inside the
check_slru_buffers() function.  But I think it is hard to make them a
single function because there is no option to pass an SLRU name in the
GUC check hook and IMHO in the check hook we need to print the GUC
name, any suggestions on how we can avoid having so many functions?

> 3. The name SimpleLruGetSLRUBankLock() contains meaning of SLRU twice. I’d 
> suggest truncating prefix of infix.
>
> I do not have hard opinion on any of this items.
>

I prefer SimpleLruGetBankLock() so that it is consistent with other
external functions starting with "SimpleLruGet", are you fine with
this name?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-18 Thread Andrey M. Borodin

> On 17 Nov 2023, at 16:11, Dilip Kumar  wrote:
> 
> On Fri, Nov 17, 2023 at 1:09 PM Dilip Kumar  wrote:
>> 
>> On Thu, Nov 16, 2023 at 3:11 PM Alvaro Herrera  
>> wrote:
> 
> PFA, updated patch version, this fixes the comment given by Alvaro and
> also improves some of the comments.

I’ve skimmed through the patch set. Here are some minor notes.

1. Cycles “for (slotno = bankstart; slotno < bankend; slotno++)” in 
SlruSelectLRUPage() and SimpleLruReadPage_ReadOnly() now have identical 
comments. I think a little of copy-paste is OK.
But SimpleLruReadPage_ReadOnly() does pgstat_count_slru_page_hit(), while 
SlruSelectLRUPage() does not. This is not related to the patch set, just a code 
nearby.

2. Do we really want these functions doing all the same?
extern bool check_multixact_offsets_buffers(int *newval, void **extra,GucSource 
source);
extern bool check_multixact_members_buffers(int *newval, void **extra,GucSource 
source);
extern bool check_subtrans_buffers(int *newval, void **extra,GucSource source);
extern bool check_notify_buffers(int *newval, void **extra, GucSource source);
extern bool check_serial_buffers(int *newval, void **extra, GucSource source);
extern bool check_xact_buffers(int *newval, void **extra, GucSource source);
extern bool check_commit_ts_buffers(int *newval, void **extra,GucSource source);

3. The name SimpleLruGetSLRUBankLock() contains meaning of SLRU twice. I’d 
suggest truncating prefix of infix.

I do not have hard opinion on any of this items.

Best regards, Andrey Borodin.

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-18 Thread Alvaro Herrera

On 2023-Nov-18, Dilip Kumar wrote:

> On Fri, Nov 17, 2023 at 6:16 PM Alvaro Herrera  
> wrote:

> > I wonder what's the deal with false sharing in the new
> > bank_cur_lru_count array.  Maybe instead of using LWLockPadded for
> > bank_locks, we should have a new struct, with both the LWLock and the
> > LRU counter; then pad *that* to the cacheline size.  This way, both the
> > lwlock and the counter come to the CPU running this code together.
> 
> Actually, the array lengths of both LWLock and the LRU counter are
> different so I don't think we can move them to a common structure.
> The length of the *buffer_locks array is equal to the number of slots,
> the length of the *bank_locks array is Min (number_of_banks, 128), and
> the length of the *bank_cur_lru_count array is number_of_banks.

Oh.

> > Looking at the coverage for this code,
> > https://coverage.postgresql.org/src/backend/access/transam/clog.c.gcov.html#413
> > it seems in our test suites we never hit the case where there is
> > anything in the "nextidx" field for commit groups.  To be honest, I
> > don't understand this group stuff, and so I'm doubly hesitant to go
> > without any testing here.  Maybe it'd be possible to use Michael
> > Paquier's injection points somehow?
> 
> Sorry, but I am not aware of "Michael Paquier's injection" Is it
> something already in the repo? Can you redirect me to some of the
> example test cases if we already have them? Then I will try it out.

https://postgr.es/zvwufo_ykztjh...@paquier.xyz

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Sallah, I said NO camels! That's FIVE camels; can't you count?"
(Indiana Jones)

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-18 Thread Dilip Kumar

On Fri, Nov 17, 2023 at 6:16 PM Alvaro Herrera  wrote:

Thanks for the review, all comments looks fine to me, replying to
those that need some clarification

> I wonder what's the deal with false sharing in the new
> bank_cur_lru_count array.  Maybe instead of using LWLockPadded for
> bank_locks, we should have a new struct, with both the LWLock and the
> LRU counter; then pad *that* to the cacheline size.  This way, both the
> lwlock and the counter come to the CPU running this code together.

Actually, the array lengths of both LWLock and the LRU counter are
different so I don't think we can move them to a common structure.
The length of the *buffer_locks array is equal to the number of slots,
the length of the *bank_locks array is Min (number_of_banks, 128), and
the length of the *bank_cur_lru_count array is number_of_banks.

> Looking at the coverage for this code,
> https://coverage.postgresql.org/src/backend/access/transam/clog.c.gcov.html#413
> it seems in our test suites we never hit the case where there is
> anything in the "nextidx" field for commit groups.  To be honest, I
> don't understand this group stuff, and so I'm doubly hesitant to go
> without any testing here.  Maybe it'd be possible to use Michael
> Paquier's injection points somehow?

Sorry, but I am not aware of "Michael Paquier's injection" Is it
something already in the repo? Can you redirect me to some of the
example test cases if we already have them? Then I will try it out.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-17 Thread Alvaro Herrera

On 2023-Nov-17, Dilip Kumar wrote:

> On Thu, Nov 16, 2023 at 3:11 PM Alvaro Herrera  
> wrote:
> >
> > I just noticed that 0003 does some changes to
> > TransactionGroupUpdateXidStatus() that haven't been adequately
> > explained AFAICS.  How do you know that these changes are safe?
> 
> IMHO this is safe as well as logical to do w.r.t. performance.  It's
> safe because whenever we are updating any page in a group we are
> acquiring the respective bank lock in exclusive mode and in extreme
> cases if there are pages from different banks then we do switch the
> lock as well before updating the pages from different groups.

Looking at the coverage for this code,
https://coverage.postgresql.org/src/backend/access/transam/clog.c.gcov.html#413
it seems in our test suites we never hit the case where there is
anything in the "nextidx" field for commit groups.  To be honest, I
don't understand this group stuff, and so I'm doubly hesitant to go
without any testing here.  Maybe it'd be possible to use Michael
Paquier's injection points somehow?

I think in the code comments where you use "w.r.t.", that acronym can be
replaced with "for", which improves readability.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"All rings of power are equal,
But some rings of power are more equal than others."
 (George Orwell's The Lord of the Rings)

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-17 Thread Alvaro Herrera

In SlruSharedData, a new comment is added that starts:
  "Instead of global counter we maintain a bank-wise lru counter because ..." 
You don't need to explain what we don't do.  Just explain what we do do.
So remove the words "Instead of a global counter" from there, because
they offer no wisdom.  Same with the phrase "so there is no point to ...".
I think "The oldest page is therefore" should say "The oldest page *in
the bank* is therefore", for extra clarity.

I wonder what's the deal with false sharing in the new
bank_cur_lru_count array.  Maybe instead of using LWLockPadded for
bank_locks, we should have a new struct, with both the LWLock and the
LRU counter; then pad *that* to the cacheline size.  This way, both the
lwlock and the counter come to the CPU running this code together.

Looking at SlruRecentlyUsed, which was a macro and is now a function.
The comment about "multiple evaluation of arguments" no longer applies,
so it needs to be removed; and it shouldn't talk about itself being a
macro.

Using "Size" as type for bank_mask looks odd.  For a bitmask, maybe it's
be more appropriate to use bits64 if we do need a 64-bit mask (we don't
have bits64, but it's easy to add a typedef).  I bet we don't really
need a 64 bit mask, and a 32-bit or even a 16-bit is sufficient, given
the other limitations on number of buffers.

I think SimpleLruReadPage should have this assert at start:

+   Assert(LWLockHeldByMe(SimpleLruGetSLRUBankLock(ctl, pageno)));

Do we really need one separate lwlock tranche for each SLRU?

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Cuando mañana llegue pelearemos segun lo que mañana exija" (Mowgli)

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-16 Thread Dilip Kumar

On Thu, Nov 16, 2023 at 3:11 PM Alvaro Herrera  wrote:
>
> I just noticed that 0003 does some changes to
> TransactionGroupUpdateXidStatus() that haven't been adequately
> explained AFAICS.  How do you know that these changes are safe?

IMHO this is safe as well as logical to do w.r.t. performance.  It's
safe because whenever we are updating any page in a group we are
acquiring the respective bank lock in exclusive mode and in extreme
cases if there are pages from different banks then we do switch the
lock as well before updating the pages from different groups.  And, we
do not wake any process in a group unless we have done the status
update for all the processes so there could not be any race condition
as well.  Also, It should not affect the performance adversely as well
and this will not remove the need for group updates.  The main use
case of group update is that it will optimize a situation when most of
the processes are contending for status updates on the same page and
processes that are waiting for status updates on different pages will
go to different groups w.r.t. that page, so in short in a group on
best effort basis we are trying to have the processes which are
waiting to update the same clog page that mean logically all the
processes in the group will be waiting on the same bank lock.  In an
extreme situation if there are processes in the group that are trying
to update different pages or even pages from different banks then we
are handling it well by changing the lock.  Although someone may raise
a concern that in cases where there are processes that are waiting for
different bank locks then after releasing one lock why not wake up
those processes, I think that is not required because that is the
situation we are trying to avoid where there are processes trying to
update different are in the same group so there is no point in adding
complexity to optimize that case.

> 0001 contains one typo in the docs, "cotents".
>
> I'm not a fan of the fact that some CLOG sizing macros moved to clog.h,
> leaving others in clog.c.  Maybe add commentary cross-linking both.
> Alternatively, perhaps allowing xact_buffers to grow beyond 65536 up to
> the slru.h-defined limit of 131072 is not that bad, even if it's more
> than could possibly be needed for xact_buffers; nobody is going to use
> 64k buffers, since useful values are below a couple thousand anyhow.

I agree, that allowing xact_buffers to grow beyond 65536 up to the
slru.h-defined limit of 131072 is not that bad, so I will change that
in the next version.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-16 Thread Alvaro Herrera

I just noticed that 0003 does some changes to
TransactionGroupUpdateXidStatus() that haven't been adequately
explained AFAICS.  How do you know that these changes are safe?

0001 contains one typo in the docs, "cotents".

I'm not a fan of the fact that some CLOG sizing macros moved to clog.h,
leaving others in clog.c.  Maybe add commentary cross-linking both.
Alternatively, perhaps allowing xact_buffers to grow beyond 65536 up to
the slru.h-defined limit of 131072 is not that bad, even if it's more
than could possibly be needed for xact_buffers; nobody is going to use
64k buffers, since useful values are below a couple thousand anyhow.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
Tom: There seems to be something broken here.
Teodor: I'm in sackcloth and ashes...  Fixed.
   http://postgr.es/m/482d1632.8010...@sigaev.ru

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-10 Thread Nathan Bossart

On Fri, Nov 10, 2023 at 10:17:49AM +0530, Dilip Kumar wrote:
> On Thu, Nov 9, 2023 at 4:55 PM Alvaro Herrera  wrote:
>> The only point on which we do not have full consensus yet is the need to
>> have one GUC per SLRU, and a lot of effort seems focused on trying to
>> fix the problem without adding so many GUCs (for example, using shared
>> buffers instead, or use a single "scaling" GUC).  I think that hinders
>> progress.  Let's just add multiple GUCs, and users can leave most of
>> them alone and only adjust the one with which they have a performance
>> problems; it's not going to be the same one for everybody.
> 
> +1

+1

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-09 Thread Dilip Kumar

On Thu, Nov 9, 2023 at 4:55 PM Alvaro Herrera  wrote:
>
> IMO the whole area of SLRU buffering is in horrible shape and many users
> are struggling with overall PG performance because of it.  An
> improvement doesn't have to be perfect -- it just has to be much better
> than the current situation, which should be easy enough.  We can
> continue to improve later, using more scalable algorithms or ones that
> allow us to raise the limits higher.

I agree with this.

> The only point on which we do not have full consensus yet is the need to
> have one GUC per SLRU, and a lot of effort seems focused on trying to
> fix the problem without adding so many GUCs (for example, using shared
> buffers instead, or use a single "scaling" GUC).  I think that hinders
> progress.  Let's just add multiple GUCs, and users can leave most of
> them alone and only adjust the one with which they have a performance
> problems; it's not going to be the same one for everybody.

+1

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-09 Thread Dilip Kumar

On Thu, Nov 9, 2023 at 9:39 PM Robert Haas  wrote:
>
> On Wed, Nov 8, 2023 at 6:41 AM Dilip Kumar  wrote:
> > Here is the updated version of the patch, here I have taken the
> > approach suggested by Andrey and I discussed the same with Alvaro
> > offlist and he also agrees with it.  So the idea is that we will keep
> > the bank size fixed which is 16 buffers per bank and the allowed GUC
> > value for each slru buffer must be in multiple of the bank size.  We
> > have removed the centralized lock but instead of one lock per bank, we
> > have kept the maximum limit on the number of bank locks which is 128.
> > We kept the max limit as 128 because, in one of the operations (i.e.
> > ActivateCommitTs), we need to acquire all the bank locks (but this is
> > not a performance path at all) and at a time we can acquire a max of
> > 200 LWlocks, so we think this limit of 128 is good.  So now if the
> > number of banks is <= 128 then we will be using one lock per bank
> > otherwise the one lock may protect access of buffer in multiple banks.
>
> Just so I understand, I guess this means that an SLRU is limited to
> 16*128 = 2k buffers = 16MB?

Not really, because 128 is the maximum limit on the number of bank
locks not on the number of banks.  So for example, if you have 16*128
= 2k buffers then each lock will protect one bank, and likewise when
you have 16 * 512 =  8k buffers then each lock will protect 4 banks.
So in short we can get the lock for each bank by simple computation
(banklockno = bankno % 128 )

> When we were talking about this earlier, I suggested fixing the number
> of banks and allowing the number of buffers per bank to scale
> depending on the setting. That seemed simpler than allowing both the
> number of banks and the number of buffers to vary, and it might allow
> the compiler to optimize some code better, by converting a calculation
> like page_no%number_of_banks into a masking operation like page_no&0xf
> or whatever. However, because it allows an individual bank to become
> arbitrarily large, it more or less requires us to use a buffer mapping
> table. Some of the performance problems mentioned could be alleviated
> by omitting the hash table when the number of buffers per bank is
> small, and we could also create the dynahash with a custom hash
> function that just does modular arithmetic on the page number rather
> than a real hashing operation. However, maybe we don't really need to
> do any of that. I agree that dynahash is clunky on a good day. I
> hadn't realized the impact would be so noticeable.

Yes, so one idea is that we keep the number of banks fixed and with
that, as you pointed out correctly with a large number of buffers, the
bank size can be quite big and for that, we would need a hash table
and OTOH what I am doing here is keeping the bank size fixed and
smaller (16 buffers per bank) and with that we can have large numbers
of the bank when the buffer pool size is quite big.  But I feel having
more banks is not really a problem if we grow the number of locks
beyond a certain limit as in some corner cases we need to acquire all
locks together and there is a limit on that.   So I like this idea of
sharing locks across the banks with that 1) We can have enough locks
so that lock contention or cache invalidation due to a common lock
should not be a problem anymore 2) We can keep a small bank size with
that seq search within the bank is quite fast so reads are fast  3)
With small bank size victim buffer search which has to be sequential
is quite fast.

> This proposal takes the opposite approach of fixing the number of
> buffers per bank, letting the number of banks vary. I think that's
> probably fine, although it does reduce the effective associativity of
> the cache. If there are more hot buffers in a bank than the bank size,
> the bank will be contended, even if other banks are cold. However,
> given the way SLRUs are accessed, it seems hard to imagine this being
> a real problem in practice. There aren't likely to be say 20 hot
> buffers that just so happen to all be separated from one another by a
> number of pages that is a multiple of the configured number of banks.
> And in the seemingly very unlikely event that you have a workload that
> behaves like that, you could always adjust the number of banks up or
> down by one, and the problem would go away. So this seems OK to me.

I agree with this

> I also agree with a couple of points that Alvaro made, specifically
> that (1) this doesn't have to be perfect, just better than now and (2)
> separate GUCs for each SLRU is fine. On the latter point, it's worth
> keeping in mind that the cost of a GUC that most people don't need to
> tune is fairly low. GUCs like work_mem and shared_buffers are
> "expensive" because everybody more or less needs to understand what
> they are and how to set them and getting the right value can tricky --
> but a GUC like autovacuum_naptime is a lot cheaper, because almost
> nobody needs to change

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-09 Thread Robert Haas

On Wed, Nov 8, 2023 at 6:41 AM Dilip Kumar  wrote:
> Here is the updated version of the patch, here I have taken the
> approach suggested by Andrey and I discussed the same with Alvaro
> offlist and he also agrees with it.  So the idea is that we will keep
> the bank size fixed which is 16 buffers per bank and the allowed GUC
> value for each slru buffer must be in multiple of the bank size.  We
> have removed the centralized lock but instead of one lock per bank, we
> have kept the maximum limit on the number of bank locks which is 128.
> We kept the max limit as 128 because, in one of the operations (i.e.
> ActivateCommitTs), we need to acquire all the bank locks (but this is
> not a performance path at all) and at a time we can acquire a max of
> 200 LWlocks, so we think this limit of 128 is good.  So now if the
> number of banks is <= 128 then we will be using one lock per bank
> otherwise the one lock may protect access of buffer in multiple banks.

Just so I understand, I guess this means that an SLRU is limited to
16*128 = 2k buffers = 16MB?

When we were talking about this earlier, I suggested fixing the number
of banks and allowing the number of buffers per bank to scale
depending on the setting. That seemed simpler than allowing both the
number of banks and the number of buffers to vary, and it might allow
the compiler to optimize some code better, by converting a calculation
like page_no%number_of_banks into a masking operation like page_no&0xf
or whatever. However, because it allows an individual bank to become
arbitrarily large, it more or less requires us to use a buffer mapping
table. Some of the performance problems mentioned could be alleviated
by omitting the hash table when the number of buffers per bank is
small, and we could also create the dynahash with a custom hash
function that just does modular arithmetic on the page number rather
than a real hashing operation. However, maybe we don't really need to
do any of that. I agree that dynahash is clunky on a good day. I
hadn't realized the impact would be so noticeable.

This proposal takes the opposite approach of fixing the number of
buffers per bank, letting the number of banks vary. I think that's
probably fine, although it does reduce the effective associativity of
the cache. If there are more hot buffers in a bank than the bank size,
the bank will be contended, even if other banks are cold. However,
given the way SLRUs are accessed, it seems hard to imagine this being
a real problem in practice. There aren't likely to be say 20 hot
buffers that just so happen to all be separated from one another by a
number of pages that is a multiple of the configured number of banks.
And in the seemingly very unlikely event that you have a workload that
behaves like that, you could always adjust the number of banks up or
down by one, and the problem would go away. So this seems OK to me.

I also agree with a couple of points that Alvaro made, specifically
that (1) this doesn't have to be perfect, just better than now and (2)
separate GUCs for each SLRU is fine. On the latter point, it's worth
keeping in mind that the cost of a GUC that most people don't need to
tune is fairly low. GUCs like work_mem and shared_buffers are
"expensive" because everybody more or less needs to understand what
they are and how to set them and getting the right value can tricky --
but a GUC like autovacuum_naptime is a lot cheaper, because almost
nobody needs to change it. It seems to me that these GUCs will fall
into the latter category. Users can hopefully just ignore them except
if they see a contention on the SLRU bank locks -- and then they can
consider increasing the number of banks for that particular SLRU. That
seems simple enough. As with autovacuum_naptime, there is a danger
that people will configure a ridiculous value of the parameter for no
good reason and get bad results, so it would be nice if someday we had
a magical system that just got all of this right without the user
needing to configure anything. But in the meantime, it's better to
have a somewhat manual system to relieve pressure on these locks than
no system at all.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-09 Thread Alvaro Herrera

IMO the whole area of SLRU buffering is in horrible shape and many users
are struggling with overall PG performance because of it.  An
improvement doesn't have to be perfect -- it just has to be much better
than the current situation, which should be easy enough.  We can
continue to improve later, using more scalable algorithms or ones that
allow us to raise the limits higher.

The only point on which we do not have full consensus yet is the need to
have one GUC per SLRU, and a lot of effort seems focused on trying to
fix the problem without adding so many GUCs (for example, using shared
buffers instead, or use a single "scaling" GUC).  I think that hinders
progress.  Let's just add multiple GUCs, and users can leave most of
them alone and only adjust the one with which they have a performance
problems; it's not going to be the same one for everybody.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Sallah, I said NO camels! That's FIVE camels; can't you count?"
(Indiana Jones)

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-08 Thread Andrey M. Borodin

> On 8 Nov 2023, at 14:17, Ants Aasma  wrote:
> 
> Is there a particular reason why lock partitions need to be bigger? We have 
> one lock per buffer anyway, bankwise locks will increase the number of locks 
> < 10%.

The problem was not attracting much attention for some years. So my reasoning 
was that solution should not have any costs at all. Initial patchset with banks 
did not add any memory footprint.

> On 8 Nov 2023, at 14:17, Ants Aasma  wrote:
> 
> I am working on trying out a SIMD based LRU mechanism that uses a 16 entry 
> bank. 

FWIW I tried to pack struct parts together to minimize cache lines touched, see 
step 3 in [0]. So far I could not prove any performance benefits of this 
approach. But maybe your implementation will be more efficient.

Thanks!

Best regards, Andrey Borodin.

[0] 
https://www.postgresql.org/message-id/93236d36-b91c-4dfa-af03-99c083840...@yandex-team.ru

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-08 Thread Ants Aasma

On Sat, 4 Nov 2023 at 22:08, Andrey M. Borodin  wrote:

> On 30 Oct 2023, at 09:20, Dilip Kumar  wrote:
>
> changed the logic of SlruAdjustNSlots() in 0002, such that now it
> starts with the next power of 2 value of the configured slots and
> keeps doubling the number of banks until we reach the number of banks
> to the max SLRU_MAX_BANKS(128) and bank size is bigger than
> SLRU_MIN_BANK_SIZE (8).  By doing so, we will ensure we don't have too
> many banks
>
> There was nothing wrong with having too many banks. Until bank-wise locks
> and counters were added in later patchsets.
> Having hashtable to find SLRU page in the buffer IMV is too slow. Some
> comments on this approach can be found here [0].
> I'm OK with having HTAB for that if we are sure performance does not
> degrade significantly, but I really doubt this is the case.
> I even think SLRU buffers used HTAB in some ancient times, but I could not
> find commit when it was changed to linear search.
>
> Maybe we could decouple locks and counters from SLRU banks? Banks were
> meant to be small to exploit performance of local linear search. Lock
> partitions have to be bigger for sure.
>

Is there a particular reason why lock partitions need to be bigger? We have
one lock per buffer anyway, bankwise locks will increase the number of
locks < 10%.

I am working on trying out a SIMD based LRU mechanism that uses a 16 entry
bank. The data layout is:

struct CacheBank {
int page_numbers[16];
char access_age[16];
}

The first part uses up one cache line, and the second line has 48 bytes of
space left over that could fit a lwlock and page_status, page_dirty arrays.

Lookup + LRU maintenance has 20 instructions/14 cycle latency and the only
branch is for found/not found. Hoping to have a working prototype of SLRU
on top in the next couple of days.

Regards,
Ants Aasma

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-07 Thread Dilip Kumar

On Wed, Nov 8, 2023 at 10:52 AM Amul Sul  wrote:

Thanks for review Amul,

> Here are some minor comments:
>
> + * By default, we'll use 1MB of for every 1GB of shared buffers, up to the
> + * maximum value that slru.c will allow, but always at least 16 buffers.
>   */
>  Size
>  CommitTsShmemBuffers(void)
>  {
> -   return Min(256, Max(4, NBuffers / 256));
> +   /* Use configured value if provided. */
> +   if (commit_ts_buffers > 0)
> +   return Max(16, commit_ts_buffers);
> +   return Min(256, Max(16, NBuffers / 256));
>
> Do you mean "4MB of for every 1GB"  in the comment?

You are right

> --
>
> diff --git a/src/include/access/commit_ts.h b/src/include/access/commit_ts.h
> index 5087cdce51..78d017ad85 100644
> --- a/src/include/access/commit_ts.h
> +++ b/src/include/access/commit_ts.h
> @@ -16,7 +16,6 @@
>  #include "replication/origin.h"
>  #include "storage/sync.h"
>
> -
>  extern PGDLLIMPORT bool track_commit_timestamp;
>
> A spurious change.

Will fix

> --
>
> @@ -168,10 +180,19 @@ SimpleLruShmemSize(int nslots, int nlsns)
>
> if (nlsns > 0)
> sz += MAXALIGN(nslots * nlsns * sizeof(XLogRecPtr));/* 
> group_lsn[] */
> -
> return BUFFERALIGN(sz) + BLCKSZ * nslots;
>  }
>
> Another spurious change in 0002 patch.

Will fix

> --
>
> +/*
> + * The slru buffer mapping table is partitioned to reduce contention. To
> + * determine which partition lock a given pageno requires, compute the 
> pageno's
> + * hash code with SlruBufTableHashCode(), then apply SlruPartitionLock().
> + */
>
> I didn't see SlruBufTableHashCode() & SlruPartitionLock() functions anywhere 
> in
> your patches, is that outdated comment?

Yes will fix it, actually, there are some major design changes to this.

> --
>
> -   sz += MAXALIGN(nslots * sizeof(LWLockPadded));  /* buffer_locks[] */
> -   sz += MAXALIGN(SLRU_NUM_PARTITIONS * sizeof(LWLockPadded)); /* 
> part_locks[] */
> +   sz += MAXALIGN((nslots + SLRU_NUM_PARTITIONS) * sizeof(LWLockPadded));  
> /* locks[] */
>
> I am a bit uncomfortable with these changes, merging parts and buffer locks
> making it hard to understand the code. Not sure what we were getting out of
> this?

Yes, even I do not like this much because it is confusing.  But the
advantage of this is that we are using a single pointer for the lock
which means the next variable for the LRU counter will come in the
same cacheline and frequent updates of lru counter will be benefitted
from this.  Although I don't have any number which proves this.
Currently, I want to focus on all the base patches and keep this patch
as add on and later if we find its useful and want to pursue this then
we will see how to make it better readable.


>
> Subject: [PATCH v4 5/5] Ensure slru buffer slots are in multiple of numbe of
>  partitions
>
> I think the 0005 patch can be merged to 0001.

Yeah in the next version, it is done that way.  Planning to post end of the day.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

1 2 >

1 - 100 of 118 matches

Mail list logo