В Пн, 14/03/2022 в 14:57 +0300, Yura Sokolov пишет:
> В Пн, 14/03/2022 в 17:12 +0900, Kyotaro Horiguchi пишет:
> > At Mon, 14 Mar 2022 09:15:11 +0300, Yura Sokolov <y.soko...@postgrespro.ru> 
> > wrote in 
> > > В Пн, 14/03/2022 в 14:31 +0900, Kyotaro Horiguchi пишет:
> > > > I'd like to ask you to remove nalloced from partitions then add a
> > > > global atomic for the same use?
> > > 
> > > I really believe it should be global. I made it per-partition to
> > > not overcomplicate first versions. Glad you tell it.
> > > 
> > > I thought to protect it with freeList[0].mutex, but probably atomic
> > > is better idea here. But which atomic to chose: uint64 or uint32?
> > > Based on sizeof(long)?
> > > Ok, I'll do in next version.
> > 
> > Current nentries is a long (= int64 on CentOS). And uint32 can support
> > roughly 2^32 * 8192 = 32TB shared buffers, which doesn't seem safe
> > enough.  So it would be uint64.
> > 
> > > Whole get_hash_entry look strange.
> > > Doesn't it better to cycle through partitions and only then go to
> > > get_hash_entry?
> > > May be there should be bitmap for non-empty free lists? 32bit for
> > > 32 partitions. But wouldn't bitmap became contention point itself?
> > 
> > The code puts significance on avoiding contention caused by visiting
> > freelists of other partitions.  And perhaps thinks that freelist
> > shortage rarely happen.
> > 
> > I tried pgbench runs with scale 100 (with 10 threads, 10 clients) on
> > 128kB shared buffers and I saw that get_hash_entry never takes the
> > !element_alloc() path and always allocate a fresh entry, then
> > saturates at 30 new elements allocated at the medium of a 100 seconds
> > run.
> > 
> > Then, I tried the same with the patch, and I am surprized to see that
> > the rise of the number of newly allocated elements didn't stop and
> > went up to 511 elements after the 100 seconds run.  So I found that my
> > concern was valid.  The change in dynahash actually
> > continuously/repeatedly causes lack of free list entries.  I'm not
> > sure how much the impact given on performance if we change
> > get_hash_entry to prefer other freelists, though.
> 
> Well, it is quite strange SharedBufHash is not allocated as
> HASH_FIXED_SIZE. Could you check what happens with this flag set?
> I'll try as well.
> 
> Other way to reduce observed case is to remember freelist_idx for
> reused entry. I didn't believe it matters much since entries migrated
> netherless, but probably due to some hot buffers there are tention to
> crowd particular freelist.

Well, I did both. Everything looks ok.

> > By the way, there's the following comment in StrategyInitalize.
> > 
> > >        * Initialize the shared buffer lookup hashtable.
> > >        *
> > >        * Since we can't tolerate running out of lookup table entries, we 
> > > must be
> > >        * sure to specify an adequate table size here.  The maximum 
> > > steady-state
> > >        * usage is of course NBuffers entries, but BufferAlloc() tries to 
> > > insert
> > >        * a new entry before deleting the old.  In principle this could be
> > >        * happening in each partition concurrently, so we could need as 
> > > many as
> > >        * NBuffers + NUM_BUFFER_PARTITIONS entries.
> > >        */
> > >       InitBufTable(NBuffers + NUM_BUFFER_PARTITIONS);
> > 
> > "but BufferAlloc() tries to insert a new entry before deleting the
> > old." gets false by this patch but still need that additional room for
> > stashed entries.  It seems like needing a fix.

Removed whole paragraph because fixed table without extra entries works
just fine.

I lost access to Xeon 8354H, so returned to old Xeon X5675.

128MB and 1GB shared buffers
pgbench with scale 100
select_only benchmark, unix sockets.

Notebook i7-1165G7:


  conns |     master |         v8 |  master 1G |      v8 1G 
--------+------------+------------+------------+------------
      1 |      29614 |      29285 |      32413 |      32784 
      2 |      58541 |      60052 |      65851 |      65938 
      3 |      91126 |      90185 |     101404 |     101956 
      5 |     135809 |     133670 |     143783 |     143471 
      7 |     155547 |     153568 |     162566 |     162361 
     17 |     221794 |     218143 |     250562 |     250136 
     27 |     213742 |     211226 |     241806 |     242594 
     53 |     216067 |     214792 |     245868 |     246269 
     83 |     216610 |     218261 |     246798 |     250515 
    107 |     216169 |     216656 |     248424 |     250105 
    139 |     208892 |     215054 |     244630 |     246439 
    163 |     206988 |     212751 |     244061 |     248051 
    191 |     203842 |     214764 |     241793 |     245081 
    211 |     201304 |     213997 |     240863 |     246076 
    239 |     199313 |     211713 |     239639 |     243586 
    271 |     196712 |     211849 |     236231 |     243831 
    307 |     194879 |     209813 |     233811 |     241303 
    353 |     191279 |     210145 |     230896 |     241039 
    397 |     188509 |     207480 |     227812 |     240637 

X5675 1 socket:

  conns |     master |         v8 |  master 1G |      v8 1G 
--------+------------+------------+------------+------------
      1 |      18590 |      18473 |      19652 |      19051 
      2 |      34899 |      34799 |      37242 |      37432 
      3 |      51484 |      51393 |      54750 |      54398 
      5 |      71037 |      70564 |      76482 |      75985 
      7 |      87391 |      86937 |      96185 |      95433 
     17 |     122609 |     123087 |     140578 |     140325 
     27 |     120051 |     120508 |     136318 |     136343 
     53 |     116851 |     117601 |     133338 |     133265 
     83 |     113682 |     116755 |     131841 |     132736 
    107 |     111925 |     116003 |     130661 |     132386 
    139 |     109338 |     115011 |     128319 |     131453 
    163 |     107661 |     114398 |     126684 |     130677 
    191 |     105000 |     113745 |     124850 |     129909 
    211 |     103607 |     113347 |     123469 |     129302 
    239 |     101820 |     112428 |     121752 |     128621 
    271 |     100060 |     111863 |     119743 |     127624 
    307 |      98554 |     111270 |     117650 |     126877 
    353 |      97530 |     110231 |     115904 |     125351 
    397 |      96122 |     109471 |     113609 |     124150 

X5675 2 socket:

  conns |     master |         v8 |  master 1G |      v8 1G 
--------+------------+------------+------------+------------
      1 |      17815 |      17577 |      19321 |      19187 
      2 |      34312 |      35655 |      37121 |      36479 
      3 |      51868 |      52165 |      56048 |      54984 
      5 |      81704 |      82477 |      90945 |      90109 
      7 |     107937 |     105411 |     116015 |     115810 
     17 |     191339 |     190813 |     216899 |     215775 
     27 |     236541 |     238078 |     278507 |     278073 
     53 |     230323 |     231709 |     267226 |     267449 
     83 |     225560 |     227455 |     261996 |     262344 
    107 |     221317 |     224030 |     259694 |     259553 
    139 |     206945 |     219005 |     254817 |     256736 
    163 |     197723 |     220353 |     251631 |     257305 
    191 |     193243 |     219149 |     246960 |     256528 
    211 |     189603 |     218545 |     245362 |     255785 
    239 |     186382 |     217229 |     240006 |     255024 
    271 |     183141 |     216359 |     236927 |     253069 
    307 |     179275 |     215218 |     232571 |     252375 
    353 |     175559 |     213298 |     227244 |     250534 
    397 |     172916 |     211627 |     223513 |     248919 

Strange thing: both master and patched version has higher
peak tps at X5676 at medium connections (17 or 27 clients)
than in first october version [1]. But lower tps at higher
connections number (>= 191 clients).
I'll try to bisect on master this unfortunate change.

October master was 2d44dee0281a1abf and today's is 7e12256b478b895

(There is small possibility that I tested with TCP sockets
in october and with UNIX sockets today and that gave difference.)

[1] 
https://postgr.esq/m/1edbb61981fe1d99c3f20e3d56d6c88999f4227c.camel%40postgrespro.ru

-------

regards
Yura Sokolov
Postgres Professional
y.soko...@postgrespro.ru


From 68800f6f02f062320e6d9fe42c986809a06a37cb Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.soko...@postgrespro.ru>
Date: Mon, 21 Feb 2022 08:49:03 +0300
Subject: [PATCH 1/4] bufmgr: do not acquire two partition locks.

Acquiring two partition locks leads to complex dependency chain that hurts
at high concurrency level.

There is no need to hold both locks simultaneously. Buffer is pinned so
other processes could not select it for eviction. If tag is cleared and
buffer removed from old partition other processes will not find it.
Therefore it is safe to release old partition lock before acquiring
new partition lock.
---
 src/backend/storage/buffer/bufmgr.c | 198 ++++++++++++++--------------
 1 file changed, 96 insertions(+), 102 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f5459c68f89..f7dbfc90aaa 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1275,8 +1275,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		}
 
 		/*
-		 * To change the association of a valid buffer, we'll need to have
-		 * exclusive lock on both the old and new mapping partitions.
+		 * To change the association of a valid buffer, we'll need to reset
+		 * tag first, so we need to have exclusive lock on the old mapping
+		 * partitions.
 		 */
 		if (oldFlags & BM_TAG_VALID)
 		{
@@ -1289,93 +1290,16 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			oldHash = BufTableHashCode(&oldTag);
 			oldPartitionLock = BufMappingPartitionLock(oldHash);
 
-			/*
-			 * Must lock the lower-numbered partition first to avoid
-			 * deadlocks.
-			 */
-			if (oldPartitionLock < newPartitionLock)
-			{
-				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
-			else if (oldPartitionLock > newPartitionLock)
-			{
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-			}
-			else
-			{
-				/* only one partition, only one lock */
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
+			LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
 		}
 		else
 		{
-			/* if it wasn't valid, we need only the new partition */
-			LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
 			/* remember we have no old-partition lock or tag */
 			oldPartitionLock = NULL;
 			/* keep the compiler quiet about uninitialized variables */
 			oldHash = 0;
 		}
 
-		/*
-		 * Try to make a hashtable entry for the buffer under its new tag.
-		 * This could fail because while we were writing someone else
-		 * allocated another buffer for the same block we want to read in.
-		 * Note that we have not yet removed the hashtable entry for the old
-		 * tag.
-		 */
-		buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);
-
-		if (buf_id >= 0)
-		{
-			/*
-			 * Got a collision. Someone has already done what we were about to
-			 * do. We'll just handle this as if it were found in the buffer
-			 * pool in the first place.  First, give up the buffer we were
-			 * planning to use.
-			 */
-			UnpinBuffer(buf, true);
-
-			/* Can give up that buffer's mapping partition lock now */
-			if (oldPartitionLock != NULL &&
-				oldPartitionLock != newPartitionLock)
-				LWLockRelease(oldPartitionLock);
-
-			/* remaining code should match code at top of routine */
-
-			buf = GetBufferDescriptor(buf_id);
-
-			valid = PinBuffer(buf, strategy);
-
-			/* Can release the mapping lock as soon as we've pinned it */
-			LWLockRelease(newPartitionLock);
-
-			*foundPtr = true;
-
-			if (!valid)
-			{
-				/*
-				 * We can only get here if (a) someone else is still reading
-				 * in the page, or (b) a previous read attempt failed.  We
-				 * have to wait for any active read attempt to finish, and
-				 * then set up our own read attempt if the page is still not
-				 * BM_VALID.  StartBufferIO does it all.
-				 */
-				if (StartBufferIO(buf, true))
-				{
-					/*
-					 * If we get here, previous attempts to read the buffer
-					 * must have failed ... but we shall bravely try again.
-					 */
-					*foundPtr = false;
-				}
-			}
-
-			return buf;
-		}
-
 		/*
 		 * Need to lock the buffer header too in order to change its tag.
 		 */
@@ -1383,40 +1307,117 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 		/*
 		 * Somebody could have pinned or re-dirtied the buffer while we were
-		 * doing the I/O and making the new hashtable entry.  If so, we can't
-		 * recycle this buffer; we must undo everything we've done and start
-		 * over with a new victim buffer.
+		 * doing the I/O.  If so, we can't recycle this buffer; we must undo
+		 * everything we've done and start over with a new victim buffer.
 		 */
 		oldFlags = buf_state & BUF_FLAG_MASK;
 		if (BUF_STATE_GET_REFCOUNT(buf_state) == 1 && !(oldFlags & BM_DIRTY))
 			break;
 
 		UnlockBufHdr(buf, buf_state);
-		BufTableDelete(&newTag, newHash);
-		if (oldPartitionLock != NULL &&
-			oldPartitionLock != newPartitionLock)
+		if (oldPartitionLock != NULL)
 			LWLockRelease(oldPartitionLock);
-		LWLockRelease(newPartitionLock);
 		UnpinBuffer(buf, true);
 	}
 
 	/*
-	 * Okay, it's finally safe to rename the buffer.
+	 * We are single pinner, we hold buffer header lock and exclusive
+	 * partition lock (if tag is valid). It means no other process can inspect
+	 * it at the moment.
 	 *
-	 * Clearing BM_VALID here is necessary, clearing the dirtybits is just
-	 * paranoia.  We also reset the usage_count since any recency of use of
-	 * the old content is no longer relevant.  (The usage_count starts out at
-	 * 1 so that the buffer can survive one clock-sweep pass.)
+	 * But we will release partition lock and buffer header lock. We must be
+	 * sure other backend will not use this buffer until we reuse it for new
+	 * tag. Therefore, we clear out the buffer's tag and flags and remove it
+	 * from buffer table. Also buffer remains pinned to ensure
+	 * StrategyGetBuffer will not try to reuse the buffer concurrently.
+	 *
+	 * We also reset the usage_count since any recent use of the old
+	 * content is no longer relevant.
+	 */
+	CLEAR_BUFFERTAG(buf->tag);
+	buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
+	UnlockBufHdr(buf, buf_state);
+
+	/* Delete old tag from hash table if it were valid. */
+	if (oldFlags & BM_TAG_VALID)
+		BufTableDelete(&oldTag, oldHash);
+
+	if (oldPartitionLock != newPartitionLock)
+	{
+		if (oldPartitionLock != NULL)
+			LWLockRelease(oldPartitionLock);
+		LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
+	}
+
+	/*
+	 * Try to make a hashtable entry for the buffer under its new tag. This
+	 * could fail because while we were writing someone else allocated another
+	 * buffer for the same block we want to read in. In that case we will have
+	 * to return our buffer to free list.
+	 */
+	buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);
+
+	if (buf_id >= 0)
+	{
+		/*
+		 * Got a collision. Someone has already done what we were about to do.
+		 * We'll just handle this as if it were found in the buffer pool in
+		 * the first place.
+		 */
+
+		/*
+		 * First, give up the buffer we were planning to use and put it to
+		 * free lists.
+		 */
+		UnpinBuffer(buf, true);
+		StrategyFreeBuffer(buf);
+
+		/* remaining code should match code at top of routine */
+
+		buf = GetBufferDescriptor(buf_id);
+
+		valid = PinBuffer(buf, strategy);
+
+		/* Can release the mapping lock as soon as we've pinned it */
+		LWLockRelease(newPartitionLock);
+
+		*foundPtr = true;
+
+		if (!valid)
+		{
+			/*
+			 * We can only get here if (a) someone else is still reading in
+			 * the page, or (b) a previous read attempt failed.  We have to
+			 * wait for any active read attempt to finish, and then set up our
+			 * own read attempt if the page is still not BM_VALID.
+			 * StartBufferIO does it all.
+			 */
+			if (StartBufferIO(buf, true))
+			{
+				/*
+				 * If we get here, previous attempts to read the buffer must
+				 * have failed ... but we shall bravely try again.
+				 */
+				*foundPtr = false;
+			}
+		}
+
+		return buf;
+	}
+
+	/*
+	 * Now reuse victim buffer for new tag.
 	 *
 	 * Make sure BM_PERMANENT is set for buffers that must be written at every
 	 * checkpoint.  Unlogged buffers only need to be written at shutdown
 	 * checkpoints, except for their "init" forks, which need to be treated
 	 * just like permanent relations.
+	 *
+	 * The usage_count starts out at 1 so that the buffer can survive one
+	 * clock-sweep pass.
 	 */
+	buf_state = LockBufHdr(buf);
 	buf->tag = newTag;
-	buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |
-				   BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT |
-				   BUF_USAGECOUNT_MASK);
 	if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM)
 		buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
 	else
@@ -1424,13 +1425,6 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	UnlockBufHdr(buf, buf_state);
 
-	if (oldPartitionLock != NULL)
-	{
-		BufTableDelete(&oldTag, oldHash);
-		if (oldPartitionLock != newPartitionLock)
-			LWLockRelease(oldPartitionLock);
-	}
-
 	LWLockRelease(newPartitionLock);
 
 	/*
-- 
2.35.1


From 51da98121aa2404d1e3e3a42f5f40fddb9877e61 Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.soko...@postgrespro.ru>
Date: Mon, 28 Feb 2022 12:19:17 +0300
Subject: [PATCH 2/4] Add HASH_REUSE and use it in BufTable.

Avoid dynahash's freelist locking when BufferAlloc reuses buffer for
different tag.

HASH_REUSE acts as HASH_REMOVE, but stores element to reuse in static
variable instead of freelist partition. And HASH_ENTER then may use the
element.

Unfortunately, FreeListData->nentries had to be manipulated even in this
case. So instead of manipulation with nentries, we replace nentries with
nfree - actual length of free list, and nalloced - initially allocated
entries for free list. This were suggested by Robert Haas in
https://postgr.es/m/CA%2BTgmoZkg-04rcNRURt%3DjAG0Cs5oPyB-qKxH4wqX09e-oXy-nw%40mail.gmail.com
---
 src/backend/storage/buffer/buf_table.c |   7 +-
 src/backend/storage/buffer/bufmgr.c    |   4 +-
 src/backend/utils/hash/dynahash.c      | 142 +++++++++++++++++++++----
 src/include/storage/buf_internals.h    |   2 +-
 src/include/utils/hsearch.h            |   3 +-
 5 files changed, 129 insertions(+), 29 deletions(-)

diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index dc439940faa..c189555751e 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -143,10 +143,13 @@ BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id)
  * BufTableDelete
  *		Delete the hashtable entry for given tag (which must exist)
  *
+ * If reuse flag is true, deleted entry is cached for reuse, and caller
+ * must call BufTableInsert next.
+ *
  * Caller must hold exclusive lock on BufMappingLock for tag's partition
  */
 void
-BufTableDelete(BufferTag *tagPtr, uint32 hashcode)
+BufTableDelete(BufferTag *tagPtr, uint32 hashcode, bool reuse)
 {
 	BufferLookupEnt *result;
 
@@ -154,7 +157,7 @@ BufTableDelete(BufferTag *tagPtr, uint32 hashcode)
 		hash_search_with_hash_value(SharedBufHash,
 									(void *) tagPtr,
 									hashcode,
-									HASH_REMOVE,
+									reuse ? HASH_REUSE : HASH_REMOVE,
 									NULL);
 
 	if (!result)				/* shouldn't happen */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f7dbfc90aaa..a16da37fe3d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1340,7 +1340,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	/* Delete old tag from hash table if it were valid. */
 	if (oldFlags & BM_TAG_VALID)
-		BufTableDelete(&oldTag, oldHash);
+		BufTableDelete(&oldTag, oldHash, true);
 
 	if (oldPartitionLock != newPartitionLock)
 	{
@@ -1534,7 +1534,7 @@ retry:
 	 * Remove the buffer from the lookup hashtable, if it was in there.
 	 */
 	if (oldFlags & BM_TAG_VALID)
-		BufTableDelete(&oldTag, oldHash);
+		BufTableDelete(&oldTag, oldHash, false);
 
 	/*
 	 * Done with mapping lock.
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 3babde8d704..4d44276e3e6 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -14,7 +14,7 @@
  * a hash table in partitioned mode, the HASH_PARTITION flag must be given
  * to hash_create.  This prevents any attempt to split buckets on-the-fly.
  * Therefore, each hash bucket chain operates independently, and no fields
- * of the hash header change after init except nentries and freeList.
+ * of the hash header change after init except nfree and freeList.
  * (A partitioned table uses multiple copies of those fields, guarded by
  * spinlocks, for additional concurrency.)
  * This lets any subset of the hash buckets be treated as a separately
@@ -98,6 +98,7 @@
 
 #include "access/xact.h"
 #include "common/hashfn.h"
+#include "port/atomics.h"
 #include "port/pg_bitutils.h"
 #include "storage/shmem.h"
 #include "storage/spin.h"
@@ -138,8 +139,7 @@ typedef HASHBUCKET *HASHSEGMENT;
  *
  * In a partitioned hash table, each freelist is associated with a specific
  * set of hashcodes, as determined by the FREELIST_IDX() macro below.
- * nentries tracks the number of live hashtable entries having those hashcodes
- * (NOT the number of entries in the freelist, as you might expect).
+ * nfree tracks the actual number of free hashtable entries in the freelist.
  *
  * The coverage of a freelist might be more or less than one partition, so it
  * needs its own lock rather than relying on caller locking.  Relying on that
@@ -147,16 +147,30 @@ typedef HASHBUCKET *HASHSEGMENT;
  * need to "borrow" entries from another freelist; see get_hash_entry().
  *
  * Using an array of FreeListData instead of separate arrays of mutexes,
- * nentries and freeLists helps to reduce sharing of cache lines between
+ * nfree and freeLists helps to reduce sharing of cache lines between
  * different mutexes.
  */
 typedef struct
 {
 	slock_t		mutex;			/* spinlock for this freelist */
-	long		nentries;		/* number of entries in associated buckets */
+	long		nfree;			/* number of free entries in the list */
 	HASHELEMENT *freeList;		/* chain of free elements */
 } FreeListData;
 
+#if SIZEOF_LONG == 4
+typedef pg_atomic_uint32 nalloced_store_t;
+typedef uint32 nalloced_value_t;
+#define nalloced_read(a)	(long)pg_atomic_read_u32(a)
+#define nalloced_add(a, v)	pg_atomic_fetch_add_u32((a), (uint32)(v))
+#define nalloced_init(a)	pg_atomic_init_u32((a), 0)
+#else
+typedef pg_atomic_uint64 nalloced_t;
+typedef uint64 nalloced_value_t;
+#define nalloced_read(a)	(long)pg_atomic_read_u64(a)
+#define nalloced_add(a, v)	pg_atomic_fetch_add_u64((a), (uint64)(v))
+#define nalloced_init(a)	pg_atomic_init_u64((a), 0)
+#endif
+
 /*
  * Header structure for a hash table --- contains all changeable info
  *
@@ -170,7 +184,7 @@ struct HASHHDR
 	/*
 	 * The freelist can become a point of contention in high-concurrency hash
 	 * tables, so we use an array of freelists, each with its own mutex and
-	 * nentries count, instead of just a single one.  Although the freelists
+	 * nfree count, instead of just a single one.  Although the freelists
 	 * normally operate independently, we will scavenge entries from freelists
 	 * other than a hashcode's default freelist when necessary.
 	 *
@@ -195,6 +209,7 @@ struct HASHHDR
 	long		ssize;			/* segment size --- must be power of 2 */
 	int			sshift;			/* segment shift = log2(ssize) */
 	int			nelem_alloc;	/* number of entries to allocate at once */
+	nalloced_t	nalloced;		/* number of entries allocated */
 
 #ifdef HASH_STATISTICS
 
@@ -254,6 +269,16 @@ struct HTAB
  */
 #define MOD(x,y)			   ((x) & ((y)-1))
 
+/*
+ * Struct for reuse element.
+ */
+struct HASHREUSE
+{
+	HTAB	   *hashp;
+	HASHBUCKET	element;
+	int			freelist_idx;
+};
+
 #ifdef HASH_STATISTICS
 static long hash_accesses,
 			hash_collisions,
@@ -293,6 +318,12 @@ DynaHashAlloc(Size size)
 }
 
 
+/*
+ * Support for HASH_REUSE + HASH_ASSIGN
+ */
+static struct HASHREUSE DynaHashReuse = {NULL, NULL, 0};
+
+
 /*
  * HashCompareFunc for string keys
  *
@@ -640,6 +671,8 @@ hdefault(HTAB *hashp)
 	hctl->ssize = DEF_SEGSIZE;
 	hctl->sshift = DEF_SEGSIZE_SHIFT;
 
+	nalloced_init(&hctl->nalloced);
+
 #ifdef HASH_STATISTICS
 	hctl->accesses = hctl->collisions = 0;
 #endif
@@ -932,6 +965,8 @@ calc_bucket(HASHHDR *hctl, uint32 hash_val)
  *		HASH_ENTER: look up key in table, creating entry if not present
  *		HASH_ENTER_NULL: same, but return NULL if out of memory
  *		HASH_REMOVE: look up key in table, remove entry if present
+ *		HASH_REUSE: same as HASH_REMOVE, but stores removed element in static
+ *					variable instead of free list.
  *
  * Return value is a pointer to the element found/entered/removed if any,
  * or NULL if no match was found.  (NB: in the case of the REMOVE action,
@@ -943,6 +978,11 @@ calc_bucket(HASHHDR *hctl, uint32 hash_val)
  * HASH_ENTER_NULL cannot be used with the default palloc-based allocator,
  * since palloc internally ereports on out-of-memory.
  *
+ * If HASH_REUSE were called then next dynahash operation must be HASH_ENTER
+ * on the same dynahash instance. Otherwise, assertion will be triggered.
+ * HASH_ENTER will reuse element stored with HASH_REUSE if no duplicate entry
+ * found.
+ *
  * If foundPtr isn't NULL, then *foundPtr is set true if we found an
  * existing entry in the table, false otherwise.  This is needed in the
  * HASH_ENTER case, but is redundant with the return value otherwise.
@@ -1000,7 +1040,10 @@ hash_search_with_hash_value(HTAB *hashp,
 		 * Can't split if running in partitioned mode, nor if frozen, nor if
 		 * table is the subject of any active hash_seq_search scans.
 		 */
-		if (hctl->freeList[0].nentries > (long) hctl->max_bucket &&
+		long		nentries;
+
+		nentries = nalloced_read(&hctl->nalloced) - hctl->freeList[0].nfree;
+		if (nentries > (long) hctl->max_bucket &&
 			!IS_PARTITIONED(hctl) && !hashp->frozen &&
 			!has_seq_scans(hashp))
 			(void) expand_table(hashp);
@@ -1044,6 +1087,11 @@ hash_search_with_hash_value(HTAB *hashp,
 	if (foundPtr)
 		*foundPtr = (bool) (currBucket != NULL);
 
+	/* Check there is no unfinished HASH_REUSE + HASH_ENTER pair */
+	Assert(action == HASH_ENTER || DynaHashReuse.element == NULL);
+	/* Check HASH_REUSE were called for same dynahash if were */
+	Assert(DynaHashReuse.element == NULL || DynaHashReuse.hashp == hashp);
+
 	/*
 	 * OK, now what?
 	 */
@@ -1057,20 +1105,17 @@ hash_search_with_hash_value(HTAB *hashp,
 		case HASH_REMOVE:
 			if (currBucket != NULL)
 			{
-				/* if partitioned, must lock to touch nentries and freeList */
+				/* if partitioned, must lock to touch nfree and freeList */
 				if (IS_PARTITIONED(hctl))
 					SpinLockAcquire(&(hctl->freeList[freelist_idx].mutex));
 
-				/* delete the record from the appropriate nentries counter. */
-				Assert(hctl->freeList[freelist_idx].nentries > 0);
-				hctl->freeList[freelist_idx].nentries--;
-
 				/* remove record from hash bucket's chain. */
 				*prevBucketPtr = currBucket->link;
 
 				/* add the record to the appropriate freelist. */
 				currBucket->link = hctl->freeList[freelist_idx].freeList;
 				hctl->freeList[freelist_idx].freeList = currBucket;
+				hctl->freeList[freelist_idx].nfree++;
 
 				if (IS_PARTITIONED(hctl))
 					SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
@@ -1084,6 +1129,22 @@ hash_search_with_hash_value(HTAB *hashp,
 			}
 			return NULL;
 
+		case HASH_REUSE:
+			if (currBucket != NULL)
+			{
+				/* remove record from hash bucket's chain. */
+				*prevBucketPtr = currBucket->link;
+
+				/* and store for HASH_ASSIGN */
+				DynaHashReuse.element = currBucket;
+				DynaHashReuse.hashp = hashp;
+				DynaHashReuse.freelist_idx = freelist_idx;
+
+				/* Caller should call HASH_ASSIGN as the very next step. */
+				return (void *) ELEMENTKEY(currBucket);
+			}
+			return NULL;
+
 		case HASH_ENTER_NULL:
 			/* ENTER_NULL does not work with palloc-based allocator */
 			Assert(hashp->alloc != DynaHashAlloc);
@@ -1092,14 +1153,47 @@ hash_search_with_hash_value(HTAB *hashp,
 		case HASH_ENTER:
 			/* Return existing element if found, else create one */
 			if (currBucket != NULL)
+			{
+				if (likely(DynaHashReuse.element == NULL))
+					return (void *) ELEMENTKEY(currBucket);
+
+				freelist_idx = DynaHashReuse.freelist_idx;
+				/* if partitioned, must lock to touch nfree and freeList */
+				if (IS_PARTITIONED(hctl))
+					SpinLockAcquire(&(hctl->freeList[freelist_idx].mutex));
+
+				/* add the record to the appropriate freelist. */
+				DynaHashReuse.element->link = hctl->freeList[freelist_idx].freeList;
+				hctl->freeList[freelist_idx].freeList = DynaHashReuse.element;
+				hctl->freeList[freelist_idx].nfree++;
+
+				if (IS_PARTITIONED(hctl))
+					SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
+
+				DynaHashReuse.element = NULL;
+				DynaHashReuse.hashp = NULL;
+				DynaHashReuse.freelist_idx = 0;
+
 				return (void *) ELEMENTKEY(currBucket);
+			}
 
 			/* disallow inserts if frozen */
 			if (hashp->frozen)
 				elog(ERROR, "cannot insert into frozen hashtable \"%s\"",
 					 hashp->tabname);
 
-			currBucket = get_hash_entry(hashp, freelist_idx);
+			if (DynaHashReuse.element == NULL)
+			{
+				currBucket = get_hash_entry(hashp, freelist_idx);
+			}
+			else
+			{
+				currBucket = DynaHashReuse.element;
+				DynaHashReuse.element = NULL;
+				DynaHashReuse.hashp = NULL;
+				DynaHashReuse.freelist_idx = 0;
+			}
+
 			if (currBucket == NULL)
 			{
 				/* out of memory */
@@ -1301,7 +1395,7 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 
 	for (;;)
 	{
-		/* if partitioned, must lock to touch nentries and freeList */
+		/* if partitioned, must lock to touch nfree and freeList */
 		if (IS_PARTITIONED(hctl))
 			SpinLockAcquire(&hctl->freeList[freelist_idx].mutex);
 
@@ -1346,14 +1440,11 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 
 				if (newElement != NULL)
 				{
+					Assert(hctl->freeList[borrow_from_idx].nfree > 0);
 					hctl->freeList[borrow_from_idx].freeList = newElement->link;
+					hctl->freeList[borrow_from_idx].nfree--;
 					SpinLockRelease(&(hctl->freeList[borrow_from_idx].mutex));
 
-					/* careful: count the new element in its proper freelist */
-					SpinLockAcquire(&hctl->freeList[freelist_idx].mutex);
-					hctl->freeList[freelist_idx].nentries++;
-					SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
-
 					return newElement;
 				}
 
@@ -1365,9 +1456,10 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 		}
 	}
 
-	/* remove entry from freelist, bump nentries */
+	/* remove entry from freelist, decrease nfree */
+	Assert(hctl->freeList[freelist_idx].nfree > 0);
 	hctl->freeList[freelist_idx].freeList = newElement->link;
-	hctl->freeList[freelist_idx].nentries++;
+	hctl->freeList[freelist_idx].nfree--;
 
 	if (IS_PARTITIONED(hctl))
 		SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
@@ -1382,7 +1474,9 @@ long
 hash_get_num_entries(HTAB *hashp)
 {
 	int			i;
-	long		sum = hashp->hctl->freeList[0].nentries;
+	long		sum = nalloced_read(&hashp->hctl->nalloced);
+
+	sum -= hashp->hctl->freeList[0].nfree;
 
 	/*
 	 * We currently don't bother with acquiring the mutexes; it's only
@@ -1392,7 +1486,7 @@ hash_get_num_entries(HTAB *hashp)
 	if (IS_PARTITIONED(hashp->hctl))
 	{
 		for (i = 1; i < NUM_FREELISTS; i++)
-			sum += hashp->hctl->freeList[i].nentries;
+			sum -= hashp->hctl->freeList[i].nfree;
 	}
 
 	return sum;
@@ -1739,6 +1833,8 @@ element_alloc(HTAB *hashp, int nelem, int freelist_idx)
 	/* freelist could be nonempty if two backends did this concurrently */
 	firstElement->link = hctl->freeList[freelist_idx].freeList;
 	hctl->freeList[freelist_idx].freeList = prevElement;
+	hctl->freeList[freelist_idx].nfree += nelem;
+	nalloced_add(&hctl->nalloced, nelem);
 
 	if (IS_PARTITIONED(hctl))
 		SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index b903d2bcaf0..2ffcde678a0 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -328,7 +328,7 @@ extern void InitBufTable(int size);
 extern uint32 BufTableHashCode(BufferTag *tagPtr);
 extern int	BufTableLookup(BufferTag *tagPtr, uint32 hashcode);
 extern int	BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id);
-extern void BufTableDelete(BufferTag *tagPtr, uint32 hashcode);
+extern void BufTableDelete(BufferTag *tagPtr, uint32 hashcode, bool reuse);
 
 /* localbuf.c */
 extern PrefetchBufferResult PrefetchLocalBuffer(SMgrRelation smgr,
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index 854c3312414..1ffb616d99e 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -113,7 +113,8 @@ typedef enum
 	HASH_FIND,
 	HASH_ENTER,
 	HASH_REMOVE,
-	HASH_ENTER_NULL
+	HASH_ENTER_NULL,
+	HASH_REUSE
 } HASHACTION;
 
 /* hash_seq status (should be considered an opaque type by callers) */
-- 
2.35.1


From 2a086abe8c184a88cf984b2ef2cdc732aa64f1b7 Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.soko...@postgrespro.ru>
Date: Mon, 14 Mar 2022 17:22:26 +0300
Subject: [PATCH 3/4] fixed BufTable

Since elements now deleted before insertion in BufferAlloc, there is no
need in excess BufTable elements. And looks like it could be safely
declared as HASH_FIXED_SIZE.
---
 src/backend/storage/buffer/buf_table.c |  3 ++-
 src/backend/storage/buffer/freelist.c  | 13 +++----------
 2 files changed, 5 insertions(+), 11 deletions(-)

diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index c189555751e..55bb491ad05 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -63,7 +63,8 @@ InitBufTable(int size)
 	SharedBufHash = ShmemInitHash("Shared Buffer Lookup Table",
 								  size, size,
 								  &info,
-								  HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
+								  HASH_ELEM | HASH_BLOBS | HASH_PARTITION |
+								  HASH_FIXED_SIZE);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 3b98e68d50f..f4733434a3b 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -455,8 +455,8 @@ StrategyShmemSize(void)
 {
 	Size		size = 0;
 
-	/* size of lookup hash table ... see comment in StrategyInitialize */
-	size = add_size(size, BufTableShmemSize(NBuffers + NUM_BUFFER_PARTITIONS));
+	/* size of lookup hash table */
+	size = add_size(size, BufTableShmemSize(NBuffers));
 
 	/* size of the shared replacement strategy control block */
 	size = add_size(size, MAXALIGN(sizeof(BufferStrategyControl)));
@@ -478,15 +478,8 @@ StrategyInitialize(bool init)
 
 	/*
 	 * Initialize the shared buffer lookup hashtable.
-	 *
-	 * Since we can't tolerate running out of lookup table entries, we must be
-	 * sure to specify an adequate table size here.  The maximum steady-state
-	 * usage is of course NBuffers entries, but BufferAlloc() tries to insert
-	 * a new entry before deleting the old.  In principle this could be
-	 * happening in each partition concurrently, so we could need as many as
-	 * NBuffers + NUM_BUFFER_PARTITIONS entries.
 	 */
-	InitBufTable(NBuffers + NUM_BUFFER_PARTITIONS);
+	InitBufTable(NBuffers);
 
 	/*
 	 * Get or create the shared strategy control block
-- 
2.35.1


From 649d69f8a3d175502f67c777688d635ba8d70d44 Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.soko...@postgrespro.ru>
Date: Thu, 3 Mar 2022 01:14:58 +0300
Subject: [PATCH 4/4] reduce memory allocation for non-partitioned
 dynahash

Non-partitioned hash table doesn't use 32 partitions of HASHHDR->freeList.
Lets allocate just single free list.
---
 src/backend/utils/hash/dynahash.c | 37 +++++++++++++++++--------------
 1 file changed, 20 insertions(+), 17 deletions(-)

diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 4d44276e3e6..50c0e476432 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -181,18 +181,6 @@ typedef uint64 nalloced_value_t;
  */
 struct HASHHDR
 {
-	/*
-	 * The freelist can become a point of contention in high-concurrency hash
-	 * tables, so we use an array of freelists, each with its own mutex and
-	 * nfree count, instead of just a single one.  Although the freelists
-	 * normally operate independently, we will scavenge entries from freelists
-	 * other than a hashcode's default freelist when necessary.
-	 *
-	 * If the hash table is not partitioned, only freeList[0] is used and its
-	 * spinlock is not used at all; callers' locking is assumed sufficient.
-	 */
-	FreeListData freeList[NUM_FREELISTS];
-
 	/* These fields can change, but not in a partitioned table */
 	/* Also, dsize can't change in a shared table, even if unpartitioned */
 	long		dsize;			/* directory size */
@@ -220,6 +208,18 @@ struct HASHHDR
 	long		accesses;
 	long		collisions;
 #endif
+
+	/*
+	 * The freelist can become a point of contention in high-concurrency hash
+	 * tables, so we use an array of freelists, each with its own mutex and
+	 * nfree count, instead of just a single one.  Although the freelists
+	 * normally operate independently, we will scavenge entries from freelists
+	 * other than a hashcode's default freelist when necessary.
+	 *
+	 * If the hash table is not partitioned, only freeList[0] is used and its
+	 * spinlock is not used at all; callers' locking is assumed sufficient.
+	 */
+	FreeListData freeList[NUM_FREELISTS];
 };
 
 #define IS_PARTITIONED(hctl)  ((hctl)->num_partitions != 0)
@@ -294,7 +294,7 @@ static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx);
 static bool dir_realloc(HTAB *hashp);
 static bool expand_table(HTAB *hashp);
 static HASHBUCKET get_hash_entry(HTAB *hashp, int freelist_idx);
-static void hdefault(HTAB *hashp);
+static void hdefault(HTAB *hashp, bool partitioned);
 static int	choose_nelem_alloc(Size entrysize);
 static bool init_htab(HTAB *hashp, long nelem);
 static void hash_corrupted(HTAB *hashp);
@@ -537,7 +537,8 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 
 	if (!hashp->hctl)
 	{
-		hashp->hctl = (HASHHDR *) hashp->alloc(sizeof(HASHHDR));
+		Assert(!(flags & HASH_PARTITION));
+		hashp->hctl = (HASHHDR *) hashp->alloc(offsetof(HASHHDR, freeList[1]));
 		if (!hashp->hctl)
 			ereport(ERROR,
 					(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -546,7 +547,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 
 	hashp->frozen = false;
 
-	hdefault(hashp);
+	hdefault(hashp, (flags & HASH_PARTITION) != 0);
 
 	hctl = hashp->hctl;
 
@@ -654,11 +655,13 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
  * Set default HASHHDR parameters.
  */
 static void
-hdefault(HTAB *hashp)
+hdefault(HTAB *hashp, bool partition)
 {
 	HASHHDR    *hctl = hashp->hctl;
 
-	MemSet(hctl, 0, sizeof(HASHHDR));
+	MemSet(hctl, 0, partition ?
+		   sizeof(HASHHDR) :
+		   offsetof(HASHHDR, freeList[1]));
 
 	hctl->dsize = DEF_DIRSIZE;
 	hctl->nsegs = 0;
-- 
2.35.1

Reply via email to