В Пт, 25/02/2022 в 09:38 +0000, Simon Riggs пишет:
> On Fri, 25 Feb 2022 at 09:24, Yura Sokolov <y.soko...@postgrespro.ru> wrote:
> 
> > > This approach is cleaner than v1, but should also perform better
> > > because there will be a 1:1 relationship between a buffer and its
> > > dynahash entry, most of the time.
> > 
> > Thank you for suggestion. Yes, it is much clearer than my initial proposal.
> > 
> > Should I incorporate it to v4 patch? Perhaps, it could be a separate
> > commit in new version.
> 
> I don't insist that you do that, but since the API changes are a few
> hours work ISTM better to include in one patch for combined perf
> testing. It would be better to put all changes in this area into PG15
> than to split it across multiple releases.
> 
> > Why there is need for this? Which way backend could be forced to abort
> > between BufTableReuse and BufTableAssign in this code path? I don't
> > see any CHECK_FOR_INTERRUPTS on the way, but may be I'm missing
> > something.
> 
> Sounds reasonable.

Ok, here is v4.
It is with two commits: one for BufferAlloc locking change and other
for dynahash's freelist avoiding.

Buffer locking patch is same to v2 with some comment changes. Ie it uses
Lock+UnlockBufHdr 

For dynahash HASH_REUSE and HASH_ASSIGN as suggested.
HASH_REUSE stores deleted element into per-process static variable.
HASH_ASSIGN uses this element instead of freelist. If there's no
such stored element, it falls back to HASH_ENTER.

I've implemented Robert Haas's suggestion to count element in freelists
instead of nentries:

> One idea is to jigger things so that we maintain a count of the total
> number of entries that doesn't change except when we allocate, and
> then for each freelist partition we maintain the number of entries in
> that freelist partition.  So then the size of the hash table, instead
> of being sum(nentries) is totalsize - sum(nfree).

https://postgr.es/m/CA%2BTgmoZkg-04rcNRURt%3DjAG0Cs5oPyB-qKxH4wqX09e-oXy-nw%40mail.gmail.com

It helps to avoid freelist lock just to actualize counters.
I made it with replacing "nentries" with "nfree" and adding
"nalloced" to each freelist. It also makes "hash_update_hash_key" valid
for key that migrates partitions.

I believe, there is no need for "nalloced" for each freelist, and
instead single such field should be in HASHHDR. More, it seems to me
`element_alloc` function needs no acquiring freelist partition lock
since it is called only during initialization of shared hash table.
Am I right?

I didn't go this path in v4 for simplicity, but can put it to v5
if approved.

To be honest, "reuse" patch gives little improvement. But still
measurable on some connection numbers.

I tried to reduce freelist partitions to 8, but it has mixed impact.
Most of time performance is same, but sometimes a bit lower. I
didn't investigate reasons. Perhaps they are not related to buffer
manager.

I didn't introduce new functions BufTableReuse and BufTableAssign
since there are single call to BufTableInsert and two calls to
BufTableDelete. So I reused this functions, just added "reuse" flag
to BufTableDelete. 

Tests simple_select for Xeon 8354H, 128MB and 1G shared buffers
for scale 100.

1 socket:
  conns |     master |   patch_v4 |  master 1G | patch_v4 1G 
--------+------------+------------+------------+------------
      1 |      41975 |      41540 |      52898 |      52213 
      2 |      77693 |      77908 |      97571 |      98371 
      3 |     114713 |     115522 |     142709 |     145226 
      5 |     188898 |     187617 |     239322 |     237269 
      7 |     261516 |     260006 |     329119 |     329449 
     17 |     521821 |     519473 |     672390 |     662106 
     27 |     555487 |     555697 |     674630 |     672736 
     53 |     868213 |     896539 |    1190734 |    1202505 
     83 |     868232 |     866029 |    1164997 |    1158719 
    107 |     850477 |     845685 |    1140597 |    1134502 
    139 |     816311 |     816808 |    1101471 |    1091258 
    163 |     794788 |     796517 |    1078445 |    1071568 
    191 |     765934 |     776185 |    1059497 |    1041944 
    211 |     738656 |     777365 |    1083356 |    1046422 
    239 |     713124 |     841337 |    1104629 |    1116668 
    271 |     692138 |     847803 |    1094432 |    1128971 
    307 |     682919 |     849239 |    1086306 |    1127051 
    353 |     679449 |     842125 |    1071482 |    1117471 
    397 |     676217 |     844015 |    1058937 |    1118628 

2 sockets:
  conns |     master |   patch_v4 |  master 1G | patch_v4 1G 
--------+------------+------------+------------+------------
      1 |      44317 |      44034 |      53920 |      53583 
      2 |      81193 |      78621 |      99138 |      97968 
      3 |     120755 |     115648 |     148102 |     147423 
      5 |     190007 |     188943 |     232078 |     231029 
      7 |     258602 |     260649 |     325545 |     318567 
     17 |     551814 |     552914 |     692312 |     697518 
     27 |     787353 |     786573 |    1023509 |    1022891 
     53 |     973880 |    1008534 |    1228274 |    1278194 
     83 |    1108442 |    1269777 |    1596292 |    1648156 
    107 |    1072188 |    1339634 |    1542401 |    1664476 
    139 |    1000446 |    1316372 |    1490757 |    1676127 
    163 |     967378 |    1257445 |    1461468 |    1655574 
    191 |     926010 |    1189591 |    1435317 |    1639313 
    211 |     909919 |    1149905 |    1417437 |    1632764 
    239 |     895944 |    1115681 |    1393530 |    1616329 
    271 |     880545 |    1090208 |    1374878 |    1609544 
    307 |     865560 |    1066798 |    1355164 |    1593769 
    353 |     857591 |    1046426 |    1330069 |    1584006 
    397 |     840374 |    1024711 |    1312257 |    1564872 

--------

regards

Yura Sokolov
Postgres Professional
y.soko...@postgrespro.ru
funny.fal...@gmail.com
From 22d9613accc70eb2f9e799b87e976d64540f36b4 Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.soko...@postgrespro.ru>
Date: Mon, 28 Feb 2022 12:19:17 +0300
Subject: [PATCH v4 2/2] Add HASH_REUSE+HASH_ASSIGN and use it in BufTable.

Avoid dynahash's freelist locking when BufferAlloc reuses buffer for
different tag.

HASH_REUSE acts as HASH_REMOVE, but stores element to reuse in static
variable instead of freelist partition. And HASH_ASSIGN then uses the
element.

Unfortunately, FreeListData->nentries had to be manipulated even in this
case. So instead of manipulation with nentries, we replace nentries with
nfree - actual length of free list, and nalloced - initially allocated
entries for free list. This were suggested by Robert Haas in
https://postgr.es/m/CA%2BTgmoZkg-04rcNRURt%3DjAG0Cs5oPyB-qKxH4wqX09e-oXy-nw%40mail.gmail.com
---
 src/backend/storage/buffer/buf_table.c |   9 +-
 src/backend/storage/buffer/bufmgr.c    |   4 +-
 src/backend/utils/hash/dynahash.c      | 130 ++++++++++++++++++++-----
 src/include/storage/buf_internals.h    |   2 +-
 src/include/utils/hsearch.h            |   4 +-
 5 files changed, 120 insertions(+), 29 deletions(-)

diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index caa03ae1233..3362c7127e9 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -128,7 +128,7 @@ BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id)
 		hash_search_with_hash_value(SharedBufHash,
 									(void *) tagPtr,
 									hashcode,
-									HASH_ENTER,
+									HASH_ASSIGN,
 									&found);
 
 	if (found)					/* found something already in the table */
@@ -143,10 +143,13 @@ BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id)
  * BufTableDelete
  *		Delete the hashtable entry for given tag (which must exist)
  *
+ * If reuse flag is true, deleted entry is cached for reuse, and caller
+ * must call BufTableInsert next.
+ *
  * Caller must hold exclusive lock on BufMappingLock for tag's partition
  */
 void
-BufTableDelete(BufferTag *tagPtr, uint32 hashcode)
+BufTableDelete(BufferTag *tagPtr, uint32 hashcode, bool reuse)
 {
 	BufferLookupEnt *result;
 
@@ -154,7 +157,7 @@ BufTableDelete(BufferTag *tagPtr, uint32 hashcode)
 		hash_search_with_hash_value(SharedBufHash,
 									(void *) tagPtr,
 									hashcode,
-									HASH_REMOVE,
+									reuse ? HASH_REUSE : HASH_REMOVE,
 									NULL);
 
 	if (!result)				/* shouldn't happen */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 5d2781f4813..85b62463c0d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1334,7 +1334,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	/* Delete old tag from hash table if it were valid. */
 	if (oldFlags & BM_TAG_VALID)
-		BufTableDelete(&oldTag, oldHash);
+		BufTableDelete(&oldTag, oldHash, true);
 
 	if (oldPartitionLock != newPartitionLock)
 	{
@@ -1528,7 +1528,7 @@ retry:
 	 * Remove the buffer from the lookup hashtable, if it was in there.
 	 */
 	if (oldFlags & BM_TAG_VALID)
-		BufTableDelete(&oldTag, oldHash);
+		BufTableDelete(&oldTag, oldHash, false);
 
 	/*
 	 * Done with mapping lock.
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 6546e3c7c79..9eb07593da7 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -14,7 +14,7 @@
  * a hash table in partitioned mode, the HASH_PARTITION flag must be given
  * to hash_create.  This prevents any attempt to split buckets on-the-fly.
  * Therefore, each hash bucket chain operates independently, and no fields
- * of the hash header change after init except nentries and freeList.
+ * of the hash header change after init except nfree and freeList.
  * (A partitioned table uses multiple copies of those fields, guarded by
  * spinlocks, for additional concurrency.)
  * This lets any subset of the hash buckets be treated as a separately
@@ -138,8 +138,9 @@ typedef HASHBUCKET *HASHSEGMENT;
  *
  * In a partitioned hash table, each freelist is associated with a specific
  * set of hashcodes, as determined by the FREELIST_IDX() macro below.
- * nentries tracks the number of live hashtable entries having those hashcodes
- * (NOT the number of entries in the freelist, as you might expect).
+ * nalloced tracks the number of free hashtable entries initially allocated
+ * for the freelist.
+ * nfree tracks the actual number of free hashtable entries in the freelist.
  *
  * The coverage of a freelist might be more or less than one partition, so it
  * needs its own lock rather than relying on caller locking.  Relying on that
@@ -147,13 +148,15 @@ typedef HASHBUCKET *HASHSEGMENT;
  * need to "borrow" entries from another freelist; see get_hash_entry().
  *
  * Using an array of FreeListData instead of separate arrays of mutexes,
- * nentries and freeLists helps to reduce sharing of cache lines between
+ * nfree and freeLists helps to reduce sharing of cache lines between
  * different mutexes.
  */
 typedef struct
 {
 	slock_t		mutex;			/* spinlock for this freelist */
-	long		nentries;		/* number of entries in associated buckets */
+	long		nfree;			/* number of free entries in the list */
+	long		nalloced;		/* number of entries initially allocated for
+								 * the list */
 	HASHELEMENT *freeList;		/* chain of free elements */
 } FreeListData;
 
@@ -170,7 +173,7 @@ struct HASHHDR
 	/*
 	 * The freelist can become a point of contention in high-concurrency hash
 	 * tables, so we use an array of freelists, each with its own mutex and
-	 * nentries count, instead of just a single one.  Although the freelists
+	 * nfree count, instead of just a single one.  Although the freelists
 	 * normally operate independently, we will scavenge entries from freelists
 	 * other than a hashcode's default freelist when necessary.
 	 *
@@ -254,6 +257,15 @@ struct HTAB
  */
 #define MOD(x,y)			   ((x) & ((y)-1))
 
+/*
+ * Struct for reuse element.
+ */
+struct HASHREUSE
+{
+	HTAB	   *hashp;
+	HASHBUCKET	element;
+};
+
 #ifdef HASH_STATISTICS
 static long hash_accesses,
 			hash_collisions,
@@ -293,6 +305,12 @@ DynaHashAlloc(Size size)
 }
 
 
+/*
+ * Support for HASH_REUSE + HASH_ASSIGN
+ */
+static struct HASHREUSE DynaHashReuse = {NULL, NULL};
+
+
 /*
  * HashCompareFunc for string keys
  *
@@ -932,6 +950,10 @@ calc_bucket(HASHHDR *hctl, uint32 hash_val)
  *		HASH_ENTER: look up key in table, creating entry if not present
  *		HASH_ENTER_NULL: same, but return NULL if out of memory
  *		HASH_REMOVE: look up key in table, remove entry if present
+ *		HASH_REUSE: same as HASH_REMOVE, but stores removed element in static
+ *					variable instead of free list.
+ *		HASH_ASSIGN: same as HASH_ENTER, but reuses element stored by HASH_REUSE
+ *					if any.
  *
  * Return value is a pointer to the element found/entered/removed if any,
  * or NULL if no match was found.  (NB: in the case of the REMOVE action,
@@ -1000,7 +1022,8 @@ hash_search_with_hash_value(HTAB *hashp,
 		 * Can't split if running in partitioned mode, nor if frozen, nor if
 		 * table is the subject of any active hash_seq_search scans.
 		 */
-		if (hctl->freeList[0].nentries > (long) hctl->max_bucket &&
+		if (hctl->freeList[0].nfree == 0 &&
+			hctl->freeList[0].nalloced > (long) hctl->max_bucket &&
 			!IS_PARTITIONED(hctl) && !hashp->frozen &&
 			!has_seq_scans(hashp))
 			(void) expand_table(hashp);
@@ -1044,6 +1067,10 @@ hash_search_with_hash_value(HTAB *hashp,
 	if (foundPtr)
 		*foundPtr = (bool) (currBucket != NULL);
 
+	/* If HASH_REUSE were not called, HASH_ASSIGN falls back to HASH_ENTER */
+	if (action == HASH_ASSIGN && DynaHashReuse.element == NULL)
+		action = HASH_ENTER;
+
 	/*
 	 * OK, now what?
 	 */
@@ -1057,20 +1084,17 @@ hash_search_with_hash_value(HTAB *hashp,
 		case HASH_REMOVE:
 			if (currBucket != NULL)
 			{
-				/* if partitioned, must lock to touch nentries and freeList */
+				/* if partitioned, must lock to touch nfree and freeList */
 				if (IS_PARTITIONED(hctl))
 					SpinLockAcquire(&(hctl->freeList[freelist_idx].mutex));
 
-				/* delete the record from the appropriate nentries counter. */
-				Assert(hctl->freeList[freelist_idx].nentries > 0);
-				hctl->freeList[freelist_idx].nentries--;
-
 				/* remove record from hash bucket's chain. */
 				*prevBucketPtr = currBucket->link;
 
 				/* add the record to the appropriate freelist. */
 				currBucket->link = hctl->freeList[freelist_idx].freeList;
 				hctl->freeList[freelist_idx].freeList = currBucket;
+				hctl->freeList[freelist_idx].nfree++;
 
 				if (IS_PARTITIONED(hctl))
 					SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
@@ -1116,6 +1140,7 @@ hash_search_with_hash_value(HTAB *hashp,
 							 errmsg("out of memory")));
 			}
 
+	link_element:
 			/* link into hashbucket chain */
 			*prevBucketPtr = currBucket;
 			currBucket->link = NULL;
@@ -1132,6 +1157,63 @@ hash_search_with_hash_value(HTAB *hashp,
 			 */
 
 			return (void *) ELEMENTKEY(currBucket);
+
+		case HASH_REUSE:
+			if (currBucket != NULL)
+			{
+				/* check there is no unfinished HASH_REUSE+HASH_ASSIGN pair */
+				Assert(DynaHashReuseHTAB == NULL);
+				Assert(DynaHashReuseElement == NULL);
+
+				/* remove record from hash bucket's chain. */
+				*prevBucketPtr = currBucket->link;
+
+				/* and store for HASH_ASSIGN */
+				DynaHashReuse.element = currBucket;
+				DynaHashReuse.hashp = hashp;
+
+				/* Caller should call HASH_ASSIGN as the very next step. */
+				return (void *) ELEMENTKEY(currBucket);
+			}
+			return NULL;
+
+		case HASH_ASSIGN:
+			/* check HASH_REUSE were called for same hash table */
+			Assert(DynaHashReuse.hashp == hashp);
+
+			/*
+			 * If existing element is found, need to put Reuse element to
+			 * original freelist. There is no much difference, which list to
+			 * put, since we migrate buckets between buckets.
+			 */
+			if (currBucket != NULL)
+			{
+
+				/* if partitioned, must lock to touch nfree and freeList */
+				if (IS_PARTITIONED(hctl))
+					SpinLockAcquire(&(hctl->freeList[freelist_idx].mutex));
+
+				/* add the record to the appropriate freelist. */
+				DynaHashReuse.element->link = hctl->freeList[freelist_idx].freeList;
+				hctl->freeList[freelist_idx].freeList = DynaHashReuse.element;
+				hctl->freeList[freelist_idx].nfree++;
+
+				if (IS_PARTITIONED(hctl))
+					SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
+
+				DynaHashReuse.element = NULL;
+				DynaHashReuse.hashp = NULL;
+
+				return (void *) ELEMENTKEY(currBucket);
+			}
+
+			currBucket = DynaHashReuse.element;
+
+			DynaHashReuse.element = NULL;
+			DynaHashReuse.hashp = NULL;
+
+			/* reuse HASH_ENTER code */
+			goto link_element;
 	}
 
 	elog(ERROR, "unrecognized hash action code: %d", (int) action);
@@ -1301,7 +1383,7 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 
 	for (;;)
 	{
-		/* if partitioned, must lock to touch nentries and freeList */
+		/* if partitioned, must lock to touch nfree and freeList */
 		if (IS_PARTITIONED(hctl))
 			SpinLockAcquire(&hctl->freeList[freelist_idx].mutex);
 
@@ -1347,13 +1429,9 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 				if (newElement != NULL)
 				{
 					hctl->freeList[borrow_from_idx].freeList = newElement->link;
+					hctl->freeList[borrow_from_idx].nfree--;
 					SpinLockRelease(&(hctl->freeList[borrow_from_idx].mutex));
 
-					/* careful: count the new element in its proper freelist */
-					SpinLockAcquire(&hctl->freeList[freelist_idx].mutex);
-					hctl->freeList[freelist_idx].nentries++;
-					SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
-
 					return newElement;
 				}
 
@@ -1365,9 +1443,9 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 		}
 	}
 
-	/* remove entry from freelist, bump nentries */
+	/* remove entry from freelist, decrease nfree */
 	hctl->freeList[freelist_idx].freeList = newElement->link;
-	hctl->freeList[freelist_idx].nentries++;
+	hctl->freeList[freelist_idx].nfree--;
 
 	if (IS_PARTITIONED(hctl))
 		SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
@@ -1382,7 +1460,10 @@ long
 hash_get_num_entries(HTAB *hashp)
 {
 	int			i;
-	long		sum = hashp->hctl->freeList[0].nentries;
+	long		sum = 0;
+
+	sum += hashp->hctl->freeList[0].nalloced;
+	sum -= hashp->hctl->freeList[0].nfree;
 
 	/*
 	 * We currently don't bother with acquiring the mutexes; it's only
@@ -1392,7 +1473,10 @@ hash_get_num_entries(HTAB *hashp)
 	if (IS_PARTITIONED(hashp->hctl))
 	{
 		for (i = 1; i < NUM_FREELISTS; i++)
-			sum += hashp->hctl->freeList[i].nentries;
+		{
+			sum += hashp->hctl->freeList[i].nalloced;
+			sum -= hashp->hctl->freeList[i].nfree;
+		}
 	}
 
 	return sum;
@@ -1739,6 +1823,8 @@ element_alloc(HTAB *hashp, int nelem, int freelist_idx)
 	/* freelist could be nonempty if two backends did this concurrently */
 	firstElement->link = hctl->freeList[freelist_idx].freeList;
 	hctl->freeList[freelist_idx].freeList = prevElement;
+	hctl->freeList[freelist_idx].nfree += nelem;
+	hctl->freeList[freelist_idx].nalloced += nelem;
 
 	if (IS_PARTITIONED(hctl))
 		SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 7c6653311a5..d35ee1b4108 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -328,7 +328,7 @@ extern void InitBufTable(int size);
 extern uint32 BufTableHashCode(BufferTag *tagPtr);
 extern int	BufTableLookup(BufferTag *tagPtr, uint32 hashcode);
 extern int	BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id);
-extern void BufTableDelete(BufferTag *tagPtr, uint32 hashcode);
+extern void BufTableDelete(BufferTag *tagPtr, uint32 hashcode, bool reuse);
 
 /* localbuf.c */
 extern PrefetchBufferResult PrefetchLocalBuffer(SMgrRelation smgr,
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index d7af0239c8c..ba21c5a65a1 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -113,7 +113,9 @@ typedef enum
 	HASH_FIND,
 	HASH_ENTER,
 	HASH_REMOVE,
-	HASH_ENTER_NULL
+	HASH_ENTER_NULL,
+	HASH_REUSE,
+	HASH_ASSIGN
 } HASHACTION;
 
 /* hash_seq status (should be considered an opaque type by callers) */
-- 
2.35.1

From c1b8e6d60030d5d02287ae731ab604feeafa7486 Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.soko...@postgrespro.ru>
Date: Mon, 21 Feb 2022 08:49:03 +0300
Subject: [PATCH v4 1/2] bufmgr: do not acquire two partition locks.

Acquiring two partition locks leads to complex dependency chain that hurts
at high concurrency level.

There is no need to hold both lock simultaneously. Buffer is pinned so
other processes could not select it for eviction. If tag is cleared and
buffer removed from old partition other processes will not find it.
Therefore it is safe to release old partition lock before acquiring
new partition lock.
---
 src/backend/storage/buffer/bufmgr.c | 189 +++++++++++++---------------
 1 file changed, 89 insertions(+), 100 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index b4532948d3f..5d2781f4813 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1288,93 +1288,16 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			oldHash = BufTableHashCode(&oldTag);
 			oldPartitionLock = BufMappingPartitionLock(oldHash);
 
-			/*
-			 * Must lock the lower-numbered partition first to avoid
-			 * deadlocks.
-			 */
-			if (oldPartitionLock < newPartitionLock)
-			{
-				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
-			else if (oldPartitionLock > newPartitionLock)
-			{
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-			}
-			else
-			{
-				/* only one partition, only one lock */
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
+			LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
 		}
 		else
 		{
-			/* if it wasn't valid, we need only the new partition */
-			LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
 			/* remember we have no old-partition lock or tag */
 			oldPartitionLock = NULL;
 			/* keep the compiler quiet about uninitialized variables */
 			oldHash = 0;
 		}
 
-		/*
-		 * Try to make a hashtable entry for the buffer under its new tag.
-		 * This could fail because while we were writing someone else
-		 * allocated another buffer for the same block we want to read in.
-		 * Note that we have not yet removed the hashtable entry for the old
-		 * tag.
-		 */
-		buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);
-
-		if (buf_id >= 0)
-		{
-			/*
-			 * Got a collision. Someone has already done what we were about to
-			 * do. We'll just handle this as if it were found in the buffer
-			 * pool in the first place.  First, give up the buffer we were
-			 * planning to use.
-			 */
-			UnpinBuffer(buf, true);
-
-			/* Can give up that buffer's mapping partition lock now */
-			if (oldPartitionLock != NULL &&
-				oldPartitionLock != newPartitionLock)
-				LWLockRelease(oldPartitionLock);
-
-			/* remaining code should match code at top of routine */
-
-			buf = GetBufferDescriptor(buf_id);
-
-			valid = PinBuffer(buf, strategy);
-
-			/* Can release the mapping lock as soon as we've pinned it */
-			LWLockRelease(newPartitionLock);
-
-			*foundPtr = true;
-
-			if (!valid)
-			{
-				/*
-				 * We can only get here if (a) someone else is still reading
-				 * in the page, or (b) a previous read attempt failed.  We
-				 * have to wait for any active read attempt to finish, and
-				 * then set up our own read attempt if the page is still not
-				 * BM_VALID.  StartBufferIO does it all.
-				 */
-				if (StartBufferIO(buf, true))
-				{
-					/*
-					 * If we get here, previous attempts to read the buffer
-					 * must have failed ... but we shall bravely try again.
-					 */
-					*foundPtr = false;
-				}
-			}
-
-			return buf;
-		}
-
 		/*
 		 * Need to lock the buffer header too in order to change its tag.
 		 */
@@ -1382,40 +1305,113 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 		/*
 		 * Somebody could have pinned or re-dirtied the buffer while we were
-		 * doing the I/O and making the new hashtable entry.  If so, we can't
-		 * recycle this buffer; we must undo everything we've done and start
-		 * over with a new victim buffer.
+		 * doing the I/O.  If so, we can't recycle this buffer; we must undo
+		 * everything we've done and start over with a new victim buffer.
 		 */
 		oldFlags = buf_state & BUF_FLAG_MASK;
 		if (BUF_STATE_GET_REFCOUNT(buf_state) == 1 && !(oldFlags & BM_DIRTY))
 			break;
 
 		UnlockBufHdr(buf, buf_state);
-		BufTableDelete(&newTag, newHash);
-		if (oldPartitionLock != NULL &&
-			oldPartitionLock != newPartitionLock)
+		if (oldPartitionLock != NULL)
 			LWLockRelease(oldPartitionLock);
-		LWLockRelease(newPartitionLock);
 		UnpinBuffer(buf, true);
 	}
 
 	/*
-	 * Okay, it's finally safe to rename the buffer.
+	 * Clear out the buffer's tag and flags.  We must do this to ensure that
+	 * linear scans of the buffer array don't think the buffer is valid. We
+	 * also reset the usage_count since any recency of use of the old content
+	 * is no longer relevant.
 	 *
-	 * Clearing BM_VALID here is necessary, clearing the dirtybits is just
-	 * paranoia.  We also reset the usage_count since any recency of use of
-	 * the old content is no longer relevant.  (The usage_count starts out at
-	 * 1 so that the buffer can survive one clock-sweep pass.)
+	 * We are single pinner, we hold buffer header lock and exclusive
+	 * partition lock (if tag is valid). Given these statements it is safe to
+	 * clear tag since no other process can inspect it to the moment.
+	 */
+	CLEAR_BUFFERTAG(buf->tag);
+	buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
+	UnlockBufHdr(buf, buf_state);
+
+	/* Delete old tag from hash table if it were valid. */
+	if (oldFlags & BM_TAG_VALID)
+		BufTableDelete(&oldTag, oldHash);
+
+	if (oldPartitionLock != newPartitionLock)
+	{
+		if (oldPartitionLock != NULL)
+			LWLockRelease(oldPartitionLock);
+		LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
+	}
+
+	/*
+	 * Try to make a hashtable entry for the buffer under its new tag. This
+	 * could fail because while we were writing someone else allocated another
+	 * buffer for the same block we want to read in. In that case we will have
+	 * to return our buffer to free list.
+	 */
+	buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);
+
+	if (buf_id >= 0)
+	{
+		/*
+		 * Got a collision. Someone has already done what we were about to do.
+		 * We'll just handle this as if it were found in the buffer pool in
+		 * the first place.
+		 */
+
+		/*
+		 * First, give up the buffer we were planning to use and put it to
+		 * free lists.
+		 */
+		UnpinBuffer(buf, true);
+		StrategyFreeBuffer(buf);
+
+		/* remaining code should match code at top of routine */
+
+		buf = GetBufferDescriptor(buf_id);
+
+		valid = PinBuffer(buf, strategy);
+
+		/* Can release the mapping lock as soon as we've pinned it */
+		LWLockRelease(newPartitionLock);
+
+		*foundPtr = true;
+
+		if (!valid)
+		{
+			/*
+			 * We can only get here if (a) someone else is still reading in
+			 * the page, or (b) a previous read attempt failed.  We have to
+			 * wait for any active read attempt to finish, and then set up our
+			 * own read attempt if the page is still not BM_VALID.
+			 * StartBufferIO does it all.
+			 */
+			if (StartBufferIO(buf, true))
+			{
+				/*
+				 * If we get here, previous attempts to read the buffer must
+				 * have failed ... but we shall bravely try again.
+				 */
+				*foundPtr = false;
+			}
+		}
+
+		return buf;
+	}
+
+	/*
+	 * Now it is safe to use victim buffer for new tag.
 	 *
 	 * Make sure BM_PERMANENT is set for buffers that must be written at every
 	 * checkpoint.  Unlogged buffers only need to be written at shutdown
 	 * checkpoints, except for their "init" forks, which need to be treated
 	 * just like permanent relations.
+	 *
+	 * The usage_count starts out at 1 so that the buffer can survive one
+	 * clock-sweep pass.
 	 */
+	buf_state = LockBufHdr(buf);
 	buf->tag = newTag;
-	buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |
-				   BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT |
-				   BUF_USAGECOUNT_MASK);
 	if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM)
 		buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
 	else
@@ -1423,13 +1419,6 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	UnlockBufHdr(buf, buf_state);
 
-	if (oldPartitionLock != NULL)
-	{
-		BufTableDelete(&oldTag, oldHash);
-		if (oldPartitionLock != newPartitionLock)
-			LWLockRelease(oldPartitionLock);
-	}
-
 	LWLockRelease(newPartitionLock);
 
 	/*
-- 
2.35.1

<<attachment: res.zip>>

Reply via email to