Re: [HACKERS] Relation extension scalability

Dilip Kumar Fri, 25 Mar 2016 10:05:42 -0700

On Fri, Mar 25, 2016 at 3:04 AM, Robert Haas <robertmh...@gmail.com> wrote:


> 1. Callers who use GetPageWithFreeSpace() rather than
> GetPageFreeSpaceExtended() will fail to find the new pages if the
> upper map levels haven't been updated by VACUUM.
>
> 2. Even callers who use GetPageFreeSpaceExtended() may fail to find
> the new pages.  This can happen in two separate ways, namely (a) the
>

Yeah, that's the issue, if extended pages spills to next FSM page, then
other waiters will not find those page, and one by one all waiters will end
up adding extra pages.
for example,  if there are ~30 waiters then
total blocks extended = (25(25+1)/2) *20 =~ 6500 pages.

This is not the case every time but whenever heap block go to new FSM page
this will happen.

- FSM page case hold 4096 heap blk info, so after every 8th extend (assume
512 block will extend in one time), it will extend ~6500 pages

- Any new request to RelationGetBufferForTuple will be able to find those
page, because by that time the backend which is extending the page would
have set new block using RelationSetTargetBlock.
(there are still chances that some blocks can be completely left unused,
until vacuum comes).

I have changed the patch as per the suggestion (just POC because
performance number are not that great)

Below is the performance number comparison of base, previous patch(v13) and
latest patch (v14).

performance of patch v14 is significantly low compared to v13, mainly I
guess below reasons
1. As per above calculation v13 extend ~6500 block (after every 8th
extend), and that's why it's performing well.

2. In v13 as soon as we extend the block we add to FSM so immediately
available for new requester,  (In this patch also I tried to add one by one
to FSM and updated fsm tree till root after all pages added to FSM, but no
significant improvement).

3.  fsm_update_recursive doesn't seems like problem to me. does it ?


Copy 10000 tuples, of 4 bytes each..
---------------------------------------------
Client     base   patch v13   patch  v14
1           118          147        126
2           217          276        269
4           210          421        347
8           166          630        375
16         145          813        415
32         124          985        451
64                         974        455

Insert 1000 tuples of 1K size each.

Client        base      patch v13     patch v14
1              117         124            119
2              111         126            119
4               51          128            124
8               43          149            131
16             40          217            120
32                           263            115
64                           248            109

Note: I think one thread number can be just run to run variance..

Does anyone see problem in updating the FSM tree, I have debugged and saw
that we are able to get the pages properly from tree and same is visible in
performance number of v14 compared to base.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 8140418..9ac1eb7 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -169,6 +169,62 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
 }
 
 /*
+ * Extend a relation by multiple blocks to avoid future contention on the
+ * relation extension lock.  Our goal is to pre-extend the relation by an
+ * amount which ramps up as the degree of contention ramps up, but limiting
+ * the result to some sane overall value.
+ */
+static void
+RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
+{
+	Page		page;
+	BlockNumber blockNum,
+				firstBlock,
+				lastBlock;
+	int			extraBlocks = 0;
+	int			lockWaiters = 0;
+	Buffer		buffer;
+
+	/*
+	 * We use the length of the lock wait queue to judge how much to extend.
+	 * It might seem like multiplying the number of lock waiters by as much
+	 * as 20 is too aggressive, but benchmarking revealed that smaller numbers
+	 * were insufficient.  512 is just an arbitrary cap to prevent pathological
+	 * results (and excessive wasted disk space).
+	 */
+	lockWaiters = RelationExtensionLockWaiterCount(relation);
+	extraBlocks = Min(512, lockWaiters * 20);
+
+	firstBlock = lastBlock = InvalidBlockNumber;
+
+	while (extraBlocks-- >= 0)
+	{
+		/* Ouch - an unnecessary lseek() each time through the loop! */
+		buffer = ReadBufferBI(relation, P_NEW, bistate);
+
+		/* Extend by one page. */
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		page = BufferGetPage(buffer);
+		PageInit(page, BufferGetPageSize(buffer), 0);
+		MarkBufferDirty(buffer);
+		blockNum = BufferGetBlockNumber(buffer);
+		UnlockReleaseBuffer(buffer);
+
+		if (firstBlock == InvalidBlockNumber)
+			firstBlock = blockNum;
+
+		lastBlock = blockNum;
+	}
+
+	/*
+	 * Put the page in the freespace map so other backends can find it.
+	 * This is what will keep those other backends from also queueing up
+	 * on the relation extension lock.
+	 */
+	FreeSpaceMapBulkExtend(relation, firstBlock, lastBlock);
+}
+
+/*
  * RelationGetBufferForTuple
  *
  *	Returns pinned and exclusive-locked buffer of a page in given relation
@@ -233,8 +289,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	bool		use_fsm = !(options & HEAP_INSERT_SKIP_FSM);
 	Buffer		buffer = InvalidBuffer;
 	Page		page;
-	Size		pageFreeSpace,
-				saveFreeSpace;
+	Size		pageFreeSpace = 0,
+				saveFreeSpace = 0;
 	BlockNumber targetBlock,
 				otherBlock;
 	bool		needLock;
@@ -308,6 +364,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
 		}
 	}
 
+loop:
 	while (targetBlock != InvalidBlockNumber)
 	{
 		/*
@@ -440,10 +497,46 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	 */
 	needLock = !RELATION_IS_LOCAL(relation);
 
+	/*
+	 * If we need the lock but are not able to acquire it immediately, we'll
+	 * consider extending the relation by multiple blocks at a time to manage
+	 * contention on the relation extension lock.  However, this only makes
+	 * sense if we're using the FSM; otherwise, there's no point.
+	 */
 	if (needLock)
-		LockRelationForExtension(relation, ExclusiveLock);
+	{
+		if (!use_fsm)
+			LockRelationForExtension(relation, ExclusiveLock);
+		else if (!ConditionLockRelationForExtension(relation, ExclusiveLock))
+		{
+			/* Couldn't get the lock immediately; wait for it. */
+			LockRelationForExtension(relation, ExclusiveLock);
+
+			/*
+			 * Check if some other backend has extended a block for us while
+			 * we were waiting on the lock.
+			 */
+			targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
+
+			/*
+			 * If some other waiter has already extended the relation, we
+			 * don't need to do so; just use the existing freespace.
+			 */
+			if (targetBlock != InvalidBlockNumber)
+			{
+				UnlockRelationForExtension(relation, ExclusiveLock);
+				goto loop;
+			}
+
+			/* Time to bulk-extend. */
+			RelationAddExtraBlocks(relation, bistate);
+		}
+	}
 
 	/*
+	 * In addition to whatever extension we performed above, we always add
+	 * at least one block to satisfy our own request.
+	 *
 	 * XXX This does an lseek - rather expensive - but at the moment it is the
 	 * only way to accurately determine how many blocks are in a relation.  Is
 	 * it worth keeping an accurate file length in shared memory someplace,
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 2631080..5f45891 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -109,7 +109,10 @@ static int fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,
 				   uint8 newValue, uint8 minValue);
 static BlockNumber fsm_search(Relation rel, uint8 min_cat);
 static uint8 fsm_vacuum_page(Relation rel, FSMAddress addr, bool *eof);
-
+static bool fsm_addr_on_same_page(FSMAddress addr1, FSMAddress addr2);
+static void fsm_set_range(Relation rel, FSMAddress addr, uint16 firstSlot,
+					uint16 lastSlot, uint8 newValue);
+static void fsm_update_recursive(Relation rel, FSMAddress addr, uint8 new_cat);
 
 /******** Public API ********/
 
@@ -189,6 +192,49 @@ RecordPageWithFreeSpace(Relation rel, BlockNumber heapBlk, Size spaceAvail)
 }
 
 /*
+ * Sets all the FSM entries for pages between firstBlock and lastBlock to
+ * BLKSIZE and then bubbles that up to the higher levels of the tree and
+ * all the way to the root.
+ */
+void
+FreeSpaceMapBulkExtend(Relation rel, BlockNumber firstBlock,
+					BlockNumber lastBlock)
+{
+	int			new_cat = fsm_space_avail_to_cat(BLCKSZ);
+	FSMAddress	addr1,
+				addr2;
+	uint16		firstSlot;
+	uint16		lastSlot;
+	uint16		slot;
+	BlockNumber	blockNum;
+
+	blockNum = firstBlock;
+	addr1 = fsm_get_location(firstBlock, &slot);
+
+	while (blockNum < lastBlock)
+	{
+		blockNum++;
+
+		addr2 = fsm_get_location(blockNum, &slot);
+
+		if (!fsm_addr_on_same_page(addr1, addr2)
+			|| blockNum == lastBlock)
+		{
+			/* This BLock is on the next FSM page so update the FSM page */
+			fsm_get_location(firstBlock, &firstSlot);
+			fsm_get_location(blockNum - 1, &lastSlot);
+			fsm_set_range(rel, addr1, firstSlot, lastSlot, new_cat);
+			fsm_update_recursive(rel, addr1, new_cat);
+			/*
+			 * Continue updating FSM till last page, so set the addr1 as new
+			 * Address. and continue search for this page.
+			 */
+			addr1 = addr2;
+		}
+	}
+}
+
+/*
  * XLogRecordPageWithFreeSpace - like RecordPageWithFreeSpace, for use in
  *		WAL replay
  */
@@ -788,3 +834,56 @@ fsm_vacuum_page(Relation rel, FSMAddress addr, bool *eof_p)
 
 	return max_avail;
 }
+
+static bool
+fsm_addr_on_same_page(FSMAddress addr1, FSMAddress addr2)
+{
+	Assert(addr1);
+	Assert(addr2);
+
+	if ((addr1.level != addr2.level)
+		|| (addr1.logpageno != addr2.logpageno))
+		return false;
+
+	return true;
+}
+
+/*
+ * Set value in given FSM page for given slot range.
+ */
+static void
+fsm_set_range(Relation rel, FSMAddress addr, uint16 firstSlot,
+			uint16 lastSlot, uint8 newValue)
+{
+	Buffer		buf;
+	Page		page;
+	int			slot;
+
+	buf = fsm_readbuf(rel, addr, true);
+	LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+	page = BufferGetPage(buf);
+
+	slot = firstSlot;
+
+	for (slot = firstSlot; slot <= lastSlot; slot++)
+		fsm_set_avail(page, slot, newValue);
+
+	MarkBufferDirtyHint(buf, false);
+
+	UnlockReleaseBuffer(buf);
+}
+
+static void
+fsm_update_recursive(Relation rel, FSMAddress addr, uint8 new_cat)
+{
+	uint16		parentslot;
+	FSMAddress	parent;
+
+	if (addr.level == FSM_ROOT_LEVEL)
+		return;
+
+	parent = fsm_get_parent(addr, &parentslot);
+	fsm_set_and_search(rel, parent, parentslot, new_cat, 0);
+	fsm_update_recursive(rel, parent, new_cat);
+}
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 0632fc0..7e04137 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -341,6 +341,41 @@ LockRelationForExtension(Relation relation, LOCKMODE lockmode)
 }
 
 /*
+ *		ConditionLockRelationForExtension
+ *
+ * As above, but only lock if we can get the lock without blocking.
+ * Returns TRUE iff the lock was acquired.
+ */
+bool
+ConditionLockRelationForExtension(Relation relation, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_RELATION_EXTEND(tag,
+								relation->rd_lockInfo.lockRelId.dbId,
+								relation->rd_lockInfo.lockRelId.relId);
+
+	return (LockAcquire(&tag, lockmode, false, true) != LOCKACQUIRE_NOT_AVAIL);
+}
+
+/*
+ *		RelationExtensionLockWaiterCount
+ *
+ * Count the lock requester for the RelationExtension lock.
+ */
+int
+RelationExtensionLockWaiterCount(Relation relation)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_RELATION_EXTEND(tag,
+								relation->rd_lockInfo.lockRelId.dbId,
+								relation->rd_lockInfo.lockRelId.relId);
+
+	return LockWaiterCount(&tag);
+}
+
+/*
  *		UnlockRelationForExtension
  */
 void
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index b30b7b1..353f705 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -4380,3 +4380,40 @@ VirtualXactLock(VirtualTransactionId vxid, bool wait)
 	LockRelease(&tag, ShareLock, false);
 	return true;
 }
+
+/*
+ * LockWaiterCount
+ *
+ * Find the number of lock requester on this locktag
+ */
+int
+LockWaiterCount(const LOCKTAG *locktag)
+{
+	LOCKMETHODID lockmethodid = locktag->locktag_lockmethodid;
+	LOCK	   *lock;
+	bool		found;
+	uint32		hashcode;
+	LWLock		*partitionLock;
+	int			waiters = 0;
+
+	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
+		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
+
+	hashcode = LockTagHashCode(locktag);
+	partitionLock = LockHashPartitionLock(hashcode);
+	LWLockAcquire(partitionLock, LW_EXCLUSIVE);
+
+	lock = (LOCK *) hash_search_with_hash_value(LockMethodLockHash,
+												(const void *) locktag,
+												hashcode,
+												HASH_FIND,
+												&found);
+	if (found)
+	{
+		Assert(lock != NULL);
+		waiters = lock->nRequested;
+	}
+	LWLockRelease(partitionLock);
+
+	return waiters;
+}
diff --git a/src/include/storage/freespace.h b/src/include/storage/freespace.h
index 19dcb8d..0f2a2be 100644
--- a/src/include/storage/freespace.h
+++ b/src/include/storage/freespace.h
@@ -32,5 +32,7 @@ extern void XLogRecordPageWithFreeSpace(RelFileNode rnode, BlockNumber heapBlk,
 
 extern void FreeSpaceMapTruncateRel(Relation rel, BlockNumber nblocks);
 extern void FreeSpaceMapVacuum(Relation rel);
+extern void FreeSpaceMapBulkExtend(Relation rel, BlockNumber firstBlock,
+							BlockNumber lastBlock);
 
 #endif   /* FREESPACE_H_ */
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index 975b6f8..4460756 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -53,6 +53,9 @@ extern void UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 /* Lock a relation for extension */
 extern void LockRelationForExtension(Relation relation, LOCKMODE lockmode);
 extern void UnlockRelationForExtension(Relation relation, LOCKMODE lockmode);
+extern bool ConditionLockRelationForExtension(Relation relation,
+								  LOCKMODE lockmode);
+extern int	RelationExtensionLockWaiterCount(Relation relation);
 
 /* Lock a page (currently only used within indexes) */
 extern void LockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index b26427d..9c08679 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -574,6 +574,8 @@ extern void RememberSimpleDeadLock(PGPROC *proc1,
 					   PGPROC *proc2);
 extern void InitDeadLockChecking(void);
 
+extern int LockWaiterCount(const LOCKTAG *locktag);
+
 #ifdef LOCK_DEBUG
 extern void DumpLocks(PGPROC *proc);
 extern void DumpAllLocks(void);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Relation extension scalability

Reply via email to