Re: [HACKERS] Relation extension scalability

Dilip Kumar Thu, 03 Mar 2016 21:09:13 -0800

On Wed, Mar 2, 2016 at 10:31 AM, Dilip Kumar <dilipbal...@gmail.com> wrote:


> 1. One option can be as you suggested like ProcArrayGroupClearXid, With
> some modification, because when we wait for the request and extend w.r.t
> that, may be again we face the Context Switch problem, So may be we can
> extend in some multiple of the Request.
> (But we need to take care whether to give block directly to requester or
> add it into FSM or do both i.e. give at-least one block to requester and
> put some multiple in FSM)
>
>
> 2. Other option can be that we analyse the current Load on the extend and
> then speed up or slow down the extending speed.
> Simple algorithm can look like this
>

I have tried the approach of group extend,

1. We convert the extension lock to TryLock and if we get the lock then
extend by one block.2.
2. If don't get the Lock then use the Group leader concep where only one
process will extend for all, Slight change from ProcArrayGroupClear is that
here other than satisfying the requested backend we Add some extra blocks
in FSM, say GroupSize*10.
3. So Actually we can not get exact load but still we have some factor like
group size tell us exactly the contention size and we extend in multiple of
that.

Performance Analysis:
---------------------
Performance is scaling with this approach, its slightly less compared to
previous patch where we directly give extend_by_block parameter and extend
in multiple blocks, and I think that's obvious because in group extend case
only after contention happen on lock we add extra blocks, but in former
case it was extending extra blocks optimistically.

Test1(COPY)
-----
./psql -d postgres -c "COPY (select g.i::text FROM generate_series(1,
10000) g(i)) TO '/tmp/copybinary' WITH BINARY";
echo COPY data from '/tmp/copybinary' WITH BINARY; > copy_script

./pgbench -f copy_script -T 300 -c$ -j$ postgres

Shared Buffer:8GB    max_wal_size:10GB      Storage:Magnetic Disk    WAL:SSD
-----------------------------------------------------------------------------------------------
Client    Base  multi_extend by 20 page     group_extend_patch(groupsize*10)
1              105                157                               149
2              217                255                               282
4              210                 494                              452
8              166                702                               629
16            145                 563                              561
32            124                 480                              566

Test2(INSERT)
--------
./psql -d postgres -c "create table test_data(a int, b text)"
./psql -d postgres -c "insert into test_data
values(generate_series(1,1000),repeat('x', 1024))"
./psql -d postgres -c "create table data (a int, b text)
echo "insert into data select * from test_data;" >> insert_script

 shared_buffers=512GB  max_wal_size=20GB  checkpoint_timeout=10min
./pgbench -c $ -j $ -f insert_script -T 300 postgres

Client        Base    Multi-extend by 1000    *group extend (group*10)
*group extend (group*100)
1                117                    118
        125                 122
2                111                    140
       161                 159
4                 51                     190
                                141                 187
8                43                      254
                                148                 172
16               40                     243
                                150                 173


* (group*10)-> means inside the code, Group leader will see how many
members are in the group who want blocks, so we will satisfy request of all
member + will put extra blocks in FSM extra block to extend are =
(group*10) --> 10 is just some constant.


Summary:
------------
1. Here with group extend patch, there is no configuration to tell that how
many block to extend, so that should be decided by current load in the
system (contention on the extension lock).
2. With small multiplier i.e. 10 we can get fairly good improvement compare
to base code, but when load is high like record size is 1K, improving the
multiplier size giving better results.


* Note: Currently this is POC patch, It has only one group Extend List, so
currently can handle only one relation Group extend.



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 8140418..adb82ba 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -23,6 +23,7 @@
 #include "storage/freespace.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
+#include "storage/proc.h"
 
 
 /*
@@ -168,6 +169,160 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
 	}
 }
 
+
+static BlockNumber
+GroupExtendRelation(PGPROC *proc, Relation relation, BulkInsertState bistate)
+{
+	volatile PROC_HDR *procglobal = ProcGlobal;
+	uint32		nextidx;
+	uint32		wakeidx;
+	int			extraWaits = -1;
+	BlockNumber targetBlock;
+	int count = 0;
+
+	/* Add ourselves to the list of processes needing a group extend. */
+	proc->groupExtendMember = true;
+
+	while (true)
+	{
+		nextidx = pg_atomic_read_u32(&procglobal->extendGroupFirst);
+		pg_atomic_write_u32(&proc->extendGroupNext, nextidx);
+
+		if (pg_atomic_compare_exchange_u32(&procglobal->extendGroupFirst,
+										   &nextidx,
+										   (uint32) proc->pgprocno))
+			break;
+	}
+
+	/*
+	 * If the list was not empty, the leader will clear our XID.  It is
+	 * impossible to have followers without a leader because the first process
+	 * that has added itself to the list will always have nextidx as
+	 * INVALID_PGPROCNO.
+	 */
+	if (nextidx != INVALID_PGPROCNO)
+	{
+		/* Sleep until the leader clears our XID. */
+		for (;;)
+		{
+			/* acts as a read barrier */
+			PGSemaphoreLock(&proc->sem);
+			if (!proc->groupExtendMember)
+				break;
+			extraWaits++;
+		}
+
+		Assert(pg_atomic_read_u32(&proc->extendGroupNext) == INVALID_PGPROCNO);
+
+		/* Fix semaphore count for any absorbed wake ups */
+		while (extraWaits-- > 0)
+			PGSemaphoreUnlock(&proc->sem);
+
+		targetBlock = proc->blockNum;
+
+		proc->blockNum = InvalidBlockNumber;
+
+		return targetBlock;
+	}
+
+	/* We are the leader.  Acquire the lock on behalf of everyone. */
+	LockRelationForExtension(relation, ExclusiveLock);
+
+	/*
+	 * Now that we've got the lock, clear the list of processes waiting for
+	 * group extending
+	 */
+	while (true)
+	{
+		nextidx = pg_atomic_read_u32(&procglobal->extendGroupFirst);
+		if (pg_atomic_compare_exchange_u32(&procglobal->extendGroupFirst,
+										   &nextidx,
+										   PG_INT32_MAX))
+			break;
+	}
+
+	/* Remember head of list so we can perform wakeups after dropping lock. */
+	wakeidx = nextidx;
+
+
+	/* Walk the list and clear all XIDs. */
+	while (nextidx != INVALID_PGPROCNO)
+	{
+		PGPROC	   *proc = &ProcGlobal->allProcs[nextidx];
+		Buffer buffer;
+
+		buffer = ReadBufferBI(relation, P_NEW, bistate);
+		proc->blockNum = BufferGetBlockNumber(buffer);
+
+		ReleaseBuffer(buffer);
+
+		/* Move to next proc in list. */
+		nextidx = pg_atomic_read_u32(&proc->extendGroupNext);
+
+		count ++;
+	}
+
+	count = count*10;
+
+	do
+	{
+		Buffer buffer;
+		Page		page;
+		Size		freespace;
+		BlockNumber blockNum;
+
+		buffer = ReadBufferBI(relation, P_NEW, bistate);
+
+		/*
+		 * Now acquire lock on the new page.
+		 */
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+		page = BufferGetPage(buffer);
+		PageInit(page, BufferGetPageSize(buffer), 0);
+
+		freespace = PageGetHeapFreeSpace(page);
+		MarkBufferDirty(buffer);
+		blockNum = BufferGetBlockNumber(buffer);
+		UnlockReleaseBuffer(buffer);
+
+		RecordPageWithFreeSpace(relation, blockNum, freespace);
+	} while (count--);
+
+	/* We're done with the lock now. */
+	UnlockRelationForExtension(relation, ExclusiveLock);
+
+	/*
+	 * Now that we've released the lock, go back and wake everybody up.  We
+	 * don't do this under the lock so as to keep lock hold times to a
+	 * minimum.  The system calls we need to perform to wake other processes
+	 * up are probably much slower than the simple memory writes we did while
+	 * holding the lock.
+	 */
+	while (wakeidx != INVALID_PGPROCNO)
+	{
+		PGPROC	   *proc = &ProcGlobal->allProcs[wakeidx];
+
+		wakeidx = pg_atomic_read_u32(&proc->extendGroupNext);
+		pg_atomic_write_u32(&proc->extendGroupNext, INVALID_PGPROCNO);
+
+		/* ensure all previous writes are visible before follower continues. */
+		pg_write_barrier();
+
+		proc->groupExtendMember = false;
+
+		if (proc != MyProc)
+			PGSemaphoreUnlock(&proc->sem);
+		else
+		{
+			targetBlock = proc->blockNum;
+			proc->blockNum = InvalidBlockNumber;
+		}
+	}
+
+	return targetBlock;
+}
+
 /*
  * RelationGetBufferForTuple
  *
@@ -238,6 +393,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	BlockNumber targetBlock,
 				otherBlock;
 	bool		needLock;
+	int			extraBlocks;
 
 	len = MAXALIGN(len);		/* be conservative */
 
@@ -308,6 +464,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
 		}
 	}
 
+loop:
 	while (targetBlock != InvalidBlockNumber)
 	{
 		/*
@@ -441,27 +598,95 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	needLock = !RELATION_IS_LOCAL(relation);
 
 	if (needLock)
-		LockRelationForExtension(relation, ExclusiveLock);
+	{
+		if (TryLockRelationForExtension(relation, ExclusiveLock))
+		{
+			buffer = ReadBufferBI(relation, P_NEW, bistate);
+			/*
+			 * We can be certain that locking the otherBuffer first is OK, since
+			 * it must have a lower page number.
+			 */
+			if ((otherBuffer != InvalidBuffer) && !extraBlocks)
+				LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+			/*
+			 * Now acquire lock on the new page.
+			 */
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+			UnlockRelationForExtension(relation, ExclusiveLock);
+		}
+		else
+		{
+			targetBlock = GroupExtendRelation(MyProc, relation, bistate);
+			goto loop;
+		}
+	}
+	else
+	{
+		buffer = ReadBufferBI(relation, P_NEW, bistate);
+		/*
+		 * We can be certain that locking the otherBuffer first is OK, since
+		 * it must have a lower page number.
+		 */
+		if ((otherBuffer != InvalidBuffer) && !extraBlocks)
+			LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+		/*
+		 * Now acquire lock on the new page.
+		 */
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+	}
+
+#if 0
+	if (LOCKACQUIRE_NOT_AVAIL)
+
+	if (use_fsm)
+		extraBlocks = RelationGetExtendBlocks(relation) - 1;
+	else
+		extraBlocks = 0;
 	/*
 	 * XXX This does an lseek - rather expensive - but at the moment it is the
 	 * only way to accurately determine how many blocks are in a relation.  Is
 	 * it worth keeping an accurate file length in shared memory someplace,
 	 * rather than relying on the kernel to do it for us?
 	 */
-	buffer = ReadBufferBI(relation, P_NEW, bistate);
 
-	/*
-	 * We can be certain that locking the otherBuffer first is OK, since it
-	 * must have a lower page number.
-	 */
-	if (otherBuffer != InvalidBuffer)
-		LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
+	do
+	{
+		buffer = ReadBufferBI(relation, P_NEW, bistate);
+
+		/*
+		 * We can be certain that locking the otherBuffer first is OK, since
+		 * it must have a lower page number.
+		 */
+		if ((otherBuffer != InvalidBuffer) && !extraBlocks)
+			LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+		/*
+		 * Now acquire lock on the new page.
+		 */
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+		if (extraBlocks)
+		{
+			Page		page;
+			Size		freespace;
+			BlockNumber blockNum;
+
+			page = BufferGetPage(buffer);
+			PageInit(page, BufferGetPageSize(buffer), 0);
+
+			freespace = PageGetHeapFreeSpace(page);
+			MarkBufferDirty(buffer);
+			blockNum = BufferGetBlockNumber(buffer);
+			UnlockReleaseBuffer(buffer);
+			RecordPageWithFreeSpace(relation, blockNum, freespace);
+		}
+
+	} while (extraBlocks--);
 
-	/*
-	 * Now acquire lock on the new page.
-	 */
-	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
 	/*
 	 * Release the file-extension lock; it's now OK for someone else to extend
@@ -471,7 +696,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	 */
 	if (needLock)
 		UnlockRelationForExtension(relation, ExclusiveLock);
-
+#endif
 	/*
 	 * We need to initialize the empty new page.  Double-check that it really
 	 * is empty (this should never happen, but if it does we don't want to
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 9d16afb..08e5e07 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -340,6 +340,21 @@ LockRelationForExtension(Relation relation, LOCKMODE lockmode)
 	(void) LockAcquire(&tag, lockmode, false, false);
 }
 
+LockAcquireResult
+TryLockRelationForExtension(Relation relation, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_RELATION_EXTEND(tag,
+								relation->rd_lockInfo.lockRelId.dbId,
+								relation->rd_lockInfo.lockRelId.relId);
+
+	return LockAcquire(&tag, lockmode, false, true);
+}
+
+
+
+
 /*
  *		UnlockRelationForExtension
  */
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 6453b88..a823eda 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -182,6 +182,7 @@ InitProcGlobal(void)
 	ProcGlobal->walwriterLatch = NULL;
 	ProcGlobal->checkpointerLatch = NULL;
 	pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PGPROCNO);
+	pg_atomic_init_u32(&ProcGlobal->extendGroupFirst, INVALID_PGPROCNO);
 
 	/*
 	 * Create and initialize all the PGPROC structures we'll need.  There are
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index dbcdd3f..5a0e60b 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -152,6 +152,11 @@ struct PGPROC
 	 */
 	TransactionId	procArrayGroupMemberXid;
 
+	bool groupExtendMember;
+	pg_atomic_uint32	extendGroupNext;
+	uint32	blockNum;
+
+
 	/* Per-backend LWLock.  Protects fields below (but not group fields). */
 	LWLock		backendLock;
 
@@ -223,6 +228,8 @@ typedef struct PROC_HDR
 	PGPROC	   *bgworkerFreeProcs;
 	/* First pgproc waiting for group XID clear */
 	pg_atomic_uint32 procArrayGroupFirst;
+	pg_atomic_uint32 extendGroupFirst;
+
 	/* WALWriter process's latch */
 	Latch	   *walwriterLatch;
 	/* Checkpointer process's latch */

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Relation extension scalability

Reply via email to