Hi hackers,

Recently I have finished my work on a patch for pg_buffercache contrib, I think it's time to share my results.


Introduction
============

I want to offer you the implementation that allows to decrease system workload by partially sacrificing (fully snapshot consistency) data consistency. Sometimes we do not need full data consistency, for example on quantitative rather than qualitative analysis of memory contents, or when we want to catch insufficient memory resources or how often relation is used.


Problem description
===================

Currently, the pg_buffercache v1.1 and prior takes an exclusive lock on all shared buffers, which greatly affects system performance. Usually we use pg_buffercache to find out why DB is working slower than expected or examine what occupies the entire memory. So when we run pg_buffercache on such system, we make it even slower.


Implementation
==============

Vanilla implementation contains loop which collecting statistic from whole shared memory acquire, read and release Spinlocks one by one, page by page while holding LWLock.

V1.2 implementation contains flexible loop which can collect shared memory statistic using three different methods: 1) with holding LWLock only on one partition of shared memory (semiconsistent method)
2) without LWLocks (nonconsistent method),
3) or in vanilia way (consistent method)

The aforementioned allow us to launch pg_buffercache in the three different ways.
Each of them have some advantages and some disadvantages:

Consistent:
+ 100% consistency of shared memory snapshot
- Slowdown the system with whole shared memory exclusive lock

Semiconsistent:
+ Faster than consistent method
+ Mostly doesn`t affect on the system load
- Speed of taking that snapshot is low
Nonconsistent:
The fastest
+ Doesn`t noticeably affects on the systems
- <3% lost of snapshot consistency

What works
==========

Actually, it work well even on big load, but of course there might be things I've
overlooked.
VIEW pg_buffercache_cons
VIEW pg_buffercache_noncons
VIEW pg_buffercache_semicons

Examples from docs in new realization:
SELECT c.relname, count(*) AS buffers FROM pg_buffercache_noncons b INNER JOIN pg_class c ON b.relfilenode = pg_relation_filenode(c.oid) AND b.reldatabase IN (0, (SELECT oid FROM pg_database WHERE datname = current_database())) GROUP BY c.relname ORDER BY 2 DESC LIMIT 10;

SELECT c.relname, count(*) AS buffers FROM pg_buffercache_semicons b INNER JOIN pg_class c ON b.relfilenode = pg_relation_filenode(c.oid) AND b.reldatabase IN (0, (SELECT oid FROM pg_database WHERE datname = current_database())) GROUP BY c.relname ORDER BY 2 DESC LIMIT 10;


Testing the implementation
======================

How implementation tested:
1) Start server
2) Make pgbench tps
        pgbench -c 250 -s 1000  -T 200 -P1
3) Compare how tps sags under load if:
        SELECT count(*) FROM pg_buffercache_cons;
        SELECT count(*) FROM pg_buffercache_semicons;
        SELECT count(*) FROM pg_buffercache_noncons;

This test was made on server (server parameters)
Model name:            Intel(R) Xeon(R) CPU E7-8890 v3 @ 2.50GHz
CPU(s):                144
Socket(s):             4
Shared_buffers:        200GB


Results of testing
======================

Our DBA team obtained the following results:
Nonconsistent:
        * 10% faster then consistent method
        * doesn`t noticeably affects on the systems
* the maximum loss of accuracy was less then 3%* ( in most situation it is permissible accuracy loss )

Semiconsistent:
        * 5 time slower then nonconsistent
        * made less affects on system compared to consistent

Overall results:
Our clients was pleased with this implementation.
Implementation is made with backward compatibility, as a conclusion old pg_buffercache v1.1 queries will work well. Semiconsistent show results approaching to nonconsistent on SELECTONLY queries.

* this values were obtained from our DBA tests.

What can be better
===============

It is unclear how to optimize the semiconsistent method to make it faster, and reduce temporary effect that appears from time to time.


I will be glad to see your feedback!

---
Ivan Kartyshov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index 065d3d6..8813c50 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -4,7 +4,7 @@ MODULE_big = pg_buffercache
 OBJS = pg_buffercache_pages.o $(WIN32RES)
 
 EXTENSION = pg_buffercache
-DATA = pg_buffercache--1.1.sql pg_buffercache--1.0--1.1.sql pg_buffercache--unpackaged--1.0.sql
+DATA = pg_buffercache--1.2.sql pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql pg_buffercache--unpackaged--1.0.sql
 PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
 
 ifdef USE_PGXS
diff --git a/contrib/pg_buffercache/README b/contrib/pg_buffercache/README
new file mode 100644
index 0000000..35a7607
--- /dev/null
+++ b/contrib/pg_buffercache/README
@@ -0,0 +1,3 @@
+In version pg_buffercache--1.2 been added parameter (consistent = cons and semiconsistent = semicons)
+
+semiconsistent pg_buffercache works 10% faster and use LWLock free algorithm
diff --git a/contrib/pg_buffercache/pg_buffercache--1.1--1.2.sql b/contrib/pg_buffercache/pg_buffercache--1.1--1.2.sql
new file mode 100644
index 0000000..cd50f93
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.1--1.2.sql
@@ -0,0 +1,27 @@
+/* contrib/pg_buffercache/pg_buffercache--1.1--1.2.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION pg_buffercache(text) UPDATE TO '1.2'" to load this file. \quit
+
+-- Register the function.
+CREATE FUNCTION pg_buffercache_pages(mode text)
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_pages'
+LANGUAGE C;
+
+-- Make special type pg_buffercache
+create type tbuffercache as 
+	(bufferid integer, relfilenode oid,
+	reltablespace oid, reldatabase oid,
+	relforknumber int2, relblocknumber int8,
+	isdirty bool, usagecount int2,	pinning_backends int4);
+
+-- Create a function for convenient access.
+create function pg_buffercache(mod text)
+returns setof tbuffercache
+as 'SELECT * FROM pg_buffercache_pages(mod) AS (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid, relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,	pinning_backends int4);'
+language 'sql';
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_pages(text) FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_buffercache(text) FROM PUBLIC;
diff --git a/contrib/pg_buffercache/pg_buffercache--1.1.sql b/contrib/pg_buffercache/pg_buffercache--1.1.sql
deleted file mode 100644
index f3b6482..0000000
--- a/contrib/pg_buffercache/pg_buffercache--1.1.sql
+++ /dev/null
@@ -1,21 +0,0 @@
-/* contrib/pg_buffercache/pg_buffercache--1.1.sql */
-
--- complain if script is sourced in psql, rather than via CREATE EXTENSION
-\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit
-
--- Register the function.
-CREATE FUNCTION pg_buffercache_pages()
-RETURNS SETOF RECORD
-AS 'MODULE_PATHNAME', 'pg_buffercache_pages'
-LANGUAGE C;
-
--- Create a view for convenient access.
-CREATE VIEW pg_buffercache AS
-	SELECT P.* FROM pg_buffercache_pages() AS P
-	(bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
-	 relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
-	 pinning_backends int4);
-
--- Don't want these to be available to public.
-REVOKE ALL ON FUNCTION pg_buffercache_pages() FROM PUBLIC;
-REVOKE ALL ON pg_buffercache FROM PUBLIC;
diff --git a/contrib/pg_buffercache/pg_buffercache--1.2.sql b/contrib/pg_buffercache/pg_buffercache--1.2.sql
new file mode 100644
index 0000000..23d7b7d
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.2.sql
@@ -0,0 +1,50 @@
+/* contrib/pg_buffercache/pg_buffercache--1.2.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit
+
+-- Register the function.
+CREATE FUNCTION pg_buffercache_pages()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_pages'
+LANGUAGE C;
+CREATE FUNCTION pg_buffercache_pages(mode text)
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_pages'
+LANGUAGE C;
+
+-- Make special type pg_buffercache
+create type tbuffercache as
+	(bufferid integer, relfilenode oid,
+	reltablespace oid, reldatabase oid,
+	relforknumber int2, relblocknumber int8,
+	isdirty bool, usagecount int2,	pinning_backends int4);
+
+-- Create a view for convenient access.
+CREATE VIEW pg_buffercache AS
+	SELECT P.* FROM pg_buffercache_pages() AS P
+	(bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
+	 relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
+	 pinning_backends int4);
+
+CREATE VIEW pg_buffercache_cons AS SELECT P.* FROM pg_buffercache_pages('cons') AS P (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,pinning_backends int4);
+
+CREATE VIEW pg_buffercache_noncons AS SELECT P.* FROM pg_buffercache_pages('noncons') AS P (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,pinning_backends int4);
+
+CREATE VIEW pg_buffercache_semicons AS SELECT P.* FROM pg_buffercache_pages('semicons') AS P (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,pinning_backends int4);
+
+-- Create a function for convenient access.
+create function pg_buffercache(mod text)
+returns setof tbuffercache
+as 'SELECT * FROM pg_buffercache_pages(mod) AS (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid, relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,	pinning_backends int4);' 
+language 'sql';
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_pages() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_buffercache_pages(text) FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_buffercache(text) FROM PUBLIC;
+
+REVOKE ALL ON pg_buffercache_cons FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_noncons FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_semicons FROM PUBLIC;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index 5494e2f..a4d664f 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
 # pg_buffercache extension
 comment = 'examine the shared buffer cache'
-default_version = '1.1'
+default_version = '1.2'
 module_pathname = '$libdir/pg_buffercache'
 relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 17b4b6f..0175936 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -53,6 +53,15 @@ typedef struct
 	BufferCachePagesRec *record;
 } BufferCachePagesContext;
 
+/*
+ * Struct for consistency control
+ */
+typedef enum
+{
+	CONS_SNAP,
+	SEMICONS_SNAP,
+	NONCONS_SNAP
+} ConsistencyType;
 
 /*
  * Function returning data from the shared buffer cache - buffer number,
@@ -70,10 +79,37 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
 	TupleDesc	tupledesc;
 	TupleDesc	expected_tupledesc;
 	HeapTuple	tuple;
+	int			num_partit = 1;			/* Number of partition. */
+	ConsistencyType snaptype  = CONS_SNAP; /* Flag for consistency control. */
+
+	if (PG_NARGS() > 0)
+	{
+		text	   *type = PG_GETARG_TEXT_P(0);
+
+		if (VARSIZE_ANY_EXHDR(type) == 4 &&
+			strncmp(VARDATA(type), "cons", 4) == 0)
+			snaptype = CONS_SNAP;
+		else if (VARSIZE_ANY_EXHDR(type) == 8 &&
+				 strncmp(VARDATA(type), "semicons", 8) == 0)
+			{
+				snaptype = SEMICONS_SNAP;
+				num_partit = NUM_BUFFER_PARTITIONS;
+			}
+		else if (VARSIZE_ANY_EXHDR(type) == 7 &&
+				 strncmp(VARDATA(type), "noncons", 7) == 0)
+			snaptype = NONCONS_SNAP;
+		else
+		{
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("invalid consistency type"),
+					 errhint("Valid consistency types are \"cons\", \"semicons\" and \"noncons\".")));
+			PG_RETURN_INT64(0);		/* Placate compiler. */
+	}
 
 	if (SRF_IS_FIRSTCALL())
 	{
-		int			i;
+		int			i, j = 1;
 
 		funcctx = SRF_FIRSTCALL_INIT();
 
@@ -138,55 +174,69 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
 		 * partitions of the buffer map.  Needless to say, this is horrible
 		 * for concurrency.  Must grab locks in increasing order to avoid
 		 * possible deadlocks.
+		 * So there is alternative by LWlock free and one partition Lock method
+		 * respectively named Nonconcistent and Semiconsistent
 		 */
-		for (i = 0; i < NUM_BUFFER_PARTITIONS; i++)
-			LWLockAcquire(BufMappingPartitionLockByIndex(i), LW_SHARED);
 
-		/*
-		 * Scan through all the buffers, saving the relevant fields in the
-		 * fctx->record structure.
-		 */
-		for (i = 0; i < NBuffers; i++)
+		for (j = 0; j < num_partit; j++)
 		{
-			BufferDesc *bufHdr;
-			uint32		buf_state;
-
-			bufHdr = GetBufferDescriptor(i);
-			/* Lock each buffer header before inspecting. */
-			buf_state = LockBufHdr(bufHdr);
-
-			fctx->record[i].bufferid = BufferDescriptorGetBuffer(bufHdr);
-			fctx->record[i].relfilenode = bufHdr->tag.rnode.relNode;
-			fctx->record[i].reltablespace = bufHdr->tag.rnode.spcNode;
-			fctx->record[i].reldatabase = bufHdr->tag.rnode.dbNode;
-			fctx->record[i].forknum = bufHdr->tag.forkNum;
-			fctx->record[i].blocknum = bufHdr->tag.blockNum;
-			fctx->record[i].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
-			fctx->record[i].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);
-
-			if (buf_state & BM_DIRTY)
-				fctx->record[i].isdirty = true;
-			else
-				fctx->record[i].isdirty = false;
-
-			/* Note if the buffer is valid, and has storage created */
-			if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
-				fctx->record[i].isvalid = true;
-			else
-				fctx->record[i].isvalid = false;
-
-			UnlockBufHdr(bufHdr, buf_state);
+			if (snaptype == CONS_SNAP)
+				for (i = 0; i < NUM_BUFFER_PARTITIONS; i++)
+					LWLockAcquire(BufMappingPartitionLockByIndex(i), LW_SHARED);
+			else if (snaptype == SEMICONS_SNAP)
+				LWLockAcquire(BufMappingPartitionLockByIndex(j), LW_SHARED);
+
+			for (i = 0; i < NBuffers; i++)
+			{
+				BufferDesc *bufHdr;
+				uint32		buf_state;
+
+				bufHdr = GetBufferDescriptor(i);
+				/* Lock each buffer header before inspecting. */
+
+				if (bufHdr->tag.forkNum == -1) 
+				{ 
+					fctx->record[i].blocknum = InvalidBlockNumber;
+					continue;
+				}
+
+				buf_state = LockBufHdr(bufHdr);
+
+				fctx->record[i].bufferid = BufferDescriptorGetBuffer(bufHdr);
+				fctx->record[i].relfilenode = bufHdr->tag.rnode.relNode;
+				fctx->record[i].reltablespace = bufHdr->tag.rnode.spcNode;
+				fctx->record[i].reldatabase = bufHdr->tag.rnode.dbNode;
+				fctx->record[i].forknum = bufHdr->tag.forkNum;
+				fctx->record[i].blocknum = bufHdr->tag.blockNum;
+				fctx->record[i].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
+				fctx->record[i].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);
+
+				if (buf_state & BM_DIRTY)
+					fctx->record[i].isdirty = true;
+				else
+					fctx->record[i].isdirty = false;
+
+				/* Note if the buffer is valid, and has storage created */
+				if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
+					fctx->record[i].isvalid = true;
+				else
+					fctx->record[i].isvalid = false;
+
+				UnlockBufHdr(bufHdr, buf_state);
+			}
+			if (snaptype == CONS_SNAP)
+				for (i = NUM_BUFFER_PARTITIONS; --i >= 0;)
+					LWLockRelease(BufMappingPartitionLockByIndex(i));
+			else if (snaptype == SEMICONS_SNAP)
+				/*
+				 * And release locks.  We do this in reverse order for two reasons:
+				 * (1) Anyone else who needs more than one of the locks will be trying
+				 * to lock them in increasing order; we don't want to release the
+				 * other process until it can get all the locks it needs. (2) This
+				 * avoids O(N^2) behavior inside LWLockRelease.
+				 */
+				LWLockRelease(BufMappingPartitionLockByIndex(j));
 		}
-
-		/*
-		 * And release locks.  We do this in reverse order for two reasons:
-		 * (1) Anyone else who needs more than one of the locks will be trying
-		 * to lock them in increasing order; we don't want to release the
-		 * other process until it can get all the locks it needs. (2) This
-		 * avoids O(N^2) behavior inside LWLockRelease.
-		 */
-		for (i = NUM_BUFFER_PARTITIONS; --i >= 0;)
-			LWLockRelease(BufMappingPartitionLockByIndex(i));
 	}
 
 	funcctx = SRF_PERCALL_SETUP();
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to