Re: [HACKERS] PATCH: optimized DROP of multiple tables within a transaction

Tomas Vondra Sun, 16 Dec 2012 15:31:46 -0800

Hi,

I've updated the patch to include the optimization described in the
previous post, i.e. if the number of relations is below a certain
threshold, use a simple for loop, for large numbers of relations use
bsearch calls.


This is done by a new constant BSEARCH_LIMIT, which is set to 10 in the
patch. Then I've modified the 'drop-test' script to take yet another
argument - number of indexes for each table. So for example this

./drop-test.py 10000 100 3 'dbname=test'

means "create 10000 tables, 3 indexes for each of them, and then drop
them in batches of 100 tables."

Then I've run the test with 0, 1, 2, ... 11 indexes for each table for

(a) unpatched HEAD
(b) patch v3.1 (without the optimization)
(c) patch v3.3 (with BSEARCH_LIMIT=10)

and I got these results:


1) dropping one-by-one
----------------------

This is the really interesting part - the issue with the v3.1 is that
for a single table, it's ~2x slower than unpatched PostgreSQL.

             0   1   2   3   4   5   6   7    8    9   10   11
--------------------------------------------------------------
unpatched   16  28  40  52  63  75  87  99  110  121  135  147
     v3.1   33  43  46  56  58  60  63  72   75   76   79   80
     v3.3   16  20  23  25  29  33  36  40   44   47   79   82

The values are durations in seconds, rounded to integer values. I've run
the test repeatedly and there's very small variance in the numbers.

The v3.3 improves that and it's actually even faster than unpatched
PostgreSQL. How can this happen? I believe it's because the code is
rewritten from

   for each relation (r) in the drop list
      DropRelFileNodeAllBuffers (r)
         for each shared buffer
            check and invalidate

to

   copy the relations from drop list to array (a)
   DropRelFileNodeAllBuffers(a)
      for each shared buffer
          for each relation (r) in the array
              check and invalidate

At least that's the only explanation I was able to come up with.

Yet another interesting observation is that even the v3.1 is about as
fast as the unpatched code once there are 3 or more indexes (or TOAST
tables).

So my opinion is that the optimizated patch works as expected, and that
even without the optimization the performance would be acceptable for
most real-world cases.

2) dropping in transaction
--------------------------

This is mostly to verify that the code did not break anything, because
the optimization should not kick-in in this case at all. And that seems
to be the case:

             0   1   2   3   4   5   6   7    8    9   10   11
--------------------------------------------------------------
unpatched   13  24  35  46  58  69  81  92  103  115  127  139
     v3.1    3   5   7   8  10  12  14  15   16   18   20   21
     v3.3    3   4   6   7   8  10  11  13   14   15   18   20

The differences between v3.1 and v3.3 are mostly due to rounding etc.


Attached is the v3.3 patch and the testing script I've been using for
the tests above. Feel free to run the tests on your hardware, with your
hardware, shared buffers size etc. I've run that on a 4-core i5-2500 CPU
with 2GB shared buffers.

Tomas

#!/bin/env python

import datetime
import psycopg2
import sys

if __name__ == '__main__':
	
	if len(sys.argv) < 4:
		print "ERORR: not enough parameters"
		print "HINT:  drop-test.py num-of-tables drop-num num-of-indexes 'connection string'"
		sys.exit(1)
	
	ntables = int(sys.argv[1])
	nlimit  = int(sys.argv[2])
	nindex  = int(sys.argv[3])
	connstr = str(sys.argv[4])
	debug   = False
	
	conn = psycopg2.connect(connstr)
	cur  = conn.cursor()
	
	# print 'creating %s tables' % (ntables,)
	start = datetime.datetime.now()
	for i in range(ntables):
		cur.execute('CREATE TABLE tab_%s (id INT)' % (i,))
		for j in range(nindex):
			cur.execute('CREATE INDEX idx_%s_%s ON tab_%s(id)' % (i,j,i))
		conn.commit();
		if (i % 1000 == 0) and debug:
			print '  tables created: %s' % (i,)
	conn.commit()
	end = datetime.datetime.now()
	# print '  all tables created in %s seconds' % ((end-start).total_seconds(),)

	# set to autocommit mode
	conn.autocommit = True
	start = datetime.datetime.now()
	# print 'dropping %s tables one by one ...' % (ntables,)
	for i in range(ntables):
		cur.execute('DROP TABLE tab_%s' % (i,))
		if (i % 1000 == 0) and debug:
			print '  tables dropped: %s' % (i,)

	end = datetime.datetime.now()
	print 'dropped one-by-one in %s seconds' % ((end-start).total_seconds(),)

	# cancel the autocommit mode
	conn.autocommit = False
	
	# recreate tables
	# print 'creating %s tables' % (ntables,)
	start = datetime.datetime.now()
	for i in range(ntables):
		cur.execute('CREATE TABLE tab_%s (id INT)' % (i,))
		for j in range(nindex):
			cur.execute('CREATE INDEX idx_%s_%s ON tab_%s(id)' % (i,j,i))
		conn.commit();
		if (i % 1000 == 0) and debug:
			print '  tables created: %s' % (i,)

	conn.commit()
	end = datetime.datetime.now()
	# print '  all tables created in %s seconds' % ((end-start).total_seconds(),)
	
	# drop the tables in batches
	# print 'dropping %s tables in batches by %s...' % (ntables,nlimit)
	start = datetime.datetime.now()
	for i in range(ntables):
		cur.execute('DROP TABLE tab_%s' % (i,))
		if i % nlimit == 0:
			conn.commit()
			if debug:
				print '  tables dropped: %s' % (i,)
	conn.commit()
	end = datetime.datetime.now()
	print 'dropped in a TX in %s seconds' % ((end-start).total_seconds(),)
	
	conn.close()

diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 8dc622a..8c12a43 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -312,6 +312,11 @@ smgrDoPendingDeletes(bool isCommit)
        PendingRelDelete *pending;
        PendingRelDelete *prev;
        PendingRelDelete *next;
+    
+       SMgrRelation   *srels = palloc(sizeof(SMgrRelation));
+       int                     nrels = 0,
+                               i = 0,
+                               maxrels = 1;
 
        prev = NULL;
        for (pending = pendingDeletes; pending != NULL; pending = next)
@@ -335,14 +340,31 @@ smgrDoPendingDeletes(bool isCommit)
                                SMgrRelation srel;
 
                                srel = smgropen(pending->relnode, 
pending->backend);
-                               smgrdounlink(srel, false);
-                               smgrclose(srel);
+                               
+                               /* extend the array if needed (double the size) 
*/
+                               if (maxrels <= nrels) {
+                                       maxrels *= 2;
+                                       srels = repalloc(srels, 
sizeof(SMgrRelation) * maxrels);
+                               }
+                               
+                               srels[nrels++] = srel;
                        }
                        /* must explicitly free the list entry */
                        pfree(pending);
                        /* prev does not change */
                }
        }
+
+       if (nrels > 0)
+       {
+               smgrdounlinkall(srels, nrels, false);
+               
+               for (i = 0; i < nrels; i++)
+                       smgrclose(srels[i]);
+       }
+       
+       pfree(srels);
+
 }
 
 /*
diff --git a/src/backend/storage/buffer/bufmgr.c 
b/src/backend/storage/buffer/bufmgr.c
index dddb6c0..7a968f8 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -62,6 +62,7 @@
 #define BUF_WRITTEN                            0x01
 #define BUF_REUSABLE                   0x02
 
+#define BSEARCH_LIMIT                  10
 
 /* GUC variables */
 bool           zero_damaged_pages = false;
@@ -108,6 +109,7 @@ static volatile BufferDesc *BufferAlloc(SMgrRelation smgr,
 static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln);
 static void AtProcExit_Buffers(int code, Datum arg);
 
+static int rnode_comparator(const void * p1, const void * p2);
 
 /*
  * PrefetchBuffer -- initiate asynchronous read of a block of a relation
@@ -2094,31 +2096,64 @@ DropRelFileNodeBuffers(RelFileNodeBackend rnode, 
ForkNumber forkNum,
  * --------------------------------------------------------------------
  */
 void
-DropRelFileNodeAllBuffers(RelFileNodeBackend rnode)
+DropRelFileNodeAllBuffers(RelFileNodeBackend * rnodes, int nnodes)
 {
-       int                     i;
+       int         i, j;
+
+       /* sort the list of rnodes */
+       pg_qsort(rnodes, nnodes, sizeof(RelFileNodeBackend), rnode_comparator);
 
        /* If it's a local relation, it's localbuf.c's problem. */
-       if (RelFileNodeBackendIsTemp(rnode))
+       for (i = 0; i < nnodes; i++)
        {
-               if (rnode.backend == MyBackendId)
-                       DropRelFileNodeAllLocalBuffers(rnode.node);
-               return;
+               if (RelFileNodeBackendIsTemp(rnodes[i]))
+               {
+                       if (rnodes[i].backend == MyBackendId)
+                               DropRelFileNodeAllLocalBuffers(rnodes[i].node);
+               }
        }
 
        for (i = 0; i < NBuffers; i++)
        {
+               RelFileNodeBackend *rnode = NULL;
                volatile BufferDesc *bufHdr = &BufferDescriptors[i];
-
+ 
                /*
                 * As in DropRelFileNodeBuffers, an unlocked precheck should be 
safe
                 * and saves some cycles.
                 */
-               if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+
+               /*
+                * For low number of relations to drop just use a simple walk 
through,
+                * to save the bsearch overhead. The BSEARCH_LIMIT is rather a 
guess
+                * than a exactly determined value, as it depends on many 
factors (CPU
+                * and RAM speeds, amount of shared buffers etc.).
+                */
+               if (nnodes <= BSEARCH_LIMIT)
+               {
+                       for (j = 0; j < nnodes; j++)
+                       {
+                               if (RelFileNodeEquals(bufHdr->tag.rnode, 
rnodes[j].node))
+                               {
+                                       rnode = &rnodes[j];
+                                       break;
+                               }
+                       }
+               }
+               else
+               {
+                       rnode = bsearch((const void *) &(bufHdr->tag.rnode),
+                                                                               
        rnodes,
+                                                                               
        nnodes, sizeof(RelFileNodeBackend),
+                                                                               
        rnode_comparator);
+               }
+
+               /* buffer does not belong to any of the relations */
+               if (rnode == NULL)
                        continue;
 
                LockBufHdr(bufHdr);
-               if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+               if (RelFileNodeEquals(bufHdr->tag.rnode, rnode->node))
                        InvalidateBuffer(bufHdr);       /* releases spinlock */
                else
                        UnlockBufHdr(bufHdr);
@@ -2953,3 +2988,31 @@ local_buffer_write_error_callback(void *arg)
                pfree(path);
        }
 }
+
+/*
+ * Used to sort relfilenode array (ordered by [relnode, dbnode, spcnode]), so
+ * that it's suitable for bsearch.
+ */
+static int
+rnode_comparator(const void * p1, const void * p2)
+{
+       RelFileNodeBackend n1 = * (RelFileNodeBackend *) p1;
+       RelFileNodeBackend n2 = * (RelFileNodeBackend *) p2;
+
+       if (n1.node.relNode < n2.node.relNode)
+               return -1;
+       else if (n1.node.relNode > n2.node.relNode)
+               return 1;
+
+       if (n1.node.dbNode < n2.node.dbNode)
+               return -1;
+       else if (n1.node.dbNode > n2.node.dbNode)
+               return 1;
+
+       if (n1.node.spcNode < n2.node.spcNode)
+               return -1;
+       else if (n1.node.spcNode > n2.node.spcNode)
+               return 1;
+       else
+               return 0;
+}
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 5dff8b3..bbce945 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -390,7 +390,7 @@ smgrdounlink(SMgrRelation reln, bool isRedo)
         * Get rid of any remaining buffers for the relation.  bufmgr will just
         * drop them without bothering to write the contents.
         */
-       DropRelFileNodeAllBuffers(rnode);
+       DropRelFileNodeAllBuffers(&rnode, 1);
 
        /*
         * It'd be nice to tell the stats collector to forget it immediately, 
too.
@@ -419,6 +419,68 @@ smgrdounlink(SMgrRelation reln, bool isRedo)
        (*(smgrsw[which].smgr_unlink)) (rnode, InvalidForkNumber, isRedo);
 }
 
+void smgrdounlinkall(SMgrRelation * rels, int nrels, bool isRedo)
+{
+       int i = 0;
+       RelFileNodeBackend *rnodes;
+       ForkNumber  forknum;
+
+       /* initialize an array which contains all relations to be dropped */
+       rnodes = palloc(sizeof(RelFileNodeBackend) * nrels);
+       for (i = 0; i < nrels; i++)
+       {
+               RelFileNodeBackend rnode = rels[i]->smgr_rnode;
+               int                     which = rels[i]->smgr_which;
+
+               rnodes[i] = rnode;
+
+               /* Close the forks at smgr level */
+               for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+                       (*(smgrsw[which].smgr_close)) (rels[i], forknum);
+       }
+
+       /*
+        * Get rid of any remaining buffers for the relation.  bufmgr will just
+        * drop them without bothering to write the contents.
+        */
+       DropRelFileNodeAllBuffers(rnodes, nrels);
+
+       /*
+        * It'd be nice to tell the stats collector to forget it immediately, 
too.
+        * But we can't because we don't know the OID (and in cases involving
+        * relfilenode swaps, it's not always clear which table OID to forget,
+        * anyway).
+        */
+
+       /*
+        * Send a shared-inval message to force other backends to close any
+        * dangling smgr references they may have for this rel.  We should do 
this
+        * before starting the actual unlinking, in case we fail partway through
+        * that step.  Note that the sinval message will eventually come back to
+        * this backend, too, and thereby provide a backstop that we closed our
+        * own smgr rel.
+        */
+       for (i = 0; i < nrels; i++)
+               CacheInvalidateSmgr(rnodes[i]);
+
+       /*
+        * Delete the physical file(s).
+        *
+        * Note: smgr_unlink must treat deletion failure as a WARNING, not an
+        * ERROR, because we've already decided to commit or abort the current
+        * xact.
+        */
+
+       for (i = 0; i < nrels; i++)
+       {
+               int     which = rels[i]->smgr_which;
+               for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+                       (*(smgrsw[which].smgr_unlink)) (rnodes[i], forknum, 
isRedo);
+       }
+
+       pfree(rnodes);
+}
+
 /*
  *     smgrdounlinkfork() -- Immediately unlink one fork of a relation.
  *
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 51eb77b..1deee42 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -188,7 +188,7 @@ extern void FlushRelationBuffers(Relation rel);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
                                           ForkNumber forkNum, BlockNumber 
firstDelBlock);
-extern void DropRelFileNodeAllBuffers(RelFileNodeBackend rnode);
+extern void DropRelFileNodeAllBuffers(RelFileNodeBackend * rnodes, int nnodes);
 extern void DropDatabaseBuffers(Oid dbid);
 
 #define RelationGetNumberOfBlocks(reln) \
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 9eccf76..272476b 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -85,6 +85,7 @@ extern void smgrcloseall(void);
 extern void smgrclosenode(RelFileNodeBackend rnode);
 extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
 extern void smgrdounlink(SMgrRelation reln, bool isRedo);
+extern void smgrdounlinkall(SMgrRelation * rels, int nrels, bool isRedo);
 extern void smgrdounlinkfork(SMgrRelation reln, ForkNumber forknum, bool 
isRedo);
 extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
                   BlockNumber blocknum, char *buffer, bool skipFsync);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: optimized DROP of multiple tables within a transaction

Reply via email to