In recent discussion[1] with Simon Riggs, there has been some talk of making some changes to the bgwriter. To summarize the problem, the bgwriter currently scans the entire T1+T2 buffer lists and returns a list of all the currently dirty buffers. It then selects a subset of that list (computed using bgwriter_percent and bgwriter_maxpages) to flush to disk. Not only does this mean we can end up scanning a significant portion of shared_buffers for every invocation of the bgwriter, we also do the scan while holding the BufMgrLock, likely hurting scalability.

I think a fix for this in some fashion is warranted for 8.0. Possible solutions:

(1) Special-case bgwriter_percent=100. The only reason we need to return a list of all the dirty buffers is so that we can choose n% of them to satisfy bgwriter_percent. That is obviously unnecessary if we have bgwriter_percent=100. I think this change won't help most users, *unless* we also change bgwriter_percent=100 in the default configuration.

(2) Remove bgwriter_percent. I have yet to hear anyone argue that there's an actual need for bgwriter_percent in tuning bgwriter behavior, and one less GUC var is a good thing, all else being equal. This is effectively the same as #1 with the default changed, only less flexibility.

(3) Change the meaning of bgwriter_percent, per Simon's proposal. Make it mean "the percentage of the buffer pool to scan, at most, to look for dirty buffers". I don't think this is workable, at least not at this point in the release cycle, because it means we might not smooth of checkpoint load, one of the primary goals of the bgwriter (in this proposal bgwriter would only ever consider writing out a small subset of the total shared buffer cache: the least-recently-used n%, with 2% being a suggested default). Some variant of this might be worth exploring for 8.1 though.

A patch (implementing #2) is attached -- any benchmark results would be helpful. Increasing shared_buffers (to 10,000 or more) should make the problem noticeable.

Opinions on which route is the best, or on some alternative solution? My inclination is toward #2, but I'm not dead-set on it.

-Neil

[1] http://archives.postgresql.org/pgsql-hackers/2004-12/msg00386.php
Index: doc/src/sgml/runtime.sgml
===================================================================
RCS file: /var/lib/cvs/pgsql/doc/src/sgml/runtime.sgml,v
retrieving revision 1.296
diff -c -r1.296 runtime.sgml
*** doc/src/sgml/runtime.sgml	13 Dec 2004 18:05:09 -0000	1.296
--- doc/src/sgml/runtime.sgml	14 Dec 2004 04:52:26 -0000
***************
*** 1350,1382 ****
          <para>
           Specifies the delay between activity rounds for the
           background writer.  In each round the writer issues writes
!          for some number of dirty buffers (controllable by the
!          following parameters).  The selected buffers will always be
!          the least recently used ones among the currently dirty
!          buffers.  It then sleeps for <varname>bgwriter_delay</>
!          milliseconds, and repeats.  The default value is 200. Note
!          that on many systems, the effective resolution of sleep
!          delays is 10 milliseconds; setting <varname>bgwriter_delay</>
!          to a value that is not a multiple of 10 may have the same
!          results as setting it to the next higher multiple of 10.
!          This option can only be set at server start or in the
!          <filename>postgresql.conf</filename> file.
!         </para>
!        </listitem>
!       </varlistentry>
! 
!       <varlistentry id="guc-bgwriter-percent" xreflabel="bgwriter_percent">
!        <term><varname>bgwriter_percent</varname> (<type>integer</type>)</term>
!        <indexterm>
!         <primary><varname>bgwriter_percent</> configuration parameter</primary>
!        </indexterm>
!        <listitem>
!         <para>
!          In each round, no more than this percentage of the currently
!          dirty buffers will be written (rounding up any fraction to
!          the next whole number of buffers).  The default value is
!          1. This option can only be set at server start or in the
!          <filename>postgresql.conf</filename> file.
          </para>
         </listitem>
        </varlistentry>
--- 1350,1367 ----
          <para>
           Specifies the delay between activity rounds for the
           background writer.  In each round the writer issues writes
!          for some number of dirty buffers (controllable by
!          <varname>bgwriter_maxpages</varname>).  The selected buffers
!          will always be the least recently used ones among the
!          currently dirty buffers.  It then sleeps for
!          <varname>bgwriter_delay</> milliseconds, and repeats.  The
!          default value is 200. Note that on many systems, the
!          effective resolution of sleep delays is 10 milliseconds;
!          setting <varname>bgwriter_delay</> to a value that is not a
!          multiple of 10 may have the same results as setting it to the
!          next higher multiple of 10.  This option can only be set at
!          server start or in the <filename>postgresql.conf</filename>
!          file.
          </para>
         </listitem>
        </varlistentry>
***************
*** 1398,1409 ****
       </variablelist>
  
       <para>
!       Smaller values of <varname>bgwriter_percent</varname> and
!       <varname>bgwriter_maxpages</varname> reduce the extra I/O load
!       caused by the background writer, but leave more work to be done
!       at checkpoint time.  To reduce load spikes at checkpoints,
!       increase the values.  To disable background writing entirely,
!       set <varname>bgwriter_percent</varname> and/or
        <varname>bgwriter_maxpages</varname> to zero.
       </para>
      </sect3>
--- 1383,1396 ----
       </variablelist>
  
       <para>
!       Decreasing <varname>bgwriter_maxpages</varname> or increasing
!       <varname>bgwriter_delay</varname> will reduce the extra I/O load
!       caused by the background writer, but will leave more work to be
!       done at checkpoint time. To reduce load spikes at checkpoints,
!       increase the number of pages written per round
!       (<varname>bgwriter_maxpages</varname>) or reduce the delay
!       between rounds (<varname>bgwriter_delay</varname>). To disable
!       background writing entirely, set
        <varname>bgwriter_maxpages</varname> to zero.
       </para>
      </sect3>
Index: src/backend/catalog/index.c
===================================================================
RCS file: /var/lib/cvs/pgsql/src/backend/catalog/index.c,v
retrieving revision 1.242
diff -c -r1.242 index.c
*** src/backend/catalog/index.c	1 Dec 2004 19:00:39 -0000	1.242
--- src/backend/catalog/index.c	14 Dec 2004 04:32:39 -0000
***************
*** 1062,1068 ****
  		/* Send out shared cache inval if necessary */
  		if (!IsBootstrapProcessingMode())
  			CacheInvalidateHeapTuple(pg_class, tuple);
! 		BufferSync(-1, -1);
  	}
  	else if (dirty)
  	{
--- 1062,1068 ----
  		/* Send out shared cache inval if necessary */
  		if (!IsBootstrapProcessingMode())
  			CacheInvalidateHeapTuple(pg_class, tuple);
! 		BufferSync(-1);
  	}
  	else if (dirty)
  	{
Index: src/backend/commands/dbcommands.c
===================================================================
RCS file: /var/lib/cvs/pgsql/src/backend/commands/dbcommands.c,v
retrieving revision 1.147
diff -c -r1.147 dbcommands.c
*** src/backend/commands/dbcommands.c	18 Nov 2004 01:14:26 -0000	1.147
--- src/backend/commands/dbcommands.c	14 Dec 2004 04:40:19 -0000
***************
*** 332,338 ****
  	 * up-to-date for the copy.  (We really only need to flush buffers for
  	 * the source database, but bufmgr.c provides no API for that.)
  	 */
! 	BufferSync(-1, -1);
  
  	/*
  	 * Close virtual file descriptors so the kernel has more available for
--- 332,338 ----
  	 * up-to-date for the copy.  (We really only need to flush buffers for
  	 * the source database, but bufmgr.c provides no API for that.)
  	 */
! 	BufferSync(-1);
  
  	/*
  	 * Close virtual file descriptors so the kernel has more available for
***************
*** 1206,1212 ****
  		 * up-to-date for the copy.  (We really only need to flush buffers for
  		 * the source database, but bufmgr.c provides no API for that.)
  		 */
! 		BufferSync(-1, -1);
  
  #ifndef WIN32
  
--- 1206,1212 ----
  		 * up-to-date for the copy.  (We really only need to flush buffers for
  		 * the source database, but bufmgr.c provides no API for that.)
  		 */
! 		BufferSync(-1);
  
  #ifndef WIN32
  
Index: src/backend/postmaster/bgwriter.c
===================================================================
RCS file: /var/lib/cvs/pgsql/src/backend/postmaster/bgwriter.c,v
retrieving revision 1.11
diff -c -r1.11 bgwriter.c
*** src/backend/postmaster/bgwriter.c	5 Nov 2004 17:11:28 -0000	1.11
--- src/backend/postmaster/bgwriter.c	14 Dec 2004 04:44:26 -0000
***************
*** 116,122 ****
   * GUC parameters
   */
  int			BgWriterDelay = 200;
- int			BgWriterPercent = 1;
  int			BgWriterMaxPages = 100;
  
  int			CheckPointTimeout = 300;
--- 116,121 ----
***************
*** 372,378 ****
  			n = 1;
  		}
  		else
! 			n = BufferSync(BgWriterPercent, BgWriterMaxPages);
  
  		/*
  		 * Nap for the configured time or sleep for 10 seconds if there
--- 371,377 ----
  			n = 1;
  		}
  		else
! 			n = BufferSync(BgWriterMaxPages);
  
  		/*
  		 * Nap for the configured time or sleep for 10 seconds if there
Index: src/backend/storage/buffer/bufmgr.c
===================================================================
RCS file: /var/lib/cvs/pgsql/src/backend/storage/buffer/bufmgr.c,v
retrieving revision 1.182
diff -c -r1.182 bufmgr.c
*** src/backend/storage/buffer/bufmgr.c	24 Nov 2004 02:56:17 -0000	1.182
--- src/backend/storage/buffer/bufmgr.c	14 Dec 2004 04:40:18 -0000
***************
*** 671,717 ****
   *
   * This is called at checkpoint time to write out all dirty shared buffers,
   * and by the background writer process to write out some of the dirty blocks.
!  * percent/maxpages should be -1 in the former case, and limit values (>= 0)
   * in the latter.
   *
   * Returns the number of buffers written.
   */
  int
! BufferSync(int percent, int maxpages)
  {
  	BufferDesc **dirty_buffers;
  	BufferTag  *buftags;
  	int			num_buffer_dirty;
  	int			i;
  
! 	/* If either limit is zero then we are disabled from doing anything... */
! 	if (percent == 0 || maxpages == 0)
  		return 0;
  
  	/*
  	 * Get a list of all currently dirty buffers and how many there are.
  	 * We do not flush buffers that get dirtied after we started. They
  	 * have to wait until the next checkpoint.
  	 */
! 	dirty_buffers = (BufferDesc **) palloc(NBuffers * sizeof(BufferDesc *));
! 	buftags = (BufferTag *) palloc(NBuffers * sizeof(BufferTag));
  
  	LWLockAcquire(BufMgrLock, LW_EXCLUSIVE);
  	num_buffer_dirty = StrategyDirtyBufferList(dirty_buffers, buftags,
! 											   NBuffers);
! 
! 	/*
! 	 * If called by the background writer, we are usually asked to only
! 	 * write out some portion of dirty buffers now, to prevent the IO
! 	 * storm at checkpoint time.
! 	 */
! 	if (percent > 0)
! 	{
! 		Assert(percent <= 100);
! 		num_buffer_dirty = (num_buffer_dirty * percent + 99) / 100;
! 	}
! 	if (maxpages > 0 && num_buffer_dirty > maxpages)
! 		num_buffer_dirty = maxpages;
  
  	/* Make sure we can handle the pin inside the loop */
  	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
--- 671,710 ----
   *
   * This is called at checkpoint time to write out all dirty shared buffers,
   * and by the background writer process to write out some of the dirty blocks.
!  * maxpages should be -1 in the former case, and a limit value (>= 0)
   * in the latter.
   *
   * Returns the number of buffers written.
   */
  int
! BufferSync(int maxpages)
  {
  	BufferDesc **dirty_buffers;
  	BufferTag  *buftags;
  	int			num_buffer_dirty;
  	int			i;
  
! 	/* If maxpages is zero then we're effectively disabled */
! 	if (maxpages == 0)
  		return 0;
  
+ 	/* If -1, flush all dirty buffers */
+ 	if (maxpages == -1)
+ 		maxpages = NBuffers;
+ 
  	/*
+ 	 * Get a list of up to "maxpages" dirty buffers, starting from LRU and
  	 * Get a list of all currently dirty buffers and how many there are.
  	 * We do not flush buffers that get dirtied after we started. They
  	 * have to wait until the next checkpoint.
  	 */
! 	dirty_buffers = (BufferDesc **) palloc(maxpages * sizeof(BufferDesc *));
! 	buftags = (BufferTag *) palloc(maxpages * sizeof(BufferTag));
  
  	LWLockAcquire(BufMgrLock, LW_EXCLUSIVE);
  	num_buffer_dirty = StrategyDirtyBufferList(dirty_buffers, buftags,
! 											   maxpages);
! 	Assert(num_buffer_dirty <= maxpages);
  
  	/* Make sure we can handle the pin inside the loop */
  	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
***************
*** 947,953 ****
  void
  FlushBufferPool(void)
  {
! 	BufferSync(-1, -1);
  	smgrsync();
  }
  
--- 940,946 ----
  void
  FlushBufferPool(void)
  {
! 	BufferSync(-1);
  	smgrsync();
  }
  
Index: src/backend/storage/buffer/freelist.c
===================================================================
RCS file: /var/lib/cvs/pgsql/src/backend/storage/buffer/freelist.c,v
retrieving revision 1.48
diff -c -r1.48 freelist.c
*** src/backend/storage/buffer/freelist.c	16 Sep 2004 16:58:31 -0000	1.48
--- src/backend/storage/buffer/freelist.c	14 Dec 2004 04:22:02 -0000
***************
*** 753,810 ****
  	int			num_buffer_dirty = 0;
  	int			cdb_id_t1;
  	int			cdb_id_t2;
- 	int			buf_id;
- 	BufferDesc *buf;
  
  	/*
! 	 * Traverse the T1 and T2 list LRU to MRU in "parallel" and add all
! 	 * dirty buffers found in that order to the list. The ARC strategy
! 	 * keeps all used buffers including pinned ones in the T1 or T2 list.
! 	 * So we cannot miss any dirty buffers.
  	 */
  	cdb_id_t1 = StrategyControl->listHead[STRAT_LIST_T1];
  	cdb_id_t2 = StrategyControl->listHead[STRAT_LIST_T2];
  
  	while (cdb_id_t1 >= 0 || cdb_id_t2 >= 0)
  	{
  		if (cdb_id_t1 >= 0)
  		{
  			buf_id = StrategyCDB[cdb_id_t1].buf_id;
- 			buf = &BufferDescriptors[buf_id];
- 
- 			if (buf->flags & BM_VALID)
- 			{
- 				if ((buf->flags & BM_DIRTY) || (buf->cntxDirty))
- 				{
- 					buffers[num_buffer_dirty] = buf;
- 					buftags[num_buffer_dirty] = buf->tag;
- 					num_buffer_dirty++;
- 					if (num_buffer_dirty >= max_buffers)
- 						break;
- 				}
- 			}
- 
  			cdb_id_t1 = StrategyCDB[cdb_id_t1].next;
  		}
! 
! 		if (cdb_id_t2 >= 0)
  		{
  			buf_id = StrategyCDB[cdb_id_t2].buf_id;
! 			buf = &BufferDescriptors[buf_id];
  
! 			if (buf->flags & BM_VALID)
  			{
! 				if ((buf->flags & BM_DIRTY) || (buf->cntxDirty))
! 				{
! 					buffers[num_buffer_dirty] = buf;
! 					buftags[num_buffer_dirty] = buf->tag;
! 					num_buffer_dirty++;
! 					if (num_buffer_dirty >= max_buffers)
! 						break;
! 				}
  			}
- 
- 			cdb_id_t2 = StrategyCDB[cdb_id_t2].next;
  		}
  	}
  
--- 753,797 ----
  	int			num_buffer_dirty = 0;
  	int			cdb_id_t1;
  	int			cdb_id_t2;
  
  	/*
! 	 * Traverse the T1 and T2 list from LRU to MRU in "parallel" and
! 	 * add all dirty buffers found in that order to the list. The ARC
! 	 * strategy keeps all used buffers including pinned ones in the T1
! 	 * or T2 list.  So we cannot miss any dirty buffers.
  	 */
  	cdb_id_t1 = StrategyControl->listHead[STRAT_LIST_T1];
  	cdb_id_t2 = StrategyControl->listHead[STRAT_LIST_T2];
  
  	while (cdb_id_t1 >= 0 || cdb_id_t2 >= 0)
  	{
+ 		int			buf_id;
+ 		BufferDesc *buf;
+ 
  		if (cdb_id_t1 >= 0)
  		{
  			buf_id = StrategyCDB[cdb_id_t1].buf_id;
  			cdb_id_t1 = StrategyCDB[cdb_id_t1].next;
  		}
! 		else
  		{
+ 			Assert(cdb_id_t2 >= 0);
  			buf_id = StrategyCDB[cdb_id_t2].buf_id;
! 			cdb_id_t2 = StrategyCDB[cdb_id_t2].next;
! 		}
! 
! 		buf = &BufferDescriptors[buf_id];
  
! 		if (buf->flags & BM_VALID)
! 		{
! 			if ((buf->flags & BM_DIRTY) || (buf->cntxDirty))
  			{
! 				buffers[num_buffer_dirty] = buf;
! 				buftags[num_buffer_dirty] = buf->tag;
! 				num_buffer_dirty++;
! 				if (num_buffer_dirty >= max_buffers)
! 					break;
  			}
  		}
  	}
  
Index: src/backend/utils/misc/guc.c
===================================================================
RCS file: /var/lib/cvs/pgsql/src/backend/utils/misc/guc.c,v
retrieving revision 1.250
diff -c -r1.250 guc.c
*** src/backend/utils/misc/guc.c	24 Nov 2004 19:51:03 -0000	1.250
--- src/backend/utils/misc/guc.c	14 Dec 2004 04:44:40 -0000
***************
*** 1249,1263 ****
  	},
  
  	{
- 		{"bgwriter_percent", PGC_SIGHUP, RESOURCES,
- 			gettext_noop("Background writer percentage of dirty buffers to flush per round"),
- 			NULL
- 		},
- 		&BgWriterPercent,
- 		1, 0, 100, NULL, NULL
- 	},
- 
- 	{
  		{"bgwriter_maxpages", PGC_SIGHUP, RESOURCES,
  			gettext_noop("Background writer maximum number of pages to flush per round"),
  			NULL
--- 1249,1254 ----
Index: src/backend/utils/misc/postgresql.conf.sample
===================================================================
RCS file: /var/lib/cvs/pgsql/src/backend/utils/misc/postgresql.conf.sample,v
retrieving revision 1.134
diff -c -r1.134 postgresql.conf.sample
*** src/backend/utils/misc/postgresql.conf.sample	5 Nov 2004 19:16:16 -0000	1.134
--- src/backend/utils/misc/postgresql.conf.sample	14 Dec 2004 04:54:47 -0000
***************
*** 96,106 ****
  #vacuum_cost_page_dirty = 20	# 0-10000 credits
  #vacuum_cost_limit = 200	# 0-10000 credits
  
! # - Background writer -
  
  #bgwriter_delay = 200		# 10-10000 milliseconds between rounds
! #bgwriter_percent = 1		# 0-100% of dirty buffers in each round
! #bgwriter_maxpages = 100	# 0-1000 buffers max per round
  
  
  #---------------------------------------------------------------------------
--- 96,105 ----
  #vacuum_cost_page_dirty = 20	# 0-10000 credits
  #vacuum_cost_limit = 200	# 0-10000 credits
  
! # - Background Writer -
  
  #bgwriter_delay = 200		# 10-10000 milliseconds between rounds
! #bgwriter_maxpages = 100	# max buffers written per round, 0 disables
  
  
  #---------------------------------------------------------------------------
Index: src/include/postmaster/bgwriter.h
===================================================================
RCS file: /var/lib/cvs/pgsql/src/include/postmaster/bgwriter.h,v
retrieving revision 1.3
diff -c -r1.3 bgwriter.h
*** src/include/postmaster/bgwriter.h	29 Aug 2004 04:13:09 -0000	1.3
--- src/include/postmaster/bgwriter.h	14 Dec 2004 04:44:44 -0000
***************
*** 18,24 ****
  
  /* GUC options */
  extern int	BgWriterDelay;
- extern int	BgWriterPercent;
  extern int	BgWriterMaxPages;
  extern int	CheckPointTimeout;
  extern int	CheckPointWarning;
--- 18,23 ----
Index: src/include/storage/bufmgr.h
===================================================================
RCS file: /var/lib/cvs/pgsql/src/include/storage/bufmgr.h,v
retrieving revision 1.88
diff -c -r1.88 bufmgr.h
*** src/include/storage/bufmgr.h	16 Oct 2004 18:57:26 -0000	1.88
--- src/include/storage/bufmgr.h	14 Dec 2004 04:40:09 -0000
***************
*** 150,156 ****
  extern void AbortBufferIO(void);
  
  extern void BufmgrCommit(void);
! extern int	BufferSync(int percent, int maxpages);
  
  extern void InitLocalBuffer(void);
  
--- 150,156 ----
  extern void AbortBufferIO(void);
  
  extern void BufmgrCommit(void);
! extern int	BufferSync(int maxpages);
  
  extern void InitLocalBuffer(void);
  
---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Reply via email to