[PATCHES] Load Distributed Checkpoints, take 3

Heikki Linnakangas Wed, 20 Jun 2007 06:50:25 -0700

Here's an updated WIP patch for load distributed checkpoints.

I added a spinlock to protect the signaling fields between bgwriter andbackends. The current non-locking approach gets really difficult as thepatch adds two new flags, and both are more important than the existingckpt_time_warn flag.


In fact, I think there's a small race condition in CVS HEAD:

1. pg_start_backup() is called, which calls RequestCheckpoint
2. RequestCheckpoint takes note of the old value of ckpt_started

3. bgwriter wakes up from pg_usleep, and sees that we've exceededcheckpoint_timeout.

4. bgwriter increases ckpt_started to note that a new checkpoint has started
5. RequestCheckpoint signals bgwriter to start a new checkpoint

6. bgwriter calls CreateCheckpoint, with the force-flag set to falsebecause this checkpoint was triggered by timeout7. RequestCheckpoint sees that ckpt_started has increased, and starts towait for ckpt_done to reach the new value.8. CreateCheckpoint finishes immediately, because there was no XLOGactivity since last checkpoint.

9. RequestCheckpoint sees that ckpt_done matches ckpt_started, and returns.

10. pg_start_backup() continues, with potentially the same redo locationand thus history filename as previous backup.

Now I admit that the chances for that to happen are extremely small,people don't usually do two pg_start_backup calls without *any* WALlogged activity in between them, for example. But as we add the newflags, avoiding scenarios like that becomes harder.

Since last patch, I did some clean up and refactoring, and added a bunchof comments, and user documentation.

I haven't yet changed GetInsertRecPtr to use the almost up-to-date valueprotected by the info_lck per Simon's suggestion, and I need to do somecorrectness testing. After that, I'm done with the patch.

Ps. In case you wonder what took me so long since last revision, I'vespent a lot of time reviewing HOT.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Index: doc/src/sgml/config.sgml
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/doc/src/sgml/config.sgml,v
retrieving revision 1.126
diff -c -r1.126 config.sgml
*** doc/src/sgml/config.sgml	7 Jun 2007 19:19:56 -0000	1.126
--- doc/src/sgml/config.sgml	19 Jun 2007 14:24:31 -0000
***************
*** 1565,1570 ****
--- 1565,1608 ----
        </listitem>
       </varlistentry>
  
+      <varlistentry id="guc-checkpoint-smoothing" xreflabel="checkpoint_smoothing">
+       <term><varname>checkpoint_smoothing</varname> (<type>floating point</type>)</term>
+       <indexterm>
+        <primary><varname>checkpoint_smoothing</> configuration parameter</primary>
+       </indexterm>
+       <listitem>
+        <para>
+         Specifies the target length of checkpoints, as a fraction of 
+         the checkpoint interval. The default is 0.3.
+ 
+         This parameter can only be set in the <filename>postgresql.conf</>
+         file or on the server command line.
+        </para>
+       </listitem>
+      </varlistentry>
+ 
+       <term><varname>checkpoint_rate</varname> (<type>floating point</type>)</term>
+       <indexterm>
+        <primary><varname>checkpoint_rate</> configuration parameter</primary>
+       </indexterm>
+       <listitem>
+        <para>
+         Specifies the minimum I/O rate used to flush dirty buffers during a
+         checkpoint, when there's not many dirty buffers in the buffer cache.
+         The default is 512 KB/s.
+ 
+         Note: the accuracy of this setting depends on
+         <varname>bgwriter_delay</varname. This value is converted internally
+         to pages / bgwriter_delay, so if for examply the minimum allowed
+         bgwriter_delay setting of 10ms is used, the effective minimum 
+         checkpoint I/O rate is 1 page / 10 ms, or 800 KB/s.
+ 
+         This parameter can only be set in the <filename>postgresql.conf</>
+         file or on the server command line.
+        </para>
+       </listitem>
+      </varlistentry>
+ 
       <varlistentry id="guc-checkpoint-warning" xreflabel="checkpoint_warning">
        <term><varname>checkpoint_warning</varname> (<type>integer</type>)</term>
        <indexterm>
Index: doc/src/sgml/wal.sgml
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/doc/src/sgml/wal.sgml,v
retrieving revision 1.43
diff -c -r1.43 wal.sgml
*** doc/src/sgml/wal.sgml	31 Jan 2007 20:56:19 -0000	1.43
--- doc/src/sgml/wal.sgml	19 Jun 2007 14:26:45 -0000
***************
*** 217,225 ****
    </para>
  
    <para>
     There will be at least one WAL segment file, and will normally
     not be more than 2 * <varname>checkpoint_segments</varname> + 1
!    files.  Each segment file is normally 16 MB (though this size can be
     altered when building the server).  You can use this to estimate space
     requirements for <acronym>WAL</acronym>.
     Ordinarily, when old log segment files are no longer needed, they
--- 217,245 ----
    </para>
  
    <para>
+    If there is a lot of dirty buffers in the buffer cache, flushing them
+    all at checkpoint will cause a heavy burst of I/O that can disrupt other
+    activity in the system. To avoid that, the checkpoint I/O can be distributed
+    over a longer period of time, defined with
+    <varname>checkpoint_smoothing</varname>. It's given as a fraction of the
+    checkpoint interval, as defined by <varname>checkpoint_timeout</varname>
+    and <varname>checkpoint_segments</varname>. The WAL segment consumption
+    and elapsed time is monitored and the I/O rate is adjusted during
+    checkpoint so that it's finished when the given fraction of elapsed time
+    or WAL segments has passed, whichever is sooner. However, that could lead
+    to unnecessarily prolonged checkpoints when there's not many dirty buffers
+    in the cache. To avoid that, <varname>checkpoint_rate</varname> can be used
+    to set the minimum I/O rate used. Note that prolonging checkpoints
+    affects recovery time, because the longer the checkpoint takes, more WAL
+    need to be kept around and replayed in recovery.
+   </para>
+ 
+   <para>
     There will be at least one WAL segment file, and will normally
     not be more than 2 * <varname>checkpoint_segments</varname> + 1
!    files, though there can be more if a large 
!    <varname>checkpoint_smoothing</varname> setting is used.  
!    Each segment file is normally 16 MB (though this size can be
     altered when building the server).  You can use this to estimate space
     requirements for <acronym>WAL</acronym>.
     Ordinarily, when old log segment files are no longer needed, they
Index: src/backend/access/transam/xlog.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.272
diff -c -r1.272 xlog.c
*** src/backend/access/transam/xlog.c	31 May 2007 15:13:01 -0000	1.272
--- src/backend/access/transam/xlog.c	20 Jun 2007 10:44:40 -0000
***************
*** 398,404 ****
  static void exitArchiveRecovery(TimeLineID endTLI,
  					uint32 endLogId, uint32 endLogSeg);
  static bool recoveryStopsHere(XLogRecord *record, bool *includeThis);
! static void CheckPointGuts(XLogRecPtr checkPointRedo);
  
  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
  				XLogRecPtr *lsn, BkpBlock *bkpb);
--- 398,404 ----
  static void exitArchiveRecovery(TimeLineID endTLI,
  					uint32 endLogId, uint32 endLogSeg);
  static bool recoveryStopsHere(XLogRecord *record, bool *includeThis);
! static void CheckPointGuts(XLogRecPtr checkPointRedo, bool immediate);
  
  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
  				XLogRecPtr *lsn, BkpBlock *bkpb);
***************
*** 1608,1614 ****
  						if (XLOG_DEBUG)
  							elog(LOG, "time for a checkpoint, signaling bgwriter");
  #endif
! 						RequestCheckpoint(false, true);
  					}
  				}
  			}
--- 1608,1614 ----
  						if (XLOG_DEBUG)
  							elog(LOG, "time for a checkpoint, signaling bgwriter");
  #endif
! 						RequestXLogFillCheckpoint();
  					}
  				}
  			}
***************
*** 5110,5116 ****
  		 * the rule that TLI only changes in shutdown checkpoints, which
  		 * allows some extra error checking in xlog_redo.
  		 */
! 		CreateCheckPoint(true, true);
  
  		/*
  		 * Close down recovery environment
--- 5110,5116 ----
  		 * the rule that TLI only changes in shutdown checkpoints, which
  		 * allows some extra error checking in xlog_redo.
  		 */
! 		CreateCheckPoint(true, true, true);
  
  		/*
  		 * Close down recovery environment
***************
*** 5319,5324 ****
--- 5319,5340 ----
  }
  
  /*
+  * GetInsertRecPtr -- Returns the current insert position.
+  */
+ XLogRecPtr
+ GetInsertRecPtr(void)
+ {
+ 	XLogCtlInsert  *Insert = &XLogCtl->Insert;
+ 	XLogRecPtr		recptr;
+ 
+ 	LWLockAcquire(WALInsertLock, LW_SHARED);
+ 	INSERT_RECPTR(recptr, Insert, Insert->curridx);
+ 	LWLockRelease(WALInsertLock);
+ 
+ 	return recptr;
+ }
+ 
+ /*
   * Get the time of the last xlog segment switch
   */
  time_t
***************
*** 5383,5389 ****
  	ereport(LOG,
  			(errmsg("shutting down")));
  
! 	CreateCheckPoint(true, true);
  	ShutdownCLOG();
  	ShutdownSUBTRANS();
  	ShutdownMultiXact();
--- 5399,5405 ----
  	ereport(LOG,
  			(errmsg("shutting down")));
  
! 	CreateCheckPoint(true, true, true);
  	ShutdownCLOG();
  	ShutdownSUBTRANS();
  	ShutdownMultiXact();
***************
*** 5395,5405 ****
  /*
   * Perform a checkpoint --- either during shutdown, or on-the-fly
   *
   * If force is true, we force a checkpoint regardless of whether any XLOG
   * activity has occurred since the last one.
   */
  void
! CreateCheckPoint(bool shutdown, bool force)
  {
  	CheckPoint	checkPoint;
  	XLogRecPtr	recptr;
--- 5411,5424 ----
  /*
   * Perform a checkpoint --- either during shutdown, or on-the-fly
   *
+  * If immediate is true, we try to finish the checkpoint as fast as we can,
+  * ignoring checkpoint_smoothing parameter. 
+  *
   * If force is true, we force a checkpoint regardless of whether any XLOG
   * activity has occurred since the last one.
   */
  void
! CreateCheckPoint(bool shutdown, bool immediate, bool force)
  {
  	CheckPoint	checkPoint;
  	XLogRecPtr	recptr;
***************
*** 5591,5597 ****
  	 */
  	END_CRIT_SECTION();
  
! 	CheckPointGuts(checkPoint.redo);
  
  	START_CRIT_SECTION();
  
--- 5610,5616 ----
  	 */
  	END_CRIT_SECTION();
  
! 	CheckPointGuts(checkPoint.redo, immediate);
  
  	START_CRIT_SECTION();
  
***************
*** 5693,5708 ****
  /*
   * Flush all data in shared memory to disk, and fsync
   *
   * This is the common code shared between regular checkpoints and
   * recovery restartpoints.
   */
  static void
! CheckPointGuts(XLogRecPtr checkPointRedo)
  {
  	CheckPointCLOG();
  	CheckPointSUBTRANS();
  	CheckPointMultiXact();
! 	FlushBufferPool();			/* performs all required fsyncs */
  	/* We deliberately delay 2PC checkpointing as long as possible */
  	CheckPointTwoPhase(checkPointRedo);
  }
--- 5712,5730 ----
  /*
   * Flush all data in shared memory to disk, and fsync
   *
+  * If immediate is true, try to finish as quickly as possible, ignoring
+  * the GUC variables to throttle checkpoint I/O.
+  *
   * This is the common code shared between regular checkpoints and
   * recovery restartpoints.
   */
  static void
! CheckPointGuts(XLogRecPtr checkPointRedo, bool immediate)
  {
  	CheckPointCLOG();
  	CheckPointSUBTRANS();
  	CheckPointMultiXact();
! 	FlushBufferPool(immediate);		/* performs all required fsyncs */
  	/* We deliberately delay 2PC checkpointing as long as possible */
  	CheckPointTwoPhase(checkPointRedo);
  }
***************
*** 5710,5716 ****
  /*
   * Set a recovery restart point if appropriate
   *
!  * This is similar to CreateCheckpoint, but is used during WAL recovery
   * to establish a point from which recovery can roll forward without
   * replaying the entire recovery log.  This function is called each time
   * a checkpoint record is read from XLOG; it must determine whether a
--- 5732,5738 ----
  /*
   * Set a recovery restart point if appropriate
   *
!  * This is similar to CreateCheckPoint, but is used during WAL recovery
   * to establish a point from which recovery can roll forward without
   * replaying the entire recovery log.  This function is called each time
   * a checkpoint record is read from XLOG; it must determine whether a
***************
*** 5751,5757 ****
  	/*
  	 * OK, force data out to disk
  	 */
! 	CheckPointGuts(checkPoint->redo);
  
  	/*
  	 * Update pg_control so that any subsequent crash will restart from this
--- 5773,5779 ----
  	/*
  	 * OK, force data out to disk
  	 */
! 	CheckPointGuts(checkPoint->redo, true);
  
  	/*
  	 * Update pg_control so that any subsequent crash will restart from this
***************
*** 6177,6183 ****
  		 * have different checkpoint positions and hence different history
  		 * file names, even if nothing happened in between.
  		 */
! 		RequestCheckpoint(true, false);
  
  		/*
  		 * Now we need to fetch the checkpoint record location, and also its
--- 6199,6205 ----
  		 * have different checkpoint positions and hence different history
  		 * file names, even if nothing happened in between.
  		 */
! 		RequestLazyCheckpoint();
  
  		/*
  		 * Now we need to fetch the checkpoint record location, and also its
Index: src/backend/bootstrap/bootstrap.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/bootstrap/bootstrap.c,v
retrieving revision 1.233
diff -c -r1.233 bootstrap.c
*** src/backend/bootstrap/bootstrap.c	7 Mar 2007 13:35:02 -0000	1.233
--- src/backend/bootstrap/bootstrap.c	19 Jun 2007 15:29:51 -0000
***************
*** 489,495 ****
  
  	/* Perform a checkpoint to ensure everything's down to disk */
  	SetProcessingMode(NormalProcessing);
! 	CreateCheckPoint(true, true);
  
  	/* Clean up and exit */
  	cleanup();
--- 489,495 ----
  
  	/* Perform a checkpoint to ensure everything's down to disk */
  	SetProcessingMode(NormalProcessing);
! 	CreateCheckPoint(true, true, true);
  
  	/* Clean up and exit */
  	cleanup();
Index: src/backend/commands/dbcommands.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/dbcommands.c,v
retrieving revision 1.195
diff -c -r1.195 dbcommands.c
*** src/backend/commands/dbcommands.c	1 Jun 2007 19:38:07 -0000	1.195
--- src/backend/commands/dbcommands.c	20 Jun 2007 09:36:24 -0000
***************
*** 404,410 ****
  	 * up-to-date for the copy.  (We really only need to flush buffers for the
  	 * source database, but bufmgr.c provides no API for that.)
  	 */
! 	BufferSync();
  
  	/*
  	 * Once we start copying subdirectories, we need to be able to clean 'em
--- 404,410 ----
  	 * up-to-date for the copy.  (We really only need to flush buffers for the
  	 * source database, but bufmgr.c provides no API for that.)
  	 */
! 	BufferSync(true);
  
  	/*
  	 * Once we start copying subdirectories, we need to be able to clean 'em
***************
*** 507,513 ****
  		 * Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
  		 * we can avoid this.
  		 */
! 		RequestCheckpoint(true, false);
  
  		/*
  		 * Close pg_database, but keep lock till commit (this is important to
--- 507,513 ----
  		 * Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
  		 * we can avoid this.
  		 */
! 		RequestImmediateCheckpoint();
  
  		/*
  		 * Close pg_database, but keep lock till commit (this is important to
***************
*** 661,667 ****
  	 * open files, which would cause rmdir() to fail.
  	 */
  #ifdef WIN32
! 	RequestCheckpoint(true, false);
  #endif
  
  	/*
--- 661,667 ----
  	 * open files, which would cause rmdir() to fail.
  	 */
  #ifdef WIN32
! 	RequestImmediateCheckpoint();
  #endif
  
  	/*
***************
*** 1427,1433 ****
  		 * up-to-date for the copy.  (We really only need to flush buffers for
  		 * the source database, but bufmgr.c provides no API for that.)
  		 */
! 		BufferSync();
  
  		/*
  		 * Copy this subdirectory to the new location
--- 1427,1433 ----
  		 * up-to-date for the copy.  (We really only need to flush buffers for
  		 * the source database, but bufmgr.c provides no API for that.)
  		 */
! 		BufferSync(true);
  
  		/*
  		 * Copy this subdirectory to the new location
Index: src/backend/postmaster/bgwriter.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/postmaster/bgwriter.c,v
retrieving revision 1.38
diff -c -r1.38 bgwriter.c
*** src/backend/postmaster/bgwriter.c	27 May 2007 03:50:39 -0000	1.38
--- src/backend/postmaster/bgwriter.c	20 Jun 2007 12:58:20 -0000
***************
*** 44,49 ****
--- 44,50 ----
  #include "postgres.h"
  
  #include <signal.h>
+ #include <sys/time.h>
  #include <time.h>
  #include <unistd.h>
  
***************
*** 59,64 ****
--- 60,66 ----
  #include "storage/pmsignal.h"
  #include "storage/shmem.h"
  #include "storage/smgr.h"
+ #include "storage/spin.h"
  #include "tcop/tcopprot.h"
  #include "utils/guc.h"
  #include "utils/memutils.h"
***************
*** 112,122 ****
  {
  	pid_t		bgwriter_pid;	/* PID of bgwriter (0 if not started) */
  
! 	sig_atomic_t ckpt_started;	/* advances when checkpoint starts */
! 	sig_atomic_t ckpt_done;		/* advances when checkpoint done */
! 	sig_atomic_t ckpt_failed;	/* advances when checkpoint fails */
  
! 	sig_atomic_t ckpt_time_warn;	/* warn if too soon since last ckpt? */
  
  	int			num_requests;	/* current # of requests */
  	int			max_requests;	/* allocated array size */
--- 114,128 ----
  {
  	pid_t		bgwriter_pid;	/* PID of bgwriter (0 if not started) */
  
! 	slock_t		ckpt_lck;		/* protects all the ckpt_* fields */
  
! 	int			ckpt_started;	/* advances when checkpoint starts */
! 	int			ckpt_done;		/* advances when checkpoint done */
! 	int			ckpt_failed;	/* advances when checkpoint fails */
! 
! 	bool	ckpt_rqst_time_warn;	/* warn if too soon since last ckpt */
! 	bool	ckpt_rqst_immediate;	/* an immediate ckpt has been requested */
! 	bool	ckpt_rqst_force;		/* checkpoint even if no WAL activity */
  
  	int			num_requests;	/* current # of requests */
  	int			max_requests;	/* allocated array size */
***************
*** 131,136 ****
--- 137,143 ----
  int			BgWriterDelay = 200;
  int			CheckPointTimeout = 300;
  int			CheckPointWarning = 30;
+ double		CheckPointSmoothing = 0.3;
  
  /*
   * Flags set by interrupt handlers for later service in the main loop.
***************
*** 146,154 ****
--- 153,176 ----
  
  static bool ckpt_active = false;
  
+ /* Current time and WAL insert location when checkpoint was started */
+ static time_t		ckpt_start_time;
+ static XLogRecPtr	ckpt_start_recptr;
+ 
+ static double		ckpt_cached_elapsed;
+ 
  static time_t last_checkpoint_time;
  static time_t last_xlog_switch_time;
  
+ /* Prototypes for private functions */
+ 
+ static void RequestCheckpoint(bool waitforit, bool warnontime, bool immediate, bool force);
+ static void CheckArchiveTimeout(void);
+ static void BgWriterNap(void);
+ static bool IsCheckpointOnSchedule(double progress);
+ static bool ImmediateCheckpointRequested(void);
+ 
+ /* Signal handlers */
  
  static void bg_quickdie(SIGNAL_ARGS);
  static void BgSigHupHandler(SIGNAL_ARGS);
***************
*** 170,175 ****
--- 192,198 ----
  
  	Assert(BgWriterShmem != NULL);
  	BgWriterShmem->bgwriter_pid = MyProcPid;
+ 	SpinLockInit(&BgWriterShmem->ckpt_lck);
  	am_bg_writer = true;
  
  	/*
***************
*** 281,288 ****
--- 304,314 ----
  			/* use volatile pointer to prevent code rearrangement */
  			volatile BgWriterShmemStruct *bgs = BgWriterShmem;
  
+ 			SpinLockAcquire(&BgWriterShmem->ckpt_lck);
  			bgs->ckpt_failed++;
  			bgs->ckpt_done = bgs->ckpt_started;
+ 			SpinLockRelease(&bgs->ckpt_lck);
+ 
  			ckpt_active = false;
  		}
  
***************
*** 328,337 ****
  	for (;;)
  	{
  		bool		do_checkpoint = false;
- 		bool		force_checkpoint = false;
  		time_t		now;
  		int			elapsed_secs;
- 		long		udelay;
  
  		/*
  		 * Emergency bailout if postmaster has died.  This is to avoid the
--- 354,361 ----
***************
*** 354,360 ****
  		{
  			checkpoint_requested = false;
  			do_checkpoint = true;
- 			force_checkpoint = true;
  			BgWriterStats.m_requested_checkpoints++;
  		}
  		if (shutdown_requested)
--- 378,383 ----
***************
*** 377,387 ****
  		 */
  		now = time(NULL);
  		elapsed_secs = now - last_checkpoint_time;
! 		if (elapsed_secs >= CheckPointTimeout)
  		{
  			do_checkpoint = true;
! 			if (!force_checkpoint)
! 				BgWriterStats.m_timed_checkpoints++;
  		}
  
  		/*
--- 400,409 ----
  		 */
  		now = time(NULL);
  		elapsed_secs = now - last_checkpoint_time;
! 		if (!do_checkpoint && elapsed_secs >= CheckPointTimeout)
  		{
  			do_checkpoint = true;
! 			BgWriterStats.m_timed_checkpoints++;
  		}
  
  		/*
***************
*** 390,395 ****
--- 412,445 ----
  		 */
  		if (do_checkpoint)
  		{
+ 			/* use volatile pointer to prevent code rearrangement */
+ 			volatile BgWriterShmemStruct *bgs = BgWriterShmem;
+ 			bool time_warn;
+ 			bool immediate;
+ 			bool force;
+ 
+ 			/*
+ 			 * Atomically check the request flags to figure out what
+ 			 * kind of a checkpoint we should perform, and increase the 
+ 			 * started-counter to acknowledge that we've started
+ 			 * a new checkpoint.
+ 			 */
+ 
+ 			SpinLockAcquire(&bgs->ckpt_lck);
+ 
+ 			time_warn = bgs->ckpt_rqst_time_warn;
+ 			bgs->ckpt_rqst_time_warn = false;
+ 
+ 			immediate = bgs->ckpt_rqst_immediate;
+ 			bgs->ckpt_rqst_immediate = false;
+ 
+ 			force = bgs->ckpt_rqst_force;
+ 			bgs->ckpt_rqst_force = false;
+ 
+ 			bgs->ckpt_started++;
+ 
+ 			SpinLockRelease(&bgs->ckpt_lck);
+ 
  			/*
  			 * We will warn if (a) too soon since last checkpoint (whatever
  			 * caused it) and (b) somebody has set the ckpt_time_warn flag
***************
*** 397,417 ****
  			 * implementation will not generate warnings caused by
  			 * CheckPointTimeout < CheckPointWarning.
  			 */
! 			if (BgWriterShmem->ckpt_time_warn &&
  				elapsed_secs < CheckPointWarning)
  				ereport(LOG,
  						(errmsg("checkpoints are occurring too frequently (%d seconds apart)",
  								elapsed_secs),
  						 errhint("Consider increasing the configuration parameter \"checkpoint_segments\".")));
! 			BgWriterShmem->ckpt_time_warn = false;
  
  			/*
  			 * Indicate checkpoint start to any waiting backends.
  			 */
  			ckpt_active = true;
- 			BgWriterShmem->ckpt_started++;
  
! 			CreateCheckPoint(false, force_checkpoint);
  
  			/*
  			 * After any checkpoint, close all smgr files.	This is so we
--- 447,474 ----
  			 * implementation will not generate warnings caused by
  			 * CheckPointTimeout < CheckPointWarning.
  			 */
! 			if (time_warn &&
  				elapsed_secs < CheckPointWarning)
  				ereport(LOG,
  						(errmsg("checkpoints are occurring too frequently (%d seconds apart)",
  								elapsed_secs),
  						 errhint("Consider increasing the configuration parameter \"checkpoint_segments\".")));
! 
  
  			/*
  			 * Indicate checkpoint start to any waiting backends.
  			 */
  			ckpt_active = true;
  
! 			ckpt_start_recptr = GetInsertRecPtr();
! 			ckpt_start_time = now;
! 			ckpt_cached_elapsed = 0;
! 
! 			elog(DEBUG1, "CHECKPOINT: start");
! 
! 			CreateCheckPoint(false, immediate, force);
! 
! 			elog(DEBUG1, "CHECKPOINT: end");
  
  			/*
  			 * After any checkpoint, close all smgr files.	This is so we
***************
*** 422,428 ****
  			/*
  			 * Indicate checkpoint completion to any waiting backends.
  			 */
! 			BgWriterShmem->ckpt_done = BgWriterShmem->ckpt_started;
  			ckpt_active = false;
  
  			/*
--- 479,487 ----
  			/*
  			 * Indicate checkpoint completion to any waiting backends.
  			 */
! 			SpinLockAcquire(&bgs->ckpt_lck);
! 			bgs->ckpt_done = bgs->ckpt_started;
! 			SpinLockRelease(&bgs->ckpt_lck);
  			ckpt_active = false;
  
  			/*
***************
*** 433,446 ****
  			last_checkpoint_time = now;
  		}
  		else
! 			BgBufferSync();
  
  		/*
! 		 * Check for archive_timeout, if so, switch xlog files.  First we do a
! 		 * quick check using possibly-stale local state.
  		 */
! 		if (XLogArchiveTimeout > 0 &&
! 			(int) (now - last_xlog_switch_time) >= XLogArchiveTimeout)
  		{
  			/*
  			 * Update local state ... note that last_xlog_switch_time is the
--- 492,530 ----
  			last_checkpoint_time = now;
  		}
  		else
! 		{
! 			BgAllSweep();
! 			BgLruSweep();
! 		}
  
  		/*
! 		 * Check for archive_timeout and switch xlog files if necessary.
  		 */
! 		CheckArchiveTimeout();
! 
! 		/* Nap for the configured time. */
! 		BgWriterNap();
! 	}
! }
! 
! /*
!  * CheckArchiveTimeout -- check for archive_timeout and switch xlog files
!  *		if needed
!  */
! static void
! CheckArchiveTimeout(void)
! {
! 	time_t		now;
! 
! 	if (XLogArchiveTimeout <= 0)
! 		return;
! 
! 	now = time(NULL);
! 
! 	/* First we do a quick check using possibly-stale local state. */
! 	if ((int) (now - last_xlog_switch_time) < XLogArchiveTimeout)
! 		return;
! 
  		{
  			/*
  			 * Update local state ... note that last_xlog_switch_time is the
***************
*** 450,459 ****
  
  			last_xlog_switch_time = Max(last_xlog_switch_time, last_time);
  
- 			/* if we did a checkpoint, 'now' might be stale too */
- 			if (do_checkpoint)
- 				now = time(NULL);
- 
  			/* Now we can do the real check */
  			if ((int) (now - last_xlog_switch_time) >= XLogArchiveTimeout)
  			{
--- 534,539 ----
***************
*** 478,483 ****
--- 558,572 ----
  				last_xlog_switch_time = now;
  			}
  		}
+ }
+ 
+ /*
+  * BgWriterNap -- Nap for the configured time or until a signal is received.
+  */
+ static void
+ BgWriterNap(void)
+ {
+ 	long		udelay;
  
  		/*
  		 * Send off activity statistics to the stats collector
***************
*** 496,502 ****
  		 * We absorb pending requests after each short sleep.
  		 */
  		if ((bgwriter_all_percent > 0.0 && bgwriter_all_maxpages > 0) ||
! 			(bgwriter_lru_percent > 0.0 && bgwriter_lru_maxpages > 0))
  			udelay = BgWriterDelay * 1000L;
  		else if (XLogArchiveTimeout > 0)
  			udelay = 1000000L;	/* One second */
--- 585,592 ----
  		 * We absorb pending requests after each short sleep.
  		 */
  		if ((bgwriter_all_percent > 0.0 && bgwriter_all_maxpages > 0) ||
! 			(bgwriter_lru_percent > 0.0 && bgwriter_lru_maxpages > 0) ||
! 			ckpt_active)
  			udelay = BgWriterDelay * 1000L;
  		else if (XLogArchiveTimeout > 0)
  			udelay = 1000000L;	/* One second */
***************
*** 505,522 ****
  
  		while (udelay > 999999L)
  		{
! 			if (got_SIGHUP || checkpoint_requested || shutdown_requested)
  				break;
  			pg_usleep(1000000L);
  			AbsorbFsyncRequests();
  			udelay -= 1000000L;
  		}
  
! 		if (!(got_SIGHUP || checkpoint_requested || shutdown_requested))
  			pg_usleep(udelay);
  	}
  }
  
  
  /* --------------------------------
   *		signal handler routines
--- 595,766 ----
  
  		while (udelay > 999999L)
  		{
! 			/* If a checkpoint is active, postpone reloading the config 
! 			 * until the checkpoint is finished, and don't care about
! 			 * non-immediate checkpoint requests.
! 			 */
! 			if (shutdown_requested ||
! 				(!ckpt_active && (got_SIGHUP || checkpoint_requested)) ||
! 				(ckpt_active && ImmediateCheckpointRequested()))
  				break;
+ 
  			pg_usleep(1000000L);
  			AbsorbFsyncRequests();
  			udelay -= 1000000L;
  		}
  
! 
! 		if (!(shutdown_requested ||
! 			  (!ckpt_active && (got_SIGHUP || checkpoint_requested)) ||
! 			  (ckpt_active && ImmediateCheckpointRequested())))
  			pg_usleep(udelay);
+ }
+ 
+ /*
+  * Returns true if an immediate checkpoint request is pending.
+  */
+ static bool
+ ImmediateCheckpointRequested()
+ {
+ 	if (checkpoint_requested)
+ 	{
+ 		volatile BgWriterShmemStruct *bgs = BgWriterShmem;
+ 
+ 		/*
+ 		 * We're only looking at a single field, so we don't need to
+ 		 * acquire the lock in this case.
+ 		 */
+ 		if (bgs->ckpt_rqst_immediate)
+ 			return true;
  	}
+ 	return false;
  }
  
+ /*
+  * CheckpointWriteDelay -- periodical sleep in checkpoint write phase
+  *
+  * During checkpoint, this is called periodically by the buffer manager while 
+  * writing out dirty buffers from the shared buffer cache. We estimate if we've
+  * made enough progress so that we're going to finish this checkpoint in time
+  * before the next one is due, taking checkpoint_smoothing into account.
+  * If so, we perform one round of normal bgwriter activity including LRU-
+  * cleaning of buffer cache, switching xlog segment if archive_timeout has
+  * passed, and sleeping for BgWriterDelay msecs.
+  *
+  * 'progress' is an estimate of how much of the writes has been done, as a
+  * fraction between 0.0 meaning none, and 1.0 meaning all done.
+  */
+ void
+ CheckpointWriteDelay(double progress)
+ {
+ 	/*
+ 	 * Return immediately if we should finish the checkpoint ASAP.
+ 	 */
+ 	if (!am_bg_writer || CheckPointSmoothing <= 0 || shutdown_requested ||
+ 		ImmediateCheckpointRequested())
+ 		return;
+ 
+ 	elog(DEBUG1, "CheckpointWriteDelay: progress=%.3f", progress);
+ 
+ 	/* Take a nap and perform the usual bgwriter duties, unless we're behind
+ 	 * schedule, in which case we just try to catch up as quickly as possible.
+ 	 */
+ 	if (IsCheckpointOnSchedule(progress))
+ 	{
+ 		CheckArchiveTimeout();
+ 		BgLruSweep();
+ 		BgWriterNap();
+ 	}
+ }
+ 
+ /*
+  * IsCheckpointOnSchedule -- are we on schedule to finish this checkpoint
+  *		 in time?
+  *
+  * Compares the current progress against the time/segments elapsed since last
+  * checkpoint, and returns true if the progress we've made this far is greater
+  * than the elapsed time/segments.
+  *
+  * If another checkpoint has already been requested, always return false.
+  */
+ static bool
+ IsCheckpointOnSchedule(double progress)
+ {
+ 	struct timeval	now;
+ 	XLogRecPtr		recptr;
+ 	double			progress_in_time,
+ 					progress_in_xlog;
+ 
+ 	Assert(ckpt_active);
+ 
+ 	/* scale progress according to CheckPointSmoothing */
+ 	progress *= CheckPointSmoothing;
+ 
+ 	/*
+ 	 * Check against the cached value first. Only do the more expensive 
+ 	 * calculations once we reach the target previously calculated. Since
+ 	 * neither time or WAL insert pointer moves backwards, a freshly
+ 	 * calculated value can only be greater than or equal to the cached value.
+ 	 */
+ 	if (progress < ckpt_cached_elapsed)
+ 	{
+ 		elog(DEBUG2, "IsCheckpointOnSchedule: Still behind cached=%.3f, progress=%.3f",
+ 			 ckpt_cached_elapsed, progress);
+ 		return false;
+ 	}
+ 
+ 	gettimeofday(&now, NULL);
+ 	
+ 	/*
+ 	 * Check progress against time elapsed and checkpoint_timeout.
+ 	 */
+ 	progress_in_time = ((double) (now.tv_sec - ckpt_start_time) +
+ 		now.tv_usec / 1000000.0) / CheckPointTimeout;
+ 
+ 	if (progress < progress_in_time)
+ 	{
+ 		elog(DEBUG2, "IsCheckpointOnSchedule: Behind checkpoint_timeout, time=%.3f, progress=%.3f",
+ 			 progress_in_time, progress);
+ 
+ 		ckpt_cached_elapsed = progress_in_time;
+ 
+ 		return false;
+ 	}
+ 
+ 	/*
+ 	 * Check progress against WAL segments written and checkpoint_segments.
+ 	 *
+ 	 * We compare the current WAL insert location against the location 
+ 	 * computed before calling CreateCheckPoint. The code in XLogInsert that
+ 	 * actually triggers a checkpoint when checkpoint_segments is exceeded
+ 	 * compares against RedoRecptr, so this is not completely accurate.
+ 	 * However, it's good enough for our purposes, we're only calculating
+ 	 * an estimate anyway.
+ 	 */
+ 	recptr = GetInsertRecPtr();
+ 	progress_in_xlog =
+ 		(((double) recptr.xlogid - (double) ckpt_start_recptr.xlogid) * XLogSegsPerFile +
+ 		 ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
+ 		CheckPointSegments;
+ 
+ 	if (progress < progress_in_xlog)
+ 	{
+ 		elog(DEBUG2, "IsCheckpointOnSchedule: Behind checkpoint_segments, xlog=%.3f, progress=%.3f",
+ 			 progress_in_xlog, progress);
+ 
+ 		ckpt_cached_elapsed = progress_in_xlog;
+ 
+ 		return false;
+ 	}
+ 
+ 
+ 	/* It looks like we're on schedule. */
+ 
+ 	elog(DEBUG2, "IsCheckpointOnSchedule: on schedule, time=%.3f, xlog=%.3f progress=%.3f",
+ 		 progress_in_time, progress_in_xlog, progress);
+ 
+ 	return true;
+ }
  
  /* --------------------------------
   *		signal handler routines
***************
*** 618,625 ****
  }
  
  /*
   * RequestCheckpoint
!  *		Called in backend processes to request an immediate checkpoint
   *
   * If waitforit is true, wait until the checkpoint is completed
   * before returning; otherwise, just signal the request and return
--- 862,910 ----
  }
  
  /*
+  * RequestImmediateCheckpoint
+  *		Called in backend processes to request an immediate checkpoint.
+  *
+  * Returns when the checkpoint is finished.
+  */
+ void
+ RequestImmediateCheckpoint()
+ {
+ 	RequestCheckpoint(true, false, true, true);
+ }
+ 
+ /*
+  * RequestImmediateCheckpoint
+  *		Called in backend processes to request a lazy checkpoint.
+  *
+  * This is essentially the same as RequestImmediateCheckpoint, except
+  * that this form obeys the checkpoint_smoothing GUC variable, and
+  * can therefore take a lot longer time.
+  *
+  * Returns when the checkpoint is finished.
+  */
+ void
+ RequestLazyCheckpoint()
+ {
+ 	RequestCheckpoint(true, false, false, true);
+ }
+ 
+ /*
+  * RequestXLogFillCheckpoint
+  *		Signals the bgwriter that we've reached checkpoint_segments
+  *
+  * Unlike RequestImmediateCheckpoint and RequestLazyCheckpoint, return
+  * immediately without waiting for the checkpoint to finish.
+  */
+ void
+ RequestXLogFillCheckpoint()
+ {
+ 	RequestCheckpoint(false, true, false, false);
+ }
+ 
+ /*
   * RequestCheckpoint
!  *		Common subroutine for all the above Request*Checkpoint variants.
   *
   * If waitforit is true, wait until the checkpoint is completed
   * before returning; otherwise, just signal the request and return
***************
*** 628,648 ****
   * If warnontime is true, and it's "too soon" since the last checkpoint,
   * the bgwriter will log a warning.  This should be true only for checkpoints
   * caused due to xlog filling, else the warning will be misleading.
   */
! void
! RequestCheckpoint(bool waitforit, bool warnontime)
  {
  	/* use volatile pointer to prevent code rearrangement */
  	volatile BgWriterShmemStruct *bgs = BgWriterShmem;
! 	sig_atomic_t old_failed = bgs->ckpt_failed;
! 	sig_atomic_t old_started = bgs->ckpt_started;
  
  	/*
  	 * If in a standalone backend, just do it ourselves.
  	 */
  	if (!IsPostmasterEnvironment)
  	{
! 		CreateCheckPoint(false, true);
  
  		/*
  		 * After any checkpoint, close all smgr files.	This is so we won't
--- 913,942 ----
   * If warnontime is true, and it's "too soon" since the last checkpoint,
   * the bgwriter will log a warning.  This should be true only for checkpoints
   * caused due to xlog filling, else the warning will be misleading.
+  *
+  * If immediate is true, the checkpoint should be finished ASAP.
+  *
+  * If force is true, force a checkpoint even if no XLOG activity has occured
+  * since the last one.
   */
! static void
! RequestCheckpoint(bool waitforit, bool warnontime, bool immediate, bool force)
  {
  	/* use volatile pointer to prevent code rearrangement */
  	volatile BgWriterShmemStruct *bgs = BgWriterShmem;
! 	int old_failed, old_started;
  
  	/*
  	 * If in a standalone backend, just do it ourselves.
  	 */
  	if (!IsPostmasterEnvironment)
  	{
! 		/*
! 		 * There's no point in doing lazy checkpoints in a standalone
! 		 * backend, because there's no other backends the checkpoint could
! 		 * disrupt.
! 		 */
! 		CreateCheckPoint(false, true, true);
  
  		/*
  		 * After any checkpoint, close all smgr files.	This is so we won't
***************
*** 653,661 ****
  		return;
  	}
  
! 	/* Set warning request flag if appropriate */
  	if (warnontime)
! 		bgs->ckpt_time_warn = true;
  
  	/*
  	 * Send signal to request checkpoint.  When waitforit is false, we
--- 947,974 ----
  		return;
  	}
  
! 	/*
! 	 * Atomically set the request flags, and take a snapshot of the counters.
! 	 * This ensures that when we see that ckpt_started > old_started,
! 	 * we know the flags we set here have been seen by bgwriter.
! 	 *
! 	 * Note that we effectively OR the flags with any existing flags, to
! 	 * avoid overriding a "stronger" request by another backend.
! 	 */
! 	SpinLockAcquire(&bgs->ckpt_lck);
! 
! 	old_failed = bgs->ckpt_failed;
! 	old_started = bgs->ckpt_started;
! 
! 	/* Set request flags as appropriate */
  	if (warnontime)
! 		bgs->ckpt_rqst_time_warn = true;
! 	if (immediate)
! 		bgs->ckpt_rqst_immediate = true;
! 	if (force)
! 		bgs->ckpt_rqst_force = true;
! 
! 	SpinLockRelease(&bgs->ckpt_lck);
  
  	/*
  	 * Send signal to request checkpoint.  When waitforit is false, we
***************
*** 674,701 ****
  	 */
  	if (waitforit)
  	{
! 		while (bgs->ckpt_started == old_started)
  		{
  			CHECK_FOR_INTERRUPTS();
  			pg_usleep(100000L);
  		}
- 		old_started = bgs->ckpt_started;
  
  		/*
! 		 * We are waiting for ckpt_done >= old_started, in a modulo sense.
! 		 * This is a little tricky since we don't know the width or signedness
! 		 * of sig_atomic_t.  We make the lowest common denominator assumption
! 		 * that it is only as wide as "char".  This means that this algorithm
! 		 * will cope correctly as long as we don't sleep for more than 127
! 		 * completed checkpoints.  (If we do, we will get another chance to
! 		 * exit after 128 more checkpoints...)
  		 */
! 		while (((signed char) (bgs->ckpt_done - old_started)) < 0)
  		{
  			CHECK_FOR_INTERRUPTS();
  			pg_usleep(100000L);
  		}
! 		if (bgs->ckpt_failed != old_failed)
  			ereport(ERROR,
  					(errmsg("checkpoint request failed"),
  					 errhint("Consult recent messages in the server log for details.")));
--- 987,1031 ----
  	 */
  	if (waitforit)
  	{
! 		int new_started, new_failed;
! 
! 		/* Wait for a new checkpoint to start. */
! 		for(;;)
  		{
+ 			SpinLockAcquire(&bgs->ckpt_lck);
+ 			new_started = bgs->ckpt_started;
+ 			SpinLockRelease(&bgs->ckpt_lck);
+ 			
+ 			if (new_started != old_started)
+ 				break;
+ 			
  			CHECK_FOR_INTERRUPTS();
  			pg_usleep(100000L);
  		}
  
  		/*
! 		 * We are waiting for ckpt_done >= new_started, in a modulo sense.
! 		 * This algorithm will cope correctly as long as we don't sleep for
! 		 * more than MAX_INT completed checkpoints.  (If we do, we will get
! 		 * another chance to exit after MAX_INT more checkpoints...)
  		 */
! 		for(;;)
  		{
+ 			int new_done;
+ 
+ 			SpinLockAcquire(&bgs->ckpt_lck);
+ 			new_done = bgs->ckpt_done;
+ 			new_failed = bgs->ckpt_failed;
+ 			SpinLockRelease(&bgs->ckpt_lck);
+ 
+ 			if(new_done - new_started >= 0)
+ 				break;
+ 
  			CHECK_FOR_INTERRUPTS();
  			pg_usleep(100000L);
  		}
! 
! 		if (new_failed != old_failed)
  			ereport(ERROR,
  					(errmsg("checkpoint request failed"),
  					 errhint("Consult recent messages in the server log for details.")));
Index: src/backend/storage/buffer/bufmgr.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/bufmgr.c,v
retrieving revision 1.220
diff -c -r1.220 bufmgr.c
*** src/backend/storage/buffer/bufmgr.c	30 May 2007 20:11:58 -0000	1.220
--- src/backend/storage/buffer/bufmgr.c	20 Jun 2007 12:47:43 -0000
***************
*** 32,38 ****
   *
   * BufferSync() -- flush all dirty buffers in the buffer pool.
   *
!  * BgBufferSync() -- flush some dirty buffers in the buffer pool.
   *
   * InitBufferPool() -- Init the buffer module.
   *
--- 32,40 ----
   *
   * BufferSync() -- flush all dirty buffers in the buffer pool.
   *
!  * BgAllSweep() -- write out some dirty buffers in the pool.
!  *
!  * BgLruSweep() -- write out some lru dirty buffers in the pool.
   *
   * InitBufferPool() -- Init the buffer module.
   *
***************
*** 74,79 ****
--- 76,82 ----
  double		bgwriter_all_percent = 0.333;
  int			bgwriter_lru_maxpages = 5;
  int			bgwriter_all_maxpages = 5;
+ int			checkpoint_rate = 512; /* in pages/s */
  
  
  long		NDirectFileRead;	/* some I/O's are direct file access. bypass
***************
*** 645,651 ****
  	 * at 1 so that the buffer can survive one clock-sweep pass.)
  	 */
  	buf->tag = newTag;
! 	buf->flags &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_IO_ERROR);
  	buf->flags |= BM_TAG_VALID;
  	buf->usage_count = 1;
  
--- 648,654 ----
  	 * at 1 so that the buffer can survive one clock-sweep pass.)
  	 */
  	buf->tag = newTag;
! 	buf->flags &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_CHECKPOINT_NEEDED | BM_IO_ERROR);
  	buf->flags |= BM_TAG_VALID;
  	buf->usage_count = 1;
  
***************
*** 1000,1037 ****
   * BufferSync -- Write out all dirty buffers in the pool.
   *
   * This is called at checkpoint time to write out all dirty shared buffers.
   */
  void
! BufferSync(void)
  {
! 	int			buf_id;
  	int			num_to_scan;
  	int			absorb_counter;
  
  	/*
  	 * Find out where to start the circular scan.
  	 */
! 	buf_id = StrategySyncStart();
  
  	/* Make sure we can handle the pin inside SyncOneBuffer */
  	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
  
  	/*
! 	 * Loop over all buffers.
  	 */
  	num_to_scan = NBuffers;
  	absorb_counter = WRITES_PER_ABSORB;
  	while (num_to_scan-- > 0)
  	{
! 		if (SyncOneBuffer(buf_id, false))
  		{
  			BgWriterStats.m_buf_written_checkpoints++;
  
  			/*
  			 * If in bgwriter, absorb pending fsync requests after each
  			 * WRITES_PER_ABSORB write operations, to prevent overflow of the
  			 * fsync request queue.  If not in bgwriter process, this is a
  			 * no-op.
  			 */
  			if (--absorb_counter <= 0)
  			{
--- 1003,1127 ----
   * BufferSync -- Write out all dirty buffers in the pool.
   *
   * This is called at checkpoint time to write out all dirty shared buffers.
+  * If 'immediate' is true, write them all ASAP, otherwise throttle the
+  * I/O rate according to checkpoint_write_rate GUC variable, and perform
+  * normal bgwriter duties periodically.
   */
  void
! BufferSync(bool immediate)
  {
! 	int			buf_id, start_id;
  	int			num_to_scan;
+ 	int			num_to_write;
+ 	int			num_written;
  	int			absorb_counter;
+ 	int			num_written_since_nap;
+ 	int			writes_per_nap;
+ 
+ 	/*
+ 	 * Convert checkpoint_write_rate to number writes of writes to perform in
+ 	 * a period of BgWriterDelay. The result is an integer, so we lose some
+ 	 * precision here. There's a lot of other factors as well that affect the
+ 	 * real rate, for example granularity of OS timer used for BgWriterDelay,
+ 	 * whether any of the writes block, and time spent in CheckpointWriteDelay
+ 	 * performing normal bgwriter duties.
+ 	 */
+ 	writes_per_nap = Min(1, checkpoint_rate / BgWriterDelay);
  
  	/*
  	 * Find out where to start the circular scan.
  	 */
! 	start_id = StrategySyncStart();
  
  	/* Make sure we can handle the pin inside SyncOneBuffer */
  	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
  
  	/*
! 	 * Loop over all buffers, and mark the ones that need to be written with
! 	 * BM_CHECKPOINT_NEEDED. Count them as we go (num_to_write), so that we
! 	 * can estimate how much work needs to be done.
! 	 *
! 	 * This allows us to only write those pages that were dirty when the
! 	 * checkpoint began, and haven't been flushed to disk since. Whenever a
! 	 * page with BM_CHECKPOINT_NEEDED is written out by normal backends or
! 	 * the bgwriter LRU-scan, the flag is cleared, and any pages dirtied after
! 	 * this scan don't have the flag set.
! 	 */
! 	num_to_scan = NBuffers;
! 	num_to_write = 0;
! 	buf_id = start_id;
! 	while (num_to_scan-- > 0)
! 	{
! 		volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id];
! 
! 		/*
! 		 * Header spinlock is enough to examine BM_DIRTY, see comment in
! 		 * SyncOneBuffer.
! 		 */
! 		LockBufHdr(bufHdr);
! 
! 		if (bufHdr->flags & BM_DIRTY)
! 		{
! 			bufHdr->flags |= BM_CHECKPOINT_NEEDED;
! 			num_to_write++;
! 		}
! 
! 		UnlockBufHdr(bufHdr);
! 
! 		if (++buf_id >= NBuffers)
! 			buf_id = 0;
! 	}
! 
! 	elog(DEBUG1, "CHECKPOINT: %d / %d buffers to write", num_to_write, NBuffers);
! 
! 	/*
! 	 * Loop over all buffers again, and write the ones (still) marked with
! 	 * BM_CHECKPOINT_NEEDED.
  	 */
  	num_to_scan = NBuffers;
+ 	num_written = num_written_since_nap = 0;
  	absorb_counter = WRITES_PER_ABSORB;
+ 	buf_id = start_id;
  	while (num_to_scan-- > 0)
  	{
! 		volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id];
! 		bool needs_flush;
! 
! 		/* We don't need to acquire the lock here, because we're
! 		 * only looking at a single bit. It's possible that someone
! 		 * else writes the buffer and clears the flag right after we
! 		 * check, but that doesn't matter. This assumes that no-one
! 		 * clears the flag and sets it again while holding info_lck, 
! 		 * expecting no-one to see the intermediary state.
! 		 */
! 		needs_flush = (bufHdr->flags & BM_CHECKPOINT_NEEDED) != 0;
! 
! 		if (needs_flush && SyncOneBuffer(buf_id, false))
  		{
  			BgWriterStats.m_buf_written_checkpoints++;
+ 			num_written++;
+ 
+ 			/*
+ 			 * Perform normal bgwriter duties and sleep to throttle
+ 			 * our I/O rate.
+ 			 */
+ 			if (!immediate && ++num_written_since_nap >= writes_per_nap)
+ 			{
+ 				num_written_since_nap = 0;
+ 				CheckpointWriteDelay((double) (num_written) / num_to_write);
+ 			}
  
  			/*
  			 * If in bgwriter, absorb pending fsync requests after each
  			 * WRITES_PER_ABSORB write operations, to prevent overflow of the
  			 * fsync request queue.  If not in bgwriter process, this is a
  			 * no-op.
+ 			 *
+ 			 * AbsorbFsyncRequests is also called inside CheckpointWriteDelay,
+ 			 * so this is partially redundant. However, we can't totally trust
+ 			 * on the call in CheckpointWriteDelay, because it's only made
+ 			 * before sleeping. In case CheckpointWriteDelay doesn't sleep,
+ 			 * we need to absorb pending requests ourselves.
  			 */
  			if (--absorb_counter <= 0)
  			{
***************
*** 1045,1059 ****
  }
  
  /*
!  * BgBufferSync -- Write out some dirty buffers in the pool.
   *
   * This is called periodically by the background writer process.
   */
  void
! BgBufferSync(void)
  {
  	static int	buf_id1 = 0;
- 	int			buf_id2;
  	int			num_to_scan;
  	int			num_written;
  
--- 1135,1152 ----
  }
  
  /*
!  * BgAllSweep -- Write out some dirty buffers in the pool.
   *
+  * Runs the bgwriter all-sweep algorithm to write dirty buffers to
+  * minimize work at checkpoint time.
   * This is called periodically by the background writer process.
+  *
+  * XXX: Is this really needed with load distributed checkpoints?
   */
  void
! BgAllSweep(void)
  {
  	static int	buf_id1 = 0;
  	int			num_to_scan;
  	int			num_written;
  
***************
*** 1063,1072 ****
  	/*
  	 * To minimize work at checkpoint time, we want to try to keep all the
  	 * buffers clean; this motivates a scan that proceeds sequentially through
! 	 * all buffers.  But we are also charged with ensuring that buffers that
! 	 * will be recycled soon are clean when needed; these buffers are the ones
! 	 * just ahead of the StrategySyncStart point.  We make a separate scan
! 	 * through those.
  	 */
  
  	/*
--- 1156,1162 ----
  	/*
  	 * To minimize work at checkpoint time, we want to try to keep all the
  	 * buffers clean; this motivates a scan that proceeds sequentially through
! 	 * all buffers. 
  	 */
  
  	/*
***************
*** 1098,1103 ****
--- 1188,1210 ----
  		}
  		BgWriterStats.m_buf_written_all += num_written;
  	}
+ }
+ 
+ /*
+  * BgLruSweep -- Write out some lru dirty buffers in the pool.
+  */
+ void
+ BgLruSweep(void)
+ {
+ 	int			buf_id2;
+ 	int			num_to_scan;
+ 	int			num_written;
+ 
+ 	/*
+ 	 * The purpose of this sweep is to ensure that buffers that
+ 	 * will be recycled soon are clean when needed; these buffers are the ones
+ 	 * just ahead of the StrategySyncStart point. 
+ 	 */
  
  	/*
  	 * This loop considers only unpinned buffers close to the clock sweep
***************
*** 1341,1349 ****
   * flushed.
   */
  void
! FlushBufferPool(void)
  {
! 	BufferSync();
  	smgrsync();
  }
  
--- 1448,1459 ----
   * flushed.
   */
  void
! FlushBufferPool(bool immediate)
  {
! 	elog(DEBUG1, "CHECKPOINT: write phase");
! 	BufferSync(immediate || CheckPointSmoothing <= 0);
! 
! 	elog(DEBUG1, "CHECKPOINT: sync phase");
  	smgrsync();
  }
  
***************
*** 2132,2138 ****
  	Assert(buf->flags & BM_IO_IN_PROGRESS);
  	buf->flags &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR);
  	if (clear_dirty && !(buf->flags & BM_JUST_DIRTIED))
! 		buf->flags &= ~BM_DIRTY;
  	buf->flags |= set_flag_bits;
  
  	UnlockBufHdr(buf);
--- 2242,2248 ----
  	Assert(buf->flags & BM_IO_IN_PROGRESS);
  	buf->flags &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR);
  	if (clear_dirty && !(buf->flags & BM_JUST_DIRTIED))
! 		buf->flags &= ~(BM_DIRTY | BM_CHECKPOINT_NEEDED);
  	buf->flags |= set_flag_bits;
  
  	UnlockBufHdr(buf);
Index: src/backend/tcop/utility.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/tcop/utility.c,v
retrieving revision 1.280
diff -c -r1.280 utility.c
*** src/backend/tcop/utility.c	30 May 2007 20:12:01 -0000	1.280
--- src/backend/tcop/utility.c	20 Jun 2007 09:36:31 -0000
***************
*** 1089,1095 ****
  				ereport(ERROR,
  						(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  						 errmsg("must be superuser to do CHECKPOINT")));
! 			RequestCheckpoint(true, false);
  			break;
  
  		case T_ReindexStmt:
--- 1089,1095 ----
  				ereport(ERROR,
  						(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  						 errmsg("must be superuser to do CHECKPOINT")));
! 			RequestImmediateCheckpoint();
  			break;
  
  		case T_ReindexStmt:
Index: src/backend/utils/misc/guc.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/utils/misc/guc.c,v
retrieving revision 1.396
diff -c -r1.396 guc.c
*** src/backend/utils/misc/guc.c	8 Jun 2007 18:23:52 -0000	1.396
--- src/backend/utils/misc/guc.c	20 Jun 2007 10:14:06 -0000
***************
*** 1487,1492 ****
--- 1487,1503 ----
  		30, 0, INT_MAX, NULL, NULL
  	},
  
+ 
+ 	{
+ 		{"checkpoint_rate", PGC_SIGHUP, WAL_CHECKPOINTS,
+ 			gettext_noop("Minimum I/O rate used to write dirty buffers during checkpoints."),
+ 			NULL,
+ 			GUC_UNIT_BLOCKS
+ 		},
+ 		&checkpoint_rate,
+ 		100, 0.0, 100000, NULL, NULL
+ 	},
+ 
  	{
  		{"wal_buffers", PGC_POSTMASTER, WAL_SETTINGS,
  			gettext_noop("Sets the number of disk-page buffers in shared memory for WAL."),
***************
*** 1866,1871 ****
--- 1877,1891 ----
  		0.1, 0.0, 100.0, NULL, NULL
  	},
  
+ 	{
+ 		{"checkpoint_smoothing", PGC_SIGHUP, WAL_CHECKPOINTS,
+ 			gettext_noop("Time spent flushing dirty buffers during checkpoint, as fraction of checkpoint interval."),
+ 			NULL
+ 		},
+ 		&CheckPointSmoothing,
+ 		0.3, 0.0, 0.9, NULL, NULL
+ 	},
+ 
  	/* End-of-list marker */
  	{
  		{NULL, 0, 0, NULL, NULL}, NULL, 0.0, 0.0, 0.0, NULL, NULL
Index: src/backend/utils/misc/postgresql.conf.sample
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/utils/misc/postgresql.conf.sample,v
retrieving revision 1.216
diff -c -r1.216 postgresql.conf.sample
*** src/backend/utils/misc/postgresql.conf.sample	3 Jun 2007 17:08:15 -0000	1.216
--- src/backend/utils/misc/postgresql.conf.sample	20 Jun 2007 10:03:17 -0000
***************
*** 168,173 ****
--- 168,175 ----
  
  #checkpoint_segments = 3		# in logfile segments, min 1, 16MB each
  #checkpoint_timeout = 5min		# range 30s-1h
+ #checkpoint_smoothing = 0.3		# checkpoint duration, range 0.0 - 0.9
+ #checkpoint_rate = 512.0KB		# min. checkpoint write rate per second
  #checkpoint_warning = 30s		# 0 is off
  
  # - Archiving -
Index: src/include/access/xlog.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/xlog.h,v
retrieving revision 1.78
diff -c -r1.78 xlog.h
*** src/include/access/xlog.h	30 May 2007 20:12:02 -0000	1.78
--- src/include/access/xlog.h	19 Jun 2007 14:10:07 -0000
***************
*** 171,179 ****
  extern void StartupXLOG(void);
  extern void ShutdownXLOG(int code, Datum arg);
  extern void InitXLOGAccess(void);
! extern void CreateCheckPoint(bool shutdown, bool force);
  extern void XLogPutNextOid(Oid nextOid);
  extern XLogRecPtr GetRedoRecPtr(void);
  extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);
  
  #endif   /* XLOG_H */
--- 171,180 ----
  extern void StartupXLOG(void);
  extern void ShutdownXLOG(int code, Datum arg);
  extern void InitXLOGAccess(void);
! extern void CreateCheckPoint(bool shutdown, bool immediate, bool force);
  extern void XLogPutNextOid(Oid nextOid);
  extern XLogRecPtr GetRedoRecPtr(void);
+ extern XLogRecPtr GetInsertRecPtr(void);
  extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);
  
  #endif   /* XLOG_H */
Index: src/include/postmaster/bgwriter.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/postmaster/bgwriter.h,v
retrieving revision 1.9
diff -c -r1.9 bgwriter.h
*** src/include/postmaster/bgwriter.h	5 Jan 2007 22:19:57 -0000	1.9
--- src/include/postmaster/bgwriter.h	20 Jun 2007 09:27:20 -0000
***************
*** 20,29 ****
  extern int	BgWriterDelay;
  extern int	CheckPointTimeout;
  extern int	CheckPointWarning;
  
  extern void BackgroundWriterMain(void);
  
! extern void RequestCheckpoint(bool waitforit, bool warnontime);
  
  extern bool ForwardFsyncRequest(RelFileNode rnode, BlockNumber segno);
  extern void AbsorbFsyncRequests(void);
--- 20,33 ----
  extern int	BgWriterDelay;
  extern int	CheckPointTimeout;
  extern int	CheckPointWarning;
+ extern double CheckPointSmoothing;
  
  extern void BackgroundWriterMain(void);
  
! extern void RequestImmediateCheckpoint(void);
! extern void RequestLazyCheckpoint(void);
! extern void RequestXLogFillCheckpoint(void);
! extern void CheckpointWriteDelay(double progress);
  
  extern bool ForwardFsyncRequest(RelFileNode rnode, BlockNumber segno);
  extern void AbsorbFsyncRequests(void);
Index: src/include/storage/buf_internals.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/buf_internals.h,v
retrieving revision 1.90
diff -c -r1.90 buf_internals.h
*** src/include/storage/buf_internals.h	30 May 2007 20:12:03 -0000	1.90
--- src/include/storage/buf_internals.h	12 Jun 2007 11:42:23 -0000
***************
*** 35,40 ****
--- 35,41 ----
  #define BM_IO_ERROR				(1 << 4)		/* previous I/O failed */
  #define BM_JUST_DIRTIED			(1 << 5)		/* dirtied since write started */
  #define BM_PIN_COUNT_WAITER		(1 << 6)		/* have waiter for sole pin */
+ #define BM_CHECKPOINT_NEEDED	(1 << 7)		/* this needs to be written in checkpoint */
  
  typedef bits16 BufFlags;
  
Index: src/include/storage/bufmgr.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/bufmgr.h,v
retrieving revision 1.104
diff -c -r1.104 bufmgr.h
*** src/include/storage/bufmgr.h	30 May 2007 20:12:03 -0000	1.104
--- src/include/storage/bufmgr.h	20 Jun 2007 10:28:43 -0000
***************
*** 36,41 ****
--- 36,42 ----
  extern double bgwriter_all_percent;
  extern int	bgwriter_lru_maxpages;
  extern int	bgwriter_all_maxpages;
+ extern int	checkpoint_rate;
  
  /* in buf_init.c */
  extern DLLIMPORT char *BufferBlocks;
***************
*** 136,142 ****
  extern void ResetBufferUsage(void);
  extern void AtEOXact_Buffers(bool isCommit);
  extern void PrintBufferLeakWarning(Buffer buffer);
! extern void FlushBufferPool(void);
  extern BlockNumber BufferGetBlockNumber(Buffer buffer);
  extern BlockNumber RelationGetNumberOfBlocks(Relation relation);
  extern void RelationTruncate(Relation rel, BlockNumber nblocks);
--- 137,143 ----
  extern void ResetBufferUsage(void);
  extern void AtEOXact_Buffers(bool isCommit);
  extern void PrintBufferLeakWarning(Buffer buffer);
! extern void FlushBufferPool(bool immediate);
  extern BlockNumber BufferGetBlockNumber(Buffer buffer);
  extern BlockNumber RelationGetNumberOfBlocks(Relation relation);
  extern void RelationTruncate(Relation rel, BlockNumber nblocks);
***************
*** 161,168 ****
  extern void AbortBufferIO(void);
  
  extern void BufmgrCommit(void);
! extern void BufferSync(void);
! extern void BgBufferSync(void);
  
  extern void AtProcExit_LocalBuffers(void);
  
--- 162,170 ----
  extern void AbortBufferIO(void);
  
  extern void BufmgrCommit(void);
! extern void BufferSync(bool immediate);
! extern void BgAllSweep(void);
! extern void BgLruSweep(void);
  
  extern void AtProcExit_LocalBuffers(void);

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

[PATCHES] Load Distributed Checkpoints, take 3

Reply via email to