Here's an updated WIP patch for load distributed checkpoints.
I added a spinlock to protect the signaling fields between bgwriter and
backends. The current non-locking approach gets really difficult as the
patch adds two new flags, and both are more important than the existing
ckpt_time_warn flag.
In fact, I think there's a small race condition in CVS HEAD:
1. pg_start_backup() is called, which calls RequestCheckpoint
2. RequestCheckpoint takes note of the old value of ckpt_started
3. bgwriter wakes up from pg_usleep, and sees that we've exceeded
checkpoint_timeout.
4. bgwriter increases ckpt_started to note that a new checkpoint has started
5. RequestCheckpoint signals bgwriter to start a new checkpoint
6. bgwriter calls CreateCheckpoint, with the force-flag set to false
because this checkpoint was triggered by timeout
7. RequestCheckpoint sees that ckpt_started has increased, and starts to
wait for ckpt_done to reach the new value.
8. CreateCheckpoint finishes immediately, because there was no XLOG
activity since last checkpoint.
9. RequestCheckpoint sees that ckpt_done matches ckpt_started, and returns.
10. pg_start_backup() continues, with potentially the same redo location
and thus history filename as previous backup.
Now I admit that the chances for that to happen are extremely small,
people don't usually do two pg_start_backup calls without *any* WAL
logged activity in between them, for example. But as we add the new
flags, avoiding scenarios like that becomes harder.
Since last patch, I did some clean up and refactoring, and added a bunch
of comments, and user documentation.
I haven't yet changed GetInsertRecPtr to use the almost up-to-date value
protected by the info_lck per Simon's suggestion, and I need to do some
correctness testing. After that, I'm done with the patch.
Ps. In case you wonder what took me so long since last revision, I've
spent a lot of time reviewing HOT.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Index: doc/src/sgml/config.sgml
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/doc/src/sgml/config.sgml,v
retrieving revision 1.126
diff -c -r1.126 config.sgml
*** doc/src/sgml/config.sgml 7 Jun 2007 19:19:56 -0000 1.126
--- doc/src/sgml/config.sgml 19 Jun 2007 14:24:31 -0000
***************
*** 1565,1570 ****
--- 1565,1608 ----
</listitem>
</varlistentry>
+ <varlistentry id="guc-checkpoint-smoothing" xreflabel="checkpoint_smoothing">
+ <term><varname>checkpoint_smoothing</varname> (<type>floating point</type>)</term>
+ <indexterm>
+ <primary><varname>checkpoint_smoothing</> configuration parameter</primary>
+ </indexterm>
+ <listitem>
+ <para>
+ Specifies the target length of checkpoints, as a fraction of
+ the checkpoint interval. The default is 0.3.
+
+ This parameter can only be set in the <filename>postgresql.conf</>
+ file or on the server command line.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <term><varname>checkpoint_rate</varname> (<type>floating point</type>)</term>
+ <indexterm>
+ <primary><varname>checkpoint_rate</> configuration parameter</primary>
+ </indexterm>
+ <listitem>
+ <para>
+ Specifies the minimum I/O rate used to flush dirty buffers during a
+ checkpoint, when there's not many dirty buffers in the buffer cache.
+ The default is 512 KB/s.
+
+ Note: the accuracy of this setting depends on
+ <varname>bgwriter_delay</varname. This value is converted internally
+ to pages / bgwriter_delay, so if for examply the minimum allowed
+ bgwriter_delay setting of 10ms is used, the effective minimum
+ checkpoint I/O rate is 1 page / 10 ms, or 800 KB/s.
+
+ This parameter can only be set in the <filename>postgresql.conf</>
+ file or on the server command line.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-checkpoint-warning" xreflabel="checkpoint_warning">
<term><varname>checkpoint_warning</varname> (<type>integer</type>)</term>
<indexterm>
Index: doc/src/sgml/wal.sgml
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/doc/src/sgml/wal.sgml,v
retrieving revision 1.43
diff -c -r1.43 wal.sgml
*** doc/src/sgml/wal.sgml 31 Jan 2007 20:56:19 -0000 1.43
--- doc/src/sgml/wal.sgml 19 Jun 2007 14:26:45 -0000
***************
*** 217,225 ****
</para>
<para>
There will be at least one WAL segment file, and will normally
not be more than 2 * <varname>checkpoint_segments</varname> + 1
! files. Each segment file is normally 16 MB (though this size can be
altered when building the server). You can use this to estimate space
requirements for <acronym>WAL</acronym>.
Ordinarily, when old log segment files are no longer needed, they
--- 217,245 ----
</para>
<para>
+ If there is a lot of dirty buffers in the buffer cache, flushing them
+ all at checkpoint will cause a heavy burst of I/O that can disrupt other
+ activity in the system. To avoid that, the checkpoint I/O can be distributed
+ over a longer period of time, defined with
+ <varname>checkpoint_smoothing</varname>. It's given as a fraction of the
+ checkpoint interval, as defined by <varname>checkpoint_timeout</varname>
+ and <varname>checkpoint_segments</varname>. The WAL segment consumption
+ and elapsed time is monitored and the I/O rate is adjusted during
+ checkpoint so that it's finished when the given fraction of elapsed time
+ or WAL segments has passed, whichever is sooner. However, that could lead
+ to unnecessarily prolonged checkpoints when there's not many dirty buffers
+ in the cache. To avoid that, <varname>checkpoint_rate</varname> can be used
+ to set the minimum I/O rate used. Note that prolonging checkpoints
+ affects recovery time, because the longer the checkpoint takes, more WAL
+ need to be kept around and replayed in recovery.
+ </para>
+
+ <para>
There will be at least one WAL segment file, and will normally
not be more than 2 * <varname>checkpoint_segments</varname> + 1
! files, though there can be more if a large
! <varname>checkpoint_smoothing</varname> setting is used.
! Each segment file is normally 16 MB (though this size can be
altered when building the server). You can use this to estimate space
requirements for <acronym>WAL</acronym>.
Ordinarily, when old log segment files are no longer needed, they
Index: src/backend/access/transam/xlog.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.272
diff -c -r1.272 xlog.c
*** src/backend/access/transam/xlog.c 31 May 2007 15:13:01 -0000 1.272
--- src/backend/access/transam/xlog.c 20 Jun 2007 10:44:40 -0000
***************
*** 398,404 ****
static void exitArchiveRecovery(TimeLineID endTLI,
uint32 endLogId, uint32 endLogSeg);
static bool recoveryStopsHere(XLogRecord *record, bool *includeThis);
! static void CheckPointGuts(XLogRecPtr checkPointRedo);
static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
XLogRecPtr *lsn, BkpBlock *bkpb);
--- 398,404 ----
static void exitArchiveRecovery(TimeLineID endTLI,
uint32 endLogId, uint32 endLogSeg);
static bool recoveryStopsHere(XLogRecord *record, bool *includeThis);
! static void CheckPointGuts(XLogRecPtr checkPointRedo, bool immediate);
static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
XLogRecPtr *lsn, BkpBlock *bkpb);
***************
*** 1608,1614 ****
if (XLOG_DEBUG)
elog(LOG, "time for a checkpoint, signaling bgwriter");
#endif
! RequestCheckpoint(false, true);
}
}
}
--- 1608,1614 ----
if (XLOG_DEBUG)
elog(LOG, "time for a checkpoint, signaling bgwriter");
#endif
! RequestXLogFillCheckpoint();
}
}
}
***************
*** 5110,5116 ****
* the rule that TLI only changes in shutdown checkpoints, which
* allows some extra error checking in xlog_redo.
*/
! CreateCheckPoint(true, true);
/*
* Close down recovery environment
--- 5110,5116 ----
* the rule that TLI only changes in shutdown checkpoints, which
* allows some extra error checking in xlog_redo.
*/
! CreateCheckPoint(true, true, true);
/*
* Close down recovery environment
***************
*** 5319,5324 ****
--- 5319,5340 ----
}
/*
+ * GetInsertRecPtr -- Returns the current insert position.
+ */
+ XLogRecPtr
+ GetInsertRecPtr(void)
+ {
+ XLogCtlInsert *Insert = &XLogCtl->Insert;
+ XLogRecPtr recptr;
+
+ LWLockAcquire(WALInsertLock, LW_SHARED);
+ INSERT_RECPTR(recptr, Insert, Insert->curridx);
+ LWLockRelease(WALInsertLock);
+
+ return recptr;
+ }
+
+ /*
* Get the time of the last xlog segment switch
*/
time_t
***************
*** 5383,5389 ****
ereport(LOG,
(errmsg("shutting down")));
! CreateCheckPoint(true, true);
ShutdownCLOG();
ShutdownSUBTRANS();
ShutdownMultiXact();
--- 5399,5405 ----
ereport(LOG,
(errmsg("shutting down")));
! CreateCheckPoint(true, true, true);
ShutdownCLOG();
ShutdownSUBTRANS();
ShutdownMultiXact();
***************
*** 5395,5405 ****
/*
* Perform a checkpoint --- either during shutdown, or on-the-fly
*
* If force is true, we force a checkpoint regardless of whether any XLOG
* activity has occurred since the last one.
*/
void
! CreateCheckPoint(bool shutdown, bool force)
{
CheckPoint checkPoint;
XLogRecPtr recptr;
--- 5411,5424 ----
/*
* Perform a checkpoint --- either during shutdown, or on-the-fly
*
+ * If immediate is true, we try to finish the checkpoint as fast as we can,
+ * ignoring checkpoint_smoothing parameter.
+ *
* If force is true, we force a checkpoint regardless of whether any XLOG
* activity has occurred since the last one.
*/
void
! CreateCheckPoint(bool shutdown, bool immediate, bool force)
{
CheckPoint checkPoint;
XLogRecPtr recptr;
***************
*** 5591,5597 ****
*/
END_CRIT_SECTION();
! CheckPointGuts(checkPoint.redo);
START_CRIT_SECTION();
--- 5610,5616 ----
*/
END_CRIT_SECTION();
! CheckPointGuts(checkPoint.redo, immediate);
START_CRIT_SECTION();
***************
*** 5693,5708 ****
/*
* Flush all data in shared memory to disk, and fsync
*
* This is the common code shared between regular checkpoints and
* recovery restartpoints.
*/
static void
! CheckPointGuts(XLogRecPtr checkPointRedo)
{
CheckPointCLOG();
CheckPointSUBTRANS();
CheckPointMultiXact();
! FlushBufferPool(); /* performs all required fsyncs */
/* We deliberately delay 2PC checkpointing as long as possible */
CheckPointTwoPhase(checkPointRedo);
}
--- 5712,5730 ----
/*
* Flush all data in shared memory to disk, and fsync
*
+ * If immediate is true, try to finish as quickly as possible, ignoring
+ * the GUC variables to throttle checkpoint I/O.
+ *
* This is the common code shared between regular checkpoints and
* recovery restartpoints.
*/
static void
! CheckPointGuts(XLogRecPtr checkPointRedo, bool immediate)
{
CheckPointCLOG();
CheckPointSUBTRANS();
CheckPointMultiXact();
! FlushBufferPool(immediate); /* performs all required fsyncs */
/* We deliberately delay 2PC checkpointing as long as possible */
CheckPointTwoPhase(checkPointRedo);
}
***************
*** 5710,5716 ****
/*
* Set a recovery restart point if appropriate
*
! * This is similar to CreateCheckpoint, but is used during WAL recovery
* to establish a point from which recovery can roll forward without
* replaying the entire recovery log. This function is called each time
* a checkpoint record is read from XLOG; it must determine whether a
--- 5732,5738 ----
/*
* Set a recovery restart point if appropriate
*
! * This is similar to CreateCheckPoint, but is used during WAL recovery
* to establish a point from which recovery can roll forward without
* replaying the entire recovery log. This function is called each time
* a checkpoint record is read from XLOG; it must determine whether a
***************
*** 5751,5757 ****
/*
* OK, force data out to disk
*/
! CheckPointGuts(checkPoint->redo);
/*
* Update pg_control so that any subsequent crash will restart from this
--- 5773,5779 ----
/*
* OK, force data out to disk
*/
! CheckPointGuts(checkPoint->redo, true);
/*
* Update pg_control so that any subsequent crash will restart from this
***************
*** 6177,6183 ****
* have different checkpoint positions and hence different history
* file names, even if nothing happened in between.
*/
! RequestCheckpoint(true, false);
/*
* Now we need to fetch the checkpoint record location, and also its
--- 6199,6205 ----
* have different checkpoint positions and hence different history
* file names, even if nothing happened in between.
*/
! RequestLazyCheckpoint();
/*
* Now we need to fetch the checkpoint record location, and also its
Index: src/backend/bootstrap/bootstrap.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/bootstrap/bootstrap.c,v
retrieving revision 1.233
diff -c -r1.233 bootstrap.c
*** src/backend/bootstrap/bootstrap.c 7 Mar 2007 13:35:02 -0000 1.233
--- src/backend/bootstrap/bootstrap.c 19 Jun 2007 15:29:51 -0000
***************
*** 489,495 ****
/* Perform a checkpoint to ensure everything's down to disk */
SetProcessingMode(NormalProcessing);
! CreateCheckPoint(true, true);
/* Clean up and exit */
cleanup();
--- 489,495 ----
/* Perform a checkpoint to ensure everything's down to disk */
SetProcessingMode(NormalProcessing);
! CreateCheckPoint(true, true, true);
/* Clean up and exit */
cleanup();
Index: src/backend/commands/dbcommands.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/dbcommands.c,v
retrieving revision 1.195
diff -c -r1.195 dbcommands.c
*** src/backend/commands/dbcommands.c 1 Jun 2007 19:38:07 -0000 1.195
--- src/backend/commands/dbcommands.c 20 Jun 2007 09:36:24 -0000
***************
*** 404,410 ****
* up-to-date for the copy. (We really only need to flush buffers for the
* source database, but bufmgr.c provides no API for that.)
*/
! BufferSync();
/*
* Once we start copying subdirectories, we need to be able to clean 'em
--- 404,410 ----
* up-to-date for the copy. (We really only need to flush buffers for the
* source database, but bufmgr.c provides no API for that.)
*/
! BufferSync(true);
/*
* Once we start copying subdirectories, we need to be able to clean 'em
***************
*** 507,513 ****
* Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
* we can avoid this.
*/
! RequestCheckpoint(true, false);
/*
* Close pg_database, but keep lock till commit (this is important to
--- 507,513 ----
* Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
* we can avoid this.
*/
! RequestImmediateCheckpoint();
/*
* Close pg_database, but keep lock till commit (this is important to
***************
*** 661,667 ****
* open files, which would cause rmdir() to fail.
*/
#ifdef WIN32
! RequestCheckpoint(true, false);
#endif
/*
--- 661,667 ----
* open files, which would cause rmdir() to fail.
*/
#ifdef WIN32
! RequestImmediateCheckpoint();
#endif
/*
***************
*** 1427,1433 ****
* up-to-date for the copy. (We really only need to flush buffers for
* the source database, but bufmgr.c provides no API for that.)
*/
! BufferSync();
/*
* Copy this subdirectory to the new location
--- 1427,1433 ----
* up-to-date for the copy. (We really only need to flush buffers for
* the source database, but bufmgr.c provides no API for that.)
*/
! BufferSync(true);
/*
* Copy this subdirectory to the new location
Index: src/backend/postmaster/bgwriter.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/postmaster/bgwriter.c,v
retrieving revision 1.38
diff -c -r1.38 bgwriter.c
*** src/backend/postmaster/bgwriter.c 27 May 2007 03:50:39 -0000 1.38
--- src/backend/postmaster/bgwriter.c 20 Jun 2007 12:58:20 -0000
***************
*** 44,49 ****
--- 44,50 ----
#include "postgres.h"
#include <signal.h>
+ #include <sys/time.h>
#include <time.h>
#include <unistd.h>
***************
*** 59,64 ****
--- 60,66 ----
#include "storage/pmsignal.h"
#include "storage/shmem.h"
#include "storage/smgr.h"
+ #include "storage/spin.h"
#include "tcop/tcopprot.h"
#include "utils/guc.h"
#include "utils/memutils.h"
***************
*** 112,122 ****
{
pid_t bgwriter_pid; /* PID of bgwriter (0 if not started) */
! sig_atomic_t ckpt_started; /* advances when checkpoint starts */
! sig_atomic_t ckpt_done; /* advances when checkpoint done */
! sig_atomic_t ckpt_failed; /* advances when checkpoint fails */
! sig_atomic_t ckpt_time_warn; /* warn if too soon since last ckpt? */
int num_requests; /* current # of requests */
int max_requests; /* allocated array size */
--- 114,128 ----
{
pid_t bgwriter_pid; /* PID of bgwriter (0 if not started) */
! slock_t ckpt_lck; /* protects all the ckpt_* fields */
! int ckpt_started; /* advances when checkpoint starts */
! int ckpt_done; /* advances when checkpoint done */
! int ckpt_failed; /* advances when checkpoint fails */
!
! bool ckpt_rqst_time_warn; /* warn if too soon since last ckpt */
! bool ckpt_rqst_immediate; /* an immediate ckpt has been requested */
! bool ckpt_rqst_force; /* checkpoint even if no WAL activity */
int num_requests; /* current # of requests */
int max_requests; /* allocated array size */
***************
*** 131,136 ****
--- 137,143 ----
int BgWriterDelay = 200;
int CheckPointTimeout = 300;
int CheckPointWarning = 30;
+ double CheckPointSmoothing = 0.3;
/*
* Flags set by interrupt handlers for later service in the main loop.
***************
*** 146,154 ****
--- 153,176 ----
static bool ckpt_active = false;
+ /* Current time and WAL insert location when checkpoint was started */
+ static time_t ckpt_start_time;
+ static XLogRecPtr ckpt_start_recptr;
+
+ static double ckpt_cached_elapsed;
+
static time_t last_checkpoint_time;
static time_t last_xlog_switch_time;
+ /* Prototypes for private functions */
+
+ static void RequestCheckpoint(bool waitforit, bool warnontime, bool immediate, bool force);
+ static void CheckArchiveTimeout(void);
+ static void BgWriterNap(void);
+ static bool IsCheckpointOnSchedule(double progress);
+ static bool ImmediateCheckpointRequested(void);
+
+ /* Signal handlers */
static void bg_quickdie(SIGNAL_ARGS);
static void BgSigHupHandler(SIGNAL_ARGS);
***************
*** 170,175 ****
--- 192,198 ----
Assert(BgWriterShmem != NULL);
BgWriterShmem->bgwriter_pid = MyProcPid;
+ SpinLockInit(&BgWriterShmem->ckpt_lck);
am_bg_writer = true;
/*
***************
*** 281,288 ****
--- 304,314 ----
/* use volatile pointer to prevent code rearrangement */
volatile BgWriterShmemStruct *bgs = BgWriterShmem;
+ SpinLockAcquire(&BgWriterShmem->ckpt_lck);
bgs->ckpt_failed++;
bgs->ckpt_done = bgs->ckpt_started;
+ SpinLockRelease(&bgs->ckpt_lck);
+
ckpt_active = false;
}
***************
*** 328,337 ****
for (;;)
{
bool do_checkpoint = false;
- bool force_checkpoint = false;
time_t now;
int elapsed_secs;
- long udelay;
/*
* Emergency bailout if postmaster has died. This is to avoid the
--- 354,361 ----
***************
*** 354,360 ****
{
checkpoint_requested = false;
do_checkpoint = true;
- force_checkpoint = true;
BgWriterStats.m_requested_checkpoints++;
}
if (shutdown_requested)
--- 378,383 ----
***************
*** 377,387 ****
*/
now = time(NULL);
elapsed_secs = now - last_checkpoint_time;
! if (elapsed_secs >= CheckPointTimeout)
{
do_checkpoint = true;
! if (!force_checkpoint)
! BgWriterStats.m_timed_checkpoints++;
}
/*
--- 400,409 ----
*/
now = time(NULL);
elapsed_secs = now - last_checkpoint_time;
! if (!do_checkpoint && elapsed_secs >= CheckPointTimeout)
{
do_checkpoint = true;
! BgWriterStats.m_timed_checkpoints++;
}
/*
***************
*** 390,395 ****
--- 412,445 ----
*/
if (do_checkpoint)
{
+ /* use volatile pointer to prevent code rearrangement */
+ volatile BgWriterShmemStruct *bgs = BgWriterShmem;
+ bool time_warn;
+ bool immediate;
+ bool force;
+
+ /*
+ * Atomically check the request flags to figure out what
+ * kind of a checkpoint we should perform, and increase the
+ * started-counter to acknowledge that we've started
+ * a new checkpoint.
+ */
+
+ SpinLockAcquire(&bgs->ckpt_lck);
+
+ time_warn = bgs->ckpt_rqst_time_warn;
+ bgs->ckpt_rqst_time_warn = false;
+
+ immediate = bgs->ckpt_rqst_immediate;
+ bgs->ckpt_rqst_immediate = false;
+
+ force = bgs->ckpt_rqst_force;
+ bgs->ckpt_rqst_force = false;
+
+ bgs->ckpt_started++;
+
+ SpinLockRelease(&bgs->ckpt_lck);
+
/*
* We will warn if (a) too soon since last checkpoint (whatever
* caused it) and (b) somebody has set the ckpt_time_warn flag
***************
*** 397,417 ****
* implementation will not generate warnings caused by
* CheckPointTimeout < CheckPointWarning.
*/
! if (BgWriterShmem->ckpt_time_warn &&
elapsed_secs < CheckPointWarning)
ereport(LOG,
(errmsg("checkpoints are occurring too frequently (%d seconds apart)",
elapsed_secs),
errhint("Consider increasing the configuration parameter \"checkpoint_segments\".")));
! BgWriterShmem->ckpt_time_warn = false;
/*
* Indicate checkpoint start to any waiting backends.
*/
ckpt_active = true;
- BgWriterShmem->ckpt_started++;
! CreateCheckPoint(false, force_checkpoint);
/*
* After any checkpoint, close all smgr files. This is so we
--- 447,474 ----
* implementation will not generate warnings caused by
* CheckPointTimeout < CheckPointWarning.
*/
! if (time_warn &&
elapsed_secs < CheckPointWarning)
ereport(LOG,
(errmsg("checkpoints are occurring too frequently (%d seconds apart)",
elapsed_secs),
errhint("Consider increasing the configuration parameter \"checkpoint_segments\".")));
!
/*
* Indicate checkpoint start to any waiting backends.
*/
ckpt_active = true;
! ckpt_start_recptr = GetInsertRecPtr();
! ckpt_start_time = now;
! ckpt_cached_elapsed = 0;
!
! elog(DEBUG1, "CHECKPOINT: start");
!
! CreateCheckPoint(false, immediate, force);
!
! elog(DEBUG1, "CHECKPOINT: end");
/*
* After any checkpoint, close all smgr files. This is so we
***************
*** 422,428 ****
/*
* Indicate checkpoint completion to any waiting backends.
*/
! BgWriterShmem->ckpt_done = BgWriterShmem->ckpt_started;
ckpt_active = false;
/*
--- 479,487 ----
/*
* Indicate checkpoint completion to any waiting backends.
*/
! SpinLockAcquire(&bgs->ckpt_lck);
! bgs->ckpt_done = bgs->ckpt_started;
! SpinLockRelease(&bgs->ckpt_lck);
ckpt_active = false;
/*
***************
*** 433,446 ****
last_checkpoint_time = now;
}
else
! BgBufferSync();
/*
! * Check for archive_timeout, if so, switch xlog files. First we do a
! * quick check using possibly-stale local state.
*/
! if (XLogArchiveTimeout > 0 &&
! (int) (now - last_xlog_switch_time) >= XLogArchiveTimeout)
{
/*
* Update local state ... note that last_xlog_switch_time is the
--- 492,530 ----
last_checkpoint_time = now;
}
else
! {
! BgAllSweep();
! BgLruSweep();
! }
/*
! * Check for archive_timeout and switch xlog files if necessary.
*/
! CheckArchiveTimeout();
!
! /* Nap for the configured time. */
! BgWriterNap();
! }
! }
!
! /*
! * CheckArchiveTimeout -- check for archive_timeout and switch xlog files
! * if needed
! */
! static void
! CheckArchiveTimeout(void)
! {
! time_t now;
!
! if (XLogArchiveTimeout <= 0)
! return;
!
! now = time(NULL);
!
! /* First we do a quick check using possibly-stale local state. */
! if ((int) (now - last_xlog_switch_time) < XLogArchiveTimeout)
! return;
!
{
/*
* Update local state ... note that last_xlog_switch_time is the
***************
*** 450,459 ****
last_xlog_switch_time = Max(last_xlog_switch_time, last_time);
- /* if we did a checkpoint, 'now' might be stale too */
- if (do_checkpoint)
- now = time(NULL);
-
/* Now we can do the real check */
if ((int) (now - last_xlog_switch_time) >= XLogArchiveTimeout)
{
--- 534,539 ----
***************
*** 478,483 ****
--- 558,572 ----
last_xlog_switch_time = now;
}
}
+ }
+
+ /*
+ * BgWriterNap -- Nap for the configured time or until a signal is received.
+ */
+ static void
+ BgWriterNap(void)
+ {
+ long udelay;
/*
* Send off activity statistics to the stats collector
***************
*** 496,502 ****
* We absorb pending requests after each short sleep.
*/
if ((bgwriter_all_percent > 0.0 && bgwriter_all_maxpages > 0) ||
! (bgwriter_lru_percent > 0.0 && bgwriter_lru_maxpages > 0))
udelay = BgWriterDelay * 1000L;
else if (XLogArchiveTimeout > 0)
udelay = 1000000L; /* One second */
--- 585,592 ----
* We absorb pending requests after each short sleep.
*/
if ((bgwriter_all_percent > 0.0 && bgwriter_all_maxpages > 0) ||
! (bgwriter_lru_percent > 0.0 && bgwriter_lru_maxpages > 0) ||
! ckpt_active)
udelay = BgWriterDelay * 1000L;
else if (XLogArchiveTimeout > 0)
udelay = 1000000L; /* One second */
***************
*** 505,522 ****
while (udelay > 999999L)
{
! if (got_SIGHUP || checkpoint_requested || shutdown_requested)
break;
pg_usleep(1000000L);
AbsorbFsyncRequests();
udelay -= 1000000L;
}
! if (!(got_SIGHUP || checkpoint_requested || shutdown_requested))
pg_usleep(udelay);
}
}
/* --------------------------------
* signal handler routines
--- 595,766 ----
while (udelay > 999999L)
{
! /* If a checkpoint is active, postpone reloading the config
! * until the checkpoint is finished, and don't care about
! * non-immediate checkpoint requests.
! */
! if (shutdown_requested ||
! (!ckpt_active && (got_SIGHUP || checkpoint_requested)) ||
! (ckpt_active && ImmediateCheckpointRequested()))
break;
+
pg_usleep(1000000L);
AbsorbFsyncRequests();
udelay -= 1000000L;
}
!
! if (!(shutdown_requested ||
! (!ckpt_active && (got_SIGHUP || checkpoint_requested)) ||
! (ckpt_active && ImmediateCheckpointRequested())))
pg_usleep(udelay);
+ }
+
+ /*
+ * Returns true if an immediate checkpoint request is pending.
+ */
+ static bool
+ ImmediateCheckpointRequested()
+ {
+ if (checkpoint_requested)
+ {
+ volatile BgWriterShmemStruct *bgs = BgWriterShmem;
+
+ /*
+ * We're only looking at a single field, so we don't need to
+ * acquire the lock in this case.
+ */
+ if (bgs->ckpt_rqst_immediate)
+ return true;
}
+ return false;
}
+ /*
+ * CheckpointWriteDelay -- periodical sleep in checkpoint write phase
+ *
+ * During checkpoint, this is called periodically by the buffer manager while
+ * writing out dirty buffers from the shared buffer cache. We estimate if we've
+ * made enough progress so that we're going to finish this checkpoint in time
+ * before the next one is due, taking checkpoint_smoothing into account.
+ * If so, we perform one round of normal bgwriter activity including LRU-
+ * cleaning of buffer cache, switching xlog segment if archive_timeout has
+ * passed, and sleeping for BgWriterDelay msecs.
+ *
+ * 'progress' is an estimate of how much of the writes has been done, as a
+ * fraction between 0.0 meaning none, and 1.0 meaning all done.
+ */
+ void
+ CheckpointWriteDelay(double progress)
+ {
+ /*
+ * Return immediately if we should finish the checkpoint ASAP.
+ */
+ if (!am_bg_writer || CheckPointSmoothing <= 0 || shutdown_requested ||
+ ImmediateCheckpointRequested())
+ return;
+
+ elog(DEBUG1, "CheckpointWriteDelay: progress=%.3f", progress);
+
+ /* Take a nap and perform the usual bgwriter duties, unless we're behind
+ * schedule, in which case we just try to catch up as quickly as possible.
+ */
+ if (IsCheckpointOnSchedule(progress))
+ {
+ CheckArchiveTimeout();
+ BgLruSweep();
+ BgWriterNap();
+ }
+ }
+
+ /*
+ * IsCheckpointOnSchedule -- are we on schedule to finish this checkpoint
+ * in time?
+ *
+ * Compares the current progress against the time/segments elapsed since last
+ * checkpoint, and returns true if the progress we've made this far is greater
+ * than the elapsed time/segments.
+ *
+ * If another checkpoint has already been requested, always return false.
+ */
+ static bool
+ IsCheckpointOnSchedule(double progress)
+ {
+ struct timeval now;
+ XLogRecPtr recptr;
+ double progress_in_time,
+ progress_in_xlog;
+
+ Assert(ckpt_active);
+
+ /* scale progress according to CheckPointSmoothing */
+ progress *= CheckPointSmoothing;
+
+ /*
+ * Check against the cached value first. Only do the more expensive
+ * calculations once we reach the target previously calculated. Since
+ * neither time or WAL insert pointer moves backwards, a freshly
+ * calculated value can only be greater than or equal to the cached value.
+ */
+ if (progress < ckpt_cached_elapsed)
+ {
+ elog(DEBUG2, "IsCheckpointOnSchedule: Still behind cached=%.3f, progress=%.3f",
+ ckpt_cached_elapsed, progress);
+ return false;
+ }
+
+ gettimeofday(&now, NULL);
+
+ /*
+ * Check progress against time elapsed and checkpoint_timeout.
+ */
+ progress_in_time = ((double) (now.tv_sec - ckpt_start_time) +
+ now.tv_usec / 1000000.0) / CheckPointTimeout;
+
+ if (progress < progress_in_time)
+ {
+ elog(DEBUG2, "IsCheckpointOnSchedule: Behind checkpoint_timeout, time=%.3f, progress=%.3f",
+ progress_in_time, progress);
+
+ ckpt_cached_elapsed = progress_in_time;
+
+ return false;
+ }
+
+ /*
+ * Check progress against WAL segments written and checkpoint_segments.
+ *
+ * We compare the current WAL insert location against the location
+ * computed before calling CreateCheckPoint. The code in XLogInsert that
+ * actually triggers a checkpoint when checkpoint_segments is exceeded
+ * compares against RedoRecptr, so this is not completely accurate.
+ * However, it's good enough for our purposes, we're only calculating
+ * an estimate anyway.
+ */
+ recptr = GetInsertRecPtr();
+ progress_in_xlog =
+ (((double) recptr.xlogid - (double) ckpt_start_recptr.xlogid) * XLogSegsPerFile +
+ ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
+ CheckPointSegments;
+
+ if (progress < progress_in_xlog)
+ {
+ elog(DEBUG2, "IsCheckpointOnSchedule: Behind checkpoint_segments, xlog=%.3f, progress=%.3f",
+ progress_in_xlog, progress);
+
+ ckpt_cached_elapsed = progress_in_xlog;
+
+ return false;
+ }
+
+
+ /* It looks like we're on schedule. */
+
+ elog(DEBUG2, "IsCheckpointOnSchedule: on schedule, time=%.3f, xlog=%.3f progress=%.3f",
+ progress_in_time, progress_in_xlog, progress);
+
+ return true;
+ }
/* --------------------------------
* signal handler routines
***************
*** 618,625 ****
}
/*
* RequestCheckpoint
! * Called in backend processes to request an immediate checkpoint
*
* If waitforit is true, wait until the checkpoint is completed
* before returning; otherwise, just signal the request and return
--- 862,910 ----
}
/*
+ * RequestImmediateCheckpoint
+ * Called in backend processes to request an immediate checkpoint.
+ *
+ * Returns when the checkpoint is finished.
+ */
+ void
+ RequestImmediateCheckpoint()
+ {
+ RequestCheckpoint(true, false, true, true);
+ }
+
+ /*
+ * RequestImmediateCheckpoint
+ * Called in backend processes to request a lazy checkpoint.
+ *
+ * This is essentially the same as RequestImmediateCheckpoint, except
+ * that this form obeys the checkpoint_smoothing GUC variable, and
+ * can therefore take a lot longer time.
+ *
+ * Returns when the checkpoint is finished.
+ */
+ void
+ RequestLazyCheckpoint()
+ {
+ RequestCheckpoint(true, false, false, true);
+ }
+
+ /*
+ * RequestXLogFillCheckpoint
+ * Signals the bgwriter that we've reached checkpoint_segments
+ *
+ * Unlike RequestImmediateCheckpoint and RequestLazyCheckpoint, return
+ * immediately without waiting for the checkpoint to finish.
+ */
+ void
+ RequestXLogFillCheckpoint()
+ {
+ RequestCheckpoint(false, true, false, false);
+ }
+
+ /*
* RequestCheckpoint
! * Common subroutine for all the above Request*Checkpoint variants.
*
* If waitforit is true, wait until the checkpoint is completed
* before returning; otherwise, just signal the request and return
***************
*** 628,648 ****
* If warnontime is true, and it's "too soon" since the last checkpoint,
* the bgwriter will log a warning. This should be true only for checkpoints
* caused due to xlog filling, else the warning will be misleading.
*/
! void
! RequestCheckpoint(bool waitforit, bool warnontime)
{
/* use volatile pointer to prevent code rearrangement */
volatile BgWriterShmemStruct *bgs = BgWriterShmem;
! sig_atomic_t old_failed = bgs->ckpt_failed;
! sig_atomic_t old_started = bgs->ckpt_started;
/*
* If in a standalone backend, just do it ourselves.
*/
if (!IsPostmasterEnvironment)
{
! CreateCheckPoint(false, true);
/*
* After any checkpoint, close all smgr files. This is so we won't
--- 913,942 ----
* If warnontime is true, and it's "too soon" since the last checkpoint,
* the bgwriter will log a warning. This should be true only for checkpoints
* caused due to xlog filling, else the warning will be misleading.
+ *
+ * If immediate is true, the checkpoint should be finished ASAP.
+ *
+ * If force is true, force a checkpoint even if no XLOG activity has occured
+ * since the last one.
*/
! static void
! RequestCheckpoint(bool waitforit, bool warnontime, bool immediate, bool force)
{
/* use volatile pointer to prevent code rearrangement */
volatile BgWriterShmemStruct *bgs = BgWriterShmem;
! int old_failed, old_started;
/*
* If in a standalone backend, just do it ourselves.
*/
if (!IsPostmasterEnvironment)
{
! /*
! * There's no point in doing lazy checkpoints in a standalone
! * backend, because there's no other backends the checkpoint could
! * disrupt.
! */
! CreateCheckPoint(false, true, true);
/*
* After any checkpoint, close all smgr files. This is so we won't
***************
*** 653,661 ****
return;
}
! /* Set warning request flag if appropriate */
if (warnontime)
! bgs->ckpt_time_warn = true;
/*
* Send signal to request checkpoint. When waitforit is false, we
--- 947,974 ----
return;
}
! /*
! * Atomically set the request flags, and take a snapshot of the counters.
! * This ensures that when we see that ckpt_started > old_started,
! * we know the flags we set here have been seen by bgwriter.
! *
! * Note that we effectively OR the flags with any existing flags, to
! * avoid overriding a "stronger" request by another backend.
! */
! SpinLockAcquire(&bgs->ckpt_lck);
!
! old_failed = bgs->ckpt_failed;
! old_started = bgs->ckpt_started;
!
! /* Set request flags as appropriate */
if (warnontime)
! bgs->ckpt_rqst_time_warn = true;
! if (immediate)
! bgs->ckpt_rqst_immediate = true;
! if (force)
! bgs->ckpt_rqst_force = true;
!
! SpinLockRelease(&bgs->ckpt_lck);
/*
* Send signal to request checkpoint. When waitforit is false, we
***************
*** 674,701 ****
*/
if (waitforit)
{
! while (bgs->ckpt_started == old_started)
{
CHECK_FOR_INTERRUPTS();
pg_usleep(100000L);
}
- old_started = bgs->ckpt_started;
/*
! * We are waiting for ckpt_done >= old_started, in a modulo sense.
! * This is a little tricky since we don't know the width or signedness
! * of sig_atomic_t. We make the lowest common denominator assumption
! * that it is only as wide as "char". This means that this algorithm
! * will cope correctly as long as we don't sleep for more than 127
! * completed checkpoints. (If we do, we will get another chance to
! * exit after 128 more checkpoints...)
*/
! while (((signed char) (bgs->ckpt_done - old_started)) < 0)
{
CHECK_FOR_INTERRUPTS();
pg_usleep(100000L);
}
! if (bgs->ckpt_failed != old_failed)
ereport(ERROR,
(errmsg("checkpoint request failed"),
errhint("Consult recent messages in the server log for details.")));
--- 987,1031 ----
*/
if (waitforit)
{
! int new_started, new_failed;
!
! /* Wait for a new checkpoint to start. */
! for(;;)
{
+ SpinLockAcquire(&bgs->ckpt_lck);
+ new_started = bgs->ckpt_started;
+ SpinLockRelease(&bgs->ckpt_lck);
+
+ if (new_started != old_started)
+ break;
+
CHECK_FOR_INTERRUPTS();
pg_usleep(100000L);
}
/*
! * We are waiting for ckpt_done >= new_started, in a modulo sense.
! * This algorithm will cope correctly as long as we don't sleep for
! * more than MAX_INT completed checkpoints. (If we do, we will get
! * another chance to exit after MAX_INT more checkpoints...)
*/
! for(;;)
{
+ int new_done;
+
+ SpinLockAcquire(&bgs->ckpt_lck);
+ new_done = bgs->ckpt_done;
+ new_failed = bgs->ckpt_failed;
+ SpinLockRelease(&bgs->ckpt_lck);
+
+ if(new_done - new_started >= 0)
+ break;
+
CHECK_FOR_INTERRUPTS();
pg_usleep(100000L);
}
!
! if (new_failed != old_failed)
ereport(ERROR,
(errmsg("checkpoint request failed"),
errhint("Consult recent messages in the server log for details.")));
Index: src/backend/storage/buffer/bufmgr.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/bufmgr.c,v
retrieving revision 1.220
diff -c -r1.220 bufmgr.c
*** src/backend/storage/buffer/bufmgr.c 30 May 2007 20:11:58 -0000 1.220
--- src/backend/storage/buffer/bufmgr.c 20 Jun 2007 12:47:43 -0000
***************
*** 32,38 ****
*
* BufferSync() -- flush all dirty buffers in the buffer pool.
*
! * BgBufferSync() -- flush some dirty buffers in the buffer pool.
*
* InitBufferPool() -- Init the buffer module.
*
--- 32,40 ----
*
* BufferSync() -- flush all dirty buffers in the buffer pool.
*
! * BgAllSweep() -- write out some dirty buffers in the pool.
! *
! * BgLruSweep() -- write out some lru dirty buffers in the pool.
*
* InitBufferPool() -- Init the buffer module.
*
***************
*** 74,79 ****
--- 76,82 ----
double bgwriter_all_percent = 0.333;
int bgwriter_lru_maxpages = 5;
int bgwriter_all_maxpages = 5;
+ int checkpoint_rate = 512; /* in pages/s */
long NDirectFileRead; /* some I/O's are direct file access. bypass
***************
*** 645,651 ****
* at 1 so that the buffer can survive one clock-sweep pass.)
*/
buf->tag = newTag;
! buf->flags &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_IO_ERROR);
buf->flags |= BM_TAG_VALID;
buf->usage_count = 1;
--- 648,654 ----
* at 1 so that the buffer can survive one clock-sweep pass.)
*/
buf->tag = newTag;
! buf->flags &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_CHECKPOINT_NEEDED | BM_IO_ERROR);
buf->flags |= BM_TAG_VALID;
buf->usage_count = 1;
***************
*** 1000,1037 ****
* BufferSync -- Write out all dirty buffers in the pool.
*
* This is called at checkpoint time to write out all dirty shared buffers.
*/
void
! BufferSync(void)
{
! int buf_id;
int num_to_scan;
int absorb_counter;
/*
* Find out where to start the circular scan.
*/
! buf_id = StrategySyncStart();
/* Make sure we can handle the pin inside SyncOneBuffer */
ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
/*
! * Loop over all buffers.
*/
num_to_scan = NBuffers;
absorb_counter = WRITES_PER_ABSORB;
while (num_to_scan-- > 0)
{
! if (SyncOneBuffer(buf_id, false))
{
BgWriterStats.m_buf_written_checkpoints++;
/*
* If in bgwriter, absorb pending fsync requests after each
* WRITES_PER_ABSORB write operations, to prevent overflow of the
* fsync request queue. If not in bgwriter process, this is a
* no-op.
*/
if (--absorb_counter <= 0)
{
--- 1003,1127 ----
* BufferSync -- Write out all dirty buffers in the pool.
*
* This is called at checkpoint time to write out all dirty shared buffers.
+ * If 'immediate' is true, write them all ASAP, otherwise throttle the
+ * I/O rate according to checkpoint_write_rate GUC variable, and perform
+ * normal bgwriter duties periodically.
*/
void
! BufferSync(bool immediate)
{
! int buf_id, start_id;
int num_to_scan;
+ int num_to_write;
+ int num_written;
int absorb_counter;
+ int num_written_since_nap;
+ int writes_per_nap;
+
+ /*
+ * Convert checkpoint_write_rate to number writes of writes to perform in
+ * a period of BgWriterDelay. The result is an integer, so we lose some
+ * precision here. There's a lot of other factors as well that affect the
+ * real rate, for example granularity of OS timer used for BgWriterDelay,
+ * whether any of the writes block, and time spent in CheckpointWriteDelay
+ * performing normal bgwriter duties.
+ */
+ writes_per_nap = Min(1, checkpoint_rate / BgWriterDelay);
/*
* Find out where to start the circular scan.
*/
! start_id = StrategySyncStart();
/* Make sure we can handle the pin inside SyncOneBuffer */
ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
/*
! * Loop over all buffers, and mark the ones that need to be written with
! * BM_CHECKPOINT_NEEDED. Count them as we go (num_to_write), so that we
! * can estimate how much work needs to be done.
! *
! * This allows us to only write those pages that were dirty when the
! * checkpoint began, and haven't been flushed to disk since. Whenever a
! * page with BM_CHECKPOINT_NEEDED is written out by normal backends or
! * the bgwriter LRU-scan, the flag is cleared, and any pages dirtied after
! * this scan don't have the flag set.
! */
! num_to_scan = NBuffers;
! num_to_write = 0;
! buf_id = start_id;
! while (num_to_scan-- > 0)
! {
! volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id];
!
! /*
! * Header spinlock is enough to examine BM_DIRTY, see comment in
! * SyncOneBuffer.
! */
! LockBufHdr(bufHdr);
!
! if (bufHdr->flags & BM_DIRTY)
! {
! bufHdr->flags |= BM_CHECKPOINT_NEEDED;
! num_to_write++;
! }
!
! UnlockBufHdr(bufHdr);
!
! if (++buf_id >= NBuffers)
! buf_id = 0;
! }
!
! elog(DEBUG1, "CHECKPOINT: %d / %d buffers to write", num_to_write, NBuffers);
!
! /*
! * Loop over all buffers again, and write the ones (still) marked with
! * BM_CHECKPOINT_NEEDED.
*/
num_to_scan = NBuffers;
+ num_written = num_written_since_nap = 0;
absorb_counter = WRITES_PER_ABSORB;
+ buf_id = start_id;
while (num_to_scan-- > 0)
{
! volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id];
! bool needs_flush;
!
! /* We don't need to acquire the lock here, because we're
! * only looking at a single bit. It's possible that someone
! * else writes the buffer and clears the flag right after we
! * check, but that doesn't matter. This assumes that no-one
! * clears the flag and sets it again while holding info_lck,
! * expecting no-one to see the intermediary state.
! */
! needs_flush = (bufHdr->flags & BM_CHECKPOINT_NEEDED) != 0;
!
! if (needs_flush && SyncOneBuffer(buf_id, false))
{
BgWriterStats.m_buf_written_checkpoints++;
+ num_written++;
+
+ /*
+ * Perform normal bgwriter duties and sleep to throttle
+ * our I/O rate.
+ */
+ if (!immediate && ++num_written_since_nap >= writes_per_nap)
+ {
+ num_written_since_nap = 0;
+ CheckpointWriteDelay((double) (num_written) / num_to_write);
+ }
/*
* If in bgwriter, absorb pending fsync requests after each
* WRITES_PER_ABSORB write operations, to prevent overflow of the
* fsync request queue. If not in bgwriter process, this is a
* no-op.
+ *
+ * AbsorbFsyncRequests is also called inside CheckpointWriteDelay,
+ * so this is partially redundant. However, we can't totally trust
+ * on the call in CheckpointWriteDelay, because it's only made
+ * before sleeping. In case CheckpointWriteDelay doesn't sleep,
+ * we need to absorb pending requests ourselves.
*/
if (--absorb_counter <= 0)
{
***************
*** 1045,1059 ****
}
/*
! * BgBufferSync -- Write out some dirty buffers in the pool.
*
* This is called periodically by the background writer process.
*/
void
! BgBufferSync(void)
{
static int buf_id1 = 0;
- int buf_id2;
int num_to_scan;
int num_written;
--- 1135,1152 ----
}
/*
! * BgAllSweep -- Write out some dirty buffers in the pool.
*
+ * Runs the bgwriter all-sweep algorithm to write dirty buffers to
+ * minimize work at checkpoint time.
* This is called periodically by the background writer process.
+ *
+ * XXX: Is this really needed with load distributed checkpoints?
*/
void
! BgAllSweep(void)
{
static int buf_id1 = 0;
int num_to_scan;
int num_written;
***************
*** 1063,1072 ****
/*
* To minimize work at checkpoint time, we want to try to keep all the
* buffers clean; this motivates a scan that proceeds sequentially through
! * all buffers. But we are also charged with ensuring that buffers that
! * will be recycled soon are clean when needed; these buffers are the ones
! * just ahead of the StrategySyncStart point. We make a separate scan
! * through those.
*/
/*
--- 1156,1162 ----
/*
* To minimize work at checkpoint time, we want to try to keep all the
* buffers clean; this motivates a scan that proceeds sequentially through
! * all buffers.
*/
/*
***************
*** 1098,1103 ****
--- 1188,1210 ----
}
BgWriterStats.m_buf_written_all += num_written;
}
+ }
+
+ /*
+ * BgLruSweep -- Write out some lru dirty buffers in the pool.
+ */
+ void
+ BgLruSweep(void)
+ {
+ int buf_id2;
+ int num_to_scan;
+ int num_written;
+
+ /*
+ * The purpose of this sweep is to ensure that buffers that
+ * will be recycled soon are clean when needed; these buffers are the ones
+ * just ahead of the StrategySyncStart point.
+ */
/*
* This loop considers only unpinned buffers close to the clock sweep
***************
*** 1341,1349 ****
* flushed.
*/
void
! FlushBufferPool(void)
{
! BufferSync();
smgrsync();
}
--- 1448,1459 ----
* flushed.
*/
void
! FlushBufferPool(bool immediate)
{
! elog(DEBUG1, "CHECKPOINT: write phase");
! BufferSync(immediate || CheckPointSmoothing <= 0);
!
! elog(DEBUG1, "CHECKPOINT: sync phase");
smgrsync();
}
***************
*** 2132,2138 ****
Assert(buf->flags & BM_IO_IN_PROGRESS);
buf->flags &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR);
if (clear_dirty && !(buf->flags & BM_JUST_DIRTIED))
! buf->flags &= ~BM_DIRTY;
buf->flags |= set_flag_bits;
UnlockBufHdr(buf);
--- 2242,2248 ----
Assert(buf->flags & BM_IO_IN_PROGRESS);
buf->flags &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR);
if (clear_dirty && !(buf->flags & BM_JUST_DIRTIED))
! buf->flags &= ~(BM_DIRTY | BM_CHECKPOINT_NEEDED);
buf->flags |= set_flag_bits;
UnlockBufHdr(buf);
Index: src/backend/tcop/utility.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/tcop/utility.c,v
retrieving revision 1.280
diff -c -r1.280 utility.c
*** src/backend/tcop/utility.c 30 May 2007 20:12:01 -0000 1.280
--- src/backend/tcop/utility.c 20 Jun 2007 09:36:31 -0000
***************
*** 1089,1095 ****
ereport(ERROR,
(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
errmsg("must be superuser to do CHECKPOINT")));
! RequestCheckpoint(true, false);
break;
case T_ReindexStmt:
--- 1089,1095 ----
ereport(ERROR,
(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
errmsg("must be superuser to do CHECKPOINT")));
! RequestImmediateCheckpoint();
break;
case T_ReindexStmt:
Index: src/backend/utils/misc/guc.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/utils/misc/guc.c,v
retrieving revision 1.396
diff -c -r1.396 guc.c
*** src/backend/utils/misc/guc.c 8 Jun 2007 18:23:52 -0000 1.396
--- src/backend/utils/misc/guc.c 20 Jun 2007 10:14:06 -0000
***************
*** 1487,1492 ****
--- 1487,1503 ----
30, 0, INT_MAX, NULL, NULL
},
+
+ {
+ {"checkpoint_rate", PGC_SIGHUP, WAL_CHECKPOINTS,
+ gettext_noop("Minimum I/O rate used to write dirty buffers during checkpoints."),
+ NULL,
+ GUC_UNIT_BLOCKS
+ },
+ &checkpoint_rate,
+ 100, 0.0, 100000, NULL, NULL
+ },
+
{
{"wal_buffers", PGC_POSTMASTER, WAL_SETTINGS,
gettext_noop("Sets the number of disk-page buffers in shared memory for WAL."),
***************
*** 1866,1871 ****
--- 1877,1891 ----
0.1, 0.0, 100.0, NULL, NULL
},
+ {
+ {"checkpoint_smoothing", PGC_SIGHUP, WAL_CHECKPOINTS,
+ gettext_noop("Time spent flushing dirty buffers during checkpoint, as fraction of checkpoint interval."),
+ NULL
+ },
+ &CheckPointSmoothing,
+ 0.3, 0.0, 0.9, NULL, NULL
+ },
+
/* End-of-list marker */
{
{NULL, 0, 0, NULL, NULL}, NULL, 0.0, 0.0, 0.0, NULL, NULL
Index: src/backend/utils/misc/postgresql.conf.sample
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/utils/misc/postgresql.conf.sample,v
retrieving revision 1.216
diff -c -r1.216 postgresql.conf.sample
*** src/backend/utils/misc/postgresql.conf.sample 3 Jun 2007 17:08:15 -0000 1.216
--- src/backend/utils/misc/postgresql.conf.sample 20 Jun 2007 10:03:17 -0000
***************
*** 168,173 ****
--- 168,175 ----
#checkpoint_segments = 3 # in logfile segments, min 1, 16MB each
#checkpoint_timeout = 5min # range 30s-1h
+ #checkpoint_smoothing = 0.3 # checkpoint duration, range 0.0 - 0.9
+ #checkpoint_rate = 512.0KB # min. checkpoint write rate per second
#checkpoint_warning = 30s # 0 is off
# - Archiving -
Index: src/include/access/xlog.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/xlog.h,v
retrieving revision 1.78
diff -c -r1.78 xlog.h
*** src/include/access/xlog.h 30 May 2007 20:12:02 -0000 1.78
--- src/include/access/xlog.h 19 Jun 2007 14:10:07 -0000
***************
*** 171,179 ****
extern void StartupXLOG(void);
extern void ShutdownXLOG(int code, Datum arg);
extern void InitXLOGAccess(void);
! extern void CreateCheckPoint(bool shutdown, bool force);
extern void XLogPutNextOid(Oid nextOid);
extern XLogRecPtr GetRedoRecPtr(void);
extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);
#endif /* XLOG_H */
--- 171,180 ----
extern void StartupXLOG(void);
extern void ShutdownXLOG(int code, Datum arg);
extern void InitXLOGAccess(void);
! extern void CreateCheckPoint(bool shutdown, bool immediate, bool force);
extern void XLogPutNextOid(Oid nextOid);
extern XLogRecPtr GetRedoRecPtr(void);
+ extern XLogRecPtr GetInsertRecPtr(void);
extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);
#endif /* XLOG_H */
Index: src/include/postmaster/bgwriter.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/postmaster/bgwriter.h,v
retrieving revision 1.9
diff -c -r1.9 bgwriter.h
*** src/include/postmaster/bgwriter.h 5 Jan 2007 22:19:57 -0000 1.9
--- src/include/postmaster/bgwriter.h 20 Jun 2007 09:27:20 -0000
***************
*** 20,29 ****
extern int BgWriterDelay;
extern int CheckPointTimeout;
extern int CheckPointWarning;
extern void BackgroundWriterMain(void);
! extern void RequestCheckpoint(bool waitforit, bool warnontime);
extern bool ForwardFsyncRequest(RelFileNode rnode, BlockNumber segno);
extern void AbsorbFsyncRequests(void);
--- 20,33 ----
extern int BgWriterDelay;
extern int CheckPointTimeout;
extern int CheckPointWarning;
+ extern double CheckPointSmoothing;
extern void BackgroundWriterMain(void);
! extern void RequestImmediateCheckpoint(void);
! extern void RequestLazyCheckpoint(void);
! extern void RequestXLogFillCheckpoint(void);
! extern void CheckpointWriteDelay(double progress);
extern bool ForwardFsyncRequest(RelFileNode rnode, BlockNumber segno);
extern void AbsorbFsyncRequests(void);
Index: src/include/storage/buf_internals.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/buf_internals.h,v
retrieving revision 1.90
diff -c -r1.90 buf_internals.h
*** src/include/storage/buf_internals.h 30 May 2007 20:12:03 -0000 1.90
--- src/include/storage/buf_internals.h 12 Jun 2007 11:42:23 -0000
***************
*** 35,40 ****
--- 35,41 ----
#define BM_IO_ERROR (1 << 4) /* previous I/O failed */
#define BM_JUST_DIRTIED (1 << 5) /* dirtied since write started */
#define BM_PIN_COUNT_WAITER (1 << 6) /* have waiter for sole pin */
+ #define BM_CHECKPOINT_NEEDED (1 << 7) /* this needs to be written in checkpoint */
typedef bits16 BufFlags;
Index: src/include/storage/bufmgr.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/bufmgr.h,v
retrieving revision 1.104
diff -c -r1.104 bufmgr.h
*** src/include/storage/bufmgr.h 30 May 2007 20:12:03 -0000 1.104
--- src/include/storage/bufmgr.h 20 Jun 2007 10:28:43 -0000
***************
*** 36,41 ****
--- 36,42 ----
extern double bgwriter_all_percent;
extern int bgwriter_lru_maxpages;
extern int bgwriter_all_maxpages;
+ extern int checkpoint_rate;
/* in buf_init.c */
extern DLLIMPORT char *BufferBlocks;
***************
*** 136,142 ****
extern void ResetBufferUsage(void);
extern void AtEOXact_Buffers(bool isCommit);
extern void PrintBufferLeakWarning(Buffer buffer);
! extern void FlushBufferPool(void);
extern BlockNumber BufferGetBlockNumber(Buffer buffer);
extern BlockNumber RelationGetNumberOfBlocks(Relation relation);
extern void RelationTruncate(Relation rel, BlockNumber nblocks);
--- 137,143 ----
extern void ResetBufferUsage(void);
extern void AtEOXact_Buffers(bool isCommit);
extern void PrintBufferLeakWarning(Buffer buffer);
! extern void FlushBufferPool(bool immediate);
extern BlockNumber BufferGetBlockNumber(Buffer buffer);
extern BlockNumber RelationGetNumberOfBlocks(Relation relation);
extern void RelationTruncate(Relation rel, BlockNumber nblocks);
***************
*** 161,168 ****
extern void AbortBufferIO(void);
extern void BufmgrCommit(void);
! extern void BufferSync(void);
! extern void BgBufferSync(void);
extern void AtProcExit_LocalBuffers(void);
--- 162,170 ----
extern void AbortBufferIO(void);
extern void BufmgrCommit(void);
! extern void BufferSync(bool immediate);
! extern void BgAllSweep(void);
! extern void BgLruSweep(void);
extern void AtProcExit_LocalBuffers(void);
---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings