Re: [PATCHES] Load distributed checkpoint V4

2007-04-23 Thread Greg Smith

On Thu, 19 Apr 2007, Heikki Linnakangas wrote:

In the sync phase, we sleep between each fsync until enough time/segments 
have passed, assuming that the time to fsync is proportional to the file 
length. I'm not sure that's a very good assumption.


I've been making scatter plots of fsync time vs. amount written to the 
database for a couple of months now, and while there's a trend there it's 
not a linear one based on data written.  Under Linux, to make a useful 
prediction about how long a fsync will take you first need to consider how 
much dirty data is already in the OS cache (the "Dirty:" figure in 
/proc/meminfo) before the write begins, relative to the kernel parameters 
that control write behavior.  Combine that with some knowledge of the 
caching behavior of the controller/disk combination you're using, and it's 
just barely possible to make a reasonable estimate.  Any less information 
than all that and you really have very little basis on which to guess how 
long it's going to take.


Other operating systems are going to give completely different behavior 
here, which of course makes the problem even worse.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] Load distributed checkpoint V4

2007-04-19 Thread Heikki Linnakangas

ITAGAKI Takahiro wrote:

Here is an updated version of LDC patch (V4).


Thanks! I'll start testing.


- Progress of checkpoint is controlled not only based on checkpoint_timeout
  but also checkpoint_segments. -- Now it works better with large
  checkpoint_timeout and small checkpoint_segments.


Great, much better now. I like the concept of "progress" used in the 
calculations. We might want to call GetCheckpointProgress something 
else, though. It doesn't return the amount of progress made, but rather 
the amount of progress we should've made up to that point or we're in 
danger of not completing the checkpoint in time.



We can control the delay of checkpoints using three parameters:
checkpoint_write_percent, checkpoint_nap_percent and checkpoint_sync_percent.
If we set all of the values to zero, checkpoint behaves as it was.


The nap and sync phases are pretty straightforward. The write phase, 
however, behaves a bit differently


In the nap phase, we just sleep until enough time/segments has passed, 
where enough is defined by checkpoint_nap_percent. However, if we're 
already past checkpoint_write_percent at the beginning of the nap, I 
think we should clamp the nap time so that we don't run out of time 
until the next checkpoint because of sleeping.


In the sync phase, we sleep between each fsync until enough 
time/segments have passed, assuming that the time to fsync is 
proportional to the file length. I'm not sure that's a very good 
assumption. We might have one huge files with only very little changed 
data, for example a logging table that is just occasionaly appended to. 
If we begin by fsyncing that, it'll take a very short time to finish, 
and we'll then sleep for a long time. If we then have another large file 
to fsync, but that one has all pages dirty, we risk running out of time 
because of the unnecessarily long sleep. The segmentation of relations 
limits the risk of that, though, by limiting the max. file size, and I 
don't really have any better suggestions.


In the write phase, bgwriter_all_maxpages is also factored in the 
sleeps. On each iteration, we write bgwriter_all_maxpages pages and then 
we sleep for bgwriter_delay msecs. checkpoint_write_percent only 
controls the maximum amount of time we try spend in the write phase, we 
skip the sleeps if we're exceeding checkpoint_write_percent, but it can 
very well finish earlier. IOW, bgwriter_all_maxpages is the *minimum* 
amount of pages to write between sleeps. If it's not set, we use 
WRITERS_PER_ABSORB, which is hardcoded to 1000.


The approach of writing min. N pages per iteration seems sound to me. By 
setting N we can control the maximum impact of a checkpoint under normal 
circumstances. If there's very little work to do, it doesn't make sense 
to stretch the write of say 10 buffers across a 15 min period; it's 
indeed better to finish the checkpoint earlier. It's similar to 
vacuum_cost_limit in that sense. But using bgwriter_all_maxpages for it 
doesn't feel right, we should at least name it differently. The default 
of 1000 is a bit high as well, with the default bgwriter_delay that adds 
up to 39MB/s. That's ok for decent a I/O subsystem, but the default 
really should be something that will still leave room for other I/O on a 
small single-disk server.


Should we try doing something similar for the sync phase? If there's 
only 2 small files to fsync, there's no point sleeping for 5 minutes 
between them just to use up the checkpoint_sync_percent budget.


Should we give a warning if you set the *_percent settings so that they 
exceed 100%?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


[PATCHES] Load distributed checkpoint V4

2007-04-18 Thread ITAGAKI Takahiro
Here is an updated version of LDC patch (V4).

- Refactor the codes to minimize the impact of changes.
- Progress of checkpoint is controlled not only based on checkpoint_timeout
  but also checkpoint_segments. -- Now it works better with large
  checkpoint_timeout and small checkpoint_segments.

We can control the delay of checkpoints using three parameters:
checkpoint_write_percent, checkpoint_nap_percent and checkpoint_sync_percent.
If we set all of the values to zero, checkpoint behaves as it was.


Heikki Linnakangas <[EMAIL PROTECTED]> wrote:

> I'd suggest rearranging the code so that BgBufferSync and mdsync would 
> basically stay like they are without the patch; the signature wouldn't 
> change. To do the naps during a checkpoint, inject calls to new 
> functions like CheckpointWriteNap() and CheckpointFsyncNap() inside 
> BgBufferSync and mdsync. Those nap functions would check if enough 
> progress has been made since last call and sleep if so.

Yeah, it makes LDC less intrusive. Now the code flow in checkpoints stay
as it was and the nap-functions are called periodically in BufferSync()
and smgrsync(). But the signatures of some functions needed small changes;
the argument 'immediate' was added.

> The nap-functions would call 
> BgWriteLRUBuffers if more than bgwriter_delay milliseconds have passed 
> since last call to it.

Only LRU buffers are written in nap and sync phases in the new patch.
The ALL activity of bgwriter was primarily designed to write drity buffers
on ahead of checkpoints, so the writes were not needed *in* checkpoints.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



LDC_v4.patch
Description: Binary data

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings