On 05.06.2013 23:16, Josh Berkus wrote:
For limiting the time required to recover after crash,
checkpoint_segments is awkward because it's difficult to calculate how
long recovery will take, given checkpoint_segments=X. A bulk load can
use up segments really fast, and recovery will be fast, while segments
full of random deletions can need a lot of random I/O to replay, and
take a long time. IMO checkpoint_timeout is a much better way to control
that, although it's not perfect either.

This is true, but I don't see that your proposal changes this at all
(for the better or for the worse).

Right, it doesn't. I explained this to justify that it's OK to replace checkpoint_segments with max_wal_size. If someone is trying to use checkpoint_segments to limit the time required to recover after crash, he might find the current checkpoint_segments setting more intuitive than my proposed max_wal_size. checkpoint_segments means "perform a checkpoint every X segments", so you know that after a crash, you will have to replay at most X segments (except that checkpoint_completion_target complicates that already). With max_wal_size, the relationship is not as clear.

What I tried to argue is that I don't think that's a serious concern.

I propose that we do something similar, but not exactly the same. Let's
have a setting, max_wal_size, to control the max. disk space reserved
for WAL. Once that's reached (or you get close enough, so that there are
still some segments left to consume while the checkpoint runs), a
checkpoint is triggered.

Refinement of the proposal:

1. max_wal_size is a hard limit

I'd like to punt on that until later. Making it a hard limit would be a much bigger patch, and needs a lot of discussion how it should behave (switch to read-only mode, progressively slow down WAL writes, or what?) and how to implement it.

But I think there's a clear evolution path here; with current checkpoint_segments, it's not sensible to treat that as a hard limit. Once we have something like max_wal_size, defined in MB, it's much more sensible. So turning it into a hard limit could be a follow-up patch, if someone wants to step up to the plate.

2. checkpointing targets 50% of ( max_wal_size - wal_keep_segments )
    to avoid lockup if checkpoint takes longer than expected.

Will also have to factor in checkpoint_completion_target.

Hmm, haven't thought about that. I think a better unit to set
wal_keep_segments in would also be MB, not segments.

Well, the ideal unit from the user's point of view is *time*, not space.
  That is, the user wants the master to keep, say, "8 hours of
transaction logs", not any amount of MB.  I don't want to complicate
this proposal by trying to deliver that, though.

OTOH, if you specify it in terms of time, then you don't have any limit on the amount of disk space required.

In this proposal, the number of segments preallocated is controlled
separately from max_wal_size, so that you can set max_wal_size high,
without actually consuming that much space in normal operation. It's
just a backstop, to avoid completely filling the disk, if there's a
sudden burst of activity. The number of segments preallocated is
auto-tuned, based on the number of segments used in previous checkpoint
cycles.

"based on"; can you give me your algorithmic thinking here?  I'm
thinking we should have some calculation of last cycle size and peak
cycle size so that bursty workloads aren't compromised.

Yeah, something like that :-). I was thinking of letting the estimate decrease like a moving average, but react to any increases immediately. Same thing we do in bgwriter to track buffer allocations:

        /*
         * Track a moving average of recent buffer allocations.  Here, rather 
than
         * a true average we want a fast-attack, slow-decline behavior: we
         * immediately follow any increase.
         */
        if (smoothed_alloc <= (float) recent_alloc)
                smoothed_alloc = recent_alloc;
        else
                smoothed_alloc += ((float) recent_alloc - smoothed_alloc) /
                        smoothing_samples;


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to