Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-26 Thread Alvaro Herrera
Gregory Stark wrote:

> I can imagine a scenario where you have a system that's very busy for 60s and
> then idle for 60s repeatedly. And for some reason you configure a
> checkpoint_timeout on the order of 20m or so (assuming you're travelling
> precisely 60mph). 

Is that Scottish m?

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-26 Thread Gregory Stark
"Greg Smith" <[EMAIL PROTECTED]> writes:

> If you write them twice, so what? You didn't even get to that point as an
> option until all the important stuff was taken care of and the system was
> near idle.

Well even if it's near idle you were still occupying the i/o system for a few
milliseconds. If someone else came in with a block request at that time you
extended their response time by that much. 

> The elimination of the all-scan background writer means that true hot and 
> dirty
> spots in the buffer cache, like popular index blocks on a heavily updated 
> table
> that never get a zero usage_count, are never going to be written out other 
> than
> as part of the checkpoint process.  

If they're really popular blocks on a heavily updated table then they really
don't buy us anything to write them out unnecessarily. 

The case where they help us is when they weren't really popular but we're not
doing enough to get around to writing them out and then when we do need to
write it out the system's busy. In that case we wasted a chance to write them
out when the system was more idle.

But don't forget that we're still going through the OS's buffers. That will
smooth out a lot of the i/o we generate anyways. Just because Postgres is idle
doesn't mean the OS isn't busy flushing the buffers we wrote out when it was
busy.

> That's OK for now, but I'd like it to be the case that one day the
> database's I/O scheduling would eventually get to those, in order to
> optimize performance in the kind of bursty scenarios I've been mentioning
> lately.

I think the feeling is that the bursty scenarios are really corner cases.
Heikki described a case where you're dirtying tons of blocks without
triggering any WAL. That can happen but it's pretty peculiar.

I can imagine a scenario where you have a system that's very busy for 60s and
then idle for 60s repeatedly. And for some reason you configure a
checkpoint_timeout on the order of 20m or so (assuming you're travelling
precisely 60mph). 

In that scenario bgwriter's lru scan has to fight to keep up for 60s while
it's mostly writing out dirty pages that it could have flushed out during the
idle time. Effectively you're only making use of half the i/o bandwidth since
bgwriter doesn't do any work for half the duty cycle.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-26 Thread Greg Smith

On Tue, 26 Jun 2007, Tom Lane wrote:

I'm not impressed with the idea of writing buffers because we might need 
them someday; that just costs extra I/O due to re-dirtying in too many 
scenarios.


This is kind of an interesting statement to me because it really 
highlights the difference in how I thinking about this problem from how 
you see it.  As far as I'm concerned, there's a hierarchy of I/O the 
database needs to finish that goes like this:


1) Client back-end writes (blocked until a buffer appears)
2) LRU writes so (1) doesn't happen
3) Checkpoint writes
4) Dirty pages with a non-zero usage count

In my view of the world, there should be one parameter for a target rate 
of how much I/O you can stand under normal use, and the background writer 
should work its way as far down this chain as it can until it meets that. 
If there's plenty of clean buffers for the expected new allocations and 
there's no checkpoint going on, by all means write out some buffers we 
might re-dirty if there's I/O to spare.  If you write them twice, so what? 
You didn't even get to that point as an option until all the important 
stuff was taken care of and the system was near idle.


The elimination of the all-scan background writer means that true hot and 
dirty spots in the buffer cache, like popular index blocks on a heavily 
updated table that never get a zero usage_count, are never going to be 
written out other than as part of the checkpoint process.  That's OK for 
now, but I'd like it to be the case that one day the database's I/O 
scheduling would eventually get to those, in order to optimize performance 
in the kind of bursty scenarios I've been mentioning lately.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-26 Thread Greg Smith

On Tue, 26 Jun 2007, Tom Lane wrote:


I have no doubt that there are scenarios such as you are thinking about,
but it definitely seems like a corner case that doesn't justify keeping
the all-buffers scan.  That scan is costing us extra I/O in ordinary
non-corner cases, so it's not free to keep it.


And scenarios I'm concerned about but can't diagram as easily fall into 
this category as well.  I agree that a LDC enabled config would ship with 
the all-buffers scan turned off as redundant and wasteful, in which the 
only cost to keep it is code baggage.  But the fact that there are corner 
cases floating around this area is what makes me feel that removing it 
altogether is still a bit premature.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-26 Thread Tom Lane
Heikki Linnakangas <[EMAIL PROTECTED]> writes:
> To recap, the sequence is:

> 1. COPY FROM
> 2. checkpoint
> 3. VACUUM

> Now you have buffer cache full of dirty buffers with usage_count=1,

Well, it won't be very full, because VACUUM works in a limited number of
buffers (and did even before the BufferAccessStrategy patch).

I have no doubt that there are scenarios such as you are thinking about,
but it definitely seems like a corner case that doesn't justify keeping
the all-buffers scan.  That scan is costing us extra I/O in ordinary
non-corner cases, so it's not free to keep it.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-26 Thread Greg Smith

On Mon, 25 Jun 2007, Tom Lane wrote:

right now, BgBufferSync starts over from the current clock-sweep point 
on each call --- that is, each bgwriter cycle.  So it can't really be 
made to write very many buffers without excessive CPU work.  Maybe we 
should redefine it to have some static state carried across bgwriter 
cycles


The LRU portion restarts like that, and it's certainly not optimal.  But 
the auto-tuning LRU patch that's already near application makes this much 
less of an issue because it only does real work when buffers have been 
allocated, so the sweep point will have moved along.  I'll add this idea 
to my list of things that would be nice to have as part of a larger 
rewriter, I think it's more trouble than it's worth to chase right now.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-26 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas <[EMAIL PROTECTED]> writes:

Tom Lane wrote:

Who's "we"?  AFAICS, CVS HEAD will treat a large copy the same as any
other large heapscan.


Umm, I'm talking about populating a table with COPY *FROM*. That's not a 
heap scan at all.


No wonder we're failing to communicate.  I assumed you were talking
about copy-to-file.  Copy-from-file is going to be creating WAL entries
hence the no-checkpoint case doesn't apply anyway, no?


We are indeed having trouble to communicate :(.

No, I'm not talking about the new non-WAL-logged COPY optimization. COPY 
FROM *would* create WAL entries, and the next checkpoint would clean 
them. So far so good. But then you run VACUUM, as you often do after 
loading a table, to set all hint bits. That will *not* generate WAL, and 
next checkpoint is skipped.


To recap, the sequence is:

1. COPY FROM
2. checkpoint
3. VACUUM

Now you have buffer cache full of dirty buffers with usage_count=1, and 
no WAL activity since last checkpoint. They will not be flushed until:

a) WAL activity happens and next checkpoint comes
b) database is shut down, or manual CHECKPOINT is issued
c) clock hand advances and decrements the usage_counts

It's a corner case. Probably not a problem in practice; you usually run 
CREATE INDEX after loading a table, for example. But it exists.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-26 Thread Tom Lane
Heikki Linnakangas <[EMAIL PROTECTED]> writes:
> Tom Lane wrote:
>> Who's "we"?  AFAICS, CVS HEAD will treat a large copy the same as any
>> other large heapscan.

> Umm, I'm talking about populating a table with COPY *FROM*. That's not a 
> heap scan at all.

No wonder we're failing to communicate.  I assumed you were talking
about copy-to-file.  Copy-from-file is going to be creating WAL entries
hence the no-checkpoint case doesn't apply anyway, no?

[ thinks ... ]  Oh, you must be positing the case where the recently
added skip-WAL-if-table-is-new-in-this-transaction optimization applies.
Well, that thing could probably do with some more work anyway (I wonder
why it's using shared buffers at all anymore).  I don't really think
that case should be allowed to drive our thinking about how the bgwriter
should work.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-26 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas <[EMAIL PROTECTED]> writes:

Tom Lane wrote:

(Note that COPY per se will not trigger this behavior anyway, since it
will act in a limited number of buffers because of the recent buffer
access strategy patch.)


Actually we dropped it from COPY, because it didn't seem to improve 
performance in the tests we ran.


Who's "we"?  AFAICS, CVS HEAD will treat a large copy the same as any
other large heapscan.


Umm, I'm talking about populating a table with COPY *FROM*. That's not a 
heap scan at all.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-26 Thread Tom Lane
Heikki Linnakangas <[EMAIL PROTECTED]> writes:
> Tom Lane wrote:
>> (Note that COPY per se will not trigger this behavior anyway, since it
>> will act in a limited number of buffers because of the recent buffer
>> access strategy patch.)

> Actually we dropped it from COPY, because it didn't seem to improve 
> performance in the tests we ran.

Who's "we"?  AFAICS, CVS HEAD will treat a large copy the same as any
other large heapscan.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-26 Thread Heikki Linnakangas

Tom Lane wrote:

(Note that COPY per se will not trigger this behavior anyway, since it
will act in a limited number of buffers because of the recent buffer
access strategy patch.)


Actually we dropped it from COPY, because it didn't seem to improve 
performance in the tests we ran.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-26 Thread Tom Lane
Heikki Linnakangas <[EMAIL PROTECTED]> writes:
> Tom Lane wrote:
>> This argument supposes that the bgwriter will do nothing while the COPY
>> is proceeding.

> It will clean buffers ahead of the COPY, but it won't write the buffers 
> COPY leaves behind since they have usage_count=1.

Yeah, and they don't *need* to be written until the clock sweep has
passed over them once.  I'm not impressed with the idea of writing
buffers because we might need them someday; that just costs extra
I/O due to re-dirtying in too many scenarios.

(Note that COPY per se will not trigger this behavior anyway, since it
will act in a limited number of buffers because of the recent buffer
access strategy patch.)

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-26 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas <[EMAIL PROTECTED]> writes:
One pathological case is a COPY of a table slightly smaller than 
shared_buffers. That will fill the buffer cache. If you then have a 
checkpoint, and after that a SELECT COUNT(*), or a VACUUM, the buffer 
cache will be full of pages with just hint-bit-updates, but no WAL 
activity since last checkpoint.


This argument supposes that the bgwriter will do nothing while the COPY
is proceeding.


It will clean buffers ahead of the COPY, but it won't write the buffers 
COPY leaves behind since they have usage_count=1.


Let me demonstrate this with an imaginary example with shared_buffers=4:

buf_id  usage_count dirty
1   0   f
2   0   f
3   0   f
4   0   f

After COPY

buf_id  usage_count dirty
1   1   t
2   1   t
3   1   t
4   1   t

CHECKPOINT:

buf_id  usage_count dirty
1   1   f
2   1   f
3   1   f
4   1   f

VACUUM:

buf_id  usage_count dirty
1   1   t
2   1   t
3   1   t
4   1   t

As soon as a backend asks for a buffer, the situation is defused as the 
backend will do a full clock sweep and decrement the usage_count of each 
buffer to 0, letting the bgwriter lru-scan to clean them.


Having the buffer cache full of dirty buffers is not a problem on its 
own, so this only becomes a performance issue if you then issue another 
large COPY etc. that needs those buffers, and you now have to write them 
at the busy time.


This is a corner case that might not be worth worrying about. It's also 
mitigated by the fact that the OS cache is most likely clean after a 
period of idle time, and should be able to absorb the write burst.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-26 Thread Tom Lane
Heikki Linnakangas <[EMAIL PROTECTED]> writes:
> Tom Lane wrote:
>> ... that's what the LRU scan is for.

> Yeah, except the LRU scan is not doing a very good job at that. It will 
> ignore buffers with usage_count > 0, and it only scans 
> bgwriter_lru_percent buffers ahead of the clock hand.

Which is exactly in sync with which buffers are actually candidates for
replacement.  It could be looking further ahead of the clock hand, as
per my previous suggestion, but there is no need to push out buffers
with usage_count > 0 until after the sweep has decremented that to 0.
Doing so would tend to create excess I/O --- it makes it more likely
that multiple page-dirtying events on a page will result in separate
writes instead of a single write.

> One pathological case is a COPY of a table slightly smaller than 
> shared_buffers. That will fill the buffer cache. If you then have a 
> checkpoint, and after that a SELECT COUNT(*), or a VACUUM, the buffer 
> cache will be full of pages with just hint-bit-updates, but no WAL 
> activity since last checkpoint.

This argument supposes that the bgwriter will do nothing while the COPY
is proceeding.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-26 Thread Heikki Linnakangas

Tom Lane wrote:

Anyway, if there are no XLOG records since the last checkpoint, there's
probably nothing in shared buffers that needs flushing.  There might be
some dirty hint-bits, but the only reason to push those out is to make
some free buffers available, and doing that is not checkpoint's job (nor
the all-buffers scan's job); that's what the LRU scan is for.


Yeah, except the LRU scan is not doing a very good job at that. It will 
ignore buffers with usage_count > 0, and it only scans 
bgwriter_lru_percent buffers ahead of the clock hand.


One pathological case is a COPY of a table slightly smaller than 
shared_buffers. That will fill the buffer cache. If you then have a 
checkpoint, and after that a SELECT COUNT(*), or a VACUUM, the buffer 
cache will be full of pages with just hint-bit-updates, but no WAL 
activity since last checkpoint.


But let's fix the LRU scan, rather work around it's deficiencies.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-26 Thread Tom Lane
Heikki Linnakangas <[EMAIL PROTECTED]> writes:
> Tom Lane wrote:
>> Hmm.  But if we're going to do that, we might as well have a checkpoint
>> for our troubles, no?  The reason for the current design is the
>> assumption that a bgwriter_all scan is less burdensome than a
>> checkpoint, but that is no longer true given this rewrite.

> Per comments in CreateCheckPoint, another  also skip the extra 
> checkpoints to avoid writing two checkpoints to the same page, risking 
> losing both on a crash:

That's not the reason for the current design --- that logic existed long
before bgwriter did.

Anyway, if there are no XLOG records since the last checkpoint, there's
probably nothing in shared buffers that needs flushing.  There might be
some dirty hint-bits, but the only reason to push those out is to make
some free buffers available, and doing that is not checkpoint's job (nor
the all-buffers scan's job); that's what the LRU scan is for.

The only reason that the all-buffers scan was put in was to try to make
the next checkpoint cheaper.  That reason falls to the ground with LDC.
What we have left is merely excess I/O.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-26 Thread Heikki Linnakangas

Tom Lane wrote:

Greg Smith <[EMAIL PROTECTED]> writes:
The way transitions between completely idle and all-out bursts happen were 
one problematic area I struggled with.  Since the LRU point doesn't move 
during the idle parts, and the lingering buffers have a usage_count>0, the 
LRU scan won't touch them; the only way to clear out a bunch of dirty 
buffers leftover from the last burst is with the all-scan.


One thing that might be worth changing is that right now, BgBufferSync
starts over from the current clock-sweep point on each call --- that is,
each bgwriter cycle.  So it can't really be made to write very many
buffers without excessive CPU work.  Maybe we should redefine it to have
some static state carried across bgwriter cycles, such that it would
write at most N dirty buffers per call, but scan through X percent of
the buffers, possibly across several calls, before returning to the (by
now probably advanced) clock-sweep point.  This would allow a larger
value of X to be used than is currently practical.  You might wish to
recheck the clock sweep point on each iteration just to make sure the
scan hasn't fallen behind it, but otherwise I don't see any downside.
The scenario where somebody re-dirties a buffer that was cleaned by the
bgwriter scan isn't a problem, because that buffer will also have had its
usage_count increased and thereby not be a candidate for replacement.


Something along those lines could be useful. I've thought of that 
before, but it never occured to me that if a page in front of the clock 
hand is re-dirtied, it's no longer a candidate for replacement anyway...


I'm going to leave the all- and lru- bgwriter scans alone for now to get 
this LDC patch finished. We still have the bgwriter autotuning patch in 
the queue. Let's think about this in the context of that patch.


As a general comment on this subject, a lot of the work in LDC presumes 
you have an accurate notion of how close the next checkpoint is.


Yeah; this is one reason I was interested in carrying some write-speed
state across checkpoints instead of having the calculation start from
scratch each time.  That wouldn't help systems that sit idle a long time
and suddenly go nuts, but it seems to me that smoothing the write rate
across more than one checkpoint could help if the fluctuations occur
over a timescale of a few checkpoints.


Hmm. This problem only applies to checkpoints triggered by 
checkpoint_segments; time tends to move forward at a constant rate.


I'm not worried about small fluctuations or bursts. As I argued earlier, 
the OS still buffers the writes and should give some extra smoothing of 
the physical writes. I believe bursts of say 50-100 MB would easily fit 
in OS cache, as long as there's enough gap between them. I haven't 
tested that, though.


Here's a proposal for an algorithm to smooth bigger bursts:

The basic design is the same as before. We keep track of elapsed time 
and elapsed xlogs, and based on them we estimate how much progress in 
flushing the buffers we should've made by now, and then we catch up 
until we reach that. The estimate for time is the same. The estimate for 
xlogs gets more complicated:


Let's have a few definitions first:

Ro = elapsed segments / elapsed time, from previous checkpoint cycle. 
For example, 1.25 means that the checkpoint was triggered by 
checkpoint_segments, and we had spent 1/1.25 =  80% of 
checkpoint_timeout when we reached checkpoint_segments. 0.25 would mean 
that checkpoint was triggered by checkpoint_timeout, and we had spent 
25% of checkpoint_segments by then.


Rn = elapsed segments / elapsed time this far from current in-progress 
checkpoint.


t = elapsed time, as a fraction of checkpoint_timeout (0.0 - 1.0, though 
could be > 1 if next checkpoint is already due)
s = elapsed xlog segments, as a fraction of checkpoint_segments (0.0 - 
1.0, though could again be > 1 if next checkpoint is already due)


R = estimate for WAL segment consumption rate, as checkpoint_segments / 
checkpoint_timeout


R = Ro * t + Rn * (1 - t)

In other words, at the beginning of the checkpoint, we give more weight 
to the state carried over from previous checkpoint. As we go forward, 
more weight is given to the rate calculated from current cycle.


From R, we extrapolate how much progress we should've done by now:

Target progress = R * t

This would require saving just one number from previous cycle (Rn), and 
there is no requirement to call the estimation function at steady time 
intervals, for example. It gives pretty steady I/O rate even if there's 
big bursts in WAL activity, but still reacts to changes in the rate.


I'm not convinced this is worth the effort, though. First of all, this 
is only a problem if you use checkpoint_segments to control your 
checkpoints, so you can lower checkpoint_timeout to do more work during 
the idle periods. Secondly, with the optimization of not flushing 
buffers during checkpoint that were dirtied after the start of 

Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-26 Thread Heikki Linnakangas

Tom Lane wrote:

Hmm.  But if we're going to do that, we might as well have a checkpoint
for our troubles, no?  The reason for the current design is the
assumption that a bgwriter_all scan is less burdensome than a
checkpoint, but that is no longer true given this rewrite.


Per comments in CreateCheckPoint, another  also skip the extra 
checkpoints to avoid writing two checkpoints to the same page, risking 
losing both on a crash:



 * If this isn't a shutdown or forced checkpoint, and we have not 
inserted
 * any XLOG records since the start of the last checkpoint, skip the
 * checkpoint.  The idea here is to avoid inserting duplicate 
checkpoints
 * when the system is idle. That wastes log space, and more importantly 
it
 * exposes us to possible loss of both current and previous checkpoint
 * records if the machine crashes just as we're writing the update.
 * (Perhaps it'd make even more sense to checkpoint only when the 
previous
 * checkpoint record is in a different xlog page?)


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-25 Thread Tom Lane
Greg Smith <[EMAIL PROTECTED]> writes:
> The way transitions between completely idle and all-out bursts happen were 
> one problematic area I struggled with.  Since the LRU point doesn't move 
> during the idle parts, and the lingering buffers have a usage_count>0, the 
> LRU scan won't touch them; the only way to clear out a bunch of dirty 
> buffers leftover from the last burst is with the all-scan.

One thing that might be worth changing is that right now, BgBufferSync
starts over from the current clock-sweep point on each call --- that is,
each bgwriter cycle.  So it can't really be made to write very many
buffers without excessive CPU work.  Maybe we should redefine it to have
some static state carried across bgwriter cycles, such that it would
write at most N dirty buffers per call, but scan through X percent of
the buffers, possibly across several calls, before returning to the (by
now probably advanced) clock-sweep point.  This would allow a larger
value of X to be used than is currently practical.  You might wish to
recheck the clock sweep point on each iteration just to make sure the
scan hasn't fallen behind it, but otherwise I don't see any downside.
The scenario where somebody re-dirties a buffer that was cleaned by the
bgwriter scan isn't a problem, because that buffer will also have had its
usage_count increased and thereby not be a candidate for replacement.

> As a general comment on this subject, a lot of the work in LDC presumes 
> you have an accurate notion of how close the next checkpoint is.

Yeah; this is one reason I was interested in carrying some write-speed
state across checkpoints instead of having the calculation start from
scratch each time.  That wouldn't help systems that sit idle a long time
and suddenly go nuts, but it seems to me that smoothing the write rate
across more than one checkpoint could help if the fluctuations occur
over a timescale of a few checkpoints.

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-25 Thread Greg Smith

On Mon, 25 Jun 2007, Heikki Linnakangas wrote:

Please describe the class of transactions and the service guarantees so 
that we can reproduce that, and figure out what's the best solution.


I'm confident you're already moving in that direction by noticing how the 
90th percentile numbers were kind of weird with your 150 warehouse DBT2 
tests, and I already mentioned how that could be usefully fleshed out by 
more tests during beta.  That number is the kind of service guarantee I'm 
talking about--if before 90% of transactions were <4.5ms, but now that 
number is closer to 6ms, that could be considered worse performance by 
some service metrics even if the average and worst-case performance were 
improved.


The only thing I can think of if you wanted to make the problem more like 
what I was seeing would be switching the transaction mix on that around to 
do more UPDATEs relative to the other types of transactions; having more 
of those seemed to aggrevate my LDC-related issues because they leave a 
different pattern of dirty+used buffers around than other operations.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-25 Thread Greg Smith

On Mon, 25 Jun 2007, Heikki Linnakangas wrote:

It only scans bgwriter_lru_percent buffers ahead of the clock hand. If the 
hand isn't moving, it keeps scanning the same buffers over and over again. 
You can crank it all the way up to 100%, though, in which case it would work, 
but that starts to get expensive CPU-wise.


In addition to being a CPU pig, that still won't necessarily work because 
the way the LRU writer ignores things with a non-zero usage_count.  If 
it's dirty, it's probably been used recently as well.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-25 Thread Greg Smith

On Mon, 25 Jun 2007, Heikki Linnakangas wrote:

Greg, is this the kind of workload you're having, or is there some other 
scenario you're worried about?


The way transitions between completely idle and all-out bursts happen were 
one problematic area I struggled with.  Since the LRU point doesn't move 
during the idle parts, and the lingering buffers have a usage_count>0, the 
LRU scan won't touch them; the only way to clear out a bunch of dirty 
buffers leftover from the last burst is with the all-scan.  Ideally, you 
want those to write during idle periods so you're completely clean when 
the next burst comes.  My plan for the code I wanted to put into 8.4 one 
day was to have something like the current all-scan that defers to the LRU 
and checkpoint, such that if neither of them are doing anything it would 
go searching for buffers it might blow out.  Because the all-scan mainly 
gets in the way under heavy load right now I've only found mild settings 
helpful, but if it had a bit more information about what else was going on 
it could run much harder during slow spots.  That's sort of the next stage 
to the auto-tuning LRU writer code in the grand design floating through my 
head.


As a general comment on this subject, a lot of the work in LDC presumes 
you have an accurate notion of how close the next checkpoint is.  On 
systems that can dirty buffers and write WAL really fast, I've found hyper 
bursty workloads are a challenge for it to cope with.  You can go from 
thinking you have all sorts of time to stream the data out to discovering 
the next checkpoint is coming up fast in only seconds.  In that situation, 
you'd have been better off had you been writing faster during the period 
preceeding the burst when the code thought it should be "smooth"[1]. 
That falls into the category of things I haven't found a good way for 
other people to test (I happened to have an internal bursty app that 
aggrevated this area to use).


[1] This is actually a reference to "Yacht Rock", one of my favorite web 
sites:  http://www.channel101.com/shows/show.php?show_id=152


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-25 Thread Tom Lane
Heikki Linnakangas <[EMAIL PROTECTED]> writes:
> Tom Lane wrote:
>> Heikki Linnakangas <[EMAIL PROTECTED]> writes:
>>> If you have a system with a very bursty transaction rate, it's possible 
>>> that when it's time for a checkpoint, there hasn't been any WAL logged 
>>> activity since last checkpoint, so we skip it. When that happens, the 
>>> buffer cache might still be full of dirty pages, because of hint bit 
>>> updates. That still isn't a problem on it's own, but if you then do a 
>>> huge batch update, you have to flush those dirty pages at that point. It 
>>> would be better to trickle flush those dirty pages during the idle period.
>> 
>> But wouldn't the LRU-based scan accomplish that?

> It only scans bgwriter_lru_percent buffers ahead of the clock hand. If 
> the hand isn't moving, it keeps scanning the same buffers over and over 
> again. You can crank it all the way up to 100%, though, in which case it 
> would work, but that starts to get expensive CPU-wise.

Hmm.  But if we're going to do that, we might as well have a checkpoint
for our troubles, no?  The reason for the current design is the
assumption that a bgwriter_all scan is less burdensome than a
checkpoint, but that is no longer true given this rewrite.  AFAICS all
the bgwriter_all scan will accomplish is induce extra I/O in most
scenarios.  And it's certainly extra code in an area that's already been
rendered pretty darn baroque by this patch.

Also, the design assumption here is that the bgwriter_lru scan is
supposed to keep enough buffers clean to satisfy the system's need for
new buffers.  If it's not able to do so, it's not apparent to me that
the bgwriter_all scan helps --- the latter is most likely flushing the
wrong buffers, ie, those not close in front of the clock sweep.  You'd
be well advised to increase lru_percent anyway to keep more buffers
clean, if this scenario is worth optimizing rather than others for your
usage.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-25 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas <[EMAIL PROTECTED]> writes:
On further thought, there is one workload where removing the non-LRU 
part would be counterproductive:


If you have a system with a very bursty transaction rate, it's possible 
that when it's time for a checkpoint, there hasn't been any WAL logged 
activity since last checkpoint, so we skip it. When that happens, the 
buffer cache might still be full of dirty pages, because of hint bit 
updates. That still isn't a problem on it's own, but if you then do a 
huge batch update, you have to flush those dirty pages at that point. It 
would be better to trickle flush those dirty pages during the idle period.


But wouldn't the LRU-based scan accomplish that?


It only scans bgwriter_lru_percent buffers ahead of the clock hand. If 
the hand isn't moving, it keeps scanning the same buffers over and over 
again. You can crank it all the way up to 100%, though, in which case it 
would work, but that starts to get expensive CPU-wise.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-25 Thread Tom Lane
Heikki Linnakangas <[EMAIL PROTECTED]> writes:
> On further thought, there is one workload where removing the non-LRU 
> part would be counterproductive:

> If you have a system with a very bursty transaction rate, it's possible 
> that when it's time for a checkpoint, there hasn't been any WAL logged 
> activity since last checkpoint, so we skip it. When that happens, the 
> buffer cache might still be full of dirty pages, because of hint bit 
> updates. That still isn't a problem on it's own, but if you then do a 
> huge batch update, you have to flush those dirty pages at that point. It 
> would be better to trickle flush those dirty pages during the idle period.

But wouldn't the LRU-based scan accomplish that?

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-25 Thread Heikki Linnakangas

Tom Lane wrote:

I agree with removing the non-LRU
part of the bgwriter's write logic though; that should simplify matters
a bit and cut down the overall I/O volume.


On further thought, there is one workload where removing the non-LRU 
part would be counterproductive:


If you have a system with a very bursty transaction rate, it's possible 
that when it's time for a checkpoint, there hasn't been any WAL logged 
activity since last checkpoint, so we skip it. When that happens, the 
buffer cache might still be full of dirty pages, because of hint bit 
updates. That still isn't a problem on it's own, but if you then do a 
huge batch update, you have to flush those dirty pages at that point. It 
would be better to trickle flush those dirty pages during the idle period.


So we might still want to keep the non-LRU scan. Or alternatively, when 
the checkpoint is a no-op, we call BufferSync nevertheless. That's 
effectively the same thing, except that BufferSync would be controlled 
by the logic to estimate the deadline until next checkpoint, instead of 
the current bgwriter_all_*-settings.


Greg, is this the kind of workload you're having, or is there some other 
scenario you're worried about?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-25 Thread Simon Riggs
On Mon, 2007-06-25 at 12:56 +0200, Magnus Hagander wrote:

> Didn't we already add other featuers that makes recovery much *faster* than
> before? In that case, are they faster enugh to neutralise this increased
> time (a guestimate, of course)
> 
> Or did I mess that up with stuff we added for 8.2? :-)

Only with full_page_writes on, but yes you're right, the predicted
recovery times are changing for 8.3 anyway.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-25 Thread Heikki Linnakangas

Magnus Hagander wrote:

On Mon, Jun 25, 2007 at 10:15:07AM +0100, Simon Riggs wrote:

As you say, we can put comments in the release notes to advise people of
50% increase in recovery time if the parameters stay the same. That
would be balanced by the comment that checkpoints are now considerably
smoother than before and more frequent checkpoints are unlikely to be a
problem, so it is OK to reduce the parameters from the settings you used
in previous releases.


Didn't we already add other featuers that makes recovery much *faster* than
before? 


Yes, we no longer read in a page just to overwrite it with a full page 
image.



In that case, are they faster enugh to neutralise this increased
time (a guestimate, of course)


It depends a lot on your workload. Under the right circumstances, it 
will more than offset any increase caused by LDC, but sometimes it won't 
make any difference at all (for example with full_page_writes=off).


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-25 Thread Magnus Hagander
On Mon, Jun 25, 2007 at 10:15:07AM +0100, Simon Riggs wrote:
> On Mon, 2007-06-25 at 01:33 -0400, Greg Smith wrote:
> > On Sun, 24 Jun 2007, Simon Riggs wrote:
> > 
> > > Greg can't choose to use checkpoint_segments as the limit and then 
> > > complain about unbounded recovery time, because that was clearly a 
> > > conscious choice.
> > 
> > I'm complaining 
> 
> I apologise for using that emotive phrase.
> 
> > only because everyone seems content to wander in a 
> > direction where the multiplier on checkpoint_segments for how many 
> > segments are actually active at once will go up considerably, which can 
> > make a known problem (recovery time) worse. 
> 
> +50% more. Recovery time is a consideration that can be adjusted for. We
> have done nothing to make recovery rate worse; the additional WAL leads
> to an increased recovery time *only* if you keep the same parameter
> settings. There is no reason to keep them the same, nor do we promise
> that parameters will keep the exact meaning they had previous releases.
> 
> As you say, we can put comments in the release notes to advise people of
> 50% increase in recovery time if the parameters stay the same. That
> would be balanced by the comment that checkpoints are now considerably
> smoother than before and more frequent checkpoints are unlikely to be a
> problem, so it is OK to reduce the parameters from the settings you used
> in previous releases.

Didn't we already add other featuers that makes recovery much *faster* than
before? In that case, are they faster enugh to neutralise this increased
time (a guestimate, of course)

Or did I mess that up with stuff we added for 8.2? :-)

//Magnus


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-25 Thread Simon Riggs
On Mon, 2007-06-25 at 01:33 -0400, Greg Smith wrote:
> On Sun, 24 Jun 2007, Simon Riggs wrote:
> 
> > Greg can't choose to use checkpoint_segments as the limit and then 
> > complain about unbounded recovery time, because that was clearly a 
> > conscious choice.
> 
> I'm complaining 

I apologise for using that emotive phrase.

> only because everyone seems content to wander in a 
> direction where the multiplier on checkpoint_segments for how many 
> segments are actually active at once will go up considerably, which can 
> make a known problem (recovery time) worse. 

+50% more. Recovery time is a consideration that can be adjusted for. We
have done nothing to make recovery rate worse; the additional WAL leads
to an increased recovery time *only* if you keep the same parameter
settings. There is no reason to keep them the same, nor do we promise
that parameters will keep the exact meaning they had previous releases.

As you say, we can put comments in the release notes to advise people of
50% increase in recovery time if the parameters stay the same. That
would be balanced by the comment that checkpoints are now considerably
smoother than before and more frequent checkpoints are unlikely to be a
problem, so it is OK to reduce the parameters from the settings you used
in previous releases.

I don't see any reason why we would want unsmooth checkpoints; similarly
we don't want unsmooth bg writing, unsmooth archiving etc.. Performance
will increase, response times will drop and people will be happy.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-25 Thread Heikki Linnakangas

Greg Smith wrote:
LDC certainly makes things better in almost every case.  My "allegiance" 
comes from having seen a class of transactions where LDC made things 
worse on a fast/overloaded system, in that it made some types of service 
guarantees harder to meet, and I just don't know who else might run into 
problems in that area.  


Please describe the class of transactions and the service guarantees so 
that we can reproduce that, and figure out what's the best solution.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-24 Thread Greg Smith

On Mon, 25 Jun 2007, Tom Lane wrote:

I'm not sure why you hold such strong allegiance to the status quo.  We 
know that the status quo isn't working very well.


Don't get me wrong here; I am a big fan of this patch, think it's an 
important step forward, and it's exactly the fact that I'm so shell 
shocked from abuse by the status quo that I'm still mixed up in this mess 
(I really should be ignoring the lot of you and writing new code instead).


LDC certainly makes things better in almost every case.  My "allegiance" 
comes from having seen a class of transactions where LDC made things worse 
on a fast/overloaded system, in that it made some types of service 
guarantees harder to meet, and I just don't know who else might run into 
problems in that area.  I'm worried that if it's not adjustable, you're 
introducing a risk that you'll take a step backward for some of this 
code's users, and that will be hard to undo given the way releases are 
structured here.


I spent some time trading stocks for a living.  There are sometimes 
situations you can get into there where there is a tiny chance that 
something very bad can happen with a trade, and many people get wiped out 
by such things.  If it's possible in that situation to remove that risk 
with something inexpensive, you do it, even though the net expected value 
of the change might be slightly negative.  This seems like such a 
situation to me.  If it's possible to take away the risk of other people 
running into an unforseen problem with the LDC patch just by keeping a 
knob that's already there, unless that's an expensive operation my opinion 
is that you should pick a good default but not remove it yet.


And if you think that the current code had enormous amounts of testing 
before it went in, I've got to disillusion you :-(


It's having been on the painful receiving end of that fact that makes me 
so paranoid now :)


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-24 Thread Greg Smith

On Sun, 24 Jun 2007, Simon Riggs wrote:

Greg can't choose to use checkpoint_segments as the limit and then 
complain about unbounded recovery time, because that was clearly a 
conscious choice.


I'm complaining only because everyone seems content to wander in a 
direction where the multiplier on checkpoint_segments for how many 
segments are actually active at once will go up considerably, which can 
make a known problem (recovery time) worse.  I'm thinking about things 
like how the release notes for 8.3 are going to address this.  It the plan 
to say something like "whatever you set checkpoint_segments to before so 
you were happy with the crash recovery time, make sure to re-run all those 
tests again because we made a big change there, it may take a lot longer 
now with the same value, and too bad if you don't like it because there's 
no way to get the old behavior back.  Suck it up, decrease 
checkpoint_segments, and get used to having more checkpoints if you have 
to keep the same recovery time".


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-24 Thread Tom Lane
Greg Smith <[EMAIL PROTECTED]> writes:
> I am not a fan of introducing a replacement feature based on what I 
> consider too limited testing, and I don't feel this one has been beat on 
> long yet enough to start pruning features that would allow better backward 
> compatibility/transitioning.  I think that's introducing an unnecessary 
> risk to the design.

While it's certainly true that we could do with wider testing of this
patch, I'm not sure why you hold such strong allegiance to the status
quo.  We know that the status quo isn't working very well.  And if you
think that the current code had enormous amounts of testing before it
went in, I've got to disillusion you :-(

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-24 Thread Greg Smith

On Sun, 24 Jun 2007, Simon Riggs wrote:


I can't see why anyone would want to turn off smoothing: If they are
doing many writes, then they will be effected by the sharp dive at
checkpoint, which happens *every* checkpoint.


There are service-level agreement situations where a short and sharp 
disruption is more acceptable than a more prolonged one.  As some of the 
overloaded I/O tests are starting to show, the LDC may be a backward step 
for someone in that sort of environment.


I am not a fan of introducing a replacement feature based on what I 
consider too limited testing, and I don't feel this one has been beat on 
long yet enough to start pruning features that would allow better backward 
compatibility/transitioning.  I think that's introducing an unnecessary 
risk to the design.


We won't need to set checkpoint_segments so high, since performance is 
smoothed across checkpoints by LDC and its OK to allow them more 
frequently. So this concern need not apply with LDC.


Performance *should* be smoothed across by checkpoints by LDC and my 
concern *may* not apply.  I think assuming it will always help based on 
the limited number of test results presented so far is extrapolation.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-24 Thread Simon Riggs
On Fri, 2007-06-22 at 16:57 -0400, Greg Smith wrote:
> If you're not, I think you should be.  Keeping that replay interval
> time down was one of the reasons why the people I was working with
> were displeased with the implications of the very spread out style of
> some LDC tunings.  They were already unhappy with the implied recovery
> time of how high they had to set checkpoint_settings for good
> performance, and making it that much bigger aggrevates the issue.
> Given a knob where the LDC can be spread out a bit but not across the
> entire interval, that makes it easier to control how much expansion
> there is relative to the current behavior. 

We won't need to set checkpoint_settings so high, since performance is
smoothed across checkpoints by LDC and its OK to allow them more
frequently. So this concern need not apply with LDC.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-24 Thread Simon Riggs
On Fri, 2007-06-22 at 16:21 -0400, Tom Lane wrote:
> Heikki Linnakangas <[EMAIL PROTECTED]> writes:

> > 3. Recovery will take longer, because the distance last committed redo 
> > ptr will lag behind more.
> 
> True, you'd have to replay 1.5 checkpoint intervals on average instead
> of 0.5 (more or less, assuming checkpoints had been short).  I don't
> think we're in the business of optimizing crash recovery time though.

Agreed, though only for crash recovery. Most archive recoveries will not
be change noticeably - half a checkpoint isn't very long in PITR terms.

> It might be that getting rid of checkpoint_smoothing is an overreaction,
> and that we should keep that parameter.  IIRC Greg had worried about
> being able to turn this behavior off, so we'd still need at least a
> bool, and we might as well expose the fraction instead.  (Still don't
> like the name "smoothing" though.) 

ISTM we don't need the checkpoint_smoothing parameter.

I can't see why anyone would want to turn off smoothing: If they are
doing many writes, then they will be effected by the sharp dive at
checkpoint, which happens *every* checkpoint. The increase in crash
recovery time is an easily acceptable trade-off for what is gained; if
not, they can always make checkpoint_timeout smaller.

If they aren't doing many writes they don't care what the setting of
checkpoint_smoothing is and recovery will be very fast anyway.

>I agree with removing the non-LRU
> part of the bgwriter's write logic though; that should simplify matters
> a bit and cut down the overall I/O volume.

Yes, agreed.


-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-24 Thread Heikki Linnakangas

Simon Riggs wrote:

On Fri, 2007-06-22 at 22:19 +0100, Heikki Linnakangas wrote:

However, I think shortening the checkpoint interval is a perfectly valid 
solution to that. 


Agreed. That's what checkpoint_timeout is for. Greg can't choose to use
checkpoint_segments as the limit and then complain about unbounded
recovery time, because that was clearly a conscious choice.


No, by checkpoint interval, I meant both checkpoint_timeout and 
checkpoint_segments. You can decrease checkpoint_segments as well to 
shorten recovery time.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-24 Thread Simon Riggs
On Fri, 2007-06-22 at 22:19 +0100, Heikki Linnakangas wrote:

> However, I think shortening the checkpoint interval is a perfectly valid 
> solution to that. 

Agreed. That's what checkpoint_timeout is for. Greg can't choose to use
checkpoint_segments as the limit and then complain about unbounded
recovery time, because that was clearly a conscious choice.

> In any case, while people 
> sometimes complain that we have a large WAL footprint, it's not usually 
> a problem.

IMHO its a huge problem. Turning on full_page_writes means that the
amount of WAL generated varies fairly linearly with number of blocks
touched, which means large databases become a problem. 

Suzuki-san's team had results that showed this was a problem also.

> This is off-topic, but at PGCon in May, Itagaki-san and his colleagues 
> whose names I can't remember, pointed out to me very clearly that our 
> recovery is *slow*. So slow, that in the benchmarks they were running, 
> their warm stand-by slave couldn't keep up with the master generating 
> the WAL, even though both are running on the same kind of hardware.

> The reason is simple: There can be tens of backends doing I/O and 
> generating WAL, but in recovery we serialize them. If you have decent 
> I/O hardware that could handle for example 10 concurrent random I/Os, at 
> recovery we'll be issuing them one at a time. That's a scalability 
> issue, and doesn't show up on a laptop or a small server with a single disk.

The results showed that the current recovery isn't scalable beyond a
certain point, not that it was slow per se. The effect isn't noticeable
on systems generating fewer writes (of any size) or ones where the cache
hit ratio is high. It isn't accurate to say laptops and small servers
only, but it would be accurate to say hi volume I/O bound OLTP is the
place where the scalability of recovery does not match the performance
scalability of the master server.

Yes, we need to make recovery more scalable by de-serializing I/O.

Slony also serializes changes onto the Slave nodes, so the general
problem of scalability of recovery solutions needs to be tackled.

> That's one of the first things I'm planning to tackle when the 8.4 dev 
> cycle opens. And I'm planning to look at recovery times in general; I've 
> never even measured it before so who knows what comes up.

I'm planning on working on recovery also, as is Florian. Let's make sure
we coordinate what we do to avoid patch conflicts.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-23 Thread Greg Smith
This message is going to come off as kind of angry, and I hope you don't 
take that personally.  I'm very frustrated with this whole area right now 
but am unable to do anything to improve that situation.


On Fri, 22 Jun 2007, Tom Lane wrote:

If you've got specific evidence why any of these things need to be 
parameterized, let's see it.


All I'm trying to suggest here is that you might want to pause and 
consider whether you want to make a change that might break existing, 
happily working installations based just on the small number of tests that 
have been done on this patch so far.  A nice stack of DBT2 results is very 
informative, but the DBT2 workload is not everybody's workload.


Did you see anybody else predicting issues with the LDC patch on 
overloaded systems as are starting to be seen in the 150 warehouse/90% 
latency figures in Heikki's most recent results?  The way I remember that, 
it was just me pushing to expose that problem, because I knew it was there 
from my unfortunately private tests, but it was difficult to encounter the 
issue on other types of benchmarks (thanks again to Greg Stark and Heikki 
for helping with that).  But that's fine, if you want to blow off the rest 
of my suggestions now just because the other things I'm worried about are 
also very hard problem to expose and I can't hand you over a smoking gun, 
that's your decision.


Personally I think that we have a bad track record of exposing GUC 
variables as a substitute for understanding performance issues at the 
start, and this approach isn't doing any favors for DBAs.


I think this project has an awful track record of introducing new GUC 
variables and never having a plan to follow through with a process to 
figure out how they should be set.  The almost complete lack of 
standardization and useful tools for collecting performance information 
about this database boggles my mind, and you're never going to get the 
performance related sections of the GUC streamlined without it.


We were just talking about the mess that is effective_cache_size recently. 
As a more topical example here, the background writer was officially 
released in early 2005, with a bizarre collection of tunables.  I had to 
help hack on that code myself, over two years later, to even start 
exposing the internal statistics data needed to optimize it correctly. 
The main reason I can't prove some of my concerns is that I got so 
side-tracked adding the infrastructure needed to even show they exist that 
I wasn't able to nail down exactly what was going on well enough to 
generate a public test case before the project that exposed the issues 
wrapped up.


Right at the moment the best thing to do seems to be to enable LDC with 
a low minimum write rate and a high target duration, and remove the 
thereby-obsoleted "all buffers" scan of the existing bgwriter logic.


I have reason to believe there's a set of use cases where a more 
accelerated LDC approach than everyone seems to be learning toward is 
appropriate, which would then reinvigorate the need for the all-scan BGW 
component.  I have a whole new design for the non-LRU background writer 
that fixes most of what's wrong with it I'm waiting for 8.4 to pass out 
and get feedback on, but if everybody is hell bent on just yanking the 
whole thing out in preference to these really lazy checkpoints go ahead 
and do what you want.  My life would be easier if I just tossed all that 
out and forgot about the whole thing, and I'm real close to doing just 
that right now.



Did anyone else ever notice that when a new xlog segment is created,
the write to clear it out doesn't happen via direct I/O like the rest
of the xlog writes do?

It's not supposed to matter, because that path isn't supposed to be
taken often.


Yes, but during the situation it does happen in--when checkpoints take so 
much longer than expected that more segments have to be created, or in an 
archive logger faiure--it badly impacts an already unpleasant situation.



there's a whole class of issues involving recycling xlog segments this
would introduce I would be really unhappy with the implications of.

Really?  Name one.


You already mentioned expansion of the log segments used which is a 
primary issue.  Acting like all the additional segments used for some of 
the more extreme checkpoint spreading approaches are without cost is 
completely unrealistic IMHO.  In the situation I just described above, I 
also noticed the way O_DIRECT sync writes get mixed with buffered WAL 
writes seems to cause some weird I/O scheduling issues in Linux that can 
make worst-case latency degrade.  But since I can't prove that, I guess I 
might as well not even mention that either.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-22 Thread Heikki Linnakangas

Greg Smith wrote:

On Fri, 22 Jun 2007, Tom Lane wrote:

Greg had worried about being able to turn this behavior off, so we'd 
still need at least a bool, and we might as well expose the fraction 
instead.  I agree with removing the non-LRU part of the bgwriter's 
write logic though


If you accept that being able to turn LDC off is important, people who 
aren't turning it on may still want the existing bgwriter_all_* settings 
so they can keep running things the way they are now.  It's certainly 
reasonable to skip that code path when doing things the LDC way.


Well, if you don't want anything to change, don't upgrade.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-22 Thread Heikki Linnakangas

Greg Smith wrote:
True, you'd have to replay 1.5 checkpoint intervals on average instead 
of 0.5 (more or less, assuming checkpoints had been short).  I don't 
think we're in the business of optimizing crash recovery time though.


If you're not, I think you should be.  Keeping that replay interval time 
down was one of the reasons why the people I was working with were 
displeased with the implications of the very spread out style of some 
LDC tunings.  They were already unhappy with the implied recovery time 
of how high they had to set checkpoint_settings for good performance, 
and making it that much bigger aggrevates the issue.  Given a knob where 
the LDC can be spread out a bit but not across the entire interval, that 
makes it easier to control how much expansion there is relative to the 
current behavior.


I agree on that one: we *should* optimize crash recovery time. It may 
not be the most important thing on earth, but it's a significant 
consideration for some systems.


However, I think shortening the checkpoint interval is a perfectly valid 
solution to that. It does lead to more full page writes, but in 8.3 more 
full page writes can actually make the recovery go faster, not slower, 
because with we no longer read in the previous contents of the page when 
we restore it from a full page image. In any case, while people 
sometimes complain that we have a large WAL footprint, it's not usually 
a problem.


This is off-topic, but at PGCon in May, Itagaki-san and his colleagues 
whose names I can't remember, pointed out to me very clearly that our 
recovery is *slow*. So slow, that in the benchmarks they were running, 
their warm stand-by slave couldn't keep up with the master generating 
the WAL, even though both are running on the same kind of hardware.


The reason is simple: There can be tens of backends doing I/O and 
generating WAL, but in recovery we serialize them. If you have decent 
I/O hardware that could handle for example 10 concurrent random I/Os, at 
recovery we'll be issuing them one at a time. That's a scalability 
issue, and doesn't show up on a laptop or a small server with a single disk.


That's one of the first things I'm planning to tackle when the 8.4 dev 
cycle opens. And I'm planning to look at recovery times in general; I've 
never even measured it before so who knows what comes up.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-22 Thread Greg Smith

On Fri, 22 Jun 2007, Tom Lane wrote:

Greg had worried about being able to turn this behavior off, so we'd 
still need at least a bool, and we might as well expose the fraction 
instead.  I agree with removing the non-LRU part of the bgwriter's write 
logic though


If you accept that being able to turn LDC off is important, people who 
aren't turning it on may still want the existing bgwriter_all_* settings 
so they can keep running things the way they are now.  It's certainly 
reasonable to skip that code path when doing things the LDC way.


True, you'd have to replay 1.5 checkpoint intervals on average instead 
of 0.5 (more or less, assuming checkpoints had been short).  I don't 
think we're in the business of optimizing crash recovery time though.


If you're not, I think you should be.  Keeping that replay interval time 
down was one of the reasons why the people I was working with were 
displeased with the implications of the very spread out style of some LDC 
tunings.  They were already unhappy with the implied recovery time of how 
high they had to set checkpoint_settings for good performance, and making 
it that much bigger aggrevates the issue.  Given a knob where the LDC can 
be spread out a bit but not across the entire interval, that makes it 
easier to control how much expansion there is relative to the current 
behavior.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-22 Thread Tom Lane
Greg Smith <[EMAIL PROTECTED]> writes:
> As the person who was complaining about corner cases I'm not in a position 
> to talk more explicitly about, I can at least summarize my opinion of how 
> I feel everyone should be thinking about this patch and you can take what 
> you want from that.

Sorry, but we're not going to make decisions on the basis of evidence
you won't show us.  Right at the moment the best thing to do seems to
be to enable LDC with a low minimum write rate and a high target
duration, and remove the thereby-obsoleted "all buffers" scan of the
existing bgwriter logic.  If you've got specific evidence why any of
these things need to be parameterized, let's see it.  Personally I think
that we have a bad track record of exposing GUC variables as a
substitute for understanding performance issues at the start, and this
approach isn't doing any favors for DBAs.

> Whether or not I think this is an awesome idea, the very idea of a change 
> that big at this point gives me the willies.  Just off the top of my head, 
> there's a whole class of issues involving recycling xlog segments this 
> would introduce I would be really unhappy with the implications of.

Really?  Name one.  AFAICS this should not affect xlog recycling in the
slightest (other than that we have to bump up the target number of live
segments from 2x to 3x the checkpoint distance).

> Did anyone else ever notice that when a new xlog segment is created,
> the write to clear it out doesn't happen via direct I/O like the rest
> of the xlog writes do?

It's not supposed to matter, because that path isn't supposed to be
taken often.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-22 Thread Tom Lane
Heikki Linnakangas <[EMAIL PROTECTED]> writes:
> Ok, if we approach this from the idea that there will be *no* GUC 
> variables at all to control this, and we remove the bgwriter_all_* 
> settings as well, does anyone see a reason why that would be bad? Here's 
> the ones mentioned this far:

> 1. we need to keep 2x as much WAL segments around as before.

No, only 50% more.  We've always kept WAL back to the
checkpoint-before-last, so the peak usage has been about 2 checkpoint
distances (plus a bit), and now it'd tend to be about 3.

> 2. pg_start_backup will need to wait for a long time.

> 3. Recovery will take longer, because the distance last committed redo 
> ptr will lag behind more.

True, you'd have to replay 1.5 checkpoint intervals on average instead
of 0.5 (more or less, assuming checkpoints had been short).  I don't
think we're in the business of optimizing crash recovery time though.

It might be that getting rid of checkpoint_smoothing is an overreaction,
and that we should keep that parameter.  IIRC Greg had worried about
being able to turn this behavior off, so we'd still need at least a
bool, and we might as well expose the fraction instead.  (Still don't
like the name "smoothing" though.)  I agree with removing the non-LRU
part of the bgwriter's write logic though; that should simplify matters
a bit and cut down the overall I/O volume.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-22 Thread Greg Smith

On Fri, 22 Jun 2007, Tom Lane wrote:


Yeah, I'm not sure that we've thought through the interactions with the
existing bgwriter behavior.


The entire background writer mess needs a rewrite, and the best way to 
handle that is going to shift considerably with LDC applied.


As the person who was complaining about corner cases I'm not in a position 
to talk more explicitly about, I can at least summarize my opinion of how 
I feel everyone should be thinking about this patch and you can take what 
you want from that.  In the context of getting an 8.3 release finalized, I 
think you should be building a LDC implementation that accomplishes the 
main goal of spreading the checkpoints out, which is clearly working, and 
erring on the side of more knobs in cases where it's unsure if they're 
needed or not.  It's clear what my position is which non-obvious knobs I 
think are important for some people.


Which is worse:  putting in a tuning setting that it's discovered 
everybody just uses the same value for, or discovering after release that 
there's a use-case for that setting and by hard-coding it you've made the 
DBAs for a class of applications you didn't think about miserable?  In the 
cases where there's good evidence so far of the right setting, just make 
that the default, and the only harm is GUC bloat.


Nothing should be done that changes the existing behavior if the LDC 
feature is turned off, so anything more obtrusive to the background writer 
is right out.  Make reducing the knobs, optimizing the default behavior, 
and rewriting the background writer to better fit into its new context a 
major goal of 8.4.  I know I've got a whole notebook full of stuff on that 
topic I've been ignoring as not to distract you guys from getting 8.3 
done.


That's the low risk plan, and the design/beta/release period here is short 
enough that I think going too experimental beyond that is a bad idea.  To 
pick an example, when I read this idea from Heikki:


You would have a checkpoint running practically all the time, and you 
would use checkpoint_timeout/checkpoint_segments to control how long it 
takes... If we do that, we should remove bgwriter_all_* settings


Whether or not I think this is an awesome idea, the very idea of a change 
that big at this point gives me the willies.  Just off the top of my head, 
there's a whole class of issues involving recycling xlog segments this 
would introduce I would be really unhappy with the implications of.  Did 
anyone else ever notice that when a new xlog segment is created, the write 
to clear it out doesn't happen via direct I/O like the rest of the xlog 
writes do?  That write goes through the regular buffered path instead. 
The checkpoint logging patch I submitted logs when this happens 
specifically because that particular issue messed with some operations and 
I found it important to be aware of.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-22 Thread Heikki Linnakangas

Tom Lane wrote:

Maybe I misread the patch, but I thought that if someone requested an
immediate checkpoint, the checkpoint-in-progress would effectively flip
to immediate mode.  So that could be handled by offering an immediate vs
extended checkpoint option in pg_start_backup.  I'm not sure it's a
problem though, since as previously noted you probably want
pg_start_backup to be noninvasive.  Also, one could do a manual
CHECKPOINT command then immediately pg_start_backup if one wanted
as-fast-as-possible (CHECKPOINT requests immediate checkpoint, right?)


Yeah, that's possible.

and recovery would need to process on average 1.5 as much WAL as before. 
Though with LDC, you should get away with shorter checkpoint intervals 
than before, because the checkpoints aren't as invasive.


No, you still want a pretty long checkpoint interval, because of the
increase in WAL traffic due to more page images being dumped when the
interval is short.

If we do that, we should remove bgwriter_all_* settings. They wouldn't 
do much because we would have checkpoint running all the time, writing 
out dirty pages.


Yeah, I'm not sure that we've thought through the interactions with the
existing bgwriter behavior.


I searched the archives a bit for the discussions when the current 
bgwriter settings were born, and found this thread:


http://archives.postgresql.org/pgsql-hackers/2004-12/msg00784.php

The idea of Load Distributed Checkpoints certainly isn't new :).

Ok, if we approach this from the idea that there will be *no* GUC 
variables at all to control this, and we remove the bgwriter_all_* 
settings as well, does anyone see a reason why that would be bad? Here's 
the ones mentioned this far:


1. we need to keep 2x as much WAL segments around as before.

2. pg_start_backup will need to wait for a long time.

3. Recovery will take longer, because the distance last committed redo 
ptr will lag behind more.


1. and 3. can be alleviated by using a smaller 
checkpoint_timeout/segments though as you pointed out that leads to 
higher WAL traffic. 2. is not a big deal, and we can add an 'immediate' 
parameter to pg_start_backup if necessary.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-22 Thread Tom Lane
Heikki Linnakangas <[EMAIL PROTECTED]> writes:
> Tom Lane wrote:
>> I still think you've not demonstrated a need to expose this parameter.

> Greg Smith wanted to explicitly control the I/O rate, and let the 
> checkpoint duration vary. I personally think that fixing the checkpoint 
> duration is better because it's easier to tune.

> But if we only do that, you might end up with ridiculously long 
> checkpoints when there's not many dirty pages. If we want to avoid that, 
> we need some way of telling what's a safe minimum rate to write at, 
> because that can vary greatly depending on your hardware.

As long as the minimum rate is at least 1/bgdelay, I don't think this is
a big problem.

> But maybe we don't care about prolonging checkpoints, and don't really 
> need any GUCs at all. We could then just hardcode writes_per_nap to some 
> low value, and target duration close to 1.0. You would have a checkpoint 
> running practically all the time, and you would use 
> checkpoint_timeout/checkpoint_segments to control how long it takes. I'm 
> a bit worried about jumping to such a completely different regime, 
> though. For example, pg_start_backup needs to create a new checkpoint, 
> so it would need to wait on average 1.5 * checkpoint_timeout/segments, 

Maybe I misread the patch, but I thought that if someone requested an
immediate checkpoint, the checkpoint-in-progress would effectively flip
to immediate mode.  So that could be handled by offering an immediate vs
extended checkpoint option in pg_start_backup.  I'm not sure it's a
problem though, since as previously noted you probably want
pg_start_backup to be noninvasive.  Also, one could do a manual
CHECKPOINT command then immediately pg_start_backup if one wanted
as-fast-as-possible (CHECKPOINT requests immediate checkpoint, right?)

> and recovery would need to process on average 1.5 as much WAL as before. 
> Though with LDC, you should get away with shorter checkpoint intervals 
> than before, because the checkpoints aren't as invasive.

No, you still want a pretty long checkpoint interval, because of the
increase in WAL traffic due to more page images being dumped when the
interval is short.

> If we do that, we should remove bgwriter_all_* settings. They wouldn't 
> do much because we would have checkpoint running all the time, writing 
> out dirty pages.

Yeah, I'm not sure that we've thought through the interactions with the
existing bgwriter behavior.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-22 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas <[EMAIL PROTECTED]> writes:

Tom Lane wrote:

(BTW, the patch seems
a bit schizoid about whether checkpoint_rate is int or float.)


Yeah, I've gone back and forth on the data type. I wanted it to be a 
float, but guc code doesn't let you specify a float in KB, so I switched 
it to int.


I seriously question trying to claim that it's blocks at all, seeing
that the *actual* units are pages per unit time.  Pretending that it's
a memory unit does more to obscure the truth than document it.


Hmm. What I wanted to avoid is that the I/O rate you get then depends on 
your bgwriter_delay, so you if you change that you need to change 
checkpoint_min_rate as well.


Now we already have that issue with bgwriter_all/lru_maxpages, and I 
don't like it there either. If you think it's better to let the user 
define it directly as pages/bgwriter_delay, fine.



And checkpoint_rate really needs to be named checkpoint_min_rate, if
it's going to be a minimum.  However, I question whether we need it at
all,


Hmm. With bgwriter_delay of 200 ms, and checkpoint_min_rate of 512 KB/s, 
  using the non-broken formula above, we get:



(512*1024/8192) * 200 / 1000 = 12.8, truncated to 12.



So I think that's fine.


"Fine?"  That's 12x the value you have actually tested.  That's enough
of a change to invalidate all your performance testing IMHO.  


I'll reschedule the tests to be sure, after we settle on how we want to 
control this feature.



I still think you've not demonstrated a need to expose this parameter.


Greg Smith wanted to explicitly control the I/O rate, and let the 
checkpoint duration vary. I personally think that fixing the checkpoint 
duration is better because it's easier to tune.


But if we only do that, you might end up with ridiculously long 
checkpoints when there's not many dirty pages. If we want to avoid that, 
we need some way of telling what's a safe minimum rate to write at, 
because that can vary greatly depending on your hardware.


But maybe we don't care about prolonging checkpoints, and don't really 
need any GUCs at all. We could then just hardcode writes_per_nap to some 
low value, and target duration close to 1.0. You would have a checkpoint 
running practically all the time, and you would use 
checkpoint_timeout/checkpoint_segments to control how long it takes. I'm 
a bit worried about jumping to such a completely different regime, 
though. For example, pg_start_backup needs to create a new checkpoint, 
so it would need to wait on average 1.5 * checkpoint_timeout/segments, 
and recovery would need to process on average 1.5 as much WAL as before. 
Though with LDC, you should get away with shorter checkpoint intervals 
than before, because the checkpoints aren't as invasive.


If we do that, we should remove bgwriter_all_* settings. They wouldn't 
do much because we would have checkpoint running all the time, writing 
out dirty pages.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-22 Thread Tom Lane
Heikki Linnakangas <[EMAIL PROTECTED]> writes:
> Tom Lane wrote:
>> (BTW, the patch seems
>> a bit schizoid about whether checkpoint_rate is int or float.)

> Yeah, I've gone back and forth on the data type. I wanted it to be a 
> float, but guc code doesn't let you specify a float in KB, so I switched 
> it to int.

I seriously question trying to claim that it's blocks at all, seeing
that the *actual* units are pages per unit time.  Pretending that it's
a memory unit does more to obscure the truth than document it.

>> What's bugging me about this is that we are either going to be writing
>> at checkpoint_rate if ahead of schedule, or max possible rate if behind;
>> that's not "smoothing" to me, that's introducing some pretty bursty
>> behavior.

> That sounds a lot more complex. The burstiness at that level shouldn't 
> matter much. The OS is still going to cache the writes, and should even 
> out the bursts.

With the changes you're proposing here, the burstiness would be quite
severe.  OTOH, if writes_per_nap is always 1, then bufmgr is going to
recheck the delay situation after every page, so what you have actually
tested is as granular as it could get.

>> And checkpoint_rate really needs to be named checkpoint_min_rate, if
>> it's going to be a minimum.  However, I question whether we need it at
>> all,

> Hmm. With bgwriter_delay of 200 ms, and checkpoint_min_rate of 512 KB/s, 
>   using the non-broken formula above, we get:

> (512*1024/8192) * 200 / 1000 = 12.8, truncated to 12.

> So I think that's fine.

"Fine?"  That's 12x the value you have actually tested.  That's enough
of a change to invalidate all your performance testing IMHO.  I still
think you've not demonstrated a need to expose this parameter.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-22 Thread Bruce Momjian
Tom Lane wrote:
> And checkpoint_rate really needs to be named checkpoint_min_rate, if
> it's going to be a minimum.  However, I question whether we need it at
> all, because as the code stands, with the default BgWriterDelay you
> would have to increase checkpoint_rate to 4x its proposed default before
> writes_per_nap moves off its minimum of 1.  This says to me that the
> system's tested behavior has been so insensitive to checkpoint_rate
> that we probably need not expose such a parameter at all; just hardwire
> the minimum writes_per_nap at 1.

I agree on starting with as simple a GUC as possible and expand it as
there is need in the field.  99% of people are never going to test this
value as much as you are.

-- 
  Bruce Momjian  <[EMAIL PROTECTED]>  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-22 Thread Heikki Linnakangas

Tom Lane wrote:

1. checkpoint_rate is used thusly:

writes_per_nap = Min(1, checkpoint_rate / BgWriterDelay);

where writes_per_nap is the max number of dirty blocks to write before
taking a bgwriter nap.  Now surely this is completely backward: if
BgWriterDelay is increased, the number of writes to allow per nap
decreases?  If you think checkpoint_rate is expressed in some kind of
physical bytes/sec unit, that cannot be right; the number of blocks
per nap has to increase if the naps get longer.  


Uh, you're right, that's just plain wrong. checkpoint_rate is in 
pages/s, so that line should be


writes_per_nap = Min(1, checkpoint_rate * BgWriterDelay / 1000)


(BTW, the patch seems
a bit schizoid about whether checkpoint_rate is int or float.)


Yeah, I've gone back and forth on the data type. I wanted it to be a 
float, but guc code doesn't let you specify a float in KB, so I switched 
it to int.



2. checkpoint_smoothing is used thusly:

/* scale progress according to CheckPointSmoothing */
progress *= CheckPointSmoothing;

where the progress value being scaled is the fraction so far completed
of the total number of dirty pages we expect to have to write.  This
is then compared against estimates of the total fraction of the time-
between-checkpoints that has elapsed; if less, we are behind schedule
and should not nap, if more, we are ahead of schedule and may nap.



(This is a bit odd, but I guess it's all right because it's equivalent
to dividing the elapsed-time estimate by CheckPointSmoothing, which
seems a more natural way of thinking about what needs to happen.)


Yeah, it's a bit unnatural. I did it that way so we don't have to divide 
all three of the estimates: cached_elapsed, progress_in_time and 
progress_in_xlog. Maybe it's not worth micro-optimizing that.



What's bugging me about this is that we are either going to be writing
at checkpoint_rate if ahead of schedule, or max possible rate if behind;
that's not "smoothing" to me, that's introducing some pretty bursty
behavior.  ISTM that actual "smoothing" would involve adjusting
writes_per_nap up or down according to whether we are ahead or behind
schedule, so as to have a finer degree of control over the I/O rate.
(I'd also consider saving the last writes_per_nap value across
checkpoints so as to have a more nearly accurate starting value next
time.)


That sounds a lot more complex. The burstiness at that level shouldn't 
matter much. The OS is still going to cache the writes, and should even 
out the bursts.


Assuming time/xlogs elapse at a steady rate, we will write some multiple 
of writes_per_nap pages between each sleep. With a small writes_per_nap, 
writing just writes_per_nap isn't enough to catch up after we fall 
behind, so we'll write more than that between each sleep. That means 
that on each iteration, we'll write either N*writes_per_nap, or 
(N+1)*writes_per_nap. At worst, that means either writes_per_nap or 
2*writes_per_nap pages on each iteration. That's not too bad, I think.



In any case I still concur with Takahiro-san that "smoothing" doesn't
seem the most appropriate name for the parameter.  Something along the
lines of "checkpoint_completion_target" would convey more about what it
does, I think.


Ok, I'm not wedded to smoothing.


And checkpoint_rate really needs to be named checkpoint_min_rate, if
it's going to be a minimum.  However, I question whether we need it at
all, because as the code stands, with the default BgWriterDelay you
would have to increase checkpoint_rate to 4x its proposed default before
writes_per_nap moves off its minimum of 1. 


Hmm. With bgwriter_delay of 200 ms, and checkpoint_min_rate of 512 KB/s, 
 using the non-broken formula above, we get:


(512*1024/8192) * 200 / 1000 = 12.8, truncated to 12.

So I think that's fine. (looking at the patch, I see that the default in 
guc.c is actually 100 pages / s, contrary to documentation; needs to be 
fixed)



This says to me that the
system's tested behavior has been so insensitive to checkpoint_rate
that we probably need not expose such a parameter at all; just hardwire
the minimum writes_per_nap at 1.


I've set checkpoint_rate to a small value in my tests on purpose to 
control the feature with the other parameter. That must be why I haven't 
noticed the bogus calculation of it before.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-21 Thread Tom Lane
Heikki Linnakangas <[EMAIL PROTECTED]> writes:
> Tom Lane wrote:
>> So the question is, why in the heck would anyone want the behavior that
>> "checkpoints take exactly X time"??

> Because it's easier to tune. You don't need to know how much checkpoint 
> I/O you can tolerate. The system will use just enough I/O bandwidth to 
> meet the deadline, but not more than that.

Uh, not really.  After consuming more caffeine and reading the patch
more carefully, I think there are several problems here:

1. checkpoint_rate is used thusly:

writes_per_nap = Min(1, checkpoint_rate / BgWriterDelay);

where writes_per_nap is the max number of dirty blocks to write before
taking a bgwriter nap.  Now surely this is completely backward: if
BgWriterDelay is increased, the number of writes to allow per nap
decreases?  If you think checkpoint_rate is expressed in some kind of
physical bytes/sec unit, that cannot be right; the number of blocks
per nap has to increase if the naps get longer.  (BTW, the patch seems
a bit schizoid about whether checkpoint_rate is int or float.)

2. checkpoint_smoothing is used thusly:

/* scale progress according to CheckPointSmoothing */
progress *= CheckPointSmoothing;

where the progress value being scaled is the fraction so far completed
of the total number of dirty pages we expect to have to write.  This
is then compared against estimates of the total fraction of the time-
between-checkpoints that has elapsed; if less, we are behind schedule
and should not nap, if more, we are ahead of schedule and may nap.
(This is a bit odd, but I guess it's all right because it's equivalent
to dividing the elapsed-time estimate by CheckPointSmoothing, which
seems a more natural way of thinking about what needs to happen.)

What's bugging me about this is that we are either going to be writing
at checkpoint_rate if ahead of schedule, or max possible rate if behind;
that's not "smoothing" to me, that's introducing some pretty bursty
behavior.  ISTM that actual "smoothing" would involve adjusting
writes_per_nap up or down according to whether we are ahead or behind
schedule, so as to have a finer degree of control over the I/O rate.
(I'd also consider saving the last writes_per_nap value across
checkpoints so as to have a more nearly accurate starting value next
time.)

In any case I still concur with Takahiro-san that "smoothing" doesn't
seem the most appropriate name for the parameter.  Something along the
lines of "checkpoint_completion_target" would convey more about what it
does, I think.

And checkpoint_rate really needs to be named checkpoint_min_rate, if
it's going to be a minimum.  However, I question whether we need it at
all, because as the code stands, with the default BgWriterDelay you
would have to increase checkpoint_rate to 4x its proposed default before
writes_per_nap moves off its minimum of 1.  This says to me that the
system's tested behavior has been so insensitive to checkpoint_rate
that we probably need not expose such a parameter at all; just hardwire
the minimum writes_per_nap at 1.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-21 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas <[EMAIL PROTECTED]> writes:
The main tuning knob is checkpoint_smoothing, which is defined as a 
fraction of the checkpoint interval (both checkpoint_timeout and 
checkpoint_segments are taken into account). Normally, the write phase 
of a checkpoint takes exactly that much time


So the question is, why in the heck would anyone want the behavior that
"checkpoints take exactly X time"??  The useful part of this whole patch
is to cap the write rate at something that doesn't interfere too much
with foreground queries.  I don't see why people wouldn't prefer
"checkpoints can take any amount of time up to the checkpoint interval,
but we do our best not to exceed Y writes/second".


Because it's easier to tune. You don't need to know how much checkpoint 
I/O you can tolerate. The system will use just enough I/O bandwidth to 
meet the deadline, but not more than that.



Basically I don't see what useful values checkpoint_smoothing would have
other than 0 and 1.  You might as well make it a bool.


Well that's one option. It feels like a good thing to be able to control 
  how much headroom you have until the next checkpoint, but maybe we 
can just hardcode it close to 1. It's also good to avoid spreading the 
checkpoints unnecessarily, to keep recovery times lower, but you can 
control that with the min rate setting as well.


There's another possible strategy: keep the I/O rate constant, but vary 
the length of the checkpoint. checkpoint_rate allows you to do that.


But only from the lower side.


Now how would you replace checkpoint_smoothing with a max I/O rate?


I don't see why you think that's hard.  It looks to me like the
components of the decision are the same numbers in any case: you have to
estimate your progress towards checkpoint completion, your available
time till next checkpoint, and your write rate.  Then you either delay
or not.


Let me put it this way: If you define a min and a max I/O rate, when 
would the max I/O rate limit take effect? If there's few dirty buffers 
in the pool, so that you'll finish the checkpoint in time before the 
next one is due writing at the min rate, that's what you'll use. If 
there's more, you'll need to write fast enough that you'll finish the 
checkpoint in time, regardless of the max rate. Or would you let the 
next checkpoint slip and keep writing at the max rate? That seems like a 
footgun if someone sets it to a too low value.


Or are you thinking that we have just one setting: checkpoint_rate? You 
describe it as a maximum, but I've been thinking of it as a minimum 
because you *will* exceed it if the next checkpoint is due soon.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-21 Thread Tom Lane
Heikki Linnakangas <[EMAIL PROTECTED]> writes:
> I don't think you understand how the settings work. Did you read the 
> documentation? If you did, it's apparently not adequate.

I did read the documentation, and I'm not complaining that I don't
understand it.  I'm complaining that I don't like the presented API
because it's self-inconsistent.  You've got two parameters that are in
effect upper and lower bounds for the checkpoint write rate, but they
are named inconsistently and not even measured in the same kind of unit.
Nor do I agree that the inconsistency buys any ease of use.

> The main tuning knob is checkpoint_smoothing, which is defined as a 
> fraction of the checkpoint interval (both checkpoint_timeout and 
> checkpoint_segments are taken into account). Normally, the write phase 
> of a checkpoint takes exactly that much time.

So the question is, why in the heck would anyone want the behavior that
"checkpoints take exactly X time"??  The useful part of this whole patch
is to cap the write rate at something that doesn't interfere too much
with foreground queries.  I don't see why people wouldn't prefer
"checkpoints can take any amount of time up to the checkpoint interval,
but we do our best not to exceed Y writes/second".

Basically I don't see what useful values checkpoint_smoothing would have
other than 0 and 1.  You might as well make it a bool.

> There's another possible strategy: keep the I/O rate constant, but vary 
> the length of the checkpoint. checkpoint_rate allows you to do that.

But only from the lower side.

> Now how would you replace checkpoint_smoothing with a max I/O rate?

I don't see why you think that's hard.  It looks to me like the
components of the decision are the same numbers in any case: you have to
estimate your progress towards checkpoint completion, your available
time till next checkpoint, and your write rate.  Then you either delay
or not.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-21 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas <[EMAIL PROTECTED]> writes:

Tom Lane wrote:

I tend to agree with whoever said upthread that the combination of GUC
variables proposed here is confusing and ugly.  It'd make more sense to
have min and max checkpoint rates in KB/s, with the max checkpoint rate
only honored when we are predicting we'll finish before the next
checkpoint time.


Really? I thought everyone is happy with the current combination, and 
that it was just the old trio of parameters controlling the write, nap 
and sync phases that was ugly. One particularly nice thing about 
defining the duration as a fraction of checkpoint interval is that we 
can come up with a reasonable default value that doesn't depend on your 
hardware.


That argument would hold some water if you weren't introducing a
hardware-dependent min rate in the same patch.  Do we need the min rate
at all?  If so, why can't it be in the same units as the max (ie, a
fraction of checkpoint)?


How would a min and max rate work?


Pretty much the same as the code does now, no?  You either delay, or not.


I don't think you understand how the settings work. Did you read the 
documentation? If you did, it's apparently not adequate.


The main tuning knob is checkpoint_smoothing, which is defined as a 
fraction of the checkpoint interval (both checkpoint_timeout and 
checkpoint_segments are taken into account). Normally, the write phase 
of a checkpoint takes exactly that much time.


So the length of a checkpoint stays the same regardless of how many 
dirty buffers there is (assuming you don't exceed the bandwidth of your 
hardware), but the I/O rate varies.


There's another possible strategy: keep the I/O rate constant, but vary 
the length of the checkpoint. checkpoint_rate allows you to do that.


I'm envisioning we set the defaults so that checkpoint_smoothing is the 
effective control in a relatively busy system, and checkpoint_rate 
ensures that we don't unnecessarily prolong checkpoints on an idle system.


Now how would you replace checkpoint_smoothing with a max I/O rate? The 
only real alternative way of defining it is directly as a time and/or 
segments, similar to checkpoint_timeout and checkpoint_segments, but 
then we'd really need two knobs instead of one.


Though maybe we could just hard-code it to 0.8, for example, and tell 
people to tune checkpoint_rate if they want more aggressive checkpoints.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-21 Thread Tom Lane
Heikki Linnakangas <[EMAIL PROTECTED]> writes:
> Tom Lane wrote:
>> I tend to agree with whoever said upthread that the combination of GUC
>> variables proposed here is confusing and ugly.  It'd make more sense to
>> have min and max checkpoint rates in KB/s, with the max checkpoint rate
>> only honored when we are predicting we'll finish before the next
>> checkpoint time.

> Really? I thought everyone is happy with the current combination, and 
> that it was just the old trio of parameters controlling the write, nap 
> and sync phases that was ugly. One particularly nice thing about 
> defining the duration as a fraction of checkpoint interval is that we 
> can come up with a reasonable default value that doesn't depend on your 
> hardware.

That argument would hold some water if you weren't introducing a
hardware-dependent min rate in the same patch.  Do we need the min rate
at all?  If so, why can't it be in the same units as the max (ie, a
fraction of checkpoint)?

> How would a min and max rate work?

Pretty much the same as the code does now, no?  You either delay, or not.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-21 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas <[EMAIL PROTECTED]> writes:

In fact, I think there's a small race condition in CVS HEAD:


Yeah, probably --- the original no-locking design didn't have any side
flags.  The reason you need the lock is for a backend to be sure that
a newly-started checkpoint is using its requested flags.  But the
detection of checkpoint termination is still the same.


Actually, the race condition I outlined isn't related to the flags. It's 
possible because RequestCheckpoint doesn't guarantee that a checkpoint 
is performed when there's been no WAL activity since last one.


I did use a new force-flag to fix it, but I'm sure there is other ways.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-21 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas <[EMAIL PROTECTED]> writes:
I added a spinlock to protect the signaling fields between bgwriter and 
backends. The current non-locking approach gets really difficult as the 
patch adds two new flags, and both are more important than the existing 
ckpt_time_warn flag.


That may be, but you could minimize the cost and notational ugliness by
not using the spinlock where you don't have to.  Put the sig_atomic_t
fields back the way they were, and many of the uses of the spinlock go
away.  All you really need it for is the places where the additional
flags are set or read.


I find it easier to understand if it's used whenever any of the fields 
are accessed. You don't need to read and write ckpt_failed and 
ckpt_started/ckpt_done in specific order anymore, for example.



Some other comments:

I tend to agree with whoever said upthread that the combination of GUC
variables proposed here is confusing and ugly.  It'd make more sense to
have min and max checkpoint rates in KB/s, with the max checkpoint rate
only honored when we are predicting we'll finish before the next
checkpoint time.


Really? I thought everyone is happy with the current combination, and 
that it was just the old trio of parameters controlling the write, nap 
and sync phases that was ugly. One particularly nice thing about 
defining the duration as a fraction of checkpoint interval is that we 
can come up with a reasonable default value that doesn't depend on your 
hardware.


How would a min and max rate work?

Anyone else have an opinion on the parameters?


The flow of control and responsibility between xlog.c, bgwriter.c and
bufmgr.c seems to have reached the breaking point of unintelligibility.
Can you think of a refactoring that would simplify it?  We might have
to resign ourselves to this much messiness, but I'd rather not...


The problem we're trying to solve is doing a checkpoint while running 
the normal bgwriter activity at the same time. The normal way to do two 
things simultaneously is to have two different processes (or threads). I 
thought about having a separate checkpoint process, but I gave up on 
that thought because the communication needed between backends, bgwriter 
and the checkpointer seems like a mess. The checkpointer would need the 
pendingOpsTable so that it can do the fsyncs, and it would also need to 
receive the forget-messages to that table. We could move that table 
entirely to the checkpointer, but since bgwriter is presumably doing 
most of the writes, there would be a lot of chatter between bgwriter and 
the checkpointer.


The current approach is like co-operative multitasking. BufferSyncs 
yields control to bgwriter every now and then.


The division of labor between xlog.c and other modules is not that bad, 
IMO. CreateCheckPoint is the main entry point to create a checkpoint, 
and it calls other modules to do their stuff, including bufmgr.c.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-21 Thread Heikki Linnakangas

ITAGAKI Takahiro wrote:

The only thing I don't understand is the naming of 'checkpoint_smoothing'.
Can users imagine the unit of 'smoothing' is a fraction?

You explain the paremeter with the word 'fraction'.
Why don't you simply name it 'checkpoint_fraction' ?
| Specifies the target length of checkpoints, as a fraction of
| the checkpoint interval. The default is 0.3.


I chose checkpoint_smoothing because it tells you what the parameter is 
for. If you want more smoothing, tune it up, and if you want less, tune 
it down. checkpoint_fraction makes you wonder what you can do with it 
and why you would change it.



Sorry if I'm missing discussions abount the naming.


No, I chose _smoothing on my own. People didn't like 
checkpoint_write_percent either (including).


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-20 Thread ITAGAKI Takahiro
Heikki Linnakangas <[EMAIL PROTECTED]> wrote:

> Here's an updated WIP patch for load distributed checkpoints.
> Since last patch, I did some clean up and refactoring, and added a bunch 
> of comments, and user documentation.

The only thing I don't understand is the naming of 'checkpoint_smoothing'.
Can users imagine the unit of 'smoothing' is a fraction?

You explain the paremeter with the word 'fraction'.
Why don't you simply name it 'checkpoint_fraction' ?
| Specifies the target length of checkpoints, as a fraction of
| the checkpoint interval. The default is 0.3.

Sorry if I'm missing discussions abount the naming.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-20 Thread Tom Lane
Heikki Linnakangas <[EMAIL PROTECTED]> writes:
> I added a spinlock to protect the signaling fields between bgwriter and 
> backends. The current non-locking approach gets really difficult as the 
> patch adds two new flags, and both are more important than the existing 
> ckpt_time_warn flag.

That may be, but you could minimize the cost and notational ugliness by
not using the spinlock where you don't have to.  Put the sig_atomic_t
fields back the way they were, and many of the uses of the spinlock go
away.  All you really need it for is the places where the additional
flags are set or read.

> In fact, I think there's a small race condition in CVS HEAD:

Yeah, probably --- the original no-locking design didn't have any side
flags.  The reason you need the lock is for a backend to be sure that
a newly-started checkpoint is using its requested flags.  But the
detection of checkpoint termination is still the same.

Some other comments:

I tend to agree with whoever said upthread that the combination of GUC
variables proposed here is confusing and ugly.  It'd make more sense to
have min and max checkpoint rates in KB/s, with the max checkpoint rate
only honored when we are predicting we'll finish before the next
checkpoint time.

The flow of control and responsibility between xlog.c, bgwriter.c and
bufmgr.c seems to have reached the breaking point of unintelligibility.
Can you think of a refactoring that would simplify it?  We might have
to resign ourselves to this much messiness, but I'd rather not...

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


[PATCHES] Load Distributed Checkpoints, take 3

2007-06-20 Thread Heikki Linnakangas

Here's an updated WIP patch for load distributed checkpoints.

I added a spinlock to protect the signaling fields between bgwriter and 
backends. The current non-locking approach gets really difficult as the 
patch adds two new flags, and both are more important than the existing 
ckpt_time_warn flag.


In fact, I think there's a small race condition in CVS HEAD:

1. pg_start_backup() is called, which calls RequestCheckpoint
2. RequestCheckpoint takes note of the old value of ckpt_started
3. bgwriter wakes up from pg_usleep, and sees that we've exceeded 
checkpoint_timeout.

4. bgwriter increases ckpt_started to note that a new checkpoint has started
5. RequestCheckpoint signals bgwriter to start a new checkpoint
6. bgwriter calls CreateCheckpoint, with the force-flag set to false 
because this checkpoint was triggered by timeout
7. RequestCheckpoint sees that ckpt_started has increased, and starts to 
wait for ckpt_done to reach the new value.
8. CreateCheckpoint finishes immediately, because there was no XLOG 
activity since last checkpoint.

9. RequestCheckpoint sees that ckpt_done matches ckpt_started, and returns.
10. pg_start_backup() continues, with potentially the same redo location 
and thus history filename as previous backup.


Now I admit that the chances for that to happen are extremely small, 
people don't usually do two pg_start_backup calls without *any* WAL 
logged activity in between them, for example. But as we add the new 
flags, avoiding scenarios like that becomes harder.


Since last patch, I did some clean up and refactoring, and added a bunch 
of comments, and user documentation.


I haven't yet changed GetInsertRecPtr to use the almost up-to-date value 
protected by the info_lck per Simon's suggestion, and I need to do some 
correctness testing. After that, I'm done with the patch.


Ps. In case you wonder what took me so long since last revision, I've 
spent a lot of time reviewing HOT.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
Index: doc/src/sgml/config.sgml
===
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/doc/src/sgml/config.sgml,v
retrieving revision 1.126
diff -c -r1.126 config.sgml
*** doc/src/sgml/config.sgml	7 Jun 2007 19:19:56 -	1.126
--- doc/src/sgml/config.sgml	19 Jun 2007 14:24:31 -
***
*** 1565,1570 
--- 1565,1608 

   
  
+  
+   checkpoint_smoothing (floating point)
+   
+checkpoint_smoothing configuration parameter
+   
+   
+
+ Specifies the target length of checkpoints, as a fraction of 
+ the checkpoint interval. The default is 0.3.
+ 
+ This parameter can only be set in the postgresql.conf
+ file or on the server command line.
+
+   
+  
+ 
+   checkpoint_rate (floating point)
+   
+checkpoint_rate configuration parameter
+   
+   
+
+ Specifies the minimum I/O rate used to flush dirty buffers during a
+ checkpoint, when there's not many dirty buffers in the buffer cache.
+ The default is 512 KB/s.
+ 
+ Note: the accuracy of this setting depends on
+ bgwriter_delaypostgresql.conf
+ file or on the server command line.
+
+   
+  
+ 
   
checkpoint_warning (integer)

Index: doc/src/sgml/wal.sgml
===
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/doc/src/sgml/wal.sgml,v
retrieving revision 1.43
diff -c -r1.43 wal.sgml
*** doc/src/sgml/wal.sgml	31 Jan 2007 20:56:19 -	1.43
--- doc/src/sgml/wal.sgml	19 Jun 2007 14:26:45 -
***
*** 217,225 

  

 There will be at least one WAL segment file, and will normally
 not be more than 2 * checkpoint_segments + 1
!files.  Each segment file is normally 16 MB (though this size can be
 altered when building the server).  You can use this to estimate space
 requirements for WAL.
 Ordinarily, when old log segment files are no longer needed, they
--- 217,245 

  

+If there is a lot of dirty buffers in the buffer cache, flushing them
+all at checkpoint will cause a heavy burst of I/O that can disrupt other
+activity in the system. To avoid that, the checkpoint I/O can be distributed
+over a longer period of time, defined with
+checkpoint_smoothing. It's given as a fraction of the
+checkpoint interval, as defined by checkpoint_timeout
+and checkpoint_segments. The WAL segment consumption
+and elapsed time is monitored and the I/O rate is adjusted during
+checkpoint so that it's finished when the given fraction of elapsed time
+or WAL segments has passed, whichever is sooner. However, that could lead
+to unnecessarily prolonged checkpoints when there's not many dirty buffers
+in the cache. To avoid that, checkpoint_rate can be used