Greg Smith wrote:
True, you'd have to replay 1.5 checkpoint intervals on average instead of 0.5 (more or less, assuming checkpoints had been short). I don't think we're in the business of optimizing crash recovery time though.

If you're not, I think you should be. Keeping that replay interval time down was one of the reasons why the people I was working with were displeased with the implications of the very spread out style of some LDC tunings. They were already unhappy with the implied recovery time of how high they had to set checkpoint_settings for good performance, and making it that much bigger aggrevates the issue. Given a knob where the LDC can be spread out a bit but not across the entire interval, that makes it easier to control how much expansion there is relative to the current behavior.

I agree on that one: we *should* optimize crash recovery time. It may not be the most important thing on earth, but it's a significant consideration for some systems.

However, I think shortening the checkpoint interval is a perfectly valid solution to that. It does lead to more full page writes, but in 8.3 more full page writes can actually make the recovery go faster, not slower, because with we no longer read in the previous contents of the page when we restore it from a full page image. In any case, while people sometimes complain that we have a large WAL footprint, it's not usually a problem.

This is off-topic, but at PGCon in May, Itagaki-san and his colleagues whose names I can't remember, pointed out to me very clearly that our recovery is *slow*. So slow, that in the benchmarks they were running, their warm stand-by slave couldn't keep up with the master generating the WAL, even though both are running on the same kind of hardware.

The reason is simple: There can be tens of backends doing I/O and generating WAL, but in recovery we serialize them. If you have decent I/O hardware that could handle for example 10 concurrent random I/Os, at recovery we'll be issuing them one at a time. That's a scalability issue, and doesn't show up on a laptop or a small server with a single disk.

That's one of the first things I'm planning to tackle when the 8.4 dev cycle opens. And I'm planning to look at recovery times in general; I've never even measured it before so who knows what comes up.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Reply via email to