Hi,
On Fri, Jul 26, 2013 at 3:41 PM, Greg Smith <greg@2ndquad<g...@2ndquadrant.com>(needrant.com<g...@2ndquadrant.com> > wrote: > On 7/26/13 9:14 AM, didier wrote: > >> During recovery you have to load the log in cache first before applying >> WAL. >> > > Checkpoints exist to bound recovery time after a crash. That is their > only purpose. What you're suggesting moves a lot of work into the recovery > path, which will slow down how long it takes to process. > > Yes it's slower but you're sequentially reading only one file at most the size of your buffer cache, moreover it's a constant time. Let say you make a checkpoint and crash just after with a next to empty WAL. Now recovery is very fast but you have to repopulate your cache with random reads from requests. With the snapshot it's slower but you read, sequentially again, a lot of hot cache you will need later when the db starts serving requests. Of course the worst case is if it crashes just before a checkpoint, most of the snapshot data are stalled and will be overwritten by WAL ops. But If the WAL recovery is CPU bound, loading from the snapshot may be done concurrently while replaying the WAL. More work at recovery time means someone who uses the default of > checkpoint_timeout='5 minutes', expecting that crash recovery won't take > very long, will discover it does take a longer time now. They'll be forced > to shrink the value to get the same recovery time as they do currently. > You might need to make checkpoint_timeout 3 minutes instead, if crash > recovery now has all this extra work to deal with. And when the time > between checkpoints drops, it will slow the fundamental efficiency of > checkpoint processing down. You will end up writing out more data in the > end. > Yes it's a trade off, now you're paying the price at checkpoint time, every time, with the log you're paying only once, at recovery. > > The interval between checkpoints and recovery time are all related. If > you let any one side of the current requirements slip, it makes the rest > easier to deal with. Those are all trade-offs though, not improvements. > And this particular one is already an option. > > If you want less checkpoint I/O per capita and don't care about recovery > time, you don't need a code change to get it. Just make checkpoint_timeout > huge. A lot of checkpoint I/O issues go away if you only do a checkpoint > per hour, because instead of random writes you're getting sequential ones > to the WAL. But when you crash, expect to be down for a significant chunk > of an hour, as you go back to sort out all of the work postponed before. It's not the same it's a snapshot saved and loaded in constant time unlike the WAL log. Didier