Re: [HACKERS] Hot standby, recovery infra
Hi, On Fri, Jan 30, 2009 at 7:47 PM, Simon Riggs si...@2ndquadrant.com wrote: That whole area was something I was leaving until last, since immediate shutdown doesn't work either, even in HEAD. (Fujii-san and I discussed this before Christmas, briefly). This problem remains in current HEAD. I mean, immediate shutdown may be unable to kill the startup process because system() which executes restore_command ignores SIGQUIT while waiting. When I tried immediate shutdown during recovery, only the startup process survived. This is undesirable behavior, I think. The following code should be added into RestoreArchivedFile()? if (WTERMSIG(rc) == SIGQUIT) exit(2); Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
Fujii Masao wrote: On Fri, Jan 30, 2009 at 7:47 PM, Simon Riggs si...@2ndquadrant.com wrote: That whole area was something I was leaving until last, since immediate shutdown doesn't work either, even in HEAD. (Fujii-san and I discussed this before Christmas, briefly). This problem remains in current HEAD. I mean, immediate shutdown may be unable to kill the startup process because system() which executes restore_command ignores SIGQUIT while waiting. When I tried immediate shutdown during recovery, only the startup process survived. This is undesirable behavior, I think. Yeah, we need to fix that. The following code should be added into RestoreArchivedFile()? if (WTERMSIG(rc) == SIGQUIT) exit(2); I don't see how that helps, as we already have this in there: signaled = WIFSIGNALED(rc) || WEXITSTATUS(rc) 125; ereport(signaled ? FATAL : DEBUG2, (errmsg(could not restore file \%s\ from archive: return code %d, xlogfname, rc))); which means we already ereport(FATAL) if the restore command dies with SIGQUIT. I think the real problem here is that pg_standby traps SIGQUIT. The startup process doesn't receive the SIGQUIT because it's in system(), and pg_standby doesn't propagate it to the startup process either because it traps it. I think we should simply remove the signal handler for SIGQUIT from pg_standby. Or will that lead to core dump by default? In that case, we need pg_standby to exit(128) or similar, so that RestoreArchivedFile understands that the command was killed by a signal. Another approach is to check that the postmaster is still alive, like we do in walwriter and bgwriter: /* * Emergency bailout if postmaster has died. This is to avoid the * necessity for manual cleanup of all postmaster children. */ if (!PostmasterIsAlive(true)) exit(1); However, I'm afraid there's a race condition with that. If we do that right after system(), postmaster might've signaled us but not exited yet. We could check that in the main loop, but if we wrongly interpret the exit of the recovery command as a file not found - go ahead and start up, the damage might be done by the time we notice that the postmaster is gone. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
Hi, On Fri, Feb 27, 2009 at 3:38 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: I think the real problem here is that pg_standby traps SIGQUIT. The startup process doesn't receive the SIGQUIT because it's in system(), and pg_standby doesn't propagate it to the startup process either because it traps it. Yes, you are right. I think we should simply remove the signal handler for SIGQUIT from pg_standby. +1 I don't see how that helps, as we already have this in there: signaled = WIFSIGNALED(rc) || WEXITSTATUS(rc) 125; ereport(signaled ? FATAL : DEBUG2, (errmsg(could not restore file \%s\ from archive: return code %d, xlogfname, rc))); which means we already ereport(FATAL) if the restore command dies with SIGQUIT. SIGQUIT should kill the process immediately, so I think that the startup process as well as other auxiliary process should call exit(2) instead of ereport(FATAL). Right? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
On Thu, 2009-02-26 at 20:38 +0200, Heikki Linnakangas wrote: I think we should simply remove the signal handler for SIGQUIT from pg_standby. If you do this, please make it release dependent so pg_standby behaves correctly for the release it is being used with. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
Simon Riggs wrote: On Thu, 2009-02-26 at 20:38 +0200, Heikki Linnakangas wrote: I think we should simply remove the signal handler for SIGQUIT from pg_standby. If you do this, please make it release dependent so pg_standby behaves correctly for the release it is being used with. Hmm, I don't think there's a way for pg_standby to know which version of PostgreSQL is calling it. Assuming there is, how would you want it to behave? If you want no change in behavior in old releases, can't we just leave it unfixed in back-branches? In fact, it seems more useful to not detect the server version, so that if you do want the new behavior, you can use a 8.4 pg_standby against a 8.3 server. In back-branches, I think we need to decide between fixing this, at the risk of breaking someone's script that is using killall -QUIT pg_standby or similar to trigger failover, and leaving it as it is knowing that immediate shutdown doesn't work on a standby server. I'm not sure which is best. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
On Mon, 2009-02-09 at 17:13 +0200, Heikki Linnakangas wrote: Attached is an updated patch that does that, and I've fixed all the other outstanding issues I listed earlier as well. Now I'm feeling again that this is in pretty good shape. UpdateMinRecoveryPoint() issues a DEBUG2 message even when we have not updated the control file, leading to log filling behaviour on an idle system. DEBUG: updated min recovery point to ... We should just tuck the message into the if section above it. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
Simon Riggs wrote: On Mon, 2009-02-09 at 17:13 +0200, Heikki Linnakangas wrote: Attached is an updated patch that does that, and I've fixed all the other outstanding issues I listed earlier as well. Now I'm feeling again that this is in pretty good shape. UpdateMinRecoveryPoint() issues a DEBUG2 message even when we have not updated the control file, leading to log filling behaviour on an idle system. DEBUG: updated min recovery point to ... We should just tuck the message into the if section above it. The outer if should ensure that it isn't printed repeatedly on an idle system. But I agree it belongs inside the inner if section. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
On Wed, 2009-02-18 at 14:26 +0200, Heikki Linnakangas wrote: The outer if should ensure that it isn't printed repeatedly on an idle system. Regrettably not. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
Simon Riggs wrote: On Wed, 2009-02-18 at 14:26 +0200, Heikki Linnakangas wrote: The outer if should ensure that it isn't printed repeatedly on an idle system. Regrettably not. Ok, committed. I fixed that and some comment changes. I also renamed IsRecoveryProcessingMode() to RecoveryInProgress(), to avoid confusion with the real processing modes defined in miscadmin.h. That will probably cause you merge conflicts in the hot standby patch, but it should be a matter of search-replace to fix. The changes need to be documented. At least the removal of log_restartpoints is a clear user-visible change. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
On Wed, 2009-02-18 at 18:01 +0200, Heikki Linnakangas wrote: Simon Riggs wrote: On Wed, 2009-02-18 at 14:26 +0200, Heikki Linnakangas wrote: The outer if should ensure that it isn't printed repeatedly on an idle system. Regrettably not. Ok, committed. Cool. I fixed that and some comment changes. I also renamed IsRecoveryProcessingMode() to RecoveryInProgress(), to avoid confusion with the real processing modes defined in miscadmin.h. That will probably cause you merge conflicts in the hot standby patch, but it should be a matter of search-replace to fix. Yep, good change, agree with reasons. The changes need to be documented. At least the removal of log_restartpoints is a clear user-visible change. Yep. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
Simon Riggs wrote: On Thu, 2009-02-05 at 21:54 +0200, Heikki Linnakangas wrote: - If you perform a fast shutdown while startup process is waiting for the restore command, startup process sometimes throws a FATAL error which leads escalates into an immediate shutdown. That leads to different messages in the logs, and skipping of the shutdown restartpoint that we now otherwise perform. Sometimes? I think what happens is that if the restore command receives the SIGTERM and dies before the startup process that's waiting for the restore command receives the SIGTERM, the startup process throws a FATAL error because the restore command died unexpectedly. I put this if (shutdown_requested InRedo) { /* XXX: Is EndRecPtr always the right value here? */ UpdateMinRecoveryPoint(EndRecPtr); proc_exit(0); } right after the system(xlogRestoreCmd) call, to exit gracefully if we were requested to shut down while restore command was running, but it seems that that's not enough because of the race condition. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
On Fri, 2009-02-06 at 10:06 +0200, Heikki Linnakangas wrote: Simon Riggs wrote: On Thu, 2009-02-05 at 21:54 +0200, Heikki Linnakangas wrote: - If you perform a fast shutdown while startup process is waiting for the restore command, startup process sometimes throws a FATAL error which leads escalates into an immediate shutdown. That leads to different messages in the logs, and skipping of the shutdown restartpoint that we now otherwise perform. Sometimes? I think what happens is that if the restore command receives the SIGTERM and dies before the startup process that's waiting for the restore command receives the SIGTERM, the startup process throws a FATAL error because the restore command died unexpectedly. I put this if (shutdown_requested InRedo) { /* XXX: Is EndRecPtr always the right value here? */ UpdateMinRecoveryPoint(EndRecPtr); proc_exit(0); } right after the system(xlogRestoreCmd) call, to exit gracefully if we were requested to shut down while restore command was running, but it seems that that's not enough because of the race condition. Can we trap the death of the restorecmd and handle it differently from the death of the startup process? -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
Simon Riggs wrote: On Thu, 2009-02-05 at 09:28 +0200, Heikki Linnakangas wrote: Simon Riggs wrote: I would suggest that at end of recovery we write the last LSN to the control file, so if we crash recover then we will always end archive recovery at the same place again should we re-enter it. So we would have a recovery_target_lsn that overrides recovery_target_xid etc.. Hmm, we don't actually want to end recovery at the same point again: if there's some updates right after the database came up, but before the first checkpoint and crash, those actions need to be replayed too. I think we do need to. Crash recovery is supposed to return us to the same state. Where we ended ArchiveRecovery is part of that state. It isn't for crash recovery to decide that it saw a few more transactions and decided to apply them anyway. The user may have specifically ended recovery to avoid those additional changes from taking place. Those additional changes were made in the standby, after ending recovery. So the sequence of events is: 1. Standby performs archive recovery 2. End of archive (or stop point) reached. Open for connections, read-write. Request an online checkpoint. Online checkpoint begins. 3. A user connects, and does INSERT INTO foo VALUES (123). Commits, commit returns. 4. pg_ctl stop -m immediate. The checkpoint started in step 2 hasn't finished yet 5. Restart the server. The server will now re-enter archive recovery. But this time, we want to replay the INSERT too. (let's do the startup checkpoint for now, as discussed elsewhere in this thread) -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
Simon Riggs wrote: On Thu, 2009-02-05 at 09:28 +0200, Heikki Linnakangas wrote: I've changed the way minRecoveryPoint is updated now anyway, so it no longer happens every XLogFileRead(). Care to elucidate? I got rid of minSafeStartPoint, advancing minRecoveryPoint instead. And it's advanced in XLogFlush instead of XLogFileRead. I'll post an updated patch soon. Why do you think XLogFlush is called less frequently than XLogFileRead? It's not, but we only need to update the control file when we're flushing an LSN that's greater than current minRecoveryPoint. And when we do update minRecoveryPoint, we can update it to the LSN of the last record we've read from the archive. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
On Thu, 2009-02-05 at 10:07 +0200, Heikki Linnakangas wrote: Simon Riggs wrote: On Thu, 2009-02-05 at 09:28 +0200, Heikki Linnakangas wrote: Simon Riggs wrote: I would suggest that at end of recovery we write the last LSN to the control file, so if we crash recover then we will always end archive recovery at the same place again should we re-enter it. So we would have a recovery_target_lsn that overrides recovery_target_xid etc.. Hmm, we don't actually want to end recovery at the same point again: if there's some updates right after the database came up, but before the first checkpoint and crash, those actions need to be replayed too. I think we do need to. Crash recovery is supposed to return us to the same state. Where we ended ArchiveRecovery is part of that state. It isn't for crash recovery to decide that it saw a few more transactions and decided to apply them anyway. The user may have specifically ended recovery to avoid those additional changes from taking place. Those additional changes were made in the standby, after ending recovery. So the sequence of events is: 1. Standby performs archive recovery 2. End of archive (or stop point) reached. Open for connections, read-write. Request an online checkpoint. Online checkpoint begins. 3. A user connects, and does INSERT INTO foo VALUES (123). Commits, commit returns. 4. pg_ctl stop -m immediate. The checkpoint started in step 2 hasn't finished yet 5. Restart the server. The server will now re-enter archive recovery. But this time, we want to replay the INSERT too. I agree with this, so let me restate both comments together. When the server starts it begins a new timeline. When recovering we must switch to that timeline at the same point we did previously when we ended archive recovery. We currently don't record when that is and we need to. Yes, we must also replay the records in the new timeline once we have switched to it, as you say. But we must not replay any further in the older timeline(s). -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
On Thu, 2009-02-05 at 10:31 +0200, Heikki Linnakangas wrote: Simon Riggs wrote: On Thu, 2009-02-05 at 09:28 +0200, Heikki Linnakangas wrote: I've changed the way minRecoveryPoint is updated now anyway, so it no longer happens every XLogFileRead(). Care to elucidate? I got rid of minSafeStartPoint, advancing minRecoveryPoint instead. And it's advanced in XLogFlush instead of XLogFileRead. I'll post an updated patch soon. Why do you think XLogFlush is called less frequently than XLogFileRead? It's not, but we only need to update the control file when we're flushing an LSN that's greater than current minRecoveryPoint. And when we do update minRecoveryPoint, we can update it to the LSN of the last record we've read from the archive. So we might end up flushing more often *and* we will be doing it potentially in the code path of other users. This change seems speculative and also against what has previously been agreed with Tom. If he chooses not to comment on your changes, that's up to him, but I don't think you should remove things quietly that have been put there through the community process, as if they caused problems. I feel like I'm in the middle here. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
On Thu, 2009-02-05 at 09:26 +, Simon Riggs wrote: This change seems speculative and also against what has previously been agreed with Tom. If he chooses not to comment on your changes, that's up to him, but I don't think you should remove things quietly that have been put there through the community process, as if they caused problems. I feel like I'm in the middle here. This is only a problem because of two independent reviewers. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
Simon Riggs wrote: On Thu, 2009-02-05 at 10:31 +0200, Heikki Linnakangas wrote: Simon Riggs wrote: On Thu, 2009-02-05 at 09:28 +0200, Heikki Linnakangas wrote: I got rid of minSafeStartPoint, advancing minRecoveryPoint instead. And it's advanced in XLogFlush instead of XLogFileRead. I'll post an updated patch soon. Why do you think XLogFlush is called less frequently than XLogFileRead? It's not, but we only need to update the control file when we're flushing an LSN that's greater than current minRecoveryPoint. And when we do update minRecoveryPoint, we can update it to the LSN of the last record we've read from the archive. So we might end up flushing more often *and* we will be doing it potentially in the code path of other users. For example, imagine a database that fits completely in shared buffers. If we update at every XLogFileRead, we have to fsync every 16MB of WAL. If we update in XLogFlush the way I described, you only need to update when we flush a page from the buffer cache, which will only happen at restartpoints. That's far less updates. Expanding that example to a database that doesn't fit in cache, you're still replacing pages from the buffer cache that have been untouched for longest. Such pages will have an old LSN, too, so we shouldn't need to update very often. I'm sure you can come up with an example of where we end up fsyncing more often, but it doesn't seem like the common case to me. This change seems speculative and also against what has previously been agreed with Tom. If he chooses not to comment on your changes, that's up to him, but I don't think you should remove things quietly that have been put there through the community process, as if they caused problems. I feel like I'm in the middle here. I'd like to have the extra protection that this approach gives. If we let safeStartPoint to be ahead of the actual WAL we've replayed, we have to just assume we're fine if we reach end of WAL before reaching that point. That assumption falls down if e.g recovery is stopped, and you go and remove the last few WAL segments from the archive before restarting it, or signal pg_standby to trigger failover too early. Tracking the real safe starting point and enforcing it always protects you from that. (we did discuss this a week ago: http://archives.postgresql.org/message-id/4981f6e0.2040...@enterprisedb.com) -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
On Thu, 2009-02-05 at 11:46 +0200, Heikki Linnakangas wrote: Simon Riggs wrote: So we might end up flushing more often *and* we will be doing it potentially in the code path of other users. For example, imagine a database that fits completely in shared buffers. If we update at every XLogFileRead, we have to fsync every 16MB of WAL. If we update in XLogFlush the way I described, you only need to update when we flush a page from the buffer cache, which will only happen at restartpoints. That's far less updates. Oh, did you change the bgwriter so it doesn't do normal page cleaning? General thoughts: Latest HS patch has a CPU profile within 1-2% of current and the use of ProcArrayLock is fairly minimal now. The additional CPU is recoveryStopsHere(), which enables the manual control of recovery, so the trade off seems worth it. The major CPU hog remains RecordIsValid, which is the CRC checks. Startup is still I/O bound. Specific avoidable I/O hogs are (1) checkpoints, (2) page cleaning so I hope we can avoid both of those. I also hope to minimise the I/O cost of keeping track of the consistency point. If that was done as infrequently as each restartpoint then I would certainly be very happy, but that won't happen in the proposed scheme if we do page cleaning. Expanding that example to a database that doesn't fit in cache, you're still replacing pages from the buffer cache that have been untouched for longest. Such pages will have an old LSN, too, so we shouldn't need to update very often. They will tend to be written in ascending LSN order which will mean we continually update the control file. Anything out of order does skip a write. The better the cache is at finding LRU blocks out the more writes we will make. I'm sure you can come up with an example of where we end up fsyncing more often, but it doesn't seem like the common case to me. I'm not trying to come up with counterexamples... This change seems speculative and also against what has previously been agreed with Tom. If he chooses not to comment on your changes, that's up to him, but I don't think you should remove things quietly that have been put there through the community process, as if they caused problems. I feel like I'm in the middle here. I'd like to have the extra protection that this approach gives. If we let safeStartPoint to be ahead of the actual WAL we've replayed, we have to just assume we're fine if we reach end of WAL before reaching that point. That assumption falls down if e.g recovery is stopped, and you go and remove the last few WAL segments from the archive before restarting it, or signal pg_standby to trigger failover too early. Tracking the real safe starting point and enforcing it always protects you from that. Doing it this way will require you to remove existing specific error messages about ending before end time of backup, to be replaced by more general ones that say consistency not reached which is harder to figure out what to do about it. (we did discuss this a week ago: http://archives.postgresql.org/message-id/4981f6e0.2040...@enterprisedb.com) Yes, we need to update it in that case. Though that is no way agreement to the other changes, nor does it require them. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
Simon Riggs wrote: On Thu, 2009-02-05 at 11:46 +0200, Heikki Linnakangas wrote: Simon Riggs wrote: So we might end up flushing more often *and* we will be doing it potentially in the code path of other users. For example, imagine a database that fits completely in shared buffers. If we update at every XLogFileRead, we have to fsync every 16MB of WAL. If we update in XLogFlush the way I described, you only need to update when we flush a page from the buffer cache, which will only happen at restartpoints. That's far less updates. Oh, did you change the bgwriter so it doesn't do normal page cleaning? No. Ok, that wasn't completely accurate. The page cleaning by bgwriter will perform XLogFlushes, but that should be pretty insignificant. When there's little page replacement going on, bgwriter will do a small trickle of page cleaning, which won't matter much. If there's more page replacement going on, bgwriter is cleaning up pages that will soon be replaced, so it's just offsetting work from other backends (or the startup process in this case). Expanding that example to a database that doesn't fit in cache, you're still replacing pages from the buffer cache that have been untouched for longest. Such pages will have an old LSN, too, so we shouldn't need to update very often. They will tend to be written in ascending LSN order which will mean we continually update the control file. Anything out of order does skip a write. The better the cache is at finding LRU blocks out the more writes we will make. When minRecoveryPoint is updated, it's not update to just the LSN that's being flushed. It's updated to the recptr of the most recently read WAL record. That's an important point to avoid that behavior. Just like XLogFlush normally always flushes all of the outstanding WAL, not just up to the requested LSN. I'd like to have the extra protection that this approach gives. If we let safeStartPoint to be ahead of the actual WAL we've replayed, we have to just assume we're fine if we reach end of WAL before reaching that point. That assumption falls down if e.g recovery is stopped, and you go and remove the last few WAL segments from the archive before restarting it, or signal pg_standby to trigger failover too early. Tracking the real safe starting point and enforcing it always protects you from that. Doing it this way will require you to remove existing specific error messages about ending before end time of backup, to be replaced by more general ones that say consistency not reached which is harder to figure out what to do about it. Yeah. If that's an important distinction, we could still save the original backup stop location somewhere, just so that we can give the old error message when we've not passed that location. But perhaps a message like WAL ends before reaching a consistent state with a hint Make sure you archive all the WAL created during backup or something would do suffice. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
On Thu, 2009-02-05 at 13:18 +0200, Heikki Linnakangas wrote: Simon Riggs wrote: On Thu, 2009-02-05 at 11:46 +0200, Heikki Linnakangas wrote: Simon Riggs wrote: So we might end up flushing more often *and* we will be doing it potentially in the code path of other users. For example, imagine a database that fits completely in shared buffers. If we update at every XLogFileRead, we have to fsync every 16MB of WAL. If we update in XLogFlush the way I described, you only need to update when we flush a page from the buffer cache, which will only happen at restartpoints. That's far less updates. Oh, did you change the bgwriter so it doesn't do normal page cleaning? No. Ok, that wasn't completely accurate. The page cleaning by bgwriter will perform XLogFlushes, but that should be pretty insignificant. When there's little page replacement going on, bgwriter will do a small trickle of page cleaning, which won't matter much. Yes, that case is good, but it wasn't the use case we're trying to speed up by having the bgwriter active during recovery. We're worried about I/O bound recoveries. If there's more page replacement going on, bgwriter is cleaning up pages that will soon be replaced, so it's just offsetting work from other backends (or the startup process in this case). Which only needs to be done if we follow this route, so is not an argument in favour. Using more I/O in I/O bound cases makes the worst case even worse. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
Simon Riggs wrote: On Thu, 2009-02-05 at 13:18 +0200, Heikki Linnakangas wrote: Simon Riggs wrote: On Thu, 2009-02-05 at 11:46 +0200, Heikki Linnakangas wrote: Simon Riggs wrote: So we might end up flushing more often *and* we will be doing it potentially in the code path of other users. For example, imagine a database that fits completely in shared buffers. If we update at every XLogFileRead, we have to fsync every 16MB of WAL. If we update in XLogFlush the way I described, you only need to update when we flush a page from the buffer cache, which will only happen at restartpoints. That's far less updates. Oh, did you change the bgwriter so it doesn't do normal page cleaning? No. Ok, that wasn't completely accurate. The page cleaning by bgwriter will perform XLogFlushes, but that should be pretty insignificant. When there's little page replacement going on, bgwriter will do a small trickle of page cleaning, which won't matter much. Yes, that case is good, but it wasn't the use case we're trying to speed up by having the bgwriter active during recovery. We're worried about I/O bound recoveries. Ok, let's do the math: By updating minRecoveryPoint in XLogFileRead, you're fsyncing the control file once every 16MB of WAL. By updating in XLogFlush, the frequency depends on the amount of shared_buffers available to buffer the modified pages, the average WAL record size, and the cache hit ratio. Let's determine the worst case: The smallest WAL record that dirties a page is a heap deletion record. That contains just enough information to locate the tuple. If I'm reading the headers right, that record is 48 bytes long (28 bytes of xlog header + 18 bytes of payload + padding). Assuming that the WAL is full of just those records, and there's no full page images, and that the cache hit ratio is 0%, we will need (16 MB / 48 B) * 8 kB = 2730 MB of shared_buffers to achieve the once per 16 MB of WAL per one fsync mark. So if you have a lower shared_buffers setting than 2.7 GB, you can have more frequent fsyncs this way in the worst case. If you think of the typical case, you're probably not doing all deletes, and you're having a non-zero cache hit ratio, so you achieve the same frequency with a much lower shared_buffers setting. And if you're really that I/O bound, I doubt the few extra fsyncs matter much. Also note that when the control file is updated in XLogFlush, it's typically the bgwriter doing it as it cleans buffers ahead of the clock hand, not the startup process. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
On Thu, 2009-02-05 at 14:18 +0200, Heikki Linnakangas wrote: when the control file is updated in XLogFlush, it's typically the bgwriter doing it as it cleans buffers ahead of the clock hand, not the startup process That is the key point. Let's do it your way. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
On Thu, 2009-02-05 at 21:54 +0200, Heikki Linnakangas wrote: - If bgwriter is performing a restartpoint when recovery ends, the startup checkpoint will be queued up behind the restartpoint. And since it uses the same smoothing logic as checkpoints, it can take quite some time for that to finish. The original patch had some code to hurry up the restartpoint by signaling the bgwriter if LWLockConditionalAcquire(CheckPointLock) fails, but there's a race condition with that if a restartpoint starts right after that check. We could let the bgwriter do the checkpoint too, and wait for it, but bgwriter might not be running yet, and we'd have to allow bgwriter to write WAL while disallowing it for all other processes, which seems quite complex. Seems like we need something like the LWLockConditionalAcquire approach, but built into CreateCheckPoint to eliminate the race condition Seems straightforward? Hold the lock longer. - If you perform a fast shutdown while startup process is waiting for the restore command, startup process sometimes throws a FATAL error which leads escalates into an immediate shutdown. That leads to different messages in the logs, and skipping of the shutdown restartpoint that we now otherwise perform. Sometimes? - It's not clear to me if the rest of the xlog flushing related functions, XLogBackgroundFlush, XLogNeedsFlush and XLogAsyncCommitFlush, need to work during recovery, and what they should do. XLogNeedsFlush should always return false InRecoveryProcessingMode(). The WAL is already in the WAL files, not in wal_buffers anymore. XLogAsyncCommitFlush should contain Assert(!InRecoveryProcessingMode()) since it is called during a VACUUM FULL only. XLogBackgroundFlush should never be called during recovery because the WALWriter is never active in recovery. That should just be documented. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
Fujii Masao wrote: On Fri, Jan 30, 2009 at 11:55 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: The startup process now catches SIGTERM, and calls proc_exit() at the next WAL record. That's what will happen in a fast shutdown. Unexpected death of the startup process is treated the same as a backend/auxiliary process crash. If unexpected death of the startup process happens in automatic recovery after a crash, postmaster and bgwriter may get stuck. Because HandleChildCrash() can be called before FatalError flag is reset. When FatalError is false, HandleChildCrash() doesn't kill any auxiliary processes. So, bgwriter survives the crash and postmaster waits for the death of bgwriter forever with recovery status (which means that new connection cannot be started). Is this bug? Yes, and in fact I ran into it myself yesterday while testing. It seems that we should reset FatalError earlier, ie. when the recovery starts and bgwriter is launched. I'm not sure why we in CVS HEAD we don't reset FatalError until after the startup process is finished. Resetting it as soon all the processes have been terminated and startup process is launched again would seem like a more obvious place to do it. The only difference that I can see is that if someone tries to connect while the startup process is running, you now get a the database system is in recovery mode message instead of the database system is starting up if we're reinitializing after crash. We can keep that behavior, just need to add another flag to mean reinitializing after crash that isn't reset until the recovery is over. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
Simon Riggs wrote: * I think we are now renaming the recovery.conf file too early. The comment says We have already restored all the WAL segments we need from the archive, and we trust that they are not going to go away even if we crash. We have, but the files overwrite each other as they arrive, so if the last restartpoint is not in the last restored WAL file then it will only exist in the archive. The recovery.conf is the only place where we store the information on where the archive is and how to access it, so by renaming it out of the way we will be unable to crash recover until the first checkpoint is complete. So the way this was in the original patch is the correct way to go, AFAICS. I can see what you mean now. In fact we're not safe even when the last restartpoint is in the last restored WAL file, because we always restore segments from the archive to a temporary filename. Yes, renaming recovery.conf at the first checkpoint avoids that problem. However, it means that we'll re-enter archive recovery if we crash before that checkpoint is finished. Won't that cause havoc if more files have appeared to the archive since the crash, and we restore those even though we already started up from an earlier point? How do the timelines work in that case? We could avoid that by performing a good old startup checkpoint, but I quite like the fast failover time we get without it. * my original intention was to deprecate log_restartpoints and would still like to do so. log_checkpoints does just as well for that. Even less code than before... Ok, done. * comment on BgWriterShmemInit() refers to CHECKPOINT_IS_STARTUP, but the actual define is CHECKPOINT_STARTUP. Would prefer the is version because it matches the IS_SHUTDOWN naming. Fixed. * In CreateCheckpoint() the if test on TruncateSubtrans() has been removed, but the comment has not been changed (to explain why). Thanks, comment updated. (we now call CreateCheckPoint() only after recovery is finished) We should continue to measure performance of recovery in the light of these changes. I still feel that fsyncing the control file on each XLogFileRead() will give a noticeable performance penalty, mostly because we know doing exactly the same thing in normal running caused a performance penalty. But that is easily changed and cannot be done with any certainty without wider feedback, so no reason to delay code commit. I've changed the way minRecoveryPoint is updated now anyway, so it no longer happens every XLogFileRead(). -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
On Wed, 2009-02-04 at 19:03 +0200, Heikki Linnakangas wrote: Simon Riggs wrote: * I think we are now renaming the recovery.conf file too early. The comment says We have already restored all the WAL segments we need from the archive, and we trust that they are not going to go away even if we crash. We have, but the files overwrite each other as they arrive, so if the last restartpoint is not in the last restored WAL file then it will only exist in the archive. The recovery.conf is the only place where we store the information on where the archive is and how to access it, so by renaming it out of the way we will be unable to crash recover until the first checkpoint is complete. So the way this was in the original patch is the correct way to go, AFAICS. I can see what you mean now. In fact we're not safe even when the last restartpoint is in the last restored WAL file, because we always restore segments from the archive to a temporary filename. Yes, renaming recovery.conf at the first checkpoint avoids that problem. However, it means that we'll re-enter archive recovery if we crash before that checkpoint is finished. Won't that cause havoc if more files have appeared to the archive since the crash, and we restore those even though we already started up from an earlier point? How do the timelines work in that case? More archive files being added to archive is possible, but unlikely. What *is* a definite problem is that if we ended recovery by manual command (possible with HS patch) or recovery target then we would have no record of which point it was that we finished at. It is then possible that the recovery.conf has been re-edited, causing re-recovery to end at a different place. That seems like a bad thing. We could avoid that by performing a good old startup checkpoint, but I quite like the fast failover time we get without it. ISTM it's either slow failover or (fast failover, but restart archive recovery if crashes). I would suggest that at end of recovery we write the last LSN to the control file, so if we crash recover then we will always end archive recovery at the same place again should we re-enter it. So we would have a recovery_target_lsn that overrides recovery_target_xid etc.. Given where we are, I would suggest we go for the slow failover option in this release. Doing otherwise is a risky decision with little gain. BGwriter-in-recovery is a good thing of itself and we have other code to review yet with a higher importance level, and the rest of HS is code I'm actually much happier with. I've spent weeks trying to get the transition between recovery and normal running clean, but it seems like time to say I didn't get it right *enough* to be absolutely certain. (Just the fast failover part of patch!). Thanks for the review. We should continue to measure performance of recovery in the light of these changes. I still feel that fsyncing the control file on each XLogFileRead() will give a noticeable performance penalty, mostly because we know doing exactly the same thing in normal running caused a performance penalty. But that is easily changed and cannot be done with any certainty without wider feedback, so no reason to delay code commit. I've changed the way minRecoveryPoint is updated now anyway, so it no longer happens every XLogFileRead(). Care to elucidate? -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
Hi, On Wed, Feb 4, 2009 at 8:35 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Yes, and in fact I ran into it myself yesterday while testing. It seems that we should reset FatalError earlier, ie. when the recovery starts and bgwriter is launched. I'm not sure why we in CVS HEAD we don't reset FatalError until after the startup process is finished. Resetting it as soon all the processes have been terminated and startup process is launched again would seem like a more obvious place to do it. Which may repeat the recovery crash and reinitializing forever. To prevent this problem, unexpected death of startup process should not cause reinitializing? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
Fujii Masao masao.fu...@gmail.com writes: On Wed, Feb 4, 2009 at 8:35 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: ... I'm not sure why we in CVS HEAD we don't reset FatalError until after the startup process is finished. Which may repeat the recovery crash and reinitializing forever. To prevent this problem, unexpected death of startup process should not cause reinitializing? Fujii-san has it in one. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
Tom Lane wrote: Fujii Masao masao.fu...@gmail.com writes: On Wed, Feb 4, 2009 at 8:35 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: ... I'm not sure why we in CVS HEAD we don't reset FatalError until after the startup process is finished. Which may repeat the recovery crash and reinitializing forever. To prevent this problem, unexpected death of startup process should not cause reinitializing? Fujii-san has it in one. In CVS HEAD, we always do ExitPostmaster(1) if the startup process dies unexpectedly, regardless of FatalError. So it serves no such purpose there. But yeah, we do have that problem with the patch. What do we want to do if the startup process dies with FATAL? It seems reasonable to assume that the problem isn't going to just go away if we restart: we'd do exactly the same sequence of actions after restart. But if it's after reaching consistent recovery, the system should still be in consistent state, and we could stay open for read-only transactions. That would be useful to recover a corrupted database/WAL; you could let the WAL replay to run as far as it can, and then connect and investigate the situation, or run pg_dump. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
Simon Riggs wrote: We could avoid that by performing a good old startup checkpoint, but I quite like the fast failover time we get without it. ISTM it's either slow failover or (fast failover, but restart archive recovery if crashes). I would suggest that at end of recovery we write the last LSN to the control file, so if we crash recover then we will always end archive recovery at the same place again should we re-enter it. So we would have a recovery_target_lsn that overrides recovery_target_xid etc.. Hmm, we don't actually want to end recovery at the same point again: if there's some updates right after the database came up, but before the first checkpoint and crash, those actions need to be replayed too. Given where we are, I would suggest we go for the slow failover option in this release. Agreed. We could do it for crash recovery, but I'd rather not have two different ways to do it. It's not as important for crash recovery, either. We should continue to measure performance of recovery in the light of these changes. I still feel that fsyncing the control file on each XLogFileRead() will give a noticeable performance penalty, mostly because we know doing exactly the same thing in normal running caused a performance penalty. But that is easily changed and cannot be done with any certainty without wider feedback, so no reason to delay code commit. I've changed the way minRecoveryPoint is updated now anyway, so it no longer happens every XLogFileRead(). Care to elucidate? I got rid of minSafeStartPoint, advancing minRecoveryPoint instead. And it's advanced in XLogFlush instead of XLogFileRead. I'll post an updated patch soon. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
On Thu, 2009-02-05 at 09:28 +0200, Heikki Linnakangas wrote: I've changed the way minRecoveryPoint is updated now anyway, so it no longer happens every XLogFileRead(). Care to elucidate? I got rid of minSafeStartPoint, advancing minRecoveryPoint instead. And it's advanced in XLogFlush instead of XLogFileRead. I'll post an updated patch soon. Why do you think XLogFlush is called less frequently than XLogFileRead? -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
On Thu, 2009-02-05 at 09:28 +0200, Heikki Linnakangas wrote: Simon Riggs wrote: We could avoid that by performing a good old startup checkpoint, but I quite like the fast failover time we get without it. ISTM it's either slow failover or (fast failover, but restart archive recovery if crashes). I would suggest that at end of recovery we write the last LSN to the control file, so if we crash recover then we will always end archive recovery at the same place again should we re-enter it. So we would have a recovery_target_lsn that overrides recovery_target_xid etc.. Hmm, we don't actually want to end recovery at the same point again: if there's some updates right after the database came up, but before the first checkpoint and crash, those actions need to be replayed too. I think we do need to. Crash recovery is supposed to return us to the same state. Where we ended ArchiveRecovery is part of that state. It isn't for crash recovery to decide that it saw a few more transactions and decided to apply them anyway. The user may have specifically ended recovery to avoid those additional changes from taking place. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
Hi, On Fri, Jan 30, 2009 at 11:55 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: The startup process now catches SIGTERM, and calls proc_exit() at the next WAL record. That's what will happen in a fast shutdown. Unexpected death of the startup process is treated the same as a backend/auxiliary process crash. If unexpected death of the startup process happens in automatic recovery after a crash, postmaster and bgwriter may get stuck. Because HandleChildCrash() can be called before FatalError flag is reset. When FatalError is false, HandleChildCrash() doesn't kill any auxiliary processes. So, bgwriter survives the crash and postmaster waits for the death of bgwriter forever with recovery status (which means that new connection cannot be started). Is this bug? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
On Sat, 2009-01-31 at 22:32 +0200, Heikki Linnakangas wrote: If you poison your WAL archive with a XLOG_CRASH_RECOVERY record, recovery will never be able to proceed over that point. There would have to be a switch to ignore those records, at the very least. Definitely in assert mode only. I'll do it as a test patch and keep it separate from main line. You don't really need to do it with a new WAL record. You could just add a GUC or recovery.conf option along the lines of recovery_target: crash_target=0/123456, and check for that in ReadRecord or wherever you want the crash to occur. Knowing that LSN is somewhat harder -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
On Sat, 2009-01-31 at 22:41 +0200, Heikki Linnakangas wrote: I like this way because it means we might in the future get Startup process to perform post-recovery actions also. Yeah, it does. Do you have something in mind already? Yes, but nothing that needs to be discussed yet. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
On Fri, 2009-01-30 at 13:15 +0200, Heikki Linnakangas wrote: Simon Riggs wrote: I'm thinking to add a new function that will allow crash testing easier. pg_crash_standby() will issue a new xlog record, XLOG_CRASH_STANDBY, which when replayed will just throw a FATAL error and crash Startup process. We won't be adding that to the user docs... This will allow us to produce tests that crash the server at specific places, rather than trying to trap those points manually. Heh, talk about a footgun ;-). I don't think including that in CVS is a good idea. Thought you'd like it. I'd have preferred it in a plugin... :-( Not really sure why its a problem though. We support pg_ctl stop -m immediate, which is the equivalent, yet we don't regard that as a footgun. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
On Fri, 2009-01-30 at 13:25 +0200, Heikki Linnakangas wrote: That whole area was something I was leaving until last, since immediate shutdown doesn't work either, even in HEAD. (Fujii-san and I discussed this before Christmas, briefly). We must handle shutdown gracefully, can't just leave bgwriter running after postmaster exit. Nobody was suggesting we should. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
On Fri, 2009-01-30 at 16:55 +0200, Heikki Linnakangas wrote: Ok, here's an attempt to make shutdown work gracefully. Startup process now signals postmaster three times during startup: first when it has done all the initialization, and starts redo. At that point. postmaster launches bgwriter, which starts to perform restartpoints when it deems appropriate. The 2nd time signals when we've reached consistent recovery state. As the patch stands, that's not significant, but it will be with all the rest of the hot standby stuff. The 3rd signal is sent when startup process has finished recovery. Postmaster used to wait for the startup process to exit, and check the return code to determine that, but now that we support shutdown, startup process also returns with 0 exit code when it has been requested to terminate. Yeh, seems much cleaner. Slightly bizarre though cos now we're pretty much back to my originally proposed design. C'est la vie. I like this way because it means we might in the future get Startup process to perform post-recovery actions also. The startup process now catches SIGTERM, and calls proc_exit() at the next WAL record. That's what will happen in a fast shutdown. Unexpected death of the startup process is treated the same as a backend/auxiliary process crash. Good. Like your re-arrangement of StartupProcessMain also. Your call to PMSIGNAL_RECOVERY_COMPLETED needs to be if (IsUnderPostmaster), or at least a comment to explain why not or perhaps an Assert. Think you need to just throw away this chunk @@ -5253,7 +5386,7 @@ StartupXLOG(void) * Complain if we did not roll forward far enough to render the backup * dump consistent. */ - if (XLByteLT(EndOfLog, ControlFile-minRecoveryPoint)) + if (InRecovery !reachedSafeStartPoint) { if (reachedStopPoint) /* stopped because of stop request */ ereport(FATAL, -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
Simon Riggs wrote: On Fri, 2009-01-30 at 13:15 +0200, Heikki Linnakangas wrote: Simon Riggs wrote: I'm thinking to add a new function that will allow crash testing easier. pg_crash_standby() will issue a new xlog record, XLOG_CRASH_STANDBY, which when replayed will just throw a FATAL error and crash Startup process. We won't be adding that to the user docs... This will allow us to produce tests that crash the server at specific places, rather than trying to trap those points manually. Heh, talk about a footgun ;-). I don't think including that in CVS is a good idea. Thought you'd like it. I'd have preferred it in a plugin... :-( Not really sure why its a problem though. We support pg_ctl stop -m immediate, which is the equivalent, yet we don't regard that as a footgun. If you poison your WAL archive with a XLOG_CRASH_RECOVERY record, recovery will never be able to proceed over that point. There would have to be a switch to ignore those records, at the very least. pg_ctl stop -m immediate has some use in a production system, while this would be a pure development aid. For that purpose, you might as use a patched version. Presumably you want to test different kind of crashes and at different points. FATAL, PANIC, segfault etc. A single crash mechanism like that will only test one path. You don't really need to do it with a new WAL record. You could just add a GUC or recovery.conf option along the lines of recovery_target: crash_target=0/123456, and check for that in ReadRecord or wherever you want the crash to occur. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
Simon Riggs wrote: On Fri, 2009-01-30 at 16:55 +0200, Heikki Linnakangas wrote: Ok, here's an attempt to make shutdown work gracefully. Startup process now signals postmaster three times during startup: first when it has done all the initialization, and starts redo. At that point. postmaster launches bgwriter, which starts to perform restartpoints when it deems appropriate. The 2nd time signals when we've reached consistent recovery state. As the patch stands, that's not significant, but it will be with all the rest of the hot standby stuff. The 3rd signal is sent when startup process has finished recovery. Postmaster used to wait for the startup process to exit, and check the return code to determine that, but now that we support shutdown, startup process also returns with 0 exit code when it has been requested to terminate. Yeh, seems much cleaner. Slightly bizarre though cos now we're pretty much back to my originally proposed design. C'est la vie. Yep. I didn't see any objections to that approach in the archives. There was other problems in the early versions of the patch, but nothing related to this arrangement. I like this way because it means we might in the future get Startup process to perform post-recovery actions also. Yeah, it does. Do you have something in mind already? Your call to PMSIGNAL_RECOVERY_COMPLETED needs to be if (IsUnderPostmaster), or at least a comment to explain why not or perhaps an Assert. Nah, StartupProcessMain is only run under postmaster; you don't want to install signal handlers in a stand-along backend. Stand-alone backend calls StartupXLOG directly. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
I just realized that the new minSafeStartPoint is actually exactly the same concept as the existing minRecoveryPoint. As the recovery progresses, we could advance minRecoveryPoint just as well as the new minSafeStartPoint. Perhaps it's a good idea to keep them separate anyway though, the original minRecoveryPoint might be a useful debugging aid. Or what do you think? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
On Thu, 2009-01-29 at 20:35 +0200, Heikki Linnakangas wrote: Hmm, another point of consideration is how this interacts with the pause/continue. In particular, it was suggested earlier that you could put an option into recovery.conf to start in paused mode. If you pause recovery, and then stop and restart the server, and have that option in recovery.conf, I would expect that when you enter consistent recovery you're at the exact same paused location as before stopping the server. The upshot of that is that we need to set minSafeStartPoint to that exact location, at least when you pause stop in a controlled fashion. OK, makes sense. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
On Thu, 2009-01-29 at 19:20 +0200, Heikki Linnakangas wrote: Heikki Linnakangas wrote: It looks like if you issue a fast shutdown during recovery, postmaster doesn't kill bgwriter. Hmm, seems like we haven't thought through how shutdown during consistent recovery is supposed to behave in general. Right now, smart shutdown doesn't do anything during consistent recovery, because the startup process will just keep going. And fast shutdown will simply ExitPostmaster(1), which is clearly not right. That whole area was something I was leaving until last, since immediate shutdown doesn't work either, even in HEAD. (Fujii-san and I discussed this before Christmas, briefly). I'm thinking that in both smart and fast shutdown, the startup process should exit in a controlled way as soon as it's finished with the current WAL record, and set minSafeStartPoint to the current point in the replay. That makes sense, though isn't required. I wonder if bgwriter should perform a restartpoint before exiting? You'll have to start with recovery on the next startup anyway, but at least we could minimize the amount of WAL that needs to be replayed. That seems like extra work for no additional benefit. I think we're beginning to blur the lines between review and you just adding some additional stuff in this area. There's nothing to stop you doing further changes after this has been committed. We can also commit what we have with some caveats also, i.e. commit in pieces. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
On Fri, 2009-01-30 at 11:33 +0200, Heikki Linnakangas wrote: I just realized that the new minSafeStartPoint is actually exactly the same concept as the existing minRecoveryPoint. As the recovery progresses, we could advance minRecoveryPoint just as well as the new minSafeStartPoint. Perhaps it's a good idea to keep them separate anyway though, the original minRecoveryPoint might be a useful debugging aid. Or what do you think? I think we've been confusing ourselves substantially. The patch already has everything it needs, but there is a one-line-fixable bug where Fujii-san says. The code comments already explain how this works * There are two points in the log that we must pass. The first * is minRecoveryPoint, which is the LSN at the time the * base backup was taken that we are about to rollforward from. * If recovery has ever crashed or was stopped there is also * another point also: minSafeStartPoint, which we know the * latest LSN that recovery could have reached prior to crash. The later message FATAL WAL ends before end time of backup dump was originally triggered if if (XLByteLT(EndOfLog, ControlFile-minRecoveryPoint)) and I changed that. Now I look at it again, I see that the original if test, shown above, is correct and should not have been changed. Other than that, I don't see the need for further change. Heikki's suggestions to write a new minSafeStartPoint are good ones and fit within the existing mechanisms and meanings of these variables. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
On Thu, 2009-01-29 at 14:21 +0200, Heikki Linnakangas wrote: It looks like if you issue a fast shutdown during recovery, postmaster doesn't kill bgwriter. Thanks for the report. I'm thinking to add a new function that will allow crash testing easier. pg_crash_standby() will issue a new xlog record, XLOG_CRASH_STANDBY, which when replayed will just throw a FATAL error and crash Startup process. We won't be adding that to the user docs... This will allow us to produce tests that crash the server at specific places, rather than trying to trap those points manually. Seems that reaper() needs to be taught that bgwriter can be active during consistent recovery. I'll take a look at how to do that. BTW, the message terminating connection ... is a bit misleading. It's referring to the startup process, which is hardly a connection. We have that in CVS HEAD too, so it's not something introduced by the patch, but seems worth changing in HS, since we then let real connections in while startup process is still running. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
Simon Riggs wrote: I'm thinking to add a new function that will allow crash testing easier. pg_crash_standby() will issue a new xlog record, XLOG_CRASH_STANDBY, which when replayed will just throw a FATAL error and crash Startup process. We won't be adding that to the user docs... This will allow us to produce tests that crash the server at specific places, rather than trying to trap those points manually. Heh, talk about a footgun ;-). I don't think including that in CVS is a good idea. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
Simon Riggs wrote: On Thu, 2009-01-29 at 19:20 +0200, Heikki Linnakangas wrote: Hmm, seems like we haven't thought through how shutdown during consistent recovery is supposed to behave in general. Right now, smart shutdown doesn't do anything during consistent recovery, because the startup process will just keep going. And fast shutdown will simply ExitPostmaster(1), which is clearly not right. That whole area was something I was leaving until last, since immediate shutdown doesn't work either, even in HEAD. (Fujii-san and I discussed this before Christmas, briefly). We must handle shutdown gracefully, can't just leave bgwriter running after postmaster exit. Hmm, why does pg_standby catch SIGQUIT? Seems it could just let it kill the process. I wonder if bgwriter should perform a restartpoint before exiting? You'll have to start with recovery on the next startup anyway, but at least we could minimize the amount of WAL that needs to be replayed. That seems like extra work for no additional benefit. I think we're beginning to blur the lines between review and you just adding some additional stuff in this area. There's nothing to stop you doing further changes after this has been committed. Sure. I think the shutdown restartpoint might actually fall out of the way the code is structured anyway: bgwriter normally performs a checkpoint before exiting. We can also commit what we have with some caveats also, i.e. commit in pieces. This late in the release cycle, I don't want to commit anything that we would have to rip out if we run out of time. There is no difference from review or testing point of view whether the code is in CVS or not. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
On Thu, 2009-01-29 at 09:34 +0200, Heikki Linnakangas wrote: It does *during recovery*, before InitXLogAccess is called. Yeah, it's harmless currently. It would be pretty hard to keep it up-to-date in bgwriter and other processes. I think it's better to keep it at 0, which is clearly an invalid value, than try to keep it up-to-date and risk using an old value. TimeLineID is not used in a lot of places, currently. It might be a good idea to add some Assert(TimeLineID != 0) to places where it used. Agreed. TimeLineID is a normal-running concept used for writing WAL. Perhaps we should even solidify the meaning of TimeLineID == 0 as cannot write WAL. I see a problem there for any process that exists both before and after recovery ends, which includes bgwriter. In that case we must not flush WAL before recovery ends, yet afterwards we *must* flush WAL. To do that we *must* have a valid TimeLineID set. I would suggest we put InitXLogAccess into IsRecoveryProcessingMode(), so if the mode changes we immediately set everything we need to allow XLogFlush() calls to work correctly. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
Simon Riggs wrote: On Thu, 2009-01-29 at 10:36 +0900, Fujii Masao wrote: Hi, On Wed, Jan 28, 2009 at 11:19 PM, Fujii Masao masao.fu...@gmail.com wrote: I feel quite good about this patch now. Given the amount of code churn, it requires testing, and I'll read it through one more time after sleeping over it. Simon, do you see anything wrong with this? I also read this patch and found something odd. I apologize if I misread it.. If archive recovery fails after it reaches the last valid record in the last unfilled WAL segment, subsequent recovery might cause the following fatal error. This is because minSafeStartPoint indicates the end of the last unfilled WAL segment which subsequent recovery cannot reach. Is this bug? (I'm not sure how to fix this problem because I don't understand yet why minSafeStartPoint is required.) FATAL: WAL ends before end time of backup dump I think you're right. We need a couple of changes to avoid confusing messages. Hmm, we could update minSafeStartPoint in XLogFlush instead. That was suggested when the idea of minSafeStartPoint was first thought of. Updating minSafeStartPoint is analogous to flushing WAL: minSafeStartPoint must be advanced to X before we can flush a data pgse with LSN X. To avoid excessive controlfile updates, whenever we update minSafeStartPoint, we can update it to the latest WAL record we've read. Or we could simply ignore that error if we've reached minSafeStartPoint - 1 segment, assuming that even though minSafeStartPoint is higher, we can't have gone past the end of valid WAL records in the last segment in previous recovery either. But that feels more fragile. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
On Thu, 2009-01-29 at 11:20 +0200, Heikki Linnakangas wrote: Simon Riggs wrote: On Thu, 2009-01-29 at 10:36 +0900, Fujii Masao wrote: Hi, On Wed, Jan 28, 2009 at 11:19 PM, Fujii Masao masao.fu...@gmail.com wrote: I feel quite good about this patch now. Given the amount of code churn, it requires testing, and I'll read it through one more time after sleeping over it. Simon, do you see anything wrong with this? I also read this patch and found something odd. I apologize if I misread it.. If archive recovery fails after it reaches the last valid record in the last unfilled WAL segment, subsequent recovery might cause the following fatal error. This is because minSafeStartPoint indicates the end of the last unfilled WAL segment which subsequent recovery cannot reach. Is this bug? (I'm not sure how to fix this problem because I don't understand yet why minSafeStartPoint is required.) FATAL: WAL ends before end time of backup dump I think you're right. We need a couple of changes to avoid confusing messages. Hmm, we could update minSafeStartPoint in XLogFlush instead. That was suggested when the idea of minSafeStartPoint was first thought of. Updating minSafeStartPoint is analogous to flushing WAL: minSafeStartPoint must be advanced to X before we can flush a data pgse with LSN X. To avoid excessive controlfile updates, whenever we update minSafeStartPoint, we can update it to the latest WAL record we've read. Or we could simply ignore that error if we've reached minSafeStartPoint - 1 segment, assuming that even though minSafeStartPoint is higher, we can't have gone past the end of valid WAL records in the last segment in previous recovery either. But that feels more fragile. My proposed fix for Fujii-san's minSafeStartPoint bug is to introduce another control file state DB_IN_ARCHIVE_RECOVERY_BASE. This would show that we are still recovering up to the point of the end of the base backup. Once we reach minSafeStartPoint we then switch state to DB_IN_ARCHIVE_RECOVERY, and set baseBackupReached boolean, which then enables writing new minSafeStartPoints when we open new WAL files in the future. We then have messages only when in DB_IN_ARCHIVE_RECOVERY_BASE state if (XLByteLT(EndOfLog, ControlFile-minRecoveryPoint) ControlFile-state == DB_IN_ARCHIVE_RECOVERY_BASE) { if (reachedStopPoint) /* stopped because of stop request */ ereport(FATAL, (errmsg(requested recovery stop point is before end time of backup dump))); else /* ran off end of WAL */ ereport(FATAL, (errmsg(WAL ends before end time of backup dump))); } -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
Simon Riggs wrote: My proposed fix for Fujii-san's minSafeStartPoint bug is to introduce another control file state DB_IN_ARCHIVE_RECOVERY_BASE. This would show that we are still recovering up to the point of the end of the base backup. Once we reach minSafeStartPoint we then switch state to DB_IN_ARCHIVE_RECOVERY, and set baseBackupReached boolean, which then enables writing new minSafeStartPoints when we open new WAL files in the future. I don't see how that helps, the bug has nothing to with base backups. It comes from the fact that we set minSafeStartPoint beyond the actual end of WAL, if the last WAL segment is only partially filled (= fails CRC check at some point). If we crash after setting minSafeStartPoint like that, and then restart recovery, we'll get the error. The last WAL segment could be partially filled for example because the DBA has manually copied the last unarchived WAL segments to pg_xlog, as we recommend in the manual. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
It looks like if you issue a fast shutdown during recovery, postmaster doesn't kill bgwriter. ... LOG: restored log file 00010028 from archive LOG: restored log file 00010029 from archive LOG: consistent recovery state reached at 0/295C ... LOG: restored log file 0001002F from archive LOG: restored log file 00010030 from archive LOG: restored log file 00010031 from archive LOG: restored log file 00010032 from archive LOG: restored log file 00010033 from archive LOG: restartpoint starting: time LOG: restored log file 00010034 from archive LOG: received fast shutdown request LOG: restored log file 00010035 from archive FATAL: terminating connection due to administrator command LOG: startup process (PID 14137) exited with exit code 1 LOG: aborting startup due to startup process failure hlinn...@heikkilaptop:~/pgsql$ hlinn...@heikkilaptop:~/pgsql$ LOG: restartpoint complete: wrote 3324 buffers (5.1%); write=13.996 s, sync=2.016 s, total=16.960 s LOG: recovery restart point at 0/3000FA14 Seems that reaper() needs to be taught that bgwriter can be active during consistent recovery. I'll take a look at how to do that. BTW, the message terminating connection ... is a bit misleading. It's referring to the startup process, which is hardly a connection. We have that in CVS HEAD too, so it's not something introduced by the patch, but seems worth changing in HS, since we then let real connections in while startup process is still running. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
On Thu, 2009-01-29 at 12:22 +0200, Heikki Linnakangas wrote: Simon Riggs wrote: My proposed fix for Fujii-san's minSafeStartPoint bug is to introduce another control file state DB_IN_ARCHIVE_RECOVERY_BASE. This would show that we are still recovering up to the point of the end of the base backup. Once we reach minSafeStartPoint we then switch state to DB_IN_ARCHIVE_RECOVERY, and set baseBackupReached boolean, which then enables writing new minSafeStartPoints when we open new WAL files in the future. I don't see how that helps, the bug has nothing to with base backups. Sorry, disagree. It comes from the fact that we set minSafeStartPoint beyond the actual end of WAL, if the last WAL segment is only partially filled (= fails CRC check at some point). If we crash after setting minSafeStartPoint like that, and then restart recovery, we'll get the error. Look again please. My proposal would avoid the error when it is not relevant, yet keep it when it is (while recovering base backups). -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
Simon Riggs wrote: On Thu, 2009-01-29 at 12:22 +0200, Heikki Linnakangas wrote: It comes from the fact that we set minSafeStartPoint beyond the actual end of WAL, if the last WAL segment is only partially filled (= fails CRC check at some point). If we crash after setting minSafeStartPoint like that, and then restart recovery, we'll get the error. Look again please. My proposal would avoid the error when it is not relevant, yet keep it when it is (while recovering base backups). I fail to see what base backups have to do with this. The problem arises in this scenario: 0. A base backup is unzipped. recovery.conf is copied in place, and the remaining unarchived WAL segments are copied from the primary server to pg_xlog. The last WAL segment is only partially filled. Let's say that redo point is in WAL segment 1. The last, partial, WAL segment is 3, and WAL ends at 0/350 1. postmaster is started, recovery starts. 2. WAL segment 1 is restored from archive. 3. We reach consistent recovery point 4. We restore WAL segment 2 from archive. minSafeStartPoint is advanced to 0/300 5. WAL segment 2 is completely replayed, we move on to WAL segment 3. It is not in archive, but it's found in pg_xlog. minSafeStartPoint is advanced to 0/400. Note that that's beyond end of WAL. 6. At replay of WAL record 0/320, the recovery is interrupted. For example, by a fast shutdown request, or crash. Now when we restart the recovery, we will never reach minSafeStartPoint, which is now 0/400, and we'll fail with the error that Fujii-san pointed out. We're already way past the min recovery point of base backup by then. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
On Thu, 2009-01-29 at 15:31 +0200, Heikki Linnakangas wrote: Now when we restart the recovery, we will never reach minSafeStartPoint, which is now 0/400, and we'll fail with the error that Fujii-san pointed out. We're already way past the min recovery point of base backup by then. The problem was that we reported this error FATAL: WAL ends before end time of backup dump and this is inappropriate because, as you say, we are way past the min recovery point of base backup. If you look again at my proposal you will see that the proposal avoids the above error by keeping track of whether we are past the point of base backup or not. If we are still in base backup we get the error and if we are passed it we do not. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
Simon Riggs wrote: On Thu, 2009-01-29 at 15:31 +0200, Heikki Linnakangas wrote: Now when we restart the recovery, we will never reach minSafeStartPoint, which is now 0/400, and we'll fail with the error that Fujii-san pointed out. We're already way past the min recovery point of base backup by then. The problem was that we reported this error FATAL: WAL ends before end time of backup dump and this is inappropriate because, as you say, we are way past the min recovery point of base backup. If you look again at my proposal you will see that the proposal avoids the above error by keeping track of whether we are past the point of base backup or not. If we are still in base backup we get the error and if we are passed it we do not. Oh, we would simply ignore the fact that we haven't reached the minSafeStartPoint at the end of recovery, and start up anyway. Ok, that would avoid the problem Fujii-san described. It's like my suggestion of ignoring the message if we're at minSafeStartPoint - 1 segment, just more lenient. I don't understand why you'd need a new control file state, though. You'd lose the extra protection minSafeStartPoint gives, though. For example, if you interrupt recovery and move recovery point backwards, we could refuse to start up when it's not safe to do so. It's currently a don't do that! case, but we could protect against that with minSafeStartPoint. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
Heikki Linnakangas wrote: It looks like if you issue a fast shutdown during recovery, postmaster doesn't kill bgwriter. Hmm, seems like we haven't thought through how shutdown during consistent recovery is supposed to behave in general. Right now, smart shutdown doesn't do anything during consistent recovery, because the startup process will just keep going. And fast shutdown will simply ExitPostmaster(1), which is clearly not right. I'm thinking that in both smart and fast shutdown, the startup process should exit in a controlled way as soon as it's finished with the current WAL record, and set minSafeStartPoint to the current point in the replay. I wonder if bgwriter should perform a restartpoint before exiting? You'll have to start with recovery on the next startup anyway, but at least we could minimize the amount of WAL that needs to be replayed. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
Heikki Linnakangas wrote: Simon Riggs wrote: On Thu, 2009-01-29 at 15:31 +0200, Heikki Linnakangas wrote: Now when we restart the recovery, we will never reach minSafeStartPoint, which is now 0/400, and we'll fail with the error that Fujii-san pointed out. We're already way past the min recovery point of base backup by then. The problem was that we reported this error FATAL: WAL ends before end time of backup dump and this is inappropriate because, as you say, we are way past the min recovery point of base backup. If you look again at my proposal you will see that the proposal avoids the above error by keeping track of whether we are past the point of base backup or not. If we are still in base backup we get the error and if we are passed it we do not. Oh, we would simply ignore the fact that we haven't reached the minSafeStartPoint at the end of recovery, and start up anyway. Ok, that would avoid the problem Fujii-san described. It's like my suggestion of ignoring the message if we're at minSafeStartPoint - 1 segment, just more lenient. I don't understand why you'd need a new control file state, though. You'd lose the extra protection minSafeStartPoint gives, though. For example, if you interrupt recovery and move recovery point backwards, we could refuse to start up when it's not safe to do so. It's currently a don't do that! case, but we could protect against that with minSafeStartPoint. Hmm, another point of consideration is how this interacts with the pause/continue. In particular, it was suggested earlier that you could put an option into recovery.conf to start in paused mode. If you pause recovery, and then stop and restart the server, and have that option in recovery.conf, I would expect that when you enter consistent recovery you're at the exact same paused location as before stopping the server. The upshot of that is that we need to set minSafeStartPoint to that exact location, at least when you pause stop in a controlled fashion. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Hot standby, recovery infra
I've been reviewing and massaging the so called recovery infra patch. To recap, the goal is to: - start background writer during (archive) recovery - skip the shutdown checkpoint at the end of recovery. Instead, the database is brought up immediately, and the bgwriter performs a normal online checkpoint, while we're already accepting connections. - keep track of when we reach a consistent point in the recovery, where we could let read-only backends in. Which is obviously required for hot standby The 1st and 2nd points provide some useful functionality, even without the rest of the hot standby patch. I've refactored the patch quite heavily, making it a lot simpler, and over 1/3 smaller than before: The signaling between the bgwriter and startup process during recovery was quite complicated. The startup process periodically sent checkpoint records to the bgwriter, so that bgwriter could perform restart points. I've replaced that by storing the last seen checkpoint in a shared memory in xlog.c. CreateRestartPoint() picks it up from there. This means that bgwriter can decide autonomously when to perform a restart point, it no longer needs to be told to do so by the startup process. Which is nice in a standby. What could happen before is that the standby processes a checkpoint record, and decides not to make it a restartpoint because not enough time has passed since last one. If we then get a long idle period after that, we'd really want to make the previous checkpoint record a restart point after all, after some time has passed. That is what will happen now, which is a usability enhancement, although the real motivation for this refactoring was to make the code simpler. The bgwriter is now always responsible for all checkpoints and restartpoints. (well, except for a stand-alone backend). Which makes it easier to understand what's going on, IMHO. There was one pretty fundamental bug in the minsafestartpoint handling: it was always set when a WAL file was opened for reading. Which means it was also moved backwards when the recovery began by reading the WAL segment containing last restart/checkpoint, rendering it useless for the purpose it was designed. Fortunately that was easy to fix. Another tiny bug was that log_restartpoints was not respected, because it was stored in a variable in startup process' memory, and wasn't seen by bgwriter. One aspect that troubles me a bit is the changes in XLogFlush. I guess we no longer have the problem that you can't start up the database if we've read in a corrupted page from disk, because we now start up before checkpointing. However, it does mean that if a corrupt page is read into shared buffers, we'll never be able to checkpoint. But then again, I guess that's already true without this patch. I feel quite good about this patch now. Given the amount of code churn, it requires testing, and I'll read it through one more time after sleeping over it. Simon, do you see anything wrong with this? (this patch is also in my git repository at git.postgresql.org, branch recoveryinfra.) -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index bd6035d..30fea49 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -119,12 +119,26 @@ CheckpointStatsData CheckpointStats; */ TimeLineID ThisTimeLineID = 0; -/* Are we doing recovery from XLOG? */ +/* + * Are we doing recovery from XLOG? + * + * This is only ever true in the startup process, when it's replaying WAL. + * It's used in functions that need to act differently when called from a + * redo function (e.g skip WAL logging). To check whether the system is in + * recovery regardless of what process you're running in, use + * IsRecoveryProcessingMode(). + */ bool InRecovery = false; /* Are we recovering using offline XLOG archives? */ static bool InArchiveRecovery = false; +/* + * Local copy of shared RecoveryProcessingMode variable. True actually + * means not known, need to check the shared state + */ +static bool LocalRecoveryProcessingMode = true; + /* Was the last xlog file restored from archive, or local? */ static bool restoredFromArchive = false; @@ -133,16 +147,22 @@ static char *recoveryRestoreCommand = NULL; static bool recoveryTarget = false; static bool recoveryTargetExact = false; static bool recoveryTargetInclusive = true; -static bool recoveryLogRestartpoints = false; static TransactionId recoveryTargetXid; static TimestampTz recoveryTargetTime; static TimestampTz recoveryLastXTime = 0; +/* + * log_restartpoints is stored in shared memory because it needs to be + * accessed by bgwriter when it performs restartpoints + */ /* if recoveryStopsHere returns true, it saves actual stop xid/time here */ static TransactionId recoveryStopXid; static TimestampTz recoveryStopTime; static bool recoveryStopAfter;
Re: [HACKERS] Hot standby, recovery infra
On Wed, 2009-01-28 at 12:04 +0200, Heikki Linnakangas wrote: I've been reviewing and massaging the so called recovery infra patch. Thanks. I feel quite good about this patch now. Given the amount of code churn, it requires testing, and I'll read it through one more time after sleeping over it. There's nothing major I feel we should discuss. The way restartpoints happen is a useful improvement, thanks. Simon, do you see anything wrong with this? Few minor points * I think we are now renaming the recovery.conf file too early. The comment says We have already restored all the WAL segments we need from the archive, and we trust that they are not going to go away even if we crash. We have, but the files overwrite each other as they arrive, so if the last restartpoint is not in the last restored WAL file then it will only exist in the archive. The recovery.conf is the only place where we store the information on where the archive is and how to access it, so by renaming it out of the way we will be unable to crash recover until the first checkpoint is complete. So the way this was in the original patch is the correct way to go, AFAICS. * my original intention was to deprecate log_restartpoints and would still like to do so. log_checkpoints does just as well for that. Even less code than before... * comment on BgWriterShmemInit() refers to CHECKPOINT_IS_STARTUP, but the actual define is CHECKPOINT_STARTUP. Would prefer the is version because it matches the IS_SHUTDOWN naming. * In CreateCheckpoint() the if test on TruncateSubtrans() has been removed, but the comment has not been changed (to explain why). * PG_CONTROL_VERSION bump should be just one increment, to 844. I deliberately had it higher to help spot mismatches earlier, and to avoid needless patch conflicts. So it looks pretty much ready for commit very soon. We should continue to measure performance of recovery in the light of these changes. I still feel that fsyncing the control file on each XLogFileRead() will give a noticeable performance penalty, mostly because we know doing exactly the same thing in normal running caused a performance penalty. But that is easily changed and cannot be done with any certainty without wider feedback, so no reason to delay code commit. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
Hi, On Wed, Jan 28, 2009 at 7:04 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: I've been reviewing and massaging the so called recovery infra patch. Great! I feel quite good about this patch now. Given the amount of code churn, it requires testing, and I'll read it through one more time after sleeping over it. Simon, do you see anything wrong with this? I also read this patch and found something odd. I apologize if I misread it.. @@ -507,7 +550,7 @@ CheckArchiveTimeout(void) pg_time_t now; pg_time_t last_time; - if (XLogArchiveTimeout = 0) + if (XLogArchiveTimeout = 0 || !IsRecoveryProcessingMode()) The above change destroys archive_timeout because checking the timeout is always skipped after recovery is done. @@ -355,6 +359,27 @@ BackgroundWriterMain(void) */ PG_SETMASK(UnBlockSig); + BgWriterRecoveryMode = IsRecoveryProcessingMode(); + + if (BgWriterRecoveryMode) + elog(DEBUG1, bgwriter starting during recovery); + else + InitXLOGAccess(); Why is InitXLOGAccess() called also here when bgwriter is started after recovery? That is already called by AuxiliaryProcessMain(). @@ -1302,7 +1314,7 @@ ServerLoop(void) * state that prevents it, start one. It doesn't matter if this * fails, we'll just try again later. */ - if (BgWriterPID == 0 pmState == PM_RUN) + if (BgWriterPID == 0 (pmState == PM_RUN || pmState == PM_RECOVERY)) BgWriterPID = StartBackgroundWriter(); Likewise, we should try to start also the stats collector during recovery? @@ -2103,7 +2148,8 @@ XLogFileInit(uint32 log, uint32 seg, unlink(tmppath); } - elog(DEBUG2, done creating and filling new WAL file); + XLogFileName(tmppath, ThisTimeLineID, log, seg); + elog(DEBUG2, done creating and filling new WAL file %s, tmppath); This debug message is somewhat confusing, because the WAL file represented as tmppath might have been already created by previous XLogFileInit() via InstallXLogFileSegment(). regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
On Wed, 2009-01-28 at 23:19 +0900, Fujii Masao wrote: @@ -355,6 +359,27 @@ BackgroundWriterMain(void) */ PG_SETMASK(UnBlockSig); + BgWriterRecoveryMode = IsRecoveryProcessingMode(); + + if (BgWriterRecoveryMode) + elog(DEBUG1, bgwriter starting during recovery); + else + InitXLOGAccess(); Why is InitXLOGAccess() called also here when bgwriter is started after recovery? That is already called by AuxiliaryProcessMain(). InitXLOGAccess() sets the timeline and also gets the latest record pointer. If the bgwriter is started in recovery these values need to be reset later. It's easier to call it twice. @@ -1302,7 +1314,7 @@ ServerLoop(void) * state that prevents it, start one. It doesn't matter if this * fails, we'll just try again later. */ - if (BgWriterPID == 0 pmState == PM_RUN) + if (BgWriterPID == 0 (pmState == PM_RUN || pmState == PM_RECOVERY)) BgWriterPID = StartBackgroundWriter(); Likewise, we should try to start also the stats collector during recovery? We did in the previous patch... @@ -2103,7 +2148,8 @@ XLogFileInit(uint32 log, uint32 seg, unlink(tmppath); } - elog(DEBUG2, done creating and filling new WAL file); + XLogFileName(tmppath, ThisTimeLineID, log, seg); + elog(DEBUG2, done creating and filling new WAL file %s, tmppath); This debug message is somewhat confusing, because the WAL file represented as tmppath might have been already created by previous XLogFileInit() via InstallXLogFileSegment(). I think those are just for debugging and can be removed. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
Hi, On Wed, Jan 28, 2009 at 11:47 PM, Simon Riggs si...@2ndquadrant.com wrote: On Wed, 2009-01-28 at 23:19 +0900, Fujii Masao wrote: @@ -355,6 +359,27 @@ BackgroundWriterMain(void) */ PG_SETMASK(UnBlockSig); + BgWriterRecoveryMode = IsRecoveryProcessingMode(); + + if (BgWriterRecoveryMode) + elog(DEBUG1, bgwriter starting during recovery); + else + InitXLOGAccess(); Why is InitXLOGAccess() called also here when bgwriter is started after recovery? That is already called by AuxiliaryProcessMain(). InitXLOGAccess() sets the timeline and also gets the latest record pointer. If the bgwriter is started in recovery these values need to be reset later. It's easier to call it twice. Right. But, InitXLOGAccess() called during main loop is enough for that purpose. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
On Wed, 2009-01-28 at 23:54 +0900, Fujii Masao wrote: Why is InitXLOGAccess() called also here when bgwriter is started after recovery? That is already called by AuxiliaryProcessMain(). InitXLOGAccess() sets the timeline and also gets the latest record pointer. If the bgwriter is started in recovery these values need to be reset later. It's easier to call it twice. Right. But, InitXLOGAccess() called during main loop is enough for that purpose. I think the code is clearer the way it is. Otherwise you'd read AuxiliaryProcessMain() and think the bgwriter didn't need xlog access. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
Fujii Masao wrote: On Wed, Jan 28, 2009 at 7:04 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: I feel quite good about this patch now. Given the amount of code churn, it requires testing, and I'll read it through one more time after sleeping over it. Simon, do you see anything wrong with this? I also read this patch and found something odd. Many thanks for looking into this! @@ -507,7 +550,7 @@ CheckArchiveTimeout(void) pg_time_t now; pg_time_t last_time; - if (XLogArchiveTimeout = 0) + if (XLogArchiveTimeout = 0 || !IsRecoveryProcessingMode()) The above change destroys archive_timeout because checking the timeout is always skipped after recovery is done. Yep, good catch. That obviously needs to be IsRecoveryProcessingMode(), without the exclamation mark. @@ -355,6 +359,27 @@ BackgroundWriterMain(void) */ PG_SETMASK(UnBlockSig); + BgWriterRecoveryMode = IsRecoveryProcessingMode(); + + if (BgWriterRecoveryMode) + elog(DEBUG1, bgwriter starting during recovery); + else + InitXLOGAccess(); Why is InitXLOGAccess() called also here when bgwriter is started after recovery? That is already called by AuxiliaryProcessMain(). Oh, I didn't realize that. Agreed, it's a bit disconcerting that InitXLOGAccess() is called twice (there was a 2nd call within the loop in the original patch as well). Looking at InitXLOGAccess, it's harmless to call it multiple times, but it seems better to remove the InitXLOGAccess call from AuxiliaryProcessMain(). InitXLOGAccess() needs to be called after seeing that IsRecoveryProcessingMode() == false, because it won't get the right timeline id before that. @@ -1302,7 +1314,7 @@ ServerLoop(void) * state that prevents it, start one. It doesn't matter if this * fails, we'll just try again later. */ - if (BgWriterPID == 0 pmState == PM_RUN) + if (BgWriterPID == 0 (pmState == PM_RUN || pmState == PM_RECOVERY)) BgWriterPID = StartBackgroundWriter(); Likewise, we should try to start also the stats collector during recovery? Hmm, there's not much point as this patch stands, but I guess we should in the hot standby patch, where we let backends in. @@ -2103,7 +2148,8 @@ XLogFileInit(uint32 log, uint32 seg, unlink(tmppath); } - elog(DEBUG2, done creating and filling new WAL file); + XLogFileName(tmppath, ThisTimeLineID, log, seg); + elog(DEBUG2, done creating and filling new WAL file %s, tmppath); This debug message is somewhat confusing, because the WAL file represented as tmppath might have been already created by previous XLogFileInit() via InstallXLogFileSegment(). I don't quite understand what you're saying, but I think I'll just revert that. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
Hi, On Wed, Jan 28, 2009 at 11:19 PM, Fujii Masao masao.fu...@gmail.com wrote: I feel quite good about this patch now. Given the amount of code churn, it requires testing, and I'll read it through one more time after sleeping over it. Simon, do you see anything wrong with this? I also read this patch and found something odd. I apologize if I misread it.. If archive recovery fails after it reaches the last valid record in the last unfilled WAL segment, subsequent recovery might cause the following fatal error. This is because minSafeStartPoint indicates the end of the last unfilled WAL segment which subsequent recovery cannot reach. Is this bug? (I'm not sure how to fix this problem because I don't understand yet why minSafeStartPoint is required.) FATAL: WAL ends before end time of backup dump Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
Hi, On Wed, Jan 28, 2009 at 11:19 PM, Fujii Masao masao.fu...@gmail.com wrote: I feel quite good about this patch now. Given the amount of code churn, it requires testing, and I'll read it through one more time after sleeping over it. Simon, do you see anything wrong with this? I also read this patch and found something odd. I apologize if I misread it.. Sorry for my random reply. Though this is a matter of taste, I think that it's weird that bgwriter runs with ThisTimeLineID = 0 during recovery. This is because XLogCtl-ThisTimeLineID is set at the end of recovery. ISTM this will be a cause of bug in the near future, though this is harmless currently. + /* + * XXX: Should we try to perform restartpoints if we're not in consistent + * recovery? The bgwriter isn't doing it for us in that case. + */ I think yes. This is helpful for the system which it takes a long time to get a base backup, ie. it also takes a long time to reach a consistent recovery point. +CreateRestartPoint(int flags) snip + * We rely on this lock to ensure that the startup process doesn't exit + * Recovery while we are half way through a restartpoint. XXX ? */ + LWLockAcquire(CheckpointLock, LW_EXCLUSIVE); Is this comment correct? CheckpointLock cannot prevent the startup process from exiting recovery because the startup process doesn't acquire that lock. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
On Thu, 2009-01-29 at 12:18 +0900, Fujii Masao wrote: Hi, On Wed, Jan 28, 2009 at 11:19 PM, Fujii Masao masao.fu...@gmail.com wrote: I feel quite good about this patch now. Given the amount of code churn, it requires testing, and I'll read it through one more time after sleeping over it. Simon, do you see anything wrong with this? I also read this patch and found something odd. I apologize if I misread it.. Sorry for my random reply. Though this is a matter of taste, I think that it's weird that bgwriter runs with ThisTimeLineID = 0 during recovery. This is because XLogCtl-ThisTimeLineID is set at the end of recovery. ISTM this will be a cause of bug in the near future, though this is harmless currently. It doesn't. That's exactly what InitXLogAccess() was for. + /* +* XXX: Should we try to perform restartpoints if we're not in consistent +* recovery? The bgwriter isn't doing it for us in that case. +*/ I think yes. This is helpful for the system which it takes a long time to get a base backup, ie. it also takes a long time to reach a consistent recovery point. The original patch did this. +CreateRestartPoint(int flags) snip +* We rely on this lock to ensure that the startup process doesn't exit +* Recovery while we are half way through a restartpoint. XXX ? */ + LWLockAcquire(CheckpointLock, LW_EXCLUSIVE); Is this comment correct? CheckpointLock cannot prevent the startup process from exiting recovery because the startup process doesn't acquire that lock. The original patch acquired CheckpointLock during exitRecovery to prove that a restartpoint was not in progress. It no longer does this, so not sure if Heikki has found another way and the comment is wrong, or that removing the lock request is a bug. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
On Thu, 2009-01-29 at 10:36 +0900, Fujii Masao wrote: Hi, On Wed, Jan 28, 2009 at 11:19 PM, Fujii Masao masao.fu...@gmail.com wrote: I feel quite good about this patch now. Given the amount of code churn, it requires testing, and I'll read it through one more time after sleeping over it. Simon, do you see anything wrong with this? I also read this patch and found something odd. I apologize if I misread it.. If archive recovery fails after it reaches the last valid record in the last unfilled WAL segment, subsequent recovery might cause the following fatal error. This is because minSafeStartPoint indicates the end of the last unfilled WAL segment which subsequent recovery cannot reach. Is this bug? (I'm not sure how to fix this problem because I don't understand yet why minSafeStartPoint is required.) FATAL: WAL ends before end time of backup dump I think you're right. We need a couple of changes to avoid confusing messages. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby, recovery infra
Simon Riggs wrote: On Thu, 2009-01-29 at 12:18 +0900, Fujii Masao wrote: Though this is a matter of taste, I think that it's weird that bgwriter runs with ThisTimeLineID = 0 during recovery. This is because XLogCtl-ThisTimeLineID is set at the end of recovery. ISTM this will be a cause of bug in the near future, though this is harmless currently. It doesn't. That's exactly what InitXLogAccess() was for. It does *during recovery*, before InitXLogAccess is called. Yeah, it's harmless currently. It would be pretty hard to keep it up-to-date in bgwriter and other processes. I think it's better to keep it at 0, which is clearly an invalid value, than try to keep it up-to-date and risk using an old value. TimeLineID is not used in a lot of places, currently. It might be a good idea to add some Assert(TimeLineID != 0) to places where it used. + /* +* XXX: Should we try to perform restartpoints if we're not in consistent +* recovery? The bgwriter isn't doing it for us in that case. +*/ I think yes. This is helpful for the system which it takes a long time to get a base backup, ie. it also takes a long time to reach a consistent recovery point. The original patch did this. Yeah, I took it out. ISTM if you restore from a base backup, you'd want to run it until consistent recovery anyway. We can put it back in if people feel it's helpful. There's two ways to do it: we can fire up the bgwriter before reaching consistent recovery point, or we can do the restartpoints in startup process before that point. I'm inclined to fire up the bgwriter earlier. That way, bgwriter remains responsible for all checkpoints and restartpoints, which seems clearer. I can't see any particular reason why bgwriter startup and reaching the consistent recovery point need to be linked together. +CreateRestartPoint(int flags) snip +* We rely on this lock to ensure that the startup process doesn't exit +* Recovery while we are half way through a restartpoint. XXX ? */ + LWLockAcquire(CheckpointLock, LW_EXCLUSIVE); Is this comment correct? CheckpointLock cannot prevent the startup process from exiting recovery because the startup process doesn't acquire that lock. The original patch acquired CheckpointLock during exitRecovery to prove that a restartpoint was not in progress. It no longer does this, so not sure if Heikki has found another way and the comment is wrong, or that removing the lock request is a bug. The comment is obsolete. There's no reason for startup process to wait for a restartpoint to finish. Looking back at the archives and the history of that, I got the impression that in a very early version of the patch, startup process still created a shutdown checkpoint after recovery. To do that, it had to interrupt an ongoing restartpoint. That was deemed too dangerous/weird, so it was changed to wait for it to finish instead. But since the startup process no longer creates a shutdown checkpoint, the waiting became obsolete, right? If there's a restartpoint in progress when we reach the end of recovery, startup process will RequestCheckpoint as usual. That will cause bgwriter to hurry the on-going restartpoint, skipping the usual delays, and start the checkpoint as soon as the restartpoint is finished. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers