Re: [HACKERS] Hot standby, recovery infra

2009-02-26 Thread Fujii Masao
Hi,

On Fri, Jan 30, 2009 at 7:47 PM, Simon Riggs si...@2ndquadrant.com wrote:
 That whole area was something I was leaving until last, since immediate
 shutdown doesn't work either, even in HEAD. (Fujii-san and I discussed
 this before Christmas, briefly).

This problem remains in current HEAD. I mean, immediate shutdown
may be unable to kill the startup process because system() which
executes restore_command ignores SIGQUIT while waiting.
When I tried immediate shutdown during recovery, only the startup
process survived. This is undesirable behavior, I think.

The following code should be added into RestoreArchivedFile()?


if (WTERMSIG(rc) == SIGQUIT)
   exit(2);


Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-02-26 Thread Heikki Linnakangas

Fujii Masao wrote:

On Fri, Jan 30, 2009 at 7:47 PM, Simon Riggs si...@2ndquadrant.com wrote:

That whole area was something I was leaving until last, since immediate
shutdown doesn't work either, even in HEAD. (Fujii-san and I discussed
this before Christmas, briefly).


This problem remains in current HEAD. I mean, immediate shutdown
may be unable to kill the startup process because system() which
executes restore_command ignores SIGQUIT while waiting.
When I tried immediate shutdown during recovery, only the startup
process survived. This is undesirable behavior, I think.


Yeah, we need to fix that.


The following code should be added into RestoreArchivedFile()?


if (WTERMSIG(rc) == SIGQUIT)
   exit(2);



I don't see how that helps, as we already have this in there:

signaled = WIFSIGNALED(rc) || WEXITSTATUS(rc)  125;

ereport(signaled ? FATAL : DEBUG2,
(errmsg(could not restore file \%s\ from archive: return code 
%d,
xlogfname, rc)));

which means we already ereport(FATAL) if the restore command dies with 
SIGQUIT.


I think the real problem here is that pg_standby traps SIGQUIT. The 
startup process doesn't receive the SIGQUIT because it's in system(), 
and pg_standby doesn't propagate it to the startup process either 
because it traps it.


I think we should simply remove the signal handler for SIGQUIT from 
pg_standby. Or will that lead to core dump by default? In that case, we 
need pg_standby to exit(128) or similar, so that RestoreArchivedFile 
understands that the command was killed by a signal.


Another approach is to check that the postmaster is still alive, like we 
 do in walwriter and bgwriter:


/*
 * Emergency bailout if postmaster has died.  This is to avoid 
the
 * necessity for manual cleanup of all postmaster children.
 */
if (!PostmasterIsAlive(true))
exit(1);

However, I'm afraid there's a race condition with that. If we do that 
right after system(), postmaster might've signaled us but not exited 
yet. We could check that in the main loop, but if we wrongly interpret 
the exit of the recovery command as a file not found - go ahead and 
start up, the damage might be done by the time we notice that the 
postmaster is gone.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-02-26 Thread Fujii Masao
Hi,

On Fri, Feb 27, 2009 at 3:38 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 I think the real problem here is that pg_standby traps SIGQUIT. The startup
 process doesn't receive the SIGQUIT because it's in system(), and pg_standby
 doesn't propagate it to the startup process either because it traps it.

Yes, you are right.

 I think we should simply remove the signal handler for SIGQUIT from
 pg_standby.

+1

 I don't see how that helps, as we already have this in there:

signaled = WIFSIGNALED(rc) || WEXITSTATUS(rc)  125;

ereport(signaled ? FATAL : DEBUG2,
(errmsg(could not restore file \%s\ from archive: return 
 code %d,
xlogfname, rc)));

 which means we already ereport(FATAL) if the restore command dies with 
 SIGQUIT.

SIGQUIT should kill the process immediately, so I think that the startup
process as well as other auxiliary process should call exit(2) instead of
ereport(FATAL). Right?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-02-26 Thread Simon Riggs

On Thu, 2009-02-26 at 20:38 +0200, Heikki Linnakangas wrote:

 I think we should simply remove the signal handler for SIGQUIT from 
 pg_standby.

If you do this, please make it release dependent so pg_standby behaves
correctly for the release it is being used with.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-02-26 Thread Heikki Linnakangas

Simon Riggs wrote:

On Thu, 2009-02-26 at 20:38 +0200, Heikki Linnakangas wrote:

I think we should simply remove the signal handler for SIGQUIT from 
pg_standby.


If you do this, please make it release dependent so pg_standby behaves
correctly for the release it is being used with.


Hmm, I don't think there's a way for pg_standby to know which version of 
PostgreSQL is calling it. Assuming there is, how would you want it to 
behave? If you want no change in behavior in old releases, can't we just 
leave it unfixed in back-branches? In fact, it seems more useful to not 
detect the server version, so that if you do want the new behavior, you 
can use a 8.4 pg_standby against a 8.3 server.


In back-branches, I think we need to decide between fixing this, at the 
risk of breaking someone's script that is using killall -QUIT 
pg_standby or similar to trigger failover, and leaving it as it is 
knowing that immediate shutdown doesn't work on a standby server. I'm 
not sure which is best.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-02-18 Thread Simon Riggs

On Mon, 2009-02-09 at 17:13 +0200, Heikki Linnakangas wrote:

 Attached is an updated patch that does that, and I've fixed all the 
 other outstanding issues I listed earlier as well. Now I'm feeling
 again that this is in pretty good shape.

UpdateMinRecoveryPoint() issues a DEBUG2 message even when we have not
updated the control file, leading to log filling behaviour on an idle
system.

DEBUG:  updated min recovery point to ...

We should just tuck the message into the if section above it.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-02-18 Thread Heikki Linnakangas

Simon Riggs wrote:

On Mon, 2009-02-09 at 17:13 +0200, Heikki Linnakangas wrote:


Attached is an updated patch that does that, and I've fixed all the
other outstanding issues I listed earlier as well. Now I'm feeling
again that this is in pretty good shape.


UpdateMinRecoveryPoint() issues a DEBUG2 message even when we have not
updated the control file, leading to log filling behaviour on an idle
system.

DEBUG:  updated min recovery point to ...

We should just tuck the message into the if section above it.


The outer if should ensure that it isn't printed repeatedly on an idle 
system. But I agree it belongs inside the inner if section.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-02-18 Thread Simon Riggs

On Wed, 2009-02-18 at 14:26 +0200, Heikki Linnakangas wrote:

 The outer if should ensure that it isn't printed repeatedly on an idle 
 system. 

Regrettably not.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-02-18 Thread Heikki Linnakangas

Simon Riggs wrote:

On Wed, 2009-02-18 at 14:26 +0200, Heikki Linnakangas wrote:

The outer if should ensure that it isn't printed repeatedly on an idle 
system. 


Regrettably not.


Ok, committed. I fixed that and some comment changes. I also renamed 
IsRecoveryProcessingMode() to RecoveryInProgress(), to avoid confusion 
with the real processing modes defined in miscadmin.h. That will 
probably cause you merge conflicts in the hot standby patch, but it 
should be a matter of search-replace to fix.


The changes need to be documented. At least the removal of 
log_restartpoints is a clear user-visible change.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-02-18 Thread Simon Riggs

On Wed, 2009-02-18 at 18:01 +0200, Heikki Linnakangas wrote:
 Simon Riggs wrote:
  On Wed, 2009-02-18 at 14:26 +0200, Heikki Linnakangas wrote:
  
  The outer if should ensure that it isn't printed repeatedly on an idle 
  system. 
  
  Regrettably not.
 
 Ok, committed. 

Cool.

 I fixed that and some comment changes. I also renamed 
 IsRecoveryProcessingMode() to RecoveryInProgress(), to avoid confusion 
 with the real processing modes defined in miscadmin.h. That will 
 probably cause you merge conflicts in the hot standby patch, but it 
 should be a matter of search-replace to fix.

Yep, good change, agree with reasons.

 The changes need to be documented. At least the removal of 
 log_restartpoints is a clear user-visible change.

Yep.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-02-06 Thread Heikki Linnakangas

Simon Riggs wrote:

On Thu, 2009-02-05 at 21:54 +0200, Heikki Linnakangas wrote:
- If you perform a fast shutdown while startup process is waiting for 
the restore command, startup process sometimes throws a FATAL error 
which leads escalates into an immediate shutdown. That leads to 
different messages in the logs, and skipping of the shutdown 
restartpoint that we now otherwise perform.


Sometimes?


I think what happens is that if the restore command receives the SIGTERM 
and dies before the startup process that's waiting for the restore 
command receives the SIGTERM, the startup process throws a FATAL error 
because the restore command died unexpectedly. I put this



if (shutdown_requested  InRedo)
{
/* XXX: Is EndRecPtr always the right value here? */
UpdateMinRecoveryPoint(EndRecPtr);
proc_exit(0);
}


right after the system(xlogRestoreCmd) call, to exit gracefully if we 
were requested to shut down while restore command was running, but it 
seems that that's not enough because of the race condition.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-02-06 Thread Simon Riggs

On Fri, 2009-02-06 at 10:06 +0200, Heikki Linnakangas wrote:
 Simon Riggs wrote:
  On Thu, 2009-02-05 at 21:54 +0200, Heikki Linnakangas wrote:
  - If you perform a fast shutdown while startup process is waiting for 
  the restore command, startup process sometimes throws a FATAL error 
  which leads escalates into an immediate shutdown. That leads to 
  different messages in the logs, and skipping of the shutdown 
  restartpoint that we now otherwise perform.
  
  Sometimes?
 
 I think what happens is that if the restore command receives the SIGTERM 
 and dies before the startup process that's waiting for the restore 
 command receives the SIGTERM, the startup process throws a FATAL error 
 because the restore command died unexpectedly. I put this
 
  if (shutdown_requested  InRedo)
  {
  /* XXX: Is EndRecPtr always the right value here? */
  UpdateMinRecoveryPoint(EndRecPtr);
  proc_exit(0);
  }
 
 right after the system(xlogRestoreCmd) call, to exit gracefully if we 
 were requested to shut down while restore command was running, but it 
 seems that that's not enough because of the race condition.

Can we trap the death of the restorecmd and handle it differently from
the death of the startup process?

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-02-05 Thread Heikki Linnakangas

Simon Riggs wrote:

On Thu, 2009-02-05 at 09:28 +0200, Heikki Linnakangas wrote:

Simon Riggs wrote:

I would suggest that at end of recovery we write the last LSN to the
control file, so if we crash recover then we will always end archive
recovery at the same place again should we re-enter it. So we would have
a recovery_target_lsn that overrides recovery_target_xid etc..
Hmm, we don't actually want to end recovery at the same point again: if 
there's some updates right after the database came up, but before the 
first checkpoint and crash, those actions need to be replayed too.


I think we do need to. Crash recovery is supposed to return us to the
same state. Where we ended ArchiveRecovery is part of that state. 


It isn't for crash recovery to decide that it saw a few more
transactions and decided to apply them anyway. The user may have
specifically ended recovery to avoid those additional changes from
taking place.


Those additional changes were made in the standby, after ending 
recovery. So the sequence of events is:


1. Standby performs archive recovery
2. End of archive (or stop point) reached. Open for connections, 
read-write. Request an online checkpoint. Online checkpoint begins.
3. A user connects, and does INSERT INTO foo VALUES (123). Commits, 
commit returns.
4. pg_ctl stop -m immediate. The checkpoint started in step 2 hasn't 
finished yet

5. Restart the server.

The server will now re-enter archive recovery. But this time, we want to 
replay the INSERT too.


(let's do the startup checkpoint for now, as discussed elsewhere in this 
thread)


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-02-05 Thread Heikki Linnakangas

Simon Riggs wrote:

On Thu, 2009-02-05 at 09:28 +0200, Heikki Linnakangas wrote:

I've changed the way minRecoveryPoint is updated now anyway, so it no 
longer happens every XLogFileRead().

Care to elucidate?
I got rid of minSafeStartPoint, advancing minRecoveryPoint instead. And 
it's advanced in XLogFlush instead of XLogFileRead. I'll post an updated 
patch soon.


Why do you think XLogFlush is called less frequently than XLogFileRead?


It's not, but we only need to update the control file when we're 
flushing an LSN that's greater than current minRecoveryPoint. And when 
we do update minRecoveryPoint, we can update it to the LSN of the last 
record we've read from the archive.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-02-05 Thread Simon Riggs

On Thu, 2009-02-05 at 10:07 +0200, Heikki Linnakangas wrote:
 Simon Riggs wrote:
  On Thu, 2009-02-05 at 09:28 +0200, Heikki Linnakangas wrote:
  Simon Riggs wrote:
  I would suggest that at end of recovery we write the last LSN to the
  control file, so if we crash recover then we will always end archive
  recovery at the same place again should we re-enter it. So we would have
  a recovery_target_lsn that overrides recovery_target_xid etc..
  Hmm, we don't actually want to end recovery at the same point again: if 
  there's some updates right after the database came up, but before the 
  first checkpoint and crash, those actions need to be replayed too.
  
  I think we do need to. Crash recovery is supposed to return us to the
  same state. Where we ended ArchiveRecovery is part of that state. 
  
  It isn't for crash recovery to decide that it saw a few more
  transactions and decided to apply them anyway. The user may have
  specifically ended recovery to avoid those additional changes from
  taking place.
 
 Those additional changes were made in the standby, after ending 
 recovery. So the sequence of events is:
 
 1. Standby performs archive recovery
 2. End of archive (or stop point) reached. Open for connections, 
 read-write. Request an online checkpoint. Online checkpoint begins.
 3. A user connects, and does INSERT INTO foo VALUES (123). Commits, 
 commit returns.
 4. pg_ctl stop -m immediate. The checkpoint started in step 2 hasn't 
 finished yet
 5. Restart the server.
 
 The server will now re-enter archive recovery. But this time, we want to 
 replay the INSERT too.

I agree with this, so let me restate both comments together.

When the server starts it begins a new timeline. 

When recovering we must switch to that timeline at the same point we did
previously when we ended archive recovery. We currently don't record
when that is and we need to.

Yes, we must also replay the records in the new timeline once we have
switched to it, as you say. But we must not replay any further in the
older timeline(s).

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-02-05 Thread Simon Riggs

On Thu, 2009-02-05 at 10:31 +0200, Heikki Linnakangas wrote:
 Simon Riggs wrote:
  On Thu, 2009-02-05 at 09:28 +0200, Heikki Linnakangas wrote:
  
  I've changed the way minRecoveryPoint is updated now anyway, so it no 
  longer happens every XLogFileRead().
  Care to elucidate?
  I got rid of minSafeStartPoint, advancing minRecoveryPoint instead. And 
  it's advanced in XLogFlush instead of XLogFileRead. I'll post an updated 
  patch soon.
  
  Why do you think XLogFlush is called less frequently than XLogFileRead?
 
 It's not, but we only need to update the control file when we're 
 flushing an LSN that's greater than current minRecoveryPoint. And when 
 we do update minRecoveryPoint, we can update it to the LSN of the last 
 record we've read from the archive.

So we might end up flushing more often *and* we will be doing it
potentially in the code path of other users.

This change seems speculative and also against what has previously been
agreed with Tom. If he chooses not to comment on your changes, that's up
to him, but I don't think you should remove things quietly that have
been put there through the community process, as if they caused
problems. I feel like I'm in the middle here. 

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-02-05 Thread Simon Riggs

On Thu, 2009-02-05 at 09:26 +, Simon Riggs wrote:

 This change seems speculative and also against what has previously been
 agreed with Tom. If he chooses not to comment on your changes, that's up
 to him, but I don't think you should remove things quietly that have
 been put there through the community process, as if they caused
 problems. I feel like I'm in the middle here. 

This is only a problem because of two independent reviewers.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-02-05 Thread Heikki Linnakangas

Simon Riggs wrote:

On Thu, 2009-02-05 at 10:31 +0200, Heikki Linnakangas wrote:

Simon Riggs wrote:

On Thu, 2009-02-05 at 09:28 +0200, Heikki Linnakangas wrote:
I got rid of minSafeStartPoint, advancing minRecoveryPoint instead. And 
it's advanced in XLogFlush instead of XLogFileRead. I'll post an updated 
patch soon.

Why do you think XLogFlush is called less frequently than XLogFileRead?
It's not, but we only need to update the control file when we're 
flushing an LSN that's greater than current minRecoveryPoint. And when 
we do update minRecoveryPoint, we can update it to the LSN of the last 
record we've read from the archive.


So we might end up flushing more often *and* we will be doing it
potentially in the code path of other users.


For example, imagine a database that fits completely in shared buffers. 
If we update at every XLogFileRead, we have to fsync every 16MB of WAL. 
If we update in XLogFlush the way I described, you only need to update 
when we flush a page from the buffer cache, which will only happen at 
restartpoints. That's far less updates.


Expanding that example to a database that doesn't fit in cache, you're 
still replacing pages from the buffer cache that have been untouched for 
longest. Such pages will have an old LSN, too, so we shouldn't need to 
update very often.


I'm sure you can come up with an example of where we end up fsyncing 
more often, but it doesn't seem like the common case to me.



This change seems speculative and also against what has previously been
agreed with Tom. If he chooses not to comment on your changes, that's up
to him, but I don't think you should remove things quietly that have
been put there through the community process, as if they caused
problems. I feel like I'm in the middle here. 


I'd like to have the extra protection that this approach gives. If we 
let safeStartPoint to be ahead of the actual WAL we've replayed, we have 
to just assume we're fine if we reach end of WAL before reaching that 
point. That assumption falls down if e.g recovery is stopped, and you go 
and remove the last few WAL segments from the archive before restarting 
it, or signal pg_standby to trigger failover too early. Tracking the 
real safe starting point and enforcing it always protects you from that.


(we did discuss this a week ago: 
http://archives.postgresql.org/message-id/4981f6e0.2040...@enterprisedb.com)


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-02-05 Thread Simon Riggs

On Thu, 2009-02-05 at 11:46 +0200, Heikki Linnakangas wrote:
 Simon Riggs wrote:

  So we might end up flushing more often *and* we will be doing it
  potentially in the code path of other users.
 
 For example, imagine a database that fits completely in shared buffers. 
 If we update at every XLogFileRead, we have to fsync every 16MB of WAL. 
 If we update in XLogFlush the way I described, you only need to update 
 when we flush a page from the buffer cache, which will only happen at 
 restartpoints. That's far less updates.

Oh, did you change the bgwriter so it doesn't do normal page cleaning? 

General thoughts: Latest HS patch has a CPU profile within 1-2% of
current and the use of ProcArrayLock is fairly minimal now. The
additional CPU is recoveryStopsHere(), which enables the manual control
of recovery, so the trade off seems worth it. The major CPU hog remains
RecordIsValid, which is the CRC checks. Startup is still I/O bound.
Specific avoidable I/O hogs are (1) checkpoints, (2) page cleaning so I
hope we can avoid both of those. 

I also hope to minimise the I/O cost of keeping track of the consistency
point. If that was done as infrequently as each restartpoint then I
would certainly be very happy, but that won't happen in the proposed
scheme if we do page cleaning.

 Expanding that example to a database that doesn't fit in cache, you're 
 still replacing pages from the buffer cache that have been untouched for 
 longest. Such pages will have an old LSN, too, so we shouldn't need to 
 update very often.

They will tend to be written in ascending LSN order which will mean we
continually update the control file. Anything out of order does skip a
write. The better the cache is at finding LRU blocks out the more writes
we will make.

 I'm sure you can come up with an example of where we end up fsyncing 
 more often, but it doesn't seem like the common case to me.

I'm not trying to come up with counterexamples...

  This change seems speculative and also against what has previously been
  agreed with Tom. If he chooses not to comment on your changes, that's up
  to him, but I don't think you should remove things quietly that have
  been put there through the community process, as if they caused
  problems. I feel like I'm in the middle here. 
 
 I'd like to have the extra protection that this approach gives. If we 
 let safeStartPoint to be ahead of the actual WAL we've replayed, we have 
 to just assume we're fine if we reach end of WAL before reaching that 
 point. That assumption falls down if e.g recovery is stopped, and you go 
 and remove the last few WAL segments from the archive before restarting 
 it, or signal pg_standby to trigger failover too early. Tracking the 
 real safe starting point and enforcing it always protects you from that.

Doing it this way will require you to remove existing specific error
messages about ending before end time of backup, to be replaced by more
general ones that say consistency not reached which is harder to
figure out what to do about it.

 (we did discuss this a week ago: 
 http://archives.postgresql.org/message-id/4981f6e0.2040...@enterprisedb.com)

Yes, we need to update it in that case. Though that is no way agreement
to the other changes, nor does it require them.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-02-05 Thread Heikki Linnakangas

Simon Riggs wrote:

On Thu, 2009-02-05 at 11:46 +0200, Heikki Linnakangas wrote:

Simon Riggs wrote:



So we might end up flushing more often *and* we will be doing it
potentially in the code path of other users.
For example, imagine a database that fits completely in shared buffers. 
If we update at every XLogFileRead, we have to fsync every 16MB of WAL. 
If we update in XLogFlush the way I described, you only need to update 
when we flush a page from the buffer cache, which will only happen at 
restartpoints. That's far less updates.


Oh, did you change the bgwriter so it doesn't do normal page cleaning? 


No. Ok, that wasn't completely accurate. The page cleaning by bgwriter 
will perform XLogFlushes, but that should be pretty insignificant. When 
there's little page replacement going on, bgwriter will do a small 
trickle of page cleaning, which won't matter much. If there's more page 
replacement going on, bgwriter is cleaning up pages that will soon be 
replaced, so it's just offsetting work from other backends (or the 
startup process in this case).


Expanding that example to a database that doesn't fit in cache, you're 
still replacing pages from the buffer cache that have been untouched for 
longest. Such pages will have an old LSN, too, so we shouldn't need to 
update very often.


They will tend to be written in ascending LSN order which will mean we
continually update the control file. Anything out of order does skip a
write. The better the cache is at finding LRU blocks out the more writes
we will make.


When minRecoveryPoint is updated, it's not update to just the LSN that's 
being flushed. It's updated to the recptr of the most recently read WAL 
record. That's an important point to avoid that behavior. Just like 
XLogFlush normally always flushes all of the outstanding WAL, not just 
up to the requested LSN.


I'd like to have the extra protection that this approach gives. If we 
let safeStartPoint to be ahead of the actual WAL we've replayed, we have 
to just assume we're fine if we reach end of WAL before reaching that 
point. That assumption falls down if e.g recovery is stopped, and you go 
and remove the last few WAL segments from the archive before restarting 
it, or signal pg_standby to trigger failover too early. Tracking the 
real safe starting point and enforcing it always protects you from that.


Doing it this way will require you to remove existing specific error
messages about ending before end time of backup, to be replaced by more
general ones that say consistency not reached which is harder to
figure out what to do about it.


Yeah. If that's an important distinction, we could still save the 
original backup stop location somewhere, just so that we can give the 
old error message when we've not passed that location. But perhaps a 
message like WAL ends before reaching a consistent state with a hint 
Make sure you archive all the WAL created during backup or something 
would do suffice.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-02-05 Thread Simon Riggs

On Thu, 2009-02-05 at 13:18 +0200, Heikki Linnakangas wrote:
 Simon Riggs wrote:
  On Thu, 2009-02-05 at 11:46 +0200, Heikki Linnakangas wrote:
  Simon Riggs wrote:
  
  So we might end up flushing more often *and* we will be doing it
  potentially in the code path of other users.
  For example, imagine a database that fits completely in shared buffers. 
  If we update at every XLogFileRead, we have to fsync every 16MB of WAL. 
  If we update in XLogFlush the way I described, you only need to update 
  when we flush a page from the buffer cache, which will only happen at 
  restartpoints. That's far less updates.
  
  Oh, did you change the bgwriter so it doesn't do normal page cleaning? 
 
 No. Ok, that wasn't completely accurate. The page cleaning by bgwriter 
 will perform XLogFlushes, but that should be pretty insignificant. When 
 there's little page replacement going on, bgwriter will do a small 
 trickle of page cleaning, which won't matter much. 

Yes, that case is good, but it wasn't the use case we're trying to speed
up by having the bgwriter active during recovery. We're worried about
I/O bound recoveries.

 If there's more page 
 replacement going on, bgwriter is cleaning up pages that will soon be 
 replaced, so it's just offsetting work from other backends (or the 
 startup process in this case).

Which only needs to be done if we follow this route, so is not an
argument in favour.

Using more I/O in I/O bound cases makes the worst case even worse.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-02-05 Thread Heikki Linnakangas

Simon Riggs wrote:

On Thu, 2009-02-05 at 13:18 +0200, Heikki Linnakangas wrote:

Simon Riggs wrote:

On Thu, 2009-02-05 at 11:46 +0200, Heikki Linnakangas wrote:

Simon Riggs wrote:

So we might end up flushing more often *and* we will be doing it
potentially in the code path of other users.
For example, imagine a database that fits completely in shared buffers. 
If we update at every XLogFileRead, we have to fsync every 16MB of WAL. 
If we update in XLogFlush the way I described, you only need to update 
when we flush a page from the buffer cache, which will only happen at 
restartpoints. That's far less updates.
Oh, did you change the bgwriter so it doesn't do normal page cleaning? 
No. Ok, that wasn't completely accurate. The page cleaning by bgwriter 
will perform XLogFlushes, but that should be pretty insignificant. When 
there's little page replacement going on, bgwriter will do a small 
trickle of page cleaning, which won't matter much. 


Yes, that case is good, but it wasn't the use case we're trying to speed
up by having the bgwriter active during recovery. We're worried about
I/O bound recoveries.


Ok, let's do the math:

By updating minRecoveryPoint in XLogFileRead, you're fsyncing the 
control file once every 16MB of WAL.


By updating in XLogFlush, the frequency depends on the amount of 
shared_buffers available to buffer the modified pages, the average WAL 
record size, and the cache hit ratio. Let's determine the worst case:


The smallest WAL record that dirties a page is a heap deletion record. 
That contains just enough information to locate the tuple. If I'm 
reading the headers right, that record is 48 bytes long (28 bytes of 
xlog header + 18 bytes of payload + padding). Assuming that the WAL is 
full of just those records, and there's no full page images, and that 
the cache hit ratio is 0%, we will need (16 MB / 48 B) * 8 kB = 2730 MB 
of shared_buffers to achieve the once per 16 MB of WAL per one fsync mark.


So if you have a lower shared_buffers setting than 2.7 GB, you can have 
more frequent fsyncs this way in the worst case. If you think of the 
typical case, you're probably not doing all deletes, and you're having a 
non-zero cache hit ratio, so you achieve the same frequency with a much 
lower shared_buffers setting. And if you're really that I/O bound, I 
doubt the few extra fsyncs matter much.


Also note that when the control file is updated in XLogFlush, it's 
typically the bgwriter doing it as it cleans buffers ahead of the clock 
hand, not the startup process.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-02-05 Thread Simon Riggs

On Thu, 2009-02-05 at 14:18 +0200, Heikki Linnakangas wrote:

 when the control file is updated in XLogFlush, it's 
 typically the bgwriter doing it as it cleans buffers ahead of the
 clock hand, not the startup process

That is the key point. Let's do it your way.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-02-05 Thread Simon Riggs

On Thu, 2009-02-05 at 21:54 +0200, Heikki Linnakangas wrote:

 - If bgwriter is performing a restartpoint when recovery ends, the 
 startup checkpoint will be queued up behind the restartpoint. And since 
 it uses the same smoothing logic as checkpoints, it can take quite some 
 time for that to finish. The original patch had some code to hurry up 
 the restartpoint by signaling the bgwriter if 
 LWLockConditionalAcquire(CheckPointLock) fails, but there's a race 
 condition with that if a restartpoint starts right after that check. We 
 could let the bgwriter do the checkpoint too, and wait for it, but 
 bgwriter might not be running yet, and we'd have to allow bgwriter to 
 write WAL while disallowing it for all other processes, which seems 
 quite complex. Seems like we need something like the 
 LWLockConditionalAcquire approach, but built into CreateCheckPoint to 
 eliminate the race condition

Seems straightforward? Hold the lock longer.

 - If you perform a fast shutdown while startup process is waiting for 
 the restore command, startup process sometimes throws a FATAL error 
 which leads escalates into an immediate shutdown. That leads to 
 different messages in the logs, and skipping of the shutdown 
 restartpoint that we now otherwise perform.

Sometimes?

 - It's not clear to me if the rest of the xlog flushing related 
 functions, XLogBackgroundFlush, XLogNeedsFlush and XLogAsyncCommitFlush, 
 need to work during recovery, and what they should do.

XLogNeedsFlush should always return false InRecoveryProcessingMode().
The WAL is already in the WAL files, not in wal_buffers anymore.

XLogAsyncCommitFlush should contain Assert(!InRecoveryProcessingMode())
since it is called during a VACUUM FULL only.

XLogBackgroundFlush should never be called during recovery because the
WALWriter is never active in recovery. That should just be documented.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-02-04 Thread Heikki Linnakangas

Fujii Masao wrote:

On Fri, Jan 30, 2009 at 11:55 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:

The startup process now catches SIGTERM, and calls proc_exit() at the next
WAL record. That's what will happen in a fast shutdown. Unexpected death of
the startup process is treated the same as a backend/auxiliary process
crash.


If unexpected death of the startup process happens in automatic recovery
after a crash, postmaster and bgwriter may get stuck. Because HandleChildCrash()
can be called before FatalError flag is reset. When FatalError is false,
HandleChildCrash() doesn't kill any auxiliary processes. So, bgwriter survives
the crash and postmaster waits for the death of bgwriter forever with recovery
status (which means that new connection cannot be started). Is this bug?


Yes, and in fact I ran into it myself yesterday while testing. It seems 
that we should reset FatalError earlier, ie. when the recovery starts 
and bgwriter is launched. I'm not sure why we in CVS HEAD we don't reset 
FatalError until after the startup process is finished. Resetting it as 
soon all the processes have been terminated and startup process is 
launched again would seem like a more obvious place to do it. The only 
difference that I can see is that if someone tries to connect while the 
startup process is running, you now get a the database system is in 
recovery mode message instead of the database system is starting up 
if we're reinitializing after crash. We can keep that behavior, just 
need to add another flag to mean reinitializing after crash that isn't 
reset until the recovery is over.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-02-04 Thread Heikki Linnakangas

Simon Riggs wrote:

* I think we are now renaming the recovery.conf file too early. The
comment says We have already restored all the WAL segments we need from
the archive, and we trust that they are not going to go away even if we
crash. We have, but the files overwrite each other as they arrive, so
if the last restartpoint is not in the last restored WAL file then it
will only exist in the archive. The recovery.conf is the only place
where we store the information on where the archive is and how to access
it, so by renaming it out of the way we will be unable to crash recover
until the first checkpoint is complete. So the way this was in the
original patch is the correct way to go, AFAICS.


I can see what you mean now. In fact we're not safe even when the last 
restartpoint is in the last restored WAL file, because we always restore 
segments from the archive to a temporary filename.


Yes, renaming recovery.conf at the first checkpoint avoids that problem. 
However, it means that we'll re-enter archive recovery if we crash 
before that checkpoint is finished. Won't that cause havoc if more files 
have appeared to the archive since the crash, and we restore those even 
though we already started up from an earlier point? How do the timelines 
work in that case?


We could avoid that by performing a good old startup checkpoint, but I 
quite like the fast failover time we get without it.



* my original intention was to deprecate log_restartpoints and would
still like to do so. log_checkpoints does just as well for that. Even
less code than before...


Ok, done.


* comment on BgWriterShmemInit() refers to CHECKPOINT_IS_STARTUP, but
the actual define is CHECKPOINT_STARTUP. Would prefer the is version
because it matches the IS_SHUTDOWN naming.


Fixed.


* In CreateCheckpoint() the if test on TruncateSubtrans() has been
removed, but the comment has not been changed (to explain why).


Thanks, comment updated. (we now call CreateCheckPoint() only after 
recovery is finished)



We should continue to measure performance of recovery in the light of
these changes. I still feel that fsyncing the control file on each
XLogFileRead() will give a noticeable performance penalty, mostly
because we know doing exactly the same thing in normal running caused a
performance penalty. But that is easily changed and cannot be done with
any certainty without wider feedback, so no reason to delay code commit.


I've changed the way minRecoveryPoint is updated now anyway, so it no 
longer happens every XLogFileRead().


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-02-04 Thread Simon Riggs

On Wed, 2009-02-04 at 19:03 +0200, Heikki Linnakangas wrote:
 Simon Riggs wrote:
  * I think we are now renaming the recovery.conf file too early. The
  comment says We have already restored all the WAL segments we need from
  the archive, and we trust that they are not going to go away even if we
  crash. We have, but the files overwrite each other as they arrive, so
  if the last restartpoint is not in the last restored WAL file then it
  will only exist in the archive. The recovery.conf is the only place
  where we store the information on where the archive is and how to access
  it, so by renaming it out of the way we will be unable to crash recover
  until the first checkpoint is complete. So the way this was in the
  original patch is the correct way to go, AFAICS.
 
 I can see what you mean now. In fact we're not safe even when the last 
 restartpoint is in the last restored WAL file, because we always restore 
 segments from the archive to a temporary filename.
 
 Yes, renaming recovery.conf at the first checkpoint avoids that problem. 
 However, it means that we'll re-enter archive recovery if we crash 
 before that checkpoint is finished. Won't that cause havoc if more files 
 have appeared to the archive since the crash, and we restore those even 
 though we already started up from an earlier point? How do the timelines 
 work in that case?

More archive files being added to archive is possible, but unlikely.
What *is* a definite problem is that if we ended recovery by manual
command (possible with HS patch) or recovery target then we would have
no record of which point it was that we finished at. It is then possible
that the recovery.conf has been re-edited, causing re-recovery to end at
a different place. That seems like a bad thing.

 We could avoid that by performing a good old startup checkpoint, but I 
 quite like the fast failover time we get without it.

ISTM it's either slow failover or (fast failover, but restart archive
recovery if crashes).

I would suggest that at end of recovery we write the last LSN to the
control file, so if we crash recover then we will always end archive
recovery at the same place again should we re-enter it. So we would have
a recovery_target_lsn that overrides recovery_target_xid etc..

Given where we are, I would suggest we go for the slow failover option
in this release. Doing otherwise is a risky decision with little gain.
BGwriter-in-recovery is a good thing of itself and we have other code to
review yet with a higher importance level, and the rest of HS is code
I'm actually much happier with. I've spent weeks trying to get the
transition between recovery and normal running clean, but it seems like
time to say I didn't get it right *enough* to be absolutely certain.
(Just the fast failover part of patch!). Thanks for the review.

  We should continue to measure performance of recovery in the light
 of
  these changes. I still feel that fsyncing the control file on each
  XLogFileRead() will give a noticeable performance penalty, mostly
  because we know doing exactly the same thing in normal running
 caused a
  performance penalty. But that is easily changed and cannot be done
 with
  any certainty without wider feedback, so no reason to delay code
 commit.
 
 I've changed the way minRecoveryPoint is updated now anyway, so it no 
 longer happens every XLogFileRead().

Care to elucidate?

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-02-04 Thread Fujii Masao
Hi,

On Wed, Feb 4, 2009 at 8:35 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Yes, and in fact I ran into it myself yesterday while testing. It seems that
 we should reset FatalError earlier, ie. when the recovery starts and
 bgwriter is launched. I'm not sure why we in CVS HEAD we don't reset
 FatalError until after the startup process is finished. Resetting it as soon
 all the processes have been terminated and startup process is launched again
 would seem like a more obvious place to do it.

Which may repeat the recovery crash and reinitializing forever. To prevent
this problem, unexpected death of startup process should not cause
reinitializing?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-02-04 Thread Tom Lane
Fujii Masao masao.fu...@gmail.com writes:
 On Wed, Feb 4, 2009 at 8:35 PM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
 ... I'm not sure why we in CVS HEAD we don't reset
 FatalError until after the startup process is finished.

 Which may repeat the recovery crash and reinitializing forever. To prevent
 this problem, unexpected death of startup process should not cause
 reinitializing?

Fujii-san has it in one.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-02-04 Thread Heikki Linnakangas

Tom Lane wrote:

Fujii Masao masao.fu...@gmail.com writes:

On Wed, Feb 4, 2009 at 8:35 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:

... I'm not sure why we in CVS HEAD we don't reset
FatalError until after the startup process is finished.



Which may repeat the recovery crash and reinitializing forever. To prevent
this problem, unexpected death of startup process should not cause
reinitializing?


Fujii-san has it in one.


In CVS HEAD, we always do ExitPostmaster(1) if the startup process dies 
unexpectedly, regardless of FatalError. So it serves no such purpose there.


But yeah, we do have that problem with the patch. What do we want to do 
if the startup process dies with FATAL? It seems reasonable to assume 
that the problem isn't going to just go away if we restart: we'd do 
exactly the same sequence of actions after restart.


But if it's after reaching consistent recovery, the system should still 
be in consistent state, and we could stay open for read-only 
transactions. That would be useful to recover a corrupted database/WAL; 
you could let the WAL replay to run as far as it can, and then connect 
and investigate the situation, or run pg_dump.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-02-04 Thread Heikki Linnakangas

Simon Riggs wrote:
We could avoid that by performing a good old startup checkpoint, but I 
quite like the fast failover time we get without it.


ISTM it's either slow failover or (fast failover, but restart archive
recovery if crashes).

I would suggest that at end of recovery we write the last LSN to the
control file, so if we crash recover then we will always end archive
recovery at the same place again should we re-enter it. So we would have
a recovery_target_lsn that overrides recovery_target_xid etc..


Hmm, we don't actually want to end recovery at the same point again: if 
there's some updates right after the database came up, but before the 
first checkpoint and crash, those actions need to be replayed too.



Given where we are, I would suggest we go for the slow failover option
in this release.


Agreed. We could do it for crash recovery, but I'd rather not have two 
different ways to do it. It's not as important for crash recovery, either.



We should continue to measure performance of recovery in the light

of

these changes. I still feel that fsyncing the control file on each
XLogFileRead() will give a noticeable performance penalty, mostly
because we know doing exactly the same thing in normal running

caused a

performance penalty. But that is easily changed and cannot be done

with

any certainty without wider feedback, so no reason to delay code

commit.

I've changed the way minRecoveryPoint is updated now anyway, so it no 
longer happens every XLogFileRead().


Care to elucidate?


I got rid of minSafeStartPoint, advancing minRecoveryPoint instead. And 
it's advanced in XLogFlush instead of XLogFileRead. I'll post an updated 
patch soon.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-02-04 Thread Simon Riggs

On Thu, 2009-02-05 at 09:28 +0200, Heikki Linnakangas wrote:

  I've changed the way minRecoveryPoint is updated now anyway, so it no 
  longer happens every XLogFileRead().
  
  Care to elucidate?
 
 I got rid of minSafeStartPoint, advancing minRecoveryPoint instead. And 
 it's advanced in XLogFlush instead of XLogFileRead. I'll post an updated 
 patch soon.

Why do you think XLogFlush is called less frequently than XLogFileRead?

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-02-04 Thread Simon Riggs

On Thu, 2009-02-05 at 09:28 +0200, Heikki Linnakangas wrote:
 Simon Riggs wrote:
  We could avoid that by performing a good old startup checkpoint, but I 
  quite like the fast failover time we get without it.
  
  ISTM it's either slow failover or (fast failover, but restart archive
  recovery if crashes).
  
  I would suggest that at end of recovery we write the last LSN to the
  control file, so if we crash recover then we will always end archive
  recovery at the same place again should we re-enter it. So we would have
  a recovery_target_lsn that overrides recovery_target_xid etc..
 
 Hmm, we don't actually want to end recovery at the same point again: if 
 there's some updates right after the database came up, but before the 
 first checkpoint and crash, those actions need to be replayed too.

I think we do need to. Crash recovery is supposed to return us to the
same state. Where we ended ArchiveRecovery is part of that state. 

It isn't for crash recovery to decide that it saw a few more
transactions and decided to apply them anyway. The user may have
specifically ended recovery to avoid those additional changes from
taking place.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-02-03 Thread Fujii Masao
Hi,

On Fri, Jan 30, 2009 at 11:55 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 The startup process now catches SIGTERM, and calls proc_exit() at the next
 WAL record. That's what will happen in a fast shutdown. Unexpected death of
 the startup process is treated the same as a backend/auxiliary process
 crash.

If unexpected death of the startup process happens in automatic recovery
after a crash, postmaster and bgwriter may get stuck. Because HandleChildCrash()
can be called before FatalError flag is reset. When FatalError is false,
HandleChildCrash() doesn't kill any auxiliary processes. So, bgwriter survives
the crash and postmaster waits for the death of bgwriter forever with recovery
status (which means that new connection cannot be started). Is this bug?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-02-01 Thread Simon Riggs

On Sat, 2009-01-31 at 22:32 +0200, Heikki Linnakangas wrote:

 If you poison your WAL archive with a XLOG_CRASH_RECOVERY record, 
 recovery will never be able to proceed over that point. There would have 
 to be a switch to ignore those records, at the very least.

Definitely in assert mode only.

I'll do it as a test patch and keep it separate from main line.

 You don't really need to do it with a new WAL record. You could just add 
 a GUC or recovery.conf option along the lines of recovery_target: 
 crash_target=0/123456, and check for that in ReadRecord or wherever you 
 want the crash to occur.

Knowing that LSN is somewhat harder

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-02-01 Thread Simon Riggs

On Sat, 2009-01-31 at 22:41 +0200, Heikki Linnakangas wrote:

  I like this way because it means we might in the future get Startup
  process to perform post-recovery actions also.
 
 Yeah, it does. Do you have something in mind already?

Yes, but nothing that needs to be discussed yet.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-01-31 Thread Simon Riggs

On Fri, 2009-01-30 at 13:15 +0200, Heikki Linnakangas wrote:
 Simon Riggs wrote:
  I'm thinking to add a new function that will allow crash testing easier.
  
  pg_crash_standby() will issue a new xlog record, XLOG_CRASH_STANDBY,
  which when replayed will just throw a FATAL error and crash Startup
  process. We won't be adding that to the user docs...
  
  This will allow us to produce tests that crash the server at specific
  places, rather than trying to trap those points manually.
 
 Heh, talk about a footgun ;-). I don't think including that in CVS is a 
 good idea.

Thought you'd like it. I'd have preferred it in a plugin... :-(

Not really sure why its a problem though. We support 
pg_ctl stop -m immediate, which is the equivalent, yet we don't regard
that as a footgun.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-01-31 Thread Simon Riggs

On Fri, 2009-01-30 at 13:25 +0200, Heikki Linnakangas wrote:
  That whole area was something I was leaving until last, since
 immediate
  shutdown doesn't work either, even in HEAD. (Fujii-san and I
 discussed
  this before Christmas, briefly).
 
 We must handle shutdown gracefully, can't just leave bgwriter running 
 after postmaster exit.

Nobody was suggesting we should.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-01-31 Thread Simon Riggs

On Fri, 2009-01-30 at 16:55 +0200, Heikki Linnakangas wrote:
 Ok, here's an attempt to make shutdown work gracefully.
 
 Startup process now signals postmaster three times during startup: first 
 when it has done all the initialization, and starts redo. At that point. 
 postmaster launches bgwriter, which starts to perform restartpoints when 
 it deems appropriate. The 2nd time signals when we've reached consistent 
 recovery state. As the patch stands, that's not significant, but it will 
 be with all the rest of the hot standby stuff. The 3rd signal is sent 
 when startup process has finished recovery. Postmaster used to wait for 
 the startup process to exit, and check the return code to determine 
 that, but now that we support shutdown, startup process also returns 
 with 0 exit code when it has been requested to terminate.

Yeh, seems much cleaner.

Slightly bizarre though cos now we're pretty much back to my originally
proposed design. C'est la vie.

I like this way because it means we might in the future get Startup
process to perform post-recovery actions also.

 The startup process now catches SIGTERM, and calls proc_exit() at the 
 next WAL record. That's what will happen in a fast shutdown. Unexpected 
 death of the startup process is treated the same as a backend/auxiliary 
 process crash.

Good. Like your re-arrangement of StartupProcessMain also.


Your call to PMSIGNAL_RECOVERY_COMPLETED needs to be if
(IsUnderPostmaster), or at least a comment to explain why not or perhaps
an Assert.

Think you need to just throw away this chunk

@@ -5253,7 +5386,7 @@ StartupXLOG(void)
 * Complain if we did not roll forward far enough to render the
backup
 * dump consistent.
 */
-   if (XLByteLT(EndOfLog, ControlFile-minRecoveryPoint))
+   if (InRecovery  !reachedSafeStartPoint)
{
if (reachedStopPoint)   /* stopped because of stop
request */
ereport(FATAL,




-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-01-31 Thread Heikki Linnakangas

Simon Riggs wrote:

On Fri, 2009-01-30 at 13:15 +0200, Heikki Linnakangas wrote:

Simon Riggs wrote:

I'm thinking to add a new function that will allow crash testing easier.

pg_crash_standby() will issue a new xlog record, XLOG_CRASH_STANDBY,
which when replayed will just throw a FATAL error and crash Startup
process. We won't be adding that to the user docs...

This will allow us to produce tests that crash the server at specific
places, rather than trying to trap those points manually.
Heh, talk about a footgun ;-). I don't think including that in CVS is a 
good idea.


Thought you'd like it. I'd have preferred it in a plugin... :-(

Not really sure why its a problem though. We support 
pg_ctl stop -m immediate, which is the equivalent, yet we don't regard

that as a footgun.


If you poison your WAL archive with a XLOG_CRASH_RECOVERY record, 
recovery will never be able to proceed over that point. There would have 
to be a switch to ignore those records, at the very least.


pg_ctl stop -m immediate has some use in a production system, while this 
would be a pure development aid. For that purpose, you might as use a 
patched version.


Presumably you want to test different kind of crashes and at different 
points. FATAL, PANIC, segfault etc. A single crash mechanism like that 
will only test one path.


You don't really need to do it with a new WAL record. You could just add 
a GUC or recovery.conf option along the lines of recovery_target: 
crash_target=0/123456, and check for that in ReadRecord or wherever you 
want the crash to occur.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-01-31 Thread Heikki Linnakangas

Simon Riggs wrote:

On Fri, 2009-01-30 at 16:55 +0200, Heikki Linnakangas wrote:

Ok, here's an attempt to make shutdown work gracefully.

Startup process now signals postmaster three times during startup: first 
when it has done all the initialization, and starts redo. At that point. 
postmaster launches bgwriter, which starts to perform restartpoints when 
it deems appropriate. The 2nd time signals when we've reached consistent 
recovery state. As the patch stands, that's not significant, but it will 
be with all the rest of the hot standby stuff. The 3rd signal is sent 
when startup process has finished recovery. Postmaster used to wait for 
the startup process to exit, and check the return code to determine 
that, but now that we support shutdown, startup process also returns 
with 0 exit code when it has been requested to terminate.


Yeh, seems much cleaner.

Slightly bizarre though cos now we're pretty much back to my originally
proposed design. C'est la vie.


Yep. I didn't see any objections to that approach in the archives. There 
was other problems in the early versions of the patch, but nothing 
related to this arrangement.



I like this way because it means we might in the future get Startup
process to perform post-recovery actions also.


Yeah, it does. Do you have something in mind already?


Your call to PMSIGNAL_RECOVERY_COMPLETED needs to be if
(IsUnderPostmaster), or at least a comment to explain why not or perhaps
an Assert.


Nah, StartupProcessMain is only run under postmaster; you don't want to 
install signal handlers in a stand-along backend. Stand-alone backend 
calls StartupXLOG directly.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-01-30 Thread Heikki Linnakangas
I just realized that the new minSafeStartPoint is actually exactly the 
same concept as the existing minRecoveryPoint. As the recovery 
progresses, we could advance minRecoveryPoint just as well as the new 
minSafeStartPoint.


Perhaps it's a good idea to keep them separate anyway though, the 
original minRecoveryPoint might be a useful debugging aid. Or what do 
you think?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-01-30 Thread Simon Riggs

On Thu, 2009-01-29 at 20:35 +0200, Heikki Linnakangas wrote:
 Hmm, another point of consideration is how this interacts with the 
 pause/continue. In particular, it was suggested earlier that you
 could 
 put an option into recovery.conf to start in paused mode. If you
 pause 
 recovery, and then stop and restart the server, and have that option
 in 
 recovery.conf, I would expect that when you enter consistent recovery 
 you're at the exact same paused location as before stopping the
 server. 
 The upshot of that is that we need to set minSafeStartPoint to that 
 exact location, at least when you pause  stop in a controlled
 fashion.

OK, makes sense.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-01-30 Thread Simon Riggs

On Thu, 2009-01-29 at 19:20 +0200, Heikki Linnakangas wrote:
 Heikki Linnakangas wrote:
  It looks like if you issue a fast shutdown during recovery, postmaster 
  doesn't kill bgwriter.
 
 Hmm, seems like we haven't thought through how shutdown during 
 consistent recovery is supposed to behave in general. Right now, smart 
 shutdown doesn't do anything during consistent recovery, because the 
 startup process will just keep going. And fast shutdown will simply 
 ExitPostmaster(1), which is clearly not right.

That whole area was something I was leaving until last, since immediate
shutdown doesn't work either, even in HEAD. (Fujii-san and I discussed
this before Christmas, briefly).

 I'm thinking that in both smart and fast shutdown, the startup process 
 should exit in a controlled way as soon as it's finished with the 
 current WAL record, and set minSafeStartPoint to the current point in 
 the replay.

That makes sense, though isn't required.

 I wonder if bgwriter should perform a restartpoint before exiting? 
 You'll have to start with recovery on the next startup anyway, but at 
 least we could minimize the amount of WAL that needs to be replayed.

That seems like extra work for no additional benefit.

I think we're beginning to blur the lines between review and you just
adding some additional stuff in this area. There's nothing to stop you
doing further changes after this has been committed. We can also commit
what we have with some caveats also, i.e. commit in pieces.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-01-30 Thread Simon Riggs

On Fri, 2009-01-30 at 11:33 +0200, Heikki Linnakangas wrote:
 I just realized that the new minSafeStartPoint is actually exactly the 
 same concept as the existing minRecoveryPoint. As the recovery 
 progresses, we could advance minRecoveryPoint just as well as the new 
 minSafeStartPoint.
 
 Perhaps it's a good idea to keep them separate anyway though, the 
 original minRecoveryPoint might be a useful debugging aid. Or what do 
 you think?

I think we've been confusing ourselves substantially. The patch already
has everything it needs, but there is a one-line-fixable bug where
Fujii-san says.

The code comments already explain how this works

* There are two points in the log that we must pass. The first
* is minRecoveryPoint, which is the LSN at the time the
* base backup was taken that we are about to rollforward from.
* If recovery has ever crashed or was stopped there is also
* another point also: minSafeStartPoint, which we know the
* latest LSN that recovery could have reached prior to crash.

The later message

FATAL  WAL ends before end time of backup dump

was originally triggered if

if (XLByteLT(EndOfLog, ControlFile-minRecoveryPoint))

and I changed that. Now I look at it again, I see that the original if
test, shown above, is correct and should not have been changed.

Other than that, I don't see the need for further change. Heikki's
suggestions to write a new minSafeStartPoint are good ones and fit
within the existing mechanisms and meanings of these variables.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-01-30 Thread Simon Riggs

On Thu, 2009-01-29 at 14:21 +0200, Heikki Linnakangas wrote:
 It looks like if you issue a fast shutdown during recovery, postmaster 
 doesn't kill bgwriter.

Thanks for the report.

I'm thinking to add a new function that will allow crash testing easier.

pg_crash_standby() will issue a new xlog record, XLOG_CRASH_STANDBY,
which when replayed will just throw a FATAL error and crash Startup
process. We won't be adding that to the user docs...

This will allow us to produce tests that crash the server at specific
places, rather than trying to trap those points manually.

 Seems that reaper() needs to be taught that bgwriter can be active 
 during consistent recovery. I'll take a look at how to do that.
 
 
 BTW, the message terminating connection ... is a bit misleading. It's 
 referring to the startup process, which is hardly a connection. We have 
 that in CVS HEAD too, so it's not something introduced by the patch, but 
 seems worth changing in HS, since we then let real connections in while 
 startup process is still running.
 
-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-01-30 Thread Heikki Linnakangas

Simon Riggs wrote:

I'm thinking to add a new function that will allow crash testing easier.

pg_crash_standby() will issue a new xlog record, XLOG_CRASH_STANDBY,
which when replayed will just throw a FATAL error and crash Startup
process. We won't be adding that to the user docs...

This will allow us to produce tests that crash the server at specific
places, rather than trying to trap those points manually.


Heh, talk about a footgun ;-). I don't think including that in CVS is a 
good idea.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-01-30 Thread Heikki Linnakangas

Simon Riggs wrote:

On Thu, 2009-01-29 at 19:20 +0200, Heikki Linnakangas wrote:
Hmm, seems like we haven't thought through how shutdown during 
consistent recovery is supposed to behave in general. Right now, smart 
shutdown doesn't do anything during consistent recovery, because the 
startup process will just keep going. And fast shutdown will simply 
ExitPostmaster(1), which is clearly not right.


That whole area was something I was leaving until last, since immediate
shutdown doesn't work either, even in HEAD. (Fujii-san and I discussed
this before Christmas, briefly).


We must handle shutdown gracefully, can't just leave bgwriter running 
after postmaster exit.


Hmm, why does pg_standby catch SIGQUIT? Seems it could just let it kill 
the process.


I wonder if bgwriter should perform a restartpoint before exiting? 
You'll have to start with recovery on the next startup anyway, but at 
least we could minimize the amount of WAL that needs to be replayed.


That seems like extra work for no additional benefit.

I think we're beginning to blur the lines between review and you just
adding some additional stuff in this area. There's nothing to stop you
doing further changes after this has been committed.


Sure. I think the shutdown restartpoint might actually fall out of the 
way the code is structured anyway: bgwriter normally performs a 
checkpoint before exiting.



We can also commit
what we have with some caveats also, i.e. commit in pieces.


This late in the release cycle, I don't want to commit anything that we 
would have to rip out if we run out of time. There is no difference from 
review or testing point of view whether the code is in CVS or not.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-01-29 Thread Simon Riggs

On Thu, 2009-01-29 at 09:34 +0200, Heikki Linnakangas wrote:

 It does *during recovery*, before InitXLogAccess is called. Yeah, it's
 harmless currently. It would be pretty hard to keep it up-to-date in 
 bgwriter and other processes. I think it's better to keep it at 0,
 which is clearly an invalid value, than try to keep it up-to-date and
 risk using an old value. TimeLineID is not used in a lot of places, 
 currently. It might be a good idea to add some Assert(TimeLineID !=
 0) to places where it used.

Agreed. TimeLineID is a normal-running concept used for writing WAL.
Perhaps we should even solidify the meaning of TimeLineID == 0 as
cannot write WAL.

I see a problem there for any process that exists both before and after
recovery ends, which includes bgwriter. In that case we must not flush
WAL before recovery ends, yet afterwards we *must* flush WAL. To do that
we *must* have a valid TimeLineID set.

I would suggest we put InitXLogAccess into IsRecoveryProcessingMode(),
so if the mode changes we immediately set everything we need to allow
XLogFlush() calls to work correctly.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-01-29 Thread Heikki Linnakangas

Simon Riggs wrote:

On Thu, 2009-01-29 at 10:36 +0900, Fujii Masao wrote:

Hi,

On Wed, Jan 28, 2009 at 11:19 PM, Fujii Masao masao.fu...@gmail.com wrote:

I feel quite good about this patch now. Given the amount of code churn, it
requires testing, and I'll read it through one more time after sleeping over
it. Simon, do you see anything wrong with this?

I also read this patch and found something odd. I apologize if I misread it..

If archive recovery fails after it reaches the last valid record
in the last unfilled WAL segment, subsequent recovery might cause
the following fatal error. This is because minSafeStartPoint indicates
the end of the last unfilled WAL segment which subsequent recovery
cannot reach. Is this bug? (I'm not sure how to fix this problem
because I don't understand yet why minSafeStartPoint is required.)


FATAL:  WAL ends before end time of backup dump


I think you're right. We need a couple of changes to avoid confusing
messages.


Hmm, we could update minSafeStartPoint in XLogFlush instead. That was 
suggested when the idea of minSafeStartPoint was first thought of. 
Updating minSafeStartPoint is analogous to flushing WAL: 
minSafeStartPoint must be advanced to X before we can flush a data pgse 
with LSN X. To avoid excessive controlfile updates, whenever we update 
minSafeStartPoint, we can update it to the latest WAL record we've read.


Or we could simply ignore that error if we've reached minSafeStartPoint 
- 1 segment, assuming that even though minSafeStartPoint is higher, we 
can't have gone past the end of valid WAL records in the last segment in 
previous recovery either. But that feels more fragile.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-01-29 Thread Simon Riggs

On Thu, 2009-01-29 at 11:20 +0200, Heikki Linnakangas wrote:
 Simon Riggs wrote:
  On Thu, 2009-01-29 at 10:36 +0900, Fujii Masao wrote:
  Hi,
 
  On Wed, Jan 28, 2009 at 11:19 PM, Fujii Masao masao.fu...@gmail.com 
  wrote:
  I feel quite good about this patch now. Given the amount of code churn, 
  it
  requires testing, and I'll read it through one more time after sleeping 
  over
  it. Simon, do you see anything wrong with this?
  I also read this patch and found something odd. I apologize if I misread 
  it..
  If archive recovery fails after it reaches the last valid record
  in the last unfilled WAL segment, subsequent recovery might cause
  the following fatal error. This is because minSafeStartPoint indicates
  the end of the last unfilled WAL segment which subsequent recovery
  cannot reach. Is this bug? (I'm not sure how to fix this problem
  because I don't understand yet why minSafeStartPoint is required.)
 
  FATAL:  WAL ends before end time of backup dump
  
  I think you're right. We need a couple of changes to avoid confusing
  messages.
 
 Hmm, we could update minSafeStartPoint in XLogFlush instead. That was 
 suggested when the idea of minSafeStartPoint was first thought of. 
 Updating minSafeStartPoint is analogous to flushing WAL: 
 minSafeStartPoint must be advanced to X before we can flush a data pgse 
 with LSN X. To avoid excessive controlfile updates, whenever we update 
 minSafeStartPoint, we can update it to the latest WAL record we've read.
 
 Or we could simply ignore that error if we've reached minSafeStartPoint 
 - 1 segment, assuming that even though minSafeStartPoint is higher, we 
 can't have gone past the end of valid WAL records in the last segment in 
 previous recovery either. But that feels more fragile.

My proposed fix for Fujii-san's minSafeStartPoint bug is to introduce
another control file state DB_IN_ARCHIVE_RECOVERY_BASE. This would show
that we are still recovering up to the point of the end of the base
backup. Once we reach minSafeStartPoint we then switch state to
DB_IN_ARCHIVE_RECOVERY, and set baseBackupReached boolean, which then
enables writing new minSafeStartPoints when we open new WAL files in the
future. 

We then have messages only when in DB_IN_ARCHIVE_RECOVERY_BASE state

  if (XLByteLT(EndOfLog, ControlFile-minRecoveryPoint) 
  ControlFile-state == DB_IN_ARCHIVE_RECOVERY_BASE)
  {
if (reachedStopPoint) /* stopped because of stop request */
  ereport(FATAL,
  (errmsg(requested recovery stop point is before end time of
backup dump)));
else /* ran off end of WAL */
ereport(FATAL,
(errmsg(WAL ends before end time of backup dump)));
  }

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-01-29 Thread Heikki Linnakangas

Simon Riggs wrote:

My proposed fix for Fujii-san's minSafeStartPoint bug is to introduce
another control file state DB_IN_ARCHIVE_RECOVERY_BASE. This would show
that we are still recovering up to the point of the end of the base
backup. Once we reach minSafeStartPoint we then switch state to
DB_IN_ARCHIVE_RECOVERY, and set baseBackupReached boolean, which then
enables writing new minSafeStartPoints when we open new WAL files in the
future. 


I don't see how that helps, the bug has nothing to with base backups. It 
comes from the fact that we set minSafeStartPoint beyond the actual end 
of WAL, if the last WAL segment is only partially filled (= fails CRC 
check at some point). If we crash after setting minSafeStartPoint like 
that, and then restart recovery, we'll get the error.


The last WAL segment could be partially filled for example because the 
DBA has manually copied the last unarchived WAL segments to pg_xlog, as 
we recommend in the manual.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-01-29 Thread Heikki Linnakangas
It looks like if you issue a fast shutdown during recovery, postmaster 
doesn't kill bgwriter.


...
LOG:  restored log file 00010028 from archive
LOG:  restored log file 00010029 from archive
LOG:  consistent recovery state reached at 0/295C
...
LOG:  restored log file 0001002F from archive
LOG:  restored log file 00010030 from archive
LOG:  restored log file 00010031 from archive
LOG:  restored log file 00010032 from archive
LOG:  restored log file 00010033 from archive
LOG:  restartpoint starting: time
LOG:  restored log file 00010034 from archive
LOG:  received fast shutdown request
LOG:  restored log file 00010035 from archive
FATAL:  terminating connection due to administrator command
LOG:  startup process (PID 14137) exited with exit code 1
LOG:  aborting startup due to startup process failure
hlinn...@heikkilaptop:~/pgsql$
hlinn...@heikkilaptop:~/pgsql$ LOG:  restartpoint complete: wrote 3324 
buffers (5.1%); write=13.996 s, sync=2.016 s, total=16.960 s

LOG:  recovery restart point at 0/3000FA14

Seems that reaper() needs to be taught that bgwriter can be active 
during consistent recovery. I'll take a look at how to do that.



BTW, the message terminating connection ... is a bit misleading. It's 
referring to the startup process, which is hardly a connection. We have 
that in CVS HEAD too, so it's not something introduced by the patch, but 
seems worth changing in HS, since we then let real connections in while 
startup process is still running.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-01-29 Thread Simon Riggs

On Thu, 2009-01-29 at 12:22 +0200, Heikki Linnakangas wrote:
 Simon Riggs wrote:
  My proposed fix for Fujii-san's minSafeStartPoint bug is to introduce
  another control file state DB_IN_ARCHIVE_RECOVERY_BASE. This would show
  that we are still recovering up to the point of the end of the base
  backup. Once we reach minSafeStartPoint we then switch state to
  DB_IN_ARCHIVE_RECOVERY, and set baseBackupReached boolean, which then
  enables writing new minSafeStartPoints when we open new WAL files in the
  future. 
 
 I don't see how that helps, the bug has nothing to with base backups. 

Sorry, disagree.

 It 
 comes from the fact that we set minSafeStartPoint beyond the actual end 
 of WAL, if the last WAL segment is only partially filled (= fails CRC 
 check at some point). If we crash after setting minSafeStartPoint like 
 that, and then restart recovery, we'll get the error.

Look again please. My proposal would avoid the error when it is not
relevant, yet keep it when it is (while recovering base backups). 

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-01-29 Thread Heikki Linnakangas

Simon Riggs wrote:

On Thu, 2009-01-29 at 12:22 +0200, Heikki Linnakangas wrote:
It 
comes from the fact that we set minSafeStartPoint beyond the actual end 
of WAL, if the last WAL segment is only partially filled (= fails CRC 
check at some point). If we crash after setting minSafeStartPoint like 
that, and then restart recovery, we'll get the error.


Look again please. My proposal would avoid the error when it is not
relevant, yet keep it when it is (while recovering base backups). 


I fail to see what base backups have to do with this. The problem arises 
in this scenario:


0. A base backup is unzipped. recovery.conf is copied in place, and the 
remaining unarchived WAL segments are copied from the primary server to 
pg_xlog. The last WAL segment is only partially filled. Let's say that 
redo point is in WAL segment 1. The last, partial, WAL segment is 3, and 
WAL ends at 0/350

1. postmaster is started, recovery starts.
2. WAL segment 1 is restored from archive.
3. We reach consistent recovery point
4. We restore WAL segment 2 from archive. minSafeStartPoint is advanced 
to 0/300
5. WAL segment 2 is completely replayed, we move on to WAL segment 3. It 
is not in archive, but it's found in pg_xlog. minSafeStartPoint is 
advanced to 0/400. Note that that's beyond end of WAL.
6. At replay of WAL record 0/320, the recovery is interrupted. For 
example, by a fast shutdown request, or crash.


Now when we restart the recovery, we will never reach minSafeStartPoint, 
which is now 0/400, and we'll fail with the error that Fujii-san 
pointed out. We're already way past the min recovery point of base 
backup by then.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-01-29 Thread Simon Riggs

On Thu, 2009-01-29 at 15:31 +0200, Heikki Linnakangas wrote:

 Now when we restart the recovery, we will never reach
 minSafeStartPoint, which is now 0/400, and we'll fail with the
 error that Fujii-san pointed out. We're already way past the min
 recovery point of base backup by then.

The problem was that we reported this error

FATAL:  WAL ends before end time of backup dump

and this is inappropriate because, as you say, we are way past the min
recovery point of base backup.

If you look again at my proposal you will see that the proposal avoids
the above error by keeping track of whether we are past the point of
base backup or not. If we are still in base backup we get the error and
if we are passed it we do not.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-01-29 Thread Heikki Linnakangas

Simon Riggs wrote:

On Thu, 2009-01-29 at 15:31 +0200, Heikki Linnakangas wrote:


Now when we restart the recovery, we will never reach
minSafeStartPoint, which is now 0/400, and we'll fail with the
error that Fujii-san pointed out. We're already way past the min
recovery point of base backup by then.


The problem was that we reported this error

FATAL:  WAL ends before end time of backup dump

and this is inappropriate because, as you say, we are way past the min
recovery point of base backup.

If you look again at my proposal you will see that the proposal avoids
the above error by keeping track of whether we are past the point of
base backup or not. If we are still in base backup we get the error and
if we are passed it we do not.


Oh, we would simply ignore the fact that we haven't reached the 
minSafeStartPoint at the end of recovery, and start up anyway. Ok, that 
would avoid the problem Fujii-san described. It's like my suggestion of 
ignoring the message if we're at minSafeStartPoint - 1 segment, just 
more lenient. I don't understand why you'd need a new control file 
state, though.


You'd lose the extra protection minSafeStartPoint gives, though. For 
example, if you interrupt recovery and move recovery point backwards, we 
could refuse to start up when it's not safe to do so. It's currently a 
don't do that! case, but we could protect against that with 
minSafeStartPoint.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-01-29 Thread Heikki Linnakangas

Heikki Linnakangas wrote:
It looks like if you issue a fast shutdown during recovery, postmaster 
doesn't kill bgwriter.


Hmm, seems like we haven't thought through how shutdown during 
consistent recovery is supposed to behave in general. Right now, smart 
shutdown doesn't do anything during consistent recovery, because the 
startup process will just keep going. And fast shutdown will simply 
ExitPostmaster(1), which is clearly not right.


I'm thinking that in both smart and fast shutdown, the startup process 
should exit in a controlled way as soon as it's finished with the 
current WAL record, and set minSafeStartPoint to the current point in 
the replay.


I wonder if bgwriter should perform a restartpoint before exiting? 
You'll have to start with recovery on the next startup anyway, but at 
least we could minimize the amount of WAL that needs to be replayed.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-01-29 Thread Heikki Linnakangas

Heikki Linnakangas wrote:

Simon Riggs wrote:

On Thu, 2009-01-29 at 15:31 +0200, Heikki Linnakangas wrote:


Now when we restart the recovery, we will never reach
minSafeStartPoint, which is now 0/400, and we'll fail with the
error that Fujii-san pointed out. We're already way past the min
recovery point of base backup by then.


The problem was that we reported this error

FATAL:  WAL ends before end time of backup dump

and this is inappropriate because, as you say, we are way past the min
recovery point of base backup.

If you look again at my proposal you will see that the proposal avoids
the above error by keeping track of whether we are past the point of
base backup or not. If we are still in base backup we get the error and
if we are passed it we do not.


Oh, we would simply ignore the fact that we haven't reached the 
minSafeStartPoint at the end of recovery, and start up anyway. Ok, that 
would avoid the problem Fujii-san described. It's like my suggestion of 
ignoring the message if we're at minSafeStartPoint - 1 segment, just 
more lenient. I don't understand why you'd need a new control file 
state, though.


You'd lose the extra protection minSafeStartPoint gives, though. For 
example, if you interrupt recovery and move recovery point backwards, we 
could refuse to start up when it's not safe to do so. It's currently a 
don't do that! case, but we could protect against that with 
minSafeStartPoint.


Hmm, another point of consideration is how this interacts with the 
pause/continue. In particular, it was suggested earlier that you could 
put an option into recovery.conf to start in paused mode. If you pause 
recovery, and then stop and restart the server, and have that option in 
recovery.conf, I would expect that when you enter consistent recovery 
you're at the exact same paused location as before stopping the server. 
The upshot of that is that we need to set minSafeStartPoint to that 
exact location, at least when you pause  stop in a controlled fashion.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Hot standby, recovery infra

2009-01-28 Thread Heikki Linnakangas

I've been reviewing and massaging the so called recovery infra patch.

To recap, the goal is to:
- start background writer during (archive) recovery
- skip the shutdown checkpoint at the end of recovery. Instead, the 
database is brought up immediately, and the bgwriter performs a normal 
online checkpoint, while we're already accepting connections.
- keep track of when we reach a consistent point in the recovery, where 
we could let read-only backends in. Which is obviously required for hot 
standby


The 1st and 2nd points provide some useful functionality, even without 
the rest of the hot standby patch.


I've refactored the patch quite heavily, making it a lot simpler, and 
over 1/3 smaller than before:


The signaling between the bgwriter and startup process during recovery 
was quite complicated. The startup process periodically sent checkpoint 
records to the bgwriter, so that bgwriter could perform restart points. 
I've replaced that by storing the last seen checkpoint in a shared 
memory in xlog.c. CreateRestartPoint() picks it up from there. This 
means that bgwriter can decide autonomously when to perform a restart 
point, it no longer needs to be told to do so by the startup process. 
Which is nice in a standby. What could happen before is that the standby 
processes a checkpoint record, and decides not to make it a restartpoint 
because not enough time has passed since last one. If we then get a long 
idle period after that, we'd really want to make the previous checkpoint 
record a restart point after all, after some time has passed. That is 
what will happen now, which is a usability enhancement, although the 
real motivation for this refactoring was to make the code simpler.


The bgwriter is now always responsible for all checkpoints and 
restartpoints. (well, except for a stand-alone backend). Which makes it 
easier to understand what's going on, IMHO.


There was one pretty fundamental bug in the minsafestartpoint handling: 
it was always set when a WAL file was opened for reading. Which means it 
was also moved backwards when the recovery began by reading the WAL 
segment containing last restart/checkpoint, rendering it useless for the 
purpose it was designed. Fortunately that was easy to fix. Another tiny 
bug was that log_restartpoints was not respected, because it was stored 
in a variable in startup process' memory, and wasn't seen by bgwriter.


One aspect that troubles me a bit is the changes in XLogFlush. I guess 
we no longer have the problem that you can't start up the database if 
we've read in a corrupted page from disk, because we now start up before 
checkpointing. However, it does mean that if a corrupt page is read into 
shared buffers, we'll never be able to checkpoint. But then again, I 
guess that's already true without this patch.



I feel quite good about this patch now. Given the amount of code churn, 
it requires testing, and I'll read it through one more time after 
sleeping over it. Simon, do you see anything wrong with this?


(this patch is also in my git repository at git.postgresql.org, branch 
recoveryinfra.)


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index bd6035d..30fea49 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -119,12 +119,26 @@ CheckpointStatsData CheckpointStats;
  */
 TimeLineID	ThisTimeLineID = 0;
 
-/* Are we doing recovery from XLOG? */
+/*
+ * Are we doing recovery from XLOG? 
+ *
+ * This is only ever true in the startup process, when it's replaying WAL.
+ * It's used in functions that need to act differently when called from a
+ * redo function (e.g skip WAL logging).  To check whether the system is in
+ * recovery regardless of what process you're running in, use
+ * IsRecoveryProcessingMode().
+ */
 bool		InRecovery = false;
 
 /* Are we recovering using offline XLOG archives? */
 static bool InArchiveRecovery = false;
 
+/*
+ * Local copy of shared RecoveryProcessingMode variable. True actually
+ * means not known, need to check the shared state
+ */
+static bool LocalRecoveryProcessingMode = true;
+
 /* Was the last xlog file restored from archive, or local? */
 static bool restoredFromArchive = false;
 
@@ -133,16 +147,22 @@ static char *recoveryRestoreCommand = NULL;
 static bool recoveryTarget = false;
 static bool recoveryTargetExact = false;
 static bool recoveryTargetInclusive = true;
-static bool recoveryLogRestartpoints = false;
 static TransactionId recoveryTargetXid;
 static TimestampTz recoveryTargetTime;
 static TimestampTz recoveryLastXTime = 0;
+/*
+ * log_restartpoints is stored in shared memory because it needs to be
+ * accessed by bgwriter when it performs restartpoints
+ */
 
 /* if recoveryStopsHere returns true, it saves actual stop xid/time here */
 static TransactionId recoveryStopXid;
 static TimestampTz recoveryStopTime;
 static bool recoveryStopAfter;
 

Re: [HACKERS] Hot standby, recovery infra

2009-01-28 Thread Simon Riggs

On Wed, 2009-01-28 at 12:04 +0200, Heikki Linnakangas wrote:
 I've been reviewing and massaging the so called recovery infra patch.

Thanks.

 I feel quite good about this patch now. Given the amount of code
 churn, it requires testing, and I'll read it through one more time
 after sleeping over it. 

There's nothing major I feel we should discuss.

The way restartpoints happen is a useful improvement, thanks.

 Simon, do you see anything wrong with this?

Few minor points

* I think we are now renaming the recovery.conf file too early. The
comment says We have already restored all the WAL segments we need from
the archive, and we trust that they are not going to go away even if we
crash. We have, but the files overwrite each other as they arrive, so
if the last restartpoint is not in the last restored WAL file then it
will only exist in the archive. The recovery.conf is the only place
where we store the information on where the archive is and how to access
it, so by renaming it out of the way we will be unable to crash recover
until the first checkpoint is complete. So the way this was in the
original patch is the correct way to go, AFAICS.

* my original intention was to deprecate log_restartpoints and would
still like to do so. log_checkpoints does just as well for that. Even
less code than before...

* comment on BgWriterShmemInit() refers to CHECKPOINT_IS_STARTUP, but
the actual define is CHECKPOINT_STARTUP. Would prefer the is version
because it matches the IS_SHUTDOWN naming.

* In CreateCheckpoint() the if test on TruncateSubtrans() has been
removed, but the comment has not been changed (to explain why).

* PG_CONTROL_VERSION bump should be just one increment, to 844. I
deliberately had it higher to help spot mismatches earlier, and to avoid
needless patch conflicts.

So it looks pretty much ready for commit very soon.

We should continue to measure performance of recovery in the light of
these changes. I still feel that fsyncing the control file on each
XLogFileRead() will give a noticeable performance penalty, mostly
because we know doing exactly the same thing in normal running caused a
performance penalty. But that is easily changed and cannot be done with
any certainty without wider feedback, so no reason to delay code commit.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-01-28 Thread Fujii Masao
Hi,

On Wed, Jan 28, 2009 at 7:04 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 I've been reviewing and massaging the so called recovery infra patch.

Great!

 I feel quite good about this patch now. Given the amount of code churn, it
 requires testing, and I'll read it through one more time after sleeping over
 it. Simon, do you see anything wrong with this?

I also read this patch and found something odd. I apologize if I misread it..

 @@ -507,7 +550,7 @@ CheckArchiveTimeout(void)
   pg_time_t   now;
   pg_time_t   last_time;

 - if (XLogArchiveTimeout = 0)
 + if (XLogArchiveTimeout = 0 || !IsRecoveryProcessingMode())

The above change destroys archive_timeout because checking the timeout
is always skipped after recovery is done.

 @@ -355,6 +359,27 @@ BackgroundWriterMain(void)
*/
   PG_SETMASK(UnBlockSig);

 + BgWriterRecoveryMode = IsRecoveryProcessingMode();
 +
 + if (BgWriterRecoveryMode)
 + elog(DEBUG1, bgwriter starting during recovery);
 + else
 + InitXLOGAccess();

Why is InitXLOGAccess() called also here when bgwriter is started after
recovery? That is already called by AuxiliaryProcessMain().

 @@ -1302,7 +1314,7 @@ ServerLoop(void)
* state that prevents it, start one.  It doesn't matter if this
* fails, we'll just try again later.
*/
 - if (BgWriterPID == 0  pmState == PM_RUN)
 + if (BgWriterPID == 0  (pmState == PM_RUN || pmState == 
 PM_RECOVERY))
   BgWriterPID = StartBackgroundWriter();

Likewise, we should try to start also the stats collector during recovery?

 @@ -2103,7 +2148,8 @@ XLogFileInit(uint32 log, uint32 seg,
   unlink(tmppath);
   }

 - elog(DEBUG2, done creating and filling new WAL file);
 + XLogFileName(tmppath, ThisTimeLineID, log, seg);
 + elog(DEBUG2, done creating and filling new WAL file %s, tmppath);

This debug message is somewhat confusing, because the WAL file
represented as tmppath might have been already created by
previous XLogFileInit() via InstallXLogFileSegment().

regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-01-28 Thread Simon Riggs

On Wed, 2009-01-28 at 23:19 +0900, Fujii Masao wrote:

  @@ -355,6 +359,27 @@ BackgroundWriterMain(void)
   */
  PG_SETMASK(UnBlockSig);
 
  +   BgWriterRecoveryMode = IsRecoveryProcessingMode();
  +
  +   if (BgWriterRecoveryMode)
  +   elog(DEBUG1, bgwriter starting during recovery);
  +   else
  +   InitXLOGAccess();
 
 Why is InitXLOGAccess() called also here when bgwriter is started after
 recovery? That is already called by AuxiliaryProcessMain().

InitXLOGAccess() sets the timeline and also gets the latest record
pointer. If the bgwriter is started in recovery these values need to be
reset later. It's easier to call it twice.

  @@ -1302,7 +1314,7 @@ ServerLoop(void)
   * state that prevents it, start one.  It doesn't matter if this
   * fails, we'll just try again later.
   */
  -   if (BgWriterPID == 0  pmState == PM_RUN)
  +   if (BgWriterPID == 0  (pmState == PM_RUN || pmState == 
  PM_RECOVERY))
  BgWriterPID = StartBackgroundWriter();
 
 Likewise, we should try to start also the stats collector during recovery?

We did in the previous patch...

  @@ -2103,7 +2148,8 @@ XLogFileInit(uint32 log, uint32 seg,
  unlink(tmppath);
  }
 
  -   elog(DEBUG2, done creating and filling new WAL file);
  +   XLogFileName(tmppath, ThisTimeLineID, log, seg);
  +   elog(DEBUG2, done creating and filling new WAL file %s, tmppath);
 
 This debug message is somewhat confusing, because the WAL file
 represented as tmppath might have been already created by
 previous XLogFileInit() via InstallXLogFileSegment().

I think those are just for debugging and can be removed.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-01-28 Thread Fujii Masao
Hi,

On Wed, Jan 28, 2009 at 11:47 PM, Simon Riggs si...@2ndquadrant.com wrote:

 On Wed, 2009-01-28 at 23:19 +0900, Fujii Masao wrote:

  @@ -355,6 +359,27 @@ BackgroundWriterMain(void)
   */
  PG_SETMASK(UnBlockSig);
 
  +   BgWriterRecoveryMode = IsRecoveryProcessingMode();
  +
  +   if (BgWriterRecoveryMode)
  +   elog(DEBUG1, bgwriter starting during recovery);
  +   else
  +   InitXLOGAccess();

 Why is InitXLOGAccess() called also here when bgwriter is started after
 recovery? That is already called by AuxiliaryProcessMain().

 InitXLOGAccess() sets the timeline and also gets the latest record
 pointer. If the bgwriter is started in recovery these values need to be
 reset later. It's easier to call it twice.

Right. But, InitXLOGAccess() called during main loop is enough for
that purpose.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-01-28 Thread Simon Riggs

On Wed, 2009-01-28 at 23:54 +0900, Fujii Masao wrote:
  Why is InitXLOGAccess() called also here when bgwriter is started after
  recovery? That is already called by AuxiliaryProcessMain().
 
  InitXLOGAccess() sets the timeline and also gets the latest record
  pointer. If the bgwriter is started in recovery these values need to be
  reset later. It's easier to call it twice.
 
 Right. But, InitXLOGAccess() called during main loop is enough for
 that purpose.

I think the code is clearer the way it is. Otherwise you'd read
AuxiliaryProcessMain() and think the bgwriter didn't need xlog access.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-01-28 Thread Heikki Linnakangas

Fujii Masao wrote:

On Wed, Jan 28, 2009 at 7:04 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:

I feel quite good about this patch now. Given the amount of code churn, it
requires testing, and I'll read it through one more time after sleeping over
it. Simon, do you see anything wrong with this?


I also read this patch and found something odd. 


Many thanks for looking into this!


@@ -507,7 +550,7 @@ CheckArchiveTimeout(void)
pg_time_t   now;
pg_time_t   last_time;

-   if (XLogArchiveTimeout = 0)
+   if (XLogArchiveTimeout = 0 || !IsRecoveryProcessingMode())


The above change destroys archive_timeout because checking the timeout
is always skipped after recovery is done.


Yep, good catch. That obviously needs to be 
IsRecoveryProcessingMode(), without the exclamation mark.



@@ -355,6 +359,27 @@ BackgroundWriterMain(void)
 */
PG_SETMASK(UnBlockSig);

+   BgWriterRecoveryMode = IsRecoveryProcessingMode();
+
+   if (BgWriterRecoveryMode)
+   elog(DEBUG1, bgwriter starting during recovery);
+   else
+   InitXLOGAccess();


Why is InitXLOGAccess() called also here when bgwriter is started after
recovery? That is already called by AuxiliaryProcessMain().


Oh, I didn't realize that. Agreed, it's a bit disconcerting that 
InitXLOGAccess() is called twice (there was a 2nd call within the loop 
in the original patch as well). Looking at InitXLOGAccess, it's harmless 
to call it multiple times, but it seems better to remove the 
InitXLOGAccess call from AuxiliaryProcessMain().


InitXLOGAccess() needs to be called after seeing that 
IsRecoveryProcessingMode() == false, because it won't get the right 
timeline id before that.



@@ -1302,7 +1314,7 @@ ServerLoop(void)
 * state that prevents it, start one.  It doesn't matter if this
 * fails, we'll just try again later.
 */
-   if (BgWriterPID == 0  pmState == PM_RUN)
+   if (BgWriterPID == 0  (pmState == PM_RUN || pmState == 
PM_RECOVERY))
BgWriterPID = StartBackgroundWriter();


Likewise, we should try to start also the stats collector during recovery?


Hmm, there's not much point as this patch stands, but I guess we should 
in the hot standby patch, where we let backends in.



@@ -2103,7 +2148,8 @@ XLogFileInit(uint32 log, uint32 seg,
unlink(tmppath);
}

-   elog(DEBUG2, done creating and filling new WAL file);
+   XLogFileName(tmppath, ThisTimeLineID, log, seg);
+   elog(DEBUG2, done creating and filling new WAL file %s, tmppath);


This debug message is somewhat confusing, because the WAL file
represented as tmppath might have been already created by
previous XLogFileInit() via InstallXLogFileSegment().


I don't quite understand what you're saying, but I think I'll just 
revert that.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-01-28 Thread Fujii Masao
Hi,

On Wed, Jan 28, 2009 at 11:19 PM, Fujii Masao masao.fu...@gmail.com wrote:
 I feel quite good about this patch now. Given the amount of code churn, it
 requires testing, and I'll read it through one more time after sleeping over
 it. Simon, do you see anything wrong with this?

 I also read this patch and found something odd. I apologize if I misread it..

If archive recovery fails after it reaches the last valid record
in the last unfilled WAL segment, subsequent recovery might cause
the following fatal error. This is because minSafeStartPoint indicates
the end of the last unfilled WAL segment which subsequent recovery
cannot reach. Is this bug? (I'm not sure how to fix this problem
because I don't understand yet why minSafeStartPoint is required.)

 FATAL:  WAL ends before end time of backup dump

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-01-28 Thread Fujii Masao
Hi,

On Wed, Jan 28, 2009 at 11:19 PM, Fujii Masao masao.fu...@gmail.com wrote:
 I feel quite good about this patch now. Given the amount of code churn, it
 requires testing, and I'll read it through one more time after sleeping over
 it. Simon, do you see anything wrong with this?

 I also read this patch and found something odd. I apologize if I misread it..

Sorry for my random reply.

Though this is a matter of taste, I think that it's weird that bgwriter
runs with ThisTimeLineID = 0 during recovery. This is because
XLogCtl-ThisTimeLineID is set at the end of recovery. ISTM this will
be a cause of bug in the near future, though this is harmless currently.

 + /*
 +  * XXX: Should we try to perform restartpoints if we're not in 
 consistent
 +  * recovery? The bgwriter isn't doing it for us in that case.
 +  */

I think yes. This is helpful for the system which it takes a long time to get
a base backup, ie. it also takes a long time to reach a consistent recovery
point.

 +CreateRestartPoint(int flags)
snip
 +  * We rely on this lock to ensure that the startup process doesn't exit
 +  * Recovery while we are half way through a restartpoint. XXX ?
*/
 + LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);

Is this comment correct? CheckpointLock cannot prevent the startup process
from exiting recovery because the startup process doesn't acquire that lock.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-01-28 Thread Simon Riggs

On Thu, 2009-01-29 at 12:18 +0900, Fujii Masao wrote:
 Hi,
 
 On Wed, Jan 28, 2009 at 11:19 PM, Fujii Masao masao.fu...@gmail.com wrote:
  I feel quite good about this patch now. Given the amount of code churn, it
  requires testing, and I'll read it through one more time after sleeping 
  over
  it. Simon, do you see anything wrong with this?
 
  I also read this patch and found something odd. I apologize if I misread 
  it..
 
 Sorry for my random reply.
 
 Though this is a matter of taste, I think that it's weird that bgwriter
 runs with ThisTimeLineID = 0 during recovery. This is because
 XLogCtl-ThisTimeLineID is set at the end of recovery. ISTM this will
 be a cause of bug in the near future, though this is harmless currently.

It doesn't. That's exactly what InitXLogAccess() was for.

  +   /*
  +* XXX: Should we try to perform restartpoints if we're not in 
  consistent
  +* recovery? The bgwriter isn't doing it for us in that case.
  +*/
 
 I think yes. This is helpful for the system which it takes a long time to get
 a base backup, ie. it also takes a long time to reach a consistent recovery
 point.

The original patch did this.

  +CreateRestartPoint(int flags)
 snip
  +* We rely on this lock to ensure that the startup process doesn't exit
  +* Recovery while we are half way through a restartpoint. XXX ?
   */
  +   LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
 
 Is this comment correct? CheckpointLock cannot prevent the startup process
 from exiting recovery because the startup process doesn't acquire that lock.

The original patch acquired CheckpointLock during exitRecovery to prove
that a restartpoint was not in progress. It no longer does this, so not
sure if Heikki has found another way and the comment is wrong, or that
removing the lock request is a bug.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-01-28 Thread Simon Riggs

On Thu, 2009-01-29 at 10:36 +0900, Fujii Masao wrote:
 Hi,
 
 On Wed, Jan 28, 2009 at 11:19 PM, Fujii Masao masao.fu...@gmail.com wrote:
  I feel quite good about this patch now. Given the amount of code churn, it
  requires testing, and I'll read it through one more time after sleeping 
  over
  it. Simon, do you see anything wrong with this?
 
  I also read this patch and found something odd. I apologize if I misread 
  it..
 
 If archive recovery fails after it reaches the last valid record
 in the last unfilled WAL segment, subsequent recovery might cause
 the following fatal error. This is because minSafeStartPoint indicates
 the end of the last unfilled WAL segment which subsequent recovery
 cannot reach. Is this bug? (I'm not sure how to fix this problem
 because I don't understand yet why minSafeStartPoint is required.)
 
  FATAL:  WAL ends before end time of backup dump

I think you're right. We need a couple of changes to avoid confusing
messages.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby, recovery infra

2009-01-28 Thread Heikki Linnakangas

Simon Riggs wrote:

On Thu, 2009-01-29 at 12:18 +0900, Fujii Masao wrote:

Though this is a matter of taste, I think that it's weird that bgwriter
runs with ThisTimeLineID = 0 during recovery. This is because
XLogCtl-ThisTimeLineID is set at the end of recovery. ISTM this will
be a cause of bug in the near future, though this is harmless currently.


It doesn't. That's exactly what InitXLogAccess() was for.


It does *during recovery*, before InitXLogAccess is called. Yeah, it's 
harmless currently. It would be pretty hard to keep it up-to-date in 
bgwriter and other processes. I think it's better to keep it at 0, which 
is clearly an invalid value, than try to keep it up-to-date and risk 
using an old value. TimeLineID is not used in a lot of places, 
currently. It might be a good idea to add some Assert(TimeLineID != 0) 
to places where it used.



+   /*
+* XXX: Should we try to perform restartpoints if we're not in 
consistent
+* recovery? The bgwriter isn't doing it for us in that case.
+*/

I think yes. This is helpful for the system which it takes a long time to get
a base backup, ie. it also takes a long time to reach a consistent recovery
point.


The original patch did this.


Yeah, I took it out. ISTM if you restore from a base backup, you'd want 
to run it until consistent recovery anyway. We can put it back in if 
people feel it's helpful. There's two ways to do it: we can fire up the 
bgwriter before reaching consistent recovery point, or we can do the 
restartpoints in startup process before that point. I'm inclined to fire 
up the bgwriter earlier. That way, bgwriter remains responsible for all 
checkpoints and restartpoints, which seems clearer. I can't see any 
particular reason why bgwriter startup and reaching the consistent 
recovery point need to be linked together.



+CreateRestartPoint(int flags)

snip

+* We rely on this lock to ensure that the startup process doesn't exit
+* Recovery while we are half way through a restartpoint. XXX ?
 */
+   LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);

Is this comment correct? CheckpointLock cannot prevent the startup process
from exiting recovery because the startup process doesn't acquire that lock.


The original patch acquired CheckpointLock during exitRecovery to prove
that a restartpoint was not in progress. It no longer does this, so not
sure if Heikki has found another way and the comment is wrong, or that
removing the lock request is a bug.


The comment is obsolete. There's no reason for startup process to wait 
for a restartpoint to finish. Looking back at the archives and the 
history of that, I got the impression that in a very early version of 
the patch, startup process still created a shutdown checkpoint after 
recovery. To do that, it had to interrupt an ongoing restartpoint. That 
was deemed too dangerous/weird, so it was changed to wait for it to 
finish instead. But since the startup process no longer creates a 
shutdown checkpoint, the waiting became obsolete, right?


If there's a restartpoint in progress when we reach the end of recovery, 
startup process will RequestCheckpoint as usual. That will cause 
bgwriter to hurry the on-going restartpoint, skipping the usual delays, 
and start the checkpoint as soon as the restartpoint is finished.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers