Re: [HACKERS] Keepalive for max_standby_delay

2010-07-03 Thread Heikki Linnakangas

On 02/07/10 23:36, Tom Lane wrote:

Robert Haasrobertmh...@gmail.com  writes:

I haven't been able to wrap my head around why the delay should be
LESS in the archive case than in the streaming case.  Can you attempt
to hit me with the clue-by-four?


In the archive case, you're presumably trying to catch up, and so it
makes sense to kill queries faster so you can catch up.  The existing
code essentially forces instant kill when reading from archive, for any
reasonable value of max_standby_delay (because the archived timestamps
will probably be older than that).  That's overenthusiastic in my view,
but you can get that behavior if you want it with this patch by setting
max_standby_archive_delay to zero.  If you don't want it, you can use
something larger.  If you don't think that max_standby_archive_delay
should be less than max_standby_streaming_delay, you can set them the
same, too.


It would seem logical to use the same logic for archive recovery as we 
do for streaming replication, and only set XLogReceiptTime when you have 
to wait for a WAL segment to arrive into the archive, ie. when 
restore_command fails.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-07-03 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 It would seem logical to use the same logic for archive recovery as we 
 do for streaming replication, and only set XLogReceiptTime when you have 
 to wait for a WAL segment to arrive into the archive, ie. when 
 restore_command fails.

That would not do what you want at all in the case where you're
recovering from archive --- XLogReceiptTime would never advance
at all for the duration of the recovery.

It might be useful if you knew that it was a standby-with-log-shipping
situation, but we have no way to tell the difference.

The restore_command might know the difference, though.  Is it
worth modifying its API somehow so that the command could tell us
whether to advance XLogReceiptTime?  How would we do that?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-07-03 Thread Heikki Linnakangas

On 03/07/10 18:32, Tom Lane wrote:

Heikki Linnakangasheikki.linnakan...@enterprisedb.com  writes:

It would seem logical to use the same logic for archive recovery as we
do for streaming replication, and only set XLogReceiptTime when you have
to wait for a WAL segment to arrive into the archive, ie. when
restore_command fails.


That would not do what you want at all in the case where you're
recovering from archive --- XLogReceiptTime would never advance
at all for the duration of the recovery.


Do you mean when using something like pg_standby, which does the waiting 
itself?



It might be useful if you knew that it was a standby-with-log-shipping
situation, but we have no way to tell the difference.


With pg_standby etc. you use standby_mode=off. Same with traditional 
archive recovery. In standby mode, it's on.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-07-03 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 On 03/07/10 18:32, Tom Lane wrote:
 That would not do what you want at all in the case where you're
 recovering from archive --- XLogReceiptTime would never advance
 at all for the duration of the recovery.

 Do you mean when using something like pg_standby, which does the waiting 
 itself?

No, I'm thinking about recovery starting from an old base backup, or any
situation where you're trying to process a significant number of
archived WAL segments as quickly as possible.

 It might be useful if you knew that it was a standby-with-log-shipping
 situation, but we have no way to tell the difference.

 With pg_standby etc. you use standby_mode=off. Same with traditional 
 archive recovery. In standby mode, it's on.

Uh, no, that parameter is not what I'm talking about.  What I tried to
say is that if you're using log shipping for replication instead of
walsender/walreceiver, you might want to treat data as live even though
the database thinks it is being pulled from archive.  But we'd need
a way for the database to tell the cases apart ... right now it cannot.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-07-02 Thread Tom Lane
[ Apologies for the very slow turnaround on this --- I got hit with
another batch of non-postgres security issues this week. ]

Attached is a draft patch for revising the max_standby_delay behavior into
something I think is a bit saner.  There is some unfinished business:

* I haven't touched the documentation yet

* The code in xlog.c needs to be reverted to its former behavior so that
recoveryLastXTime means what it's supposed to mean, ie just the last
commit/abort timestamp.

However neither of these affects the testability of the patch.

Basically the way that it works is that the standby delay is computed with
reference to XLogReceiptTime rather than any timestamp obtained from WAL.
XLogReceiptTime is set to current time whenever we obtain a WAL segment
from the archives or when we begin processing a fresh batch of WAL from
walreceiver.  There's a subtlety in the streaming case: we don't advance
XLogReceiptTime if we are not caught up, that is if the startup process
is more than one flush cycle behind walreceiver.  In the normal case
we'll advance XLogReceiptTime on each flush cycle, but once we start
falling behind, it doesn't move so the grace time alloted to conflicting
queries begins to decrease.

I split max_standby_delay into two GUC variables, as previously
threatened: max_standby_archive_delay and max_standby_streaming_delay.
The former applies when processing data read from archive, the latter
when processing data received from walreceiver.  I think this is really
quite important given the new behavior, because max_standby_archive_delay
ought to be set with reference to the expected time to process one WAL
segment, whereas max_standby_streaming_delay doesn't depend on that value
at all.  I'm not sure what good defaults are for these values, so I left
them at 30 seconds for the moment.  I'm inclined to think
max_standby_archive_delay ought to be quite a bit less though.

It might be worth adding a minimum-grace-time limit as we previously
discussed, but this patch doesn't attempt to do that.

Comments?

regards, tom lane

Index: src/backend/access/transam/xlog.c
===
RCS file: /cvsroot/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.427
diff -c -r1.427 xlog.c
*** src/backend/access/transam/xlog.c	28 Jun 2010 19:46:19 -	1.427
--- src/backend/access/transam/xlog.c	2 Jul 2010 20:01:24 -
***
*** 72,78 
  bool		XLogArchiveMode = false;
  char	   *XLogArchiveCommand = NULL;
  bool		EnableHotStandby = false;
- int			MaxStandbyDelay = 30 * 1000;
  bool		fullPageWrites = true;
  bool		log_checkpoints = false;
  int			sync_method = DEFAULT_SYNC_METHOD;
--- 72,77 
***
*** 487,493 
   * Keeps track of which sources we've tried to read the current WAL
   * record from and failed.
   */
! static int failedSources = 0;
  
  /* Buffer for currently read page (XLOG_BLCKSZ bytes) */
  static char *readBuf = NULL;
--- 487,502 
   * Keeps track of which sources we've tried to read the current WAL
   * record from and failed.
   */
! static int failedSources = 0;	/* OR of XLOG_FROM_* codes */
! 
! /*
!  * These variables track when we last obtained some WAL data to process,
!  * and where we got it from.  (XLogReceiptSource is initially the same as
!  * readSource, but readSource gets reset to zero when we don't have data
!  * to process right now.)
!  */
! static TimestampTz XLogReceiptTime = 0;
! static int XLogReceiptSource = 0;	/* XLOG_FROM_* code */
  
  /* Buffer for currently read page (XLOG_BLCKSZ bytes) */
  static char *readBuf = NULL;
***
*** 2626,2632 
   * Open a logfile segment for reading (during recovery).
   *
   * If source = XLOG_FROM_ARCHIVE, the segment is retrieved from archive.
!  * If source = XLOG_FROM_PG_XLOG, it's read from pg_xlog.
   */
  static int
  XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli,
--- 2635,2641 
   * Open a logfile segment for reading (during recovery).
   *
   * If source = XLOG_FROM_ARCHIVE, the segment is retrieved from archive.
!  * Otherwise, it's assumed to be already available in pg_xlog.
   */
  static int
  XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli,
***
*** 2655,2660 
--- 2664,2670 
  			break;
  
  		case XLOG_FROM_PG_XLOG:
+ 		case XLOG_FROM_STREAM:
  			XLogFilePath(path, tli, log, seg);
  			restoredFromArchive = false;
  			break;
***
*** 2674,2680 
--- 2684,2696 
   xlogfname);
  		set_ps_display(activitymsg, false);
  
+ 		/* Track source of data in assorted state variables */
  		readSource = source;
+ 		XLogReceiptSource = source;
+ 		/* In FROM_STREAM case, caller tracks receipt time, not me */
+ 		if (source != XLOG_FROM_STREAM)
+ 			XLogReceiptTime = GetCurrentTimestamp();
+ 
  		return fd;
  	}
  	if (errno != ENOENT || !notfoundOk) /* unexpected failure? */
***
*** 5568,5574 
  /*
 

Re: [HACKERS] Keepalive for max_standby_delay

2010-07-02 Thread Robert Haas
On Fri, Jul 2, 2010 at 4:11 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 [ Apologies for the very slow turnaround on this --- I got hit with
 another batch of non-postgres security issues this week. ]

 Attached is a draft patch for revising the max_standby_delay behavior into
 something I think is a bit saner.  There is some unfinished business:

 * I haven't touched the documentation yet

 * The code in xlog.c needs to be reverted to its former behavior so that
 recoveryLastXTime means what it's supposed to mean, ie just the last
 commit/abort timestamp.

 However neither of these affects the testability of the patch.

 Basically the way that it works is that the standby delay is computed with
 reference to XLogReceiptTime rather than any timestamp obtained from WAL.
 XLogReceiptTime is set to current time whenever we obtain a WAL segment
 from the archives or when we begin processing a fresh batch of WAL from
 walreceiver.  There's a subtlety in the streaming case: we don't advance
 XLogReceiptTime if we are not caught up, that is if the startup process
 is more than one flush cycle behind walreceiver.  In the normal case
 we'll advance XLogReceiptTime on each flush cycle, but once we start
 falling behind, it doesn't move so the grace time alloted to conflicting
 queries begins to decrease.

 I split max_standby_delay into two GUC variables, as previously
 threatened: max_standby_archive_delay and max_standby_streaming_delay.
 The former applies when processing data read from archive, the latter
 when processing data received from walreceiver.  I think this is really
 quite important given the new behavior, because max_standby_archive_delay
 ought to be set with reference to the expected time to process one WAL
 segment, whereas max_standby_streaming_delay doesn't depend on that value
 at all.  I'm not sure what good defaults are for these values, so I left
 them at 30 seconds for the moment.  I'm inclined to think
 max_standby_archive_delay ought to be quite a bit less though.

 It might be worth adding a minimum-grace-time limit as we previously
 discussed, but this patch doesn't attempt to do that.

 Comments?

I haven't been able to wrap my head around why the delay should be
LESS in the archive case than in the streaming case.  Can you attempt
to hit me with the clue-by-four?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-07-02 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 I haven't been able to wrap my head around why the delay should be
 LESS in the archive case than in the streaming case.  Can you attempt
 to hit me with the clue-by-four?

In the archive case, you're presumably trying to catch up, and so it
makes sense to kill queries faster so you can catch up.  The existing
code essentially forces instant kill when reading from archive, for any
reasonable value of max_standby_delay (because the archived timestamps
will probably be older than that).  That's overenthusiastic in my view,
but you can get that behavior if you want it with this patch by setting
max_standby_archive_delay to zero.  If you don't want it, you can use
something larger.  If you don't think that max_standby_archive_delay
should be less than max_standby_streaming_delay, you can set them the
same, too.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-07-02 Thread Robert Haas
On Fri, Jul 2, 2010 at 4:36 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 I haven't been able to wrap my head around why the delay should be
 LESS in the archive case than in the streaming case.  Can you attempt
 to hit me with the clue-by-four?

 In the archive case, you're presumably trying to catch up, and so it
 makes sense to kill queries faster so you can catch up.

On the flip side, the timeout for the WAL segment is for 16MB of WAL,
whereas the timeout for SR is normally going to be for a much smaller
chunk (right?).  So even with the same value for both, it seems like
queries will be killed more aggressively during archive recovery.

Even so, it seems useful to have both.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-07-02 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Fri, Jul 2, 2010 at 4:36 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 In the archive case, you're presumably trying to catch up, and so it
 makes sense to kill queries faster so you can catch up.

 On the flip side, the timeout for the WAL segment is for 16MB of WAL,
 whereas the timeout for SR is normally going to be for a much smaller
 chunk (right?).  So even with the same value for both, it seems like
 queries will be killed more aggressively during archive recovery.

True, but I suspect useful settings will be significantly larger than
those times anyway, so it kind of comes out in the wash.  For
walreceiver the expected time to process each new chunk is less than
wal_sender_delay (if it's not, you better get a faster slave machine).
For the archive case, it probably takes more than 200ms to process a
16MB segment, but still only a few seconds.  So if you have both the
max-delay parameters set to 30 seconds, the minimum normal grace
periods are 29.8 sec vs maybe 25 sec, not all that different.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-07-02 Thread Robert Haas
On Fri, Jul 2, 2010 at 4:52 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Fri, Jul 2, 2010 at 4:36 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 In the archive case, you're presumably trying to catch up, and so it
 makes sense to kill queries faster so you can catch up.

 On the flip side, the timeout for the WAL segment is for 16MB of WAL,
 whereas the timeout for SR is normally going to be for a much smaller
 chunk (right?).  So even with the same value for both, it seems like
 queries will be killed more aggressively during archive recovery.

 True, but I suspect useful settings will be significantly larger than
 those times anyway, so it kind of comes out in the wash.  For
 walreceiver the expected time to process each new chunk is less than
 wal_sender_delay (if it's not, you better get a faster slave machine).
 For the archive case, it probably takes more than 200ms to process a
 16MB segment, but still only a few seconds.  So if you have both the
 max-delay parameters set to 30 seconds, the minimum normal grace
 periods are 29.8 sec vs maybe 25 sec, not all that different.

I guess I was thinking more in terms of conflicts-per-WAL-record than
actual replay time.  If we assume that ever 100th WAL record produces
a conflict, SR will pause every once in a while, and then catch up
(either because it canceled some queries or because they finished
within the allowable grace period), whereas archive recovery will be
patient for a bit and then start nuking 'em from orbit.  Maybe my
mental model is all wrong though.

Anyway, I don't object to having two separate settings, and maybe
we'll know more about how to tune them after this gets out in the
field.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-30 Thread Simon Riggs
On Mon, 2010-06-28 at 10:09 -0700, Josh Berkus wrote:
  It will get done.  It is not the very first thing on my to-do list.
 
 ???  What is then?
 
 If it's not the first thing on your priority list, with 9.0 getting 
 later by the day, maybe we should leave it to Robert and Simon, who *do* 
 seem to have it first on *their* list?
 
 I swear, when Simon was keeping his branch to himself in August everyone 
 was on his case.  It sure seems like Tom is doing exactly the same thing.

Hmmm, yes, looks that way. At that time I was actively working on the
code, not just locking it to prevent other activity.

The only urgency on my part here was to fulfil my responsibility to the
project.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-30 Thread Bruce Momjian
Simon Riggs wrote:
 On Mon, 2010-06-28 at 10:09 -0700, Josh Berkus wrote:
   It will get done.  It is not the very first thing on my to-do list.
  
  ???  What is then?
  
  If it's not the first thing on your priority list, with 9.0 getting 
  later by the day, maybe we should leave it to Robert and Simon, who *do* 
  seem to have it first on *their* list?
  
  I swear, when Simon was keeping his branch to himself in August everyone 
  was on his case.  It sure seems like Tom is doing exactly the same thing.
 
 Hmmm, yes, looks that way. At that time I was actively working on the
 code, not just locking it to prevent other activity.
 
 The only urgency on my part here was to fulfil my responsibility to the
 project.

Simon, you have a very legitimate concern.  I phoned Tom and he is
planning to start working on the max_standby_delay tomorrow.  I am
unclear how it is different from your version, but I hope once Tom is
done we can review his work and decide how to proceed.  The fact that we
allowed Tom this huge amount of time to submit an alternative patch is
unusual and hopefully rare.

FYI, Tom and I are hoping to work through all the outstanding issues
before we package up 9.0 beta3 on Thursday, July 8.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + None of us is going to be here forever. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-28 Thread Simon Riggs
On Wed, 2010-06-16 at 21:56 -0400, Tom Lane wrote:
 Robert Haas robertmh...@gmail.com writes:
  On Wed, Jun 9, 2010 at 8:01 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  Yes, I'll get with it ...
 
  Any update on this?
 
 Sorry, I've been a bit distracted by other responsibilities (libtiff
 security issues for Red Hat, if you must know).  I'll get on it shortly.

I don't think the PostgreSQL project should wait any longer on this. If
it does we risk loss of quality in final release, assuming no slippage.

From here, I will rework my patch of 31 May to
* use arrival time on standby as base for max_standby_delay
* make delay apply to both streaming and file cases
* min_standby_grace_period - min grace on every query, default 0

Decision time, so thoughts please?

-- 
 Simon Riggs   www.2ndQuadrant.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-28 Thread Robert Haas
On Mon, Jun 28, 2010 at 3:17 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Wed, 2010-06-16 at 21:56 -0400, Tom Lane wrote:
 Robert Haas robertmh...@gmail.com writes:
  On Wed, Jun 9, 2010 at 8:01 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  Yes, I'll get with it ...

  Any update on this?

 Sorry, I've been a bit distracted by other responsibilities (libtiff
 security issues for Red Hat, if you must know).  I'll get on it shortly.

 I don't think the PostgreSQL project should wait any longer on this. If
 it does we risk loss of quality in final release, assuming no slippage.

I agree, and had actually been intending to post a similar message in
the next day or two.  We've been waiting for this for nearly a month.

 From here, I will rework my patch of 31 May to
 * use arrival time on standby as base for max_standby_delay

Assuming that by this you mean, in the case of SR, the time of receipt
of the current WAL chunk, and in the case of the archive, the time of
its acquisition from the archive, +1.

 * make delay apply to both streaming and file cases

+1.

 * min_standby_grace_period - min grace on every query, default 0

I could go either way on this.  +0.5, I guess.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-28 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 On Wed, 2010-06-16 at 21:56 -0400, Tom Lane wrote:
 Sorry, I've been a bit distracted by other responsibilities (libtiff
 security issues for Red Hat, if you must know).  I'll get on it shortly.

 I don't think the PostgreSQL project should wait any longer on this. If
 it does we risk loss of quality in final release, assuming no slippage.

It will get done.  It is not the very first thing on my to-do list.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-28 Thread Josh Berkus



It will get done.  It is not the very first thing on my to-do list.


???  What is then?

If it's not the first thing on your priority list, with 9.0 getting 
later by the day, maybe we should leave it to Robert and Simon, who *do* 
seem to have it first on *their* list?


I swear, when Simon was keeping his branch to himself in August everyone 
was on his case.  It sure seems like Tom is doing exactly the same thing.


--
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-28 Thread Robert Haas
On Mon, Jun 28, 2010 at 10:19 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Simon Riggs si...@2ndquadrant.com writes:
 On Wed, 2010-06-16 at 21:56 -0400, Tom Lane wrote:
 Sorry, I've been a bit distracted by other responsibilities (libtiff
 security issues for Red Hat, if you must know).  I'll get on it shortly.

 I don't think the PostgreSQL project should wait any longer on this. If
 it does we risk loss of quality in final release, assuming no slippage.

 It will get done.  It is not the very first thing on my to-do list.

That's apparent.  On June 9th, you wrote Yes, I'll get with it ...;
on June 16th, you wrote I'll get on it shortly.  Two weeks later
you're backing off from shortly to eventually.  It is unfair to
the community to assert a vigorous opinion of how something should be
handled in the code and then refuse to commit to a time frame for
providing a patch.  It is even more unreasonable to commit to
providing a timely patch (twice) and then fail to do so.  We are
trying to finalize a release here, and you've made it clear you think
this code needs revision before then.  I respect your opinion, but not
your right to make the project release timetable dependent on your own
schedule, and not your right to shut other people out of working on
the issues you've raised.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-28 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 ... It is even more unreasonable to commit to
 providing a timely patch (twice) and then fail to do so.  We are
 trying to finalize a release here, and you've made it clear you think
 this code needs revision before then.  I respect your opinion, but not
 your right to make the project release timetable dependent on your own
 schedule, and not your right to shut other people out of working on
 the issues you've raised.

Since nobody has put forward a proposed beta3 release date, I don't feel
that I'm holding anything up.  In the meantime, I have many
responsibilities and am facing Red Hat internal deadlines.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-28 Thread Robert Haas
On Mon, Jun 28, 2010 at 2:26 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 ... It is even more unreasonable to commit to
 providing a timely patch (twice) and then fail to do so.  We are
 trying to finalize a release here, and you've made it clear you think
 this code needs revision before then.  I respect your opinion, but not
 your right to make the project release timetable dependent on your own
 schedule, and not your right to shut other people out of working on
 the issues you've raised.

 Since nobody has put forward a proposed beta3 release date, I don't feel
 that I'm holding anything up.  In the meantime, I have many
 responsibilities and am facing Red Hat internal deadlines.

See here, last paragraph:

http://archives.postgresql.org/pgsql-hackers/2010-06/msg01093.php

On a related note, this list is getting pretty darn short:

http://wiki.postgresql.org/wiki/PostgreSQL_9.0_Open_Items

...partly because, in my desire to get another beta out, I have been
devoting a lot of time to clearing it out.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-21 Thread Robert Haas
On Mon, Jun 21, 2010 at 12:20 AM, Ron Mayer
rm...@cheapcomplexdevices.com wrote:
 Robert Haas wrote:
 On Wed, Jun 16, 2010 at 9:56 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Sorry, I've been a bit distracted by other responsibilities (libtiff
 security issues for Red Hat, if you must know).  I'll get on it shortly.

 What?  You have other things to do besides hack on PostgreSQL?  Shocking!  
 :-)

 I suspect you're kidding, but in case some on the list didn't realize,
 Tom's probably as famous (if not moreso) in the image compression
 community as he is in the database community:

 http://www.jpeg.org/jpeg/index.html
 Probably the largest and most important contribution however was the work
  of the Independent JPEG Group (IJG), and Tom Lane in particular.

 http://www.w3.org/TR/PNG-Credits.html , http://www.w3.org/TR/PNG/
 PNG (Portable Network Graphics) Specification
  Version 1.0
  ...
  Contributing Editor
  Tom Lane, t...@sss.pgh.pa.us

 http://www.fileformat.info/format/tiff/egff.htm
 ... by Dr. Tom Lane of the Independent JPEG Group, a member of the
  TIFF Advisory Committee

Yes, I was joking, hence the smiley.  I did know he was involved in
the above, although I confess I didn't know to what degree... or that
he had a doctorate.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-20 Thread Ron Mayer
Robert Haas wrote:
 On Wed, Jun 16, 2010 at 9:56 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Sorry, I've been a bit distracted by other responsibilities (libtiff
 security issues for Red Hat, if you must know).  I'll get on it shortly.
 
 What?  You have other things to do besides hack on PostgreSQL?  Shocking!  :-)

I suspect you're kidding, but in case some on the list didn't realize,
Tom's probably as famous (if not moreso) in the image compression
community as he is in the database community:

http://www.jpeg.org/jpeg/index.html
Probably the largest and most important contribution however was the work
 of the Independent JPEG Group (IJG), and Tom Lane in particular.

http://www.w3.org/TR/PNG-Credits.html , http://www.w3.org/TR/PNG/
PNG (Portable Network Graphics) Specification
 Version 1.0
 ...
 Contributing Editor
 Tom Lane, t...@sss.pgh.pa.us

http://www.fileformat.info/format/tiff/egff.htm
... by Dr. Tom Lane of the Independent JPEG Group, a member of the
 TIFF Advisory Committee

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-17 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Wed, Jun 9, 2010 at 8:01 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Yes, I'll get with it ...

 Any update on this?

Sorry, I've been a bit distracted by other responsibilities (libtiff
security issues for Red Hat, if you must know).  I'll get on it shortly.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-17 Thread Robert Haas
On Wed, Jun 16, 2010 at 9:56 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Wed, Jun 9, 2010 at 8:01 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Yes, I'll get with it ...

 Any update on this?

 Sorry, I've been a bit distracted by other responsibilities (libtiff
 security issues for Red Hat, if you must know).  I'll get on it shortly.

What?  You have other things to do besides hack on PostgreSQL?  Shocking!  :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-16 Thread Robert Haas
On Wed, Jun 9, 2010 at 8:01 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Simon Riggs si...@2ndquadrant.com writes:
 On Thu, 2010-06-03 at 19:02 -0400, Tom Lane wrote:
 I decided there wasn't time to get anything useful done on it before the
 beta2 deadline (which is, more or less, right now).  I will take another
 look over the next few days.

 We all really need you to fix up max_standby_delay, or, let me do it.

 Yes, I'll get with it ...

Tom,

Any update on this?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-09 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 On Thu, 2010-06-03 at 19:02 -0400, Tom Lane wrote:
 I decided there wasn't time to get anything useful done on it before the
 beta2 deadline (which is, more or less, right now).  I will take another
 look over the next few days.

 We all really need you to fix up max_standby_delay, or, let me do it.

Yes, I'll get with it ...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-09 Thread Simon Riggs
On Thu, 2010-06-03 at 19:02 -0400, Tom Lane wrote:
 Simon Riggs si...@2ndquadrant.com writes:
  On Thu, 2010-06-03 at 18:18 +0100, Simon Riggs wrote:
  Are you planning to work on these things now as you said?
 
  Are you? Or do you want me to?
 
 I decided there wasn't time to get anything useful done on it before the
 beta2 deadline (which is, more or less, right now).  I will take another
 look over the next few days.

Esteemed colleague Tom,

We all really need you to fix up max_standby_delay, or, let me do it.

It appears we both agree that more testing would be beneficial. The only
way we are going to get that is by finishing this off and declaring a
new HS-beta.

Please let me know how you wish to proceed. (No answer means I do it.)

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-03 Thread Simon Riggs
On Wed, 2010-06-02 at 16:00 -0400, Tom Lane wrote:
 the current situation that query grace periods go to zero

Possibly a better way to handle this concern is to make the second
parameter: min_standby_grace_period - the minimum time a query will be
given in which to execute, even if max_standby_delay has been reached or
exceeded.

Would that more directly address you concerns?

min_standby_grace_period (ms) SIGHUP 

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-03 Thread Fujii Masao
On Thu, Jun 3, 2010 at 4:41 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 I don't understand why you want to use a different delay when you're
 restoring from archive vs. when you're streaming (what about existing WAL
 files found in pg_xlog, BTW?). The source of WAL shouldn't make a
 difference.

Yes. The pace of a recovery has nothing to do with that of log shipping.
So to hurry up a recovery when restoring from archive seems to be useless.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-03 Thread Simon Riggs
On Thu, 2010-06-03 at 17:56 +0900, Fujii Masao wrote:
 On Thu, Jun 3, 2010 at 4:41 AM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
  I don't understand why you want to use a different delay when you're
  restoring from archive vs. when you're streaming (what about existing WAL
  files found in pg_xlog, BTW?). The source of WAL shouldn't make a
  difference.
 
 Yes. The pace of a recovery has nothing to do with that of log shipping.
 So to hurry up a recovery when restoring from archive seems to be useless.

When streaming drops for some reason we revert to scanning the archive
for files. There is clearly two modes of operation. So it makes sense
that you might want to set different times for the parameter in each
case.

We might think of those modes as connected and degraded modes rather
than streaming and file shipping.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-03 Thread Fujii Masao
On Thu, Jun 3, 2010 at 6:07 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Thu, 2010-06-03 at 17:56 +0900, Fujii Masao wrote:
 On Thu, Jun 3, 2010 at 4:41 AM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
  I don't understand why you want to use a different delay when you're
  restoring from archive vs. when you're streaming (what about existing WAL
  files found in pg_xlog, BTW?). The source of WAL shouldn't make a
  difference.

 Yes. The pace of a recovery has nothing to do with that of log shipping.
 So to hurry up a recovery when restoring from archive seems to be useless.

 When streaming drops for some reason we revert to scanning the archive
 for files. There is clearly two modes of operation.

Yes.

 So it makes sense
 that you might want to set different times for the parameter in each
 case.

What purpose would that serve?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-03 Thread Simon Riggs
On Thu, 2010-06-03 at 18:39 +0900, Fujii Masao wrote:

 What purpose would that serve?

Tom has already explained this and it makes sense for me.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-03 Thread Fujii Masao
On Thu, Jun 3, 2010 at 5:00 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 I stand by my suggestion from yesterday: Let's define max_standby_delay
 as the difference between a piece of WAL becoming available in the
 standby, and applying it.

 My proposal is essentially the same as yours plus allowing the DBA to
 choose different max delays for the caught-up and not-caught-up cases.
 Maybe everybody will end up setting the two delays the same, but I think
 we don't have enough experience to decide that for them now.

Applying WAL that arrives via SR is not always the sign of the caught-up
or not-caught-up. OTOH, applying WAL restored from archive is not always
the sign of either of them. So isn't it nonsense to separate the delay in
order to control the behavior of a recovery for those cases?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-03 Thread Greg Stark
On Thu, Jun 3, 2010 at 12:11 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Greg Stark gsst...@mit.edu writes:
 I was assuming the walreceiver only requests more wal in relatively
 small chunks and only when replay has caught up and needs more data. I
 haven't actually read this code so if that assumption is wrong then
 I'm off-base.

 It is off-base.  The receiver does not request data, the sender is
 what determines how much WAL is sent when.

Hm, so what happens if the slave blocks, doesn't the sender block when
the kernel buffers fill up?

 So I think this isn't necessarily such a blue moon event. As I
 understand it, all it would take is a single long-running report and a
 vacuum or HOT cleanup occurring on the master.

 I think this is mostly FUD too.  How often do you see vacuum blocked for
 an hour now?

No, that's not comparable. On the master vacuum will just ignore
tuples that are still visible to live transactions. On the slave it
doesn't have a choice, it sees the cleanup record and must pause
recovery until those transactions finish.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-03 Thread Tom Lane
Greg Stark gsst...@mit.edu writes:
 Well, if the slave can't keep up, that's a separate problem.  It will
 not fail to keep up as a result of the transmission mechanism.

 No, I mean if the slave is paused due to a conflict. Does it stop
 reading data from the master or does it buffer it up on disk? If it
 stops reading it from the master then the effect is the same as if the
 slave stopped requesting data even if there's no actual request.

The data keeps coming in and getting dumped into the slave's pg_xlog.
walsender/walreceiver are not at all tied to the slave's application
of WAL.  In principle we could have the code around max_standby_delay
notice just how far behind it's gotten and adjust the delay tolerance
based on that; but I think designing a feedback loop for that is 9.1
material.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-03 Thread Greg Stark
On Thu, Jun 3, 2010 at 4:34 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 The data keeps coming in and getting dumped into the slave's pg_xlog.
 walsender/walreceiver are not at all tied to the slave's application
 of WAL.  In principle we could have the code around max_standby_delay
 notice just how far behind it's gotten and adjust the delay tolerance
 based on that; but I think designing a feedback loop for that is 9.1
 material.

Er, no. In that case my first concern was misguided. I'm happy there's
no feedback loop -- my fear was that there was and it would mean the
time received could be decoupled from the time the wal was
generated. But as you describe it then the time received might be
slightly delayed from the time the wal was generated but to some
constant degree -- not in a way that will be influenced by the log
application being blocked on the slave.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-03 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 On Wed, 2010-06-02 at 13:14 -0400, Tom Lane wrote:
 This patch seems to me to be going in fundamentally the wrong direction.
 It's adding complexity and overhead (far more than is needed), and it's
 failing utterly to resolve the objections that I raised to start with.

 Having read your proposal, it seems changing from time-on-sender to
 time-on-receiver is a one line change to the patch.

 What else are you thinking of removing, if anything?

Basically, we need to get rid of everything that feeds timestamps from
the WAL content into the kill-delay logic.

 In particular, Simon seems to be basically refusing to do anything about
 the complaint that the code fails unless master and standby clocks are
 in close sync.  I do not believe that this is acceptable, and since he
 won't fix it, I guess I'll have to.

 Syncing two servers in replication is common practice, as has been
 explained here; I'm still surprised people think otherwise.

Doesn't affect the complaint in the least: I do not find it acceptable
to have that be *mandated* in order for our code to work sensibly.
I would be OK with having something approaching what you want as a
non-default optional behavior (with a clearly-documented dependency
on having synchronized clocks).  But in any case the current behavior is
still quite broken as regards reading stale timestamps from WAL.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-03 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 On Wed, 2010-06-02 at 16:00 -0400, Tom Lane wrote:
 the current situation that query grace periods go to zero

 Possibly a better way to handle this concern is to make the second
 parameter: min_standby_grace_period - the minimum time a query will be
 given in which to execute, even if max_standby_delay has been reached or
 exceeded.

 Would that more directly address you concerns?
 min_standby_grace_period (ms) SIGHUP 

A minimum grace period seems like a good idea to me, but I think it's
somewhat orthogonal to the core problem here.  I think we all
intuitively feel that there should be a way to dial back the grace
period when a slave is far behind on applying WAL.  The problem is
first how to figure out what far behind means, and second how to
adjust the grace period in a way that doesn't have surprising
misbehaviors.  A minimum grace period would prevent some of the very
worst misbehaviors but it's not really addressing the core problem.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-03 Thread Simon Riggs
On Thu, 2010-06-03 at 12:47 -0400, Tom Lane wrote:
 Simon Riggs si...@2ndquadrant.com writes:
  On Wed, 2010-06-02 at 13:14 -0400, Tom Lane wrote:
  This patch seems to me to be going in fundamentally the wrong direction.
  It's adding complexity and overhead (far more than is needed), and it's
  failing utterly to resolve the objections that I raised to start with.
 
  Having read your proposal, it seems changing from time-on-sender to
  time-on-receiver is a one line change to the patch.
 
  What else are you thinking of removing, if anything?
 
 Basically, we need to get rid of everything that feeds timestamps from
 the WAL content into the kill-delay logic.

I understand your wish to do this, though it isn't always accurate to do
that in the case where there is a backlog.

The patch already does that *mostly* for the streaming case. The only
time it does use the WAL content timestamp is when the WAL content
timestamp is later than the oldest receive time, so in that case it is
used as a correction. (see code comments and comments below also)

  In particular, Simon seems to be basically refusing to do anything about
  the complaint that the code fails unless master and standby clocks are
  in close sync.  I do not believe that this is acceptable, and since he
  won't fix it, I guess I'll have to.
 
  Syncing two servers in replication is common practice, as has been
  explained here; I'm still surprised people think otherwise.
 
 Doesn't affect the complaint in the least: I do not find it acceptable
 to have that be *mandated* in order for our code to work sensibly.

OK, accepted.

 I would be OK with having something approaching what you want as a
 non-default optional behavior (with a clearly-documented dependency
 on having synchronized clocks).  

Yes, that's what I'd been thinking also. So that gives us a way
forwards.

standby_delay_base = apply (default) | send

determines whether the standby_delay is calculated solely with reference
to the standby server (apply) or whether times from the sending server
are used (send). Use of send implies that the clocks on primary and
standby are synchronised to within a useful accuracy, in which case it
is usual to enforce that with time synchronisation software such as ntp.

 But in any case the current behavior is
 still quite broken as regards reading stale timestamps from WAL.

Agreed. That wasn't the objective of this patch or a priority.

If you're reading from an archive, you need to set max_standby_delay
appropriately, current recommendation is -1.

We can address that concern once the main streaming case is covered, or
we can add that now.

Are you planning to work on these things now as you said? How about I
apply my patch now, then you do another set of changes afterwards to add
the other items you mentioned, since that is now 2 additional parameters
and related code?

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-03 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 On Thu, 2010-06-03 at 12:47 -0400, Tom Lane wrote:
 But in any case the current behavior is
 still quite broken as regards reading stale timestamps from WAL.

 Agreed. That wasn't the objective of this patch or a priority.

 If you're reading from an archive, you need to set max_standby_delay
 appropriately, current recommendation is -1.

That's not a fix; the problem is that there *is no* appropriate setting
for max_standby_delay when you're reading from archive.  It is unlikely
that the values of the timestamps you're reading should be considered to
have any bearing on what grace period to allow; but nonetheless it's
desirable to be able to have a non-infinite grace time.

Basically what I think is that we want what you called apply semantics
always for reading from archive (and I still think the DBA should be
able to set that grace period separately from the one that applies in
SR operation).  Paying attention to the master's timestamps is only
reasonable in the SR context.

And yes, I want the dependency on WAL timestamps to be gone completely,
not just mostly.  I think that driving the delay logic off of SR receipt
time (and/or the timestamp we agreed to add to the SR protocol) is the
appropriate and sufficient way to deal with the problem.  Looking at the
embedded timestamps just confuses the feedback loop behavior.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-03 Thread Simon Riggs
On Thu, 2010-06-03 at 13:32 -0400, Tom Lane wrote:
 Simon Riggs si...@2ndquadrant.com writes:
  On Thu, 2010-06-03 at 12:47 -0400, Tom Lane wrote:
  But in any case the current behavior is
  still quite broken as regards reading stale timestamps from WAL.
 
  Agreed. That wasn't the objective of this patch or a priority.
 
  If you're reading from an archive, you need to set max_standby_delay
  appropriately, current recommendation is -1.
 
 That's not a fix; the problem is that there *is no* appropriate setting
 for max_standby_delay when you're reading from archive.  It is unlikely
 that the values of the timestamps you're reading should be considered to
 have any bearing on what grace period to allow; but nonetheless it's
 desirable to be able to have a non-infinite grace time.

I'm not saying that's a fix; I agree we should change that also.

 Basically what I think is that we want what you called apply semantics
 always for reading from archive (and I still think the DBA should be
 able to set that grace period separately from the one that applies in
 SR operation).  Paying attention to the master's timestamps is only
 reasonable in the SR context.

Agreed.

 And yes, I want the dependency on WAL timestamps to be gone completely,
 not just mostly.  I think that driving the delay logic off of SR receipt
 time (and/or the timestamp we agreed to add to the SR protocol) is the
 appropriate and sufficient way to deal with the problem.  Looking at the
 embedded timestamps just confuses the feedback loop behavior.

I disagree with sufficient, with good reason. Please look at the code
comments and see what will happen if we don't make the correction
suggested. If after that you still disagree, then do it your way and
we'll wait for comments in the beta; I think there will be some, but I
am tired and prone to agreement to allow this to go through. 

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-03 Thread Greg Stark
On Thu, Jun 3, 2010 at 4:18 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Greg Stark gsst...@mit.edu writes:
 On Thu, Jun 3, 2010 at 12:11 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 It is off-base.  The receiver does not request data, the sender is
 what determines how much WAL is sent when.

 Hm, so what happens if the slave blocks, doesn't the sender block when
 the kernel buffers fill up?

 Well, if the slave can't keep up, that's a separate problem.  It will
 not fail to keep up as a result of the transmission mechanism.

No, I mean if the slave is paused due to a conflict. Does it stop
reading data from the master or does it buffer it up on disk? If it
stops reading it from the master then the effect is the same as if the
slave stopped requesting data even if there's no actual request.


-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-03 Thread Tom Lane
Greg Stark gsst...@mit.edu writes:
 On Thu, Jun 3, 2010 at 12:11 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 It is off-base.  The receiver does not request data, the sender is
 what determines how much WAL is sent when.

 Hm, so what happens if the slave blocks, doesn't the sender block when
 the kernel buffers fill up?

Well, if the slave can't keep up, that's a separate problem.  It will
not fail to keep up as a result of the transmission mechanism.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-03 Thread Simon Riggs
On Thu, 2010-06-03 at 18:18 +0100, Simon Riggs wrote:

 Are you planning to work on these things now as you said?

Are you? Or do you want me to?

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-03 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 On Thu, 2010-06-03 at 18:18 +0100, Simon Riggs wrote:
 Are you planning to work on these things now as you said?

 Are you? Or do you want me to?

I decided there wasn't time to get anything useful done on it before the
beta2 deadline (which is, more or less, right now).  I will take another
look over the next few days.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Simon Riggs
On Mon, 2010-05-31 at 14:40 -0400, Bruce Momjian wrote:

 Uh, we have three days before we package 9.0beta2.  It would be good if
 we could decide on the max_standby_delay issue soon.

I've heard something from Heikki, not from anyone else. Those comments
amount to lets replace max_standby_delay with max_apply_delay.

Got a beta idea?

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Kevin Grittner
Simon Riggs si...@2ndquadrant.com wrote:
 On Mon, 2010-05-31 at 14:40 -0400, Bruce Momjian wrote:
 
 Uh, we have three days before we package 9.0beta2.  It would be
 good if we could decide on the max_standby_delay issue soon.
 
 I've heard something from Heikki, not from anyone else. Those
 comments amount to lets replace max_standby_delay with
 max_apply_delay.
 
 Got a beta idea?
 
Given the incessant ticking of the clock, I have a hard time
believing we have any real options besides max_standby_delay or a
boolean which corresponds to the -1 and 0 settings of
max_standby_delay.  I think it's pretty clear that there's a use
case for the positive values, although there are bound to be some
who try it and are surprised by behavior at transition from idle to
active.  The whole debate seems to boil down to how important a
middle ground is versus how damaging the surprise factor is.  (I
don't really buy the argument that we won't be able to remove it
later if we replace it with something better.)
 
I know there were initially some technical problems, too; have
those been resolved?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 OK, here's v4.

I've been trying to stay out of this thread, but with beta2 approaching
and no resolution in sight, I'm afraid I have to get involved.

This patch seems to me to be going in fundamentally the wrong direction.
It's adding complexity and overhead (far more than is needed), and it's
failing utterly to resolve the objections that I raised to start with.

In particular, Simon seems to be basically refusing to do anything about
the complaint that the code fails unless master and standby clocks are
in close sync.  I do not believe that this is acceptable, and since he
won't fix it, I guess I'll have to.

The other basic problem that I had with the current code, which this
does nothing about, is that it believes that timestamps pulled from WAL
archive should be treated on the same footing as timestamps of live
actions.  That might work in certain limited scenarios but it's going to
be a disaster in general.

I believe that the motivation for treating archived timestamps as live
is, essentially, to force rapid catchup if a slave falls behind so far
that it's reading from archive instead of SR.  There are certainly
use-cases where that's appropriate (though, again, not all); but even
when you do want it it's a pretty inflexible implementation.  For
realistic values of max_standby_delay the behavior is going to pretty
much be instant kill whenever we're reading from archive.

I have backed off my original suggestion that we should reduce
max_standby_delay to a boolean: that was based on the idea that delays
would only occur in response to DDL on the master, but since vacuum and
btree page splits can also trigger delays, it seems clear that a kill
queries immediately policy isn't really very useful in practice.  So we
need to make max_standby_delay work rather than just get rid of it.

What I think might be a realistic compromise is this:

1. Separate max_standby_delay into two GUCs, say max_streaming_delay
and max_archive_delay.

2. When applying WAL that came across SR, use max_streaming_delay and
let the time measurement be current time minus time of receipt of the
current WAL send chunk.

3. When applying WAL that came from archive, use max_archive_delay and
let the time measurement be current time minus time of acquisition of
the current WAL segment from the archive.

The current code's behavior in the latter case could effectively be
modeled by setting max_archive_delay to zero, but that isn't the only
plausible setting.  More likely DBAs would set max_archive_delay to
something smaller than max_streaming_delay, but still positive so as to
not kill conflicting queries instantly.

An important property of this design is that all relevant timestamps
are taken on the slave, so clock skew isn't an issue.

I'm still inclined to apply the part of Simon's patch that adds a
transmit timestamp to each SR send chunk.  That would actually be
completely unused by the slave given my proposal above, but I think that
it is an important step to take to future-proof the SR protocol against
possible changes in the slave-side timing logic.  I don't however see
the value of transmitting keepalive records when we'd otherwise not
have anything to send.  The value of adding timestamps to the SR
protocol is really to let the slave determine the current amount of
clock skew between it and the master; which is a number that should not
change so rapidly that it has to be updated every 100ms even in an idle
system.

Comments?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Stephen Frost
* Tom Lane (t...@sss.pgh.pa.us) wrote:
 An important property of this design is that all relevant timestamps
 are taken on the slave, so clock skew isn't an issue.

I agree that this is important, and I do run NTP on all my servers and
even monitor it using Nagios.

It's still not a cure-all for time skew issues.

 Comments?

I'm not really a huge fan of adding another GUC, to be honest.  I'm more
inclined to say we treat 'max_archive_delay' as '0', and turn
max_streaming_delay into what you've described.  If we fall back so far
that we have to go back to reading WALs, then we need to hurry up and
catch-up and damn the torpedos.  I'd also prefer that we only wait the
delay time once until we're fully caught up again (and have gotten
back around to waiting for new data).  

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Andrew Dunstan



Tom Lane wrote:

I'm still inclined to apply the part of Simon's patch that adds a
transmit timestamp to each SR send chunk.  That would actually be
completely unused by the slave given my proposal above, but I think that
it is an important step to take to future-proof the SR protocol against
possible changes in the slave-side timing logic.  
  


+1.

From a radically different perspective, I had to do something similar 
in the buildfarm years ago to protect us from machines reporting with 
grossly inaccurate timestamps. This was part of the solution. The client 
adds its current timestamp setting just before transmitting the data to 
the server.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Tom Lane
Stephen Frost sfr...@snowman.net writes:
 * Tom Lane (t...@sss.pgh.pa.us) wrote:
 Comments?

 I'm not really a huge fan of adding another GUC, to be honest.  I'm more
 inclined to say we treat 'max_archive_delay' as '0', and turn
 max_streaming_delay into what you've described.  If we fall back so far
 that we have to go back to reading WALs, then we need to hurry up and
 catch-up and damn the torpedos.

If I thought that 0 were a generally acceptable value, I'd still be
pushing the simplify it to a boolean agenda ;-).  The problem is that
that will sometimes kill standby queries even when they are quite short
and doing nothing objectionable.

 I'd also prefer that we only wait the
 delay time once until we're fully caught up again (and have gotten
 back around to waiting for new data).  

The delays will be measured from a receipt instant to current time,
which means that the longer it takes to apply a WAL segment or WAL
send chunk, the less grace period there will be.  (Which is the
same as what CVS HEAD does --- I'm just arguing about where we get
the start time from.)  I believe this does what you suggest and more.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Simon Riggs
On Wed, 2010-06-02 at 13:45 -0400, Tom Lane wrote:
 Stephen Frost sfr...@snowman.net writes:
  * Tom Lane (t...@sss.pgh.pa.us) wrote:
  Comments?
 
  I'm not really a huge fan of adding another GUC, to be honest.  I'm more
  inclined to say we treat 'max_archive_delay' as '0', and turn
  max_streaming_delay into what you've described.  If we fall back so far
  that we have to go back to reading WALs, then we need to hurry up and
  catch-up and damn the torpedos.
 
 If I thought that 0 were a generally acceptable value, I'd still be
 pushing the simplify it to a boolean agenda ;-).  The problem is that
 that will sometimes kill standby queries even when they are quite short
 and doing nothing objectionable.

OK, now I understand. I was just thinking the same as Stephen, but now I
agree we need a second parameter.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Robert Haas
On Wed, Jun 2, 2010 at 2:03 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Wed, 2010-06-02 at 13:45 -0400, Tom Lane wrote:
 Stephen Frost sfr...@snowman.net writes:
  * Tom Lane (t...@sss.pgh.pa.us) wrote:
  Comments?

  I'm not really a huge fan of adding another GUC, to be honest.  I'm more
  inclined to say we treat 'max_archive_delay' as '0', and turn
  max_streaming_delay into what you've described.  If we fall back so far
  that we have to go back to reading WALs, then we need to hurry up and
  catch-up and damn the torpedos.

 If I thought that 0 were a generally acceptable value, I'd still be
 pushing the simplify it to a boolean agenda ;-).  The problem is that
 that will sometimes kill standby queries even when they are quite short
 and doing nothing objectionable.

 OK, now I understand. I was just thinking the same as Stephen, but now I
 agree we need a second parameter.

I too was thinking the same as Stephen, but now I also agree we need a
second parameter.  I think the whole design Tom proposed seems good.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Greg Stark
On Wed, Jun 2, 2010 at 6:14 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 I believe that the motivation for treating archived timestamps as live
 is, essentially, to force rapid catchup if a slave falls behind so far
 that it's reading from archive instead of SR.


Huh, I think this is the first mention of this that I've seen. I
always assumed the motivation was just that you wanted to control how
much data loss could occur on failover and how long recovery would
take. I think separating the two delays is an interesting idea but I
don't see how it counts as a simplification.

This also still allows a slave to become arbitrarily far behind the
master. The master might take a long time to fill up a log file, and
if the slave is blocked for a few minutes, then requests the next bit
of log file and blocks for a few more minutes, then requests the next
log file and blocks for a few more minutes, etc. As long as the slave
is reading more slowly than the master is filling those segments it
will fall further and further behind but never be more than a few
minutes behind receiving the logs.

I propose an alternate way out of the problem of syncing two clocks.
Instead of comparing timestamps compare time intervals. So as it reads
xlog records it only ever compares the master timestamps with previous
master timestamps to determine how much time has elapsed on the
master. It compares that time elapsed with the time elapsed on the
slave to determine if it's falling behind. I'm assuming it
periodically catches up and can throw out its accumulated time elapsed
and start fresh. If not then the accumulated error might be a problem,
but hopefully one we can find a solution to.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Simon Riggs
On Wed, 2010-06-02 at 13:14 -0400, Tom Lane wrote:

 This patch seems to me to be going in fundamentally the wrong direction.
 It's adding complexity and overhead (far more than is needed), and it's
 failing utterly to resolve the objections that I raised to start with.

Having read your proposal, it seems changing from time-on-sender to
time-on-receiver is a one line change to the patch.

What else are you thinking of removing, if anything?

Adding an extra parameter adds more obviously and is something I now
agree with.

 In particular, Simon seems to be basically refusing to do anything about
 the complaint that the code fails unless master and standby clocks are
 in close sync.  I do not believe that this is acceptable, and since he
 won't fix it, I guess I'll have to.

Syncing two servers in replication is common practice, as has been
explained here; I'm still surprised people think otherwise. Measuring
the time between two servers is the very purpose of the patch, so the
synchronisation is not a design flaw, it is its raison d'etre. There's
been a few spleens emptied on that topic, not all of them mine, and
certainly no consensus on that. So I'm not refusing to do anything
that's been agreed... 

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Robert Haas
On Wed, Jun 2, 2010 at 2:27 PM, Simon Riggs si...@2ndquadrant.com wrote:
 Syncing two servers in replication is common practice, as has been
 explained here; I'm still surprised people think otherwise. Measuring
 the time between two servers is the very purpose of the patch, so the
 synchronisation is not a design flaw, it is its raison d'etre.

I think the purpose of the patch should not be to measure the time
difference between servers, but rather the replication delay.  While I
don't necessarily agree with Tom's statement that this is must-fix, I
do agree that it would be nicer if we could avoid depending on time
sync.  Yeah, I keep my servers time synced, too.  But, shit happens.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Tom Lane
Greg Stark gsst...@mit.edu writes:
 On Wed, Jun 2, 2010 at 6:14 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I believe that the motivation for treating archived timestamps as live
 is, essentially, to force rapid catchup if a slave falls behind so far
 that it's reading from archive instead of SR.

 Huh, I think this is the first mention of this that I've seen. I
 always assumed the motivation was just that you wanted to control how
 much data loss could occur on failover and how long recovery would
 take. I think separating the two delays is an interesting idea but I
 don't see how it counts as a simplification.

Well, it isn't a simplification: it's bringing it up to the minimum
complication level where it'll actually work sanely.  The current
implementation doesn't work sanely because it confuses stale timestamps
read from WAL with real live time.

 This also still allows a slave to become arbitrarily far behind the
 master.

Indeed, but nothing we do can prevent that, if the slave is just plain
slower than the master.  You have to assume that the slave is capable of
keeping up in the absence of query-caused delays, or you're hosed.

The real reason this is at issue is the fact that the max_standby_delay
kill mechanism applies to certain buffer-level locking operations.
On the master we just wait, and it's not a problem, because in practice
the conflicting queries almost always release these locks pretty quick.
On the slave, though, instant kill as a result of a buffer-level lock
conflict would result in a very serious degradation in standby query
reliability (while also doing practically nothing for the speed of WAL
application, most of the time).  This morning on the phone Bruce and I
were seriously discussing the idea of ripping the max_standby_delay
mechanism out of the buffer-level locking paths, and just let them work
like they do on the master, ie, wait forever.  If we did that then
simplifying max_standby_delay to a boolean would be reasonable again
(because it really would only come into play for DDL on the master).
The sticky point is that once in a blue moon you do have a conflicting
query sitting on a buffer lock for a long time, or even more likely a
series of queries keeping the WAL replay process from obtaining buffer
cleanup lock.

So it seems that we have to have max_standby_delay-like logic for those
locks, and also that zero grace period before kill isn't a very practical
setting.  However, there isn't a lot of point in obsessing over exactly
how long the grace period ought to be, as long as it's more than a few
milliseconds.  It *isn't* going to have any real effect on whether the
slave can stay caught up.  You could make a fairly decent case for just
measuring the grace period from when the replay process starts to wait,
as I think I proposed awhile back.  The value of measuring delay from a
receipt time is that if you do happen to have a bunch of delays within
a short interval you'll get more willing to kill queries --- but I
really believe that that is a corner case and will have nothing to do
with ordinary performance.

 I propose an alternate way out of the problem of syncing two clocks.
 Instead of comparing timestamps compare time intervals. So as it reads
 xlog records it only ever compares the master timestamps with previous
 master timestamps to determine how much time has elapsed on the
 master. It compares that time elapsed with the time elapsed on the
 slave to determine if it's falling behind.

I think this would just add complexity and uncertainty, to deal with
something that won't be much of a problem in practice.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Heikki Linnakangas

On 02/06/10 20:14, Tom Lane wrote:

 For realistic values of max_standby_delay ...


Hang on right there. What do you consider a realistic value for 
max_standby_delay? Because I'm not sure I have a grip on that myself. 5 
seconds? 5 minutes? 5 hours? I can see use cases for all of those...



What I think might be a realistic compromise is this:

1. Separate max_standby_delay into two GUCs, say max_streaming_delay
and max_archive_delay.

2. When applying WAL that came across SR, use max_streaming_delay and
let the time measurement be current time minus time of receipt of the
current WAL send chunk.

3. When applying WAL that came from archive, use max_archive_delay and
let the time measurement be current time minus time of acquisition of
the current WAL segment from the archive.

The current code's behavior in the latter case could effectively be
modeled by setting max_archive_delay to zero, but that isn't the only
plausible setting.  More likely DBAs would set max_archive_delay to
something smaller than max_streaming_delay, but still positive so as to
not kill conflicting queries instantly.


The problem with defining max_archive_delay that way is again that you 
can fall behind indefinitely. If you set it to 5 minutes, it means that 
you'll wait a maximum of 5 minutes *per WAL segment*, even if WAL is 
being generated faster.


I don't understand why you want to use a different delay when you're 
restoring from archive vs. when you're streaming (what about existing 
WAL files found in pg_xlog, BTW?). The source of WAL shouldn't make a 
difference. If it's because you assume that restoring from archive is a 
sign that you've fallen behind a lot, surely you've exceeded 
max_standby_delay then and I still don't see a need for a separate GUC.


I stand by my suggestion from yesterday: Let's define max_standby_delay 
as the difference between a piece of WAL becoming available in the 
standby, and applying it. To approximate piece of WAL becoming 
available for SR, we can use the mechanism with send/applyChunks from 
Simon's latest patch, or go with the simpler scheme of just resetting a 
last caughtup timestamp to current time whenever we have to wait for 
new WAL to arrive. When restoring from archive, likewise reset last 
caughtup timestamp whenever restore_command returns non-0, i.e we have 
to wait for the next WAL file to arrive.


That works the same for both SR and file-based log shipping, there's 
only one knob to set, is simple to implement and doesn't require 
synchronized clocks.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 The problem with defining max_archive_delay that way is again that you 
 can fall behind indefinitely.

See my response to Greg Stark --- I don't think this is really an
issue.  It's certainly far less of an issue than the current situation
that query grace periods go to zero under not-at-all-extreme conditions.

 I don't understand why you want to use a different delay when you're 
 restoring from archive vs. when you're streaming (what about existing 
 WAL files found in pg_xlog, BTW?).

You're missing the point.  I want the DBA to be able to control what
happens in those two cases.  In the current implementation he has no
control over what happens while restoring from archive: it's going to
effectively act like max_archive_delay = 0 all the time.  If you're of
the opinion that's good, you can set the parameter that way and be
happy.  I'm of the opinion you'll soon find out it isn't so good,
because it'll kill standby queries too easily.

 I stand by my suggestion from yesterday: Let's define max_standby_delay 
 as the difference between a piece of WAL becoming available in the 
 standby, and applying it.

My proposal is essentially the same as yours plus allowing the DBA to
choose different max delays for the caught-up and not-caught-up cases.
Maybe everybody will end up setting the two delays the same, but I think
we don't have enough experience to decide that for them now.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Greg Stark
On Wed, Jun 2, 2010 at 8:14 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Indeed, but nothing we do can prevent that, if the slave is just plain
 slower than the master.  You have to assume that the slave is capable of
 keeping up in the absence of query-caused delays, or you're hosed.

I was assuming the walreceiver only requests more wal in relatively
small chunks and only when replay has caught up and needs more data. I
haven't actually read this code so if that assumption is wrong then
I'm off-base. But if my assumption is right then it's not merely the
master running faster than the slave that can cause you to fall
arbitrarily far behind. The receive time is delayed by how long the
slave waited on a previous lock, so if we wait for a few minutes, then
proceed and find we need more wal data we'll read data from the master
that it could have generated those few minutes ago.


 The sticky point is that once in a blue moon you do have a conflicting
 query sitting on a buffer lock for a long time, or even more likely a
 series of queries keeping the WAL replay process from obtaining buffer
 cleanup lock.

So I think this isn't necessarily such a blue moon event. As I
understand it, all it would take is a single long-running report and a
vacuum or HOT cleanup occurring on the master. If I want to set
max_standby_delay to 60min to allow reports to run for up to an hour
then any random HOT cleanup on the master will propagate to the slave
and cause a WAL stall until the transactions which have that xid in
their snapshot end.

Even the buffer locks are could potentially be blocked for a long time
if you happen to run the right kind of query (say a nested loop with
the buffer in question on the outside and a large table on the
inside). That's a rarer event though; is that what you were thinking
of?


-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Tom Lane
Greg Stark gsst...@mit.edu writes:
 I was assuming the walreceiver only requests more wal in relatively
 small chunks and only when replay has caught up and needs more data. I
 haven't actually read this code so if that assumption is wrong then
 I'm off-base.

It is off-base.  The receiver does not request data, the sender is
what determines how much WAL is sent when.

 So I think this isn't necessarily such a blue moon event. As I
 understand it, all it would take is a single long-running report and a
 vacuum or HOT cleanup occurring on the master.

I think this is mostly FUD too.  How often do you see vacuum blocked for
an hour now?  It probably can happen, which is why we need to be able to
kick queries off the locks with max_standby_delay, but it's far from
common.  What we're concerned about here is how long a buffer lock on a
single page is held, not how long heavyweight locks are held.  The
normal hold times are measured in microseconds.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-02 Thread Stephen Frost
* Tom Lane (t...@sss.pgh.pa.us) wrote:
 Greg Stark gsst...@mit.edu writes:
  So I think this isn't necessarily such a blue moon event. As I
  understand it, all it would take is a single long-running report and a
  vacuum or HOT cleanup occurring on the master.
 
 I think this is mostly FUD too.  How often do you see vacuum blocked for
 an hour now?  It probably can happen, which is why we need to be able to
 kick queries off the locks with max_standby_delay, but it's far from
 common.  What we're concerned about here is how long a buffer lock on a
 single page is held, not how long heavyweight locks are held.  The
 normal hold times are measured in microseconds.

Maybe I'm missing something, but I think Greg's point was that if you
have a long-running query running against the standby/slave/whatever,
which is holding locks on various relations to implement that report,
and then a vacuum or HOT update happens on the master, the long-running
report query will get killed off unless you bump max_streaming_delay up
pretty high (eg: 60 mins).

That being said, I'm not sure that there's really another solution.
Yes, in this case, the slave can end up being an hour behind, but that's
the trade-off you have to make if you want to run an hour-long query on
the slave.  The other answer is to make the master not update those
tuples, etc, which might be possible by starting a transaction on the
master which grabs things enough to prevent the vacuum/hot/etc update
from happening.  That may be possible manually, but it's not fun and it
certainly isn't something we'll have built-in support for in 9.0.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-01 Thread Heikki Linnakangas

On 27/05/10 20:26, Simon Riggs wrote:

On Wed, 2010-05-26 at 16:22 -0700, Josh Berkus wrote:

Just this second posted about that, as it turns out.

I have a v3 *almost* ready of the keepalive patch. It still makes sense
to me after a few days reflection, so is worth discussion and review. In
or out, I want this settled within a week. Definitely need some RR
here.


Does the keepalive fix all the issues with max_standby_delay?  Tom?


OK, here's v4.

Summary

* WALSender adds a timestamp onto the header of every WAL chunk sent.

* Each WAL record now has a conceptual send timestamp that remains
constant while that record is replayed. This is used as the basis from
which max_standby_delay is calculated when required during replay.

* Send timestamp is calculated as the later of the timestamp of chunk in
which WAL record was sent and the latest XLog time.

* WALSender sends an empty message as a keepalive when nothing else to
send. (No longer a special message type for the keepalive).

I think its close, but if there's a gaping hole here somewhere then I'll
punt for this release.


This certainly alleviates some of the problems. You still need to ensure 
that master and standby have synchronized clocks, and you still get zero 
grace time after a long period of inactivity when not using streaming 
replication, however.


Sending a keep-alive message every 100ms seems overly aggressive to me.


If we really want to try to salvage max_standby_delay with a meaning 
similar to what it has now, I think we should go with the idea some 
people bashed around earlier and define the grace period as the 
difference between a WAL record becoming available to the standby for 
replay, and between replaying it. An approximation of that is to do 
lastIdle = gettimeofday() in XLogPageRead() whenever it needs to wait 
for new WAL to arrive, whether that's via streaming replication or by a 
success return code from restore_command, and compare the difference of 
that with current timestamp in WaitExceedsMaxStandbyDelay().


That's very simple, doesn't require synchronized clocks, and works the 
same with file- and stream-based setups.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-06-01 Thread Simon Riggs
Thanks for the review.

On Tue, 2010-06-01 at 13:36 +0300, Heikki Linnakangas wrote:

 If we really want to try to salvage max_standby_delay with a meaning 
 similar to what it has now, I think we should go with the idea some 
 people bashed around earlier and define the grace period as the 
 difference between a WAL record becoming available to the standby for 
 replay, and between replaying it. An approximation of that is to do 
 lastIdle = gettimeofday() in XLogPageRead() whenever it needs to wait 
 for new WAL to arrive, whether that's via streaming replication or by a 
 success return code from restore_command, and compare the difference of 
 that with current timestamp in WaitExceedsMaxStandbyDelay().

That wouldn't cope with a continuous stream of records arriving, unless
you also include the second half of the patch.

 That's very simple, doesn't require synchronized clocks, and works the 
 same with file- and stream-based setups.

Nor does it provide a mechanism for monitoring of SR. standby_delay is
explicitly defined in terms of the gap between two servers, so is a
useful real world concept. apply_delay is somewhat less interesting.

I'm sure most people would rather have monitoring and therefore the
requirement for synchronised-ish clocks, than no monitoring. If you
think no monitoring is OK, I don't, but there are other ways, so its not
a point to fight about.

 This certainly alleviates some of the problems. You still need to ensure
 that master and standby have synchronized clocks, and you still get zero 
 grace time after a long period of inactivity when not using streaming 
 replication, however.

Second issue can be added once we approve the rest of this if you like.

 Sending a keep-alive message every 100ms seems overly aggressive to me.

It's sent every wal_sender_delay. Why is that a negative?

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-05-31 Thread Bruce Momjian

Uh, we have three days before we package 9.0beta2.  It would be good if
we could decide on the max_standby_delay issue soon.

---

Simon Riggs wrote:
 On Wed, 2010-05-26 at 16:22 -0700, Josh Berkus wrote:
   Just this second posted about that, as it turns out.
   
   I have a v3 *almost* ready of the keepalive patch. It still makes sense
   to me after a few days reflection, so is worth discussion and review. In
   or out, I want this settled within a week. Definitely need some RR
   here.
  
  Does the keepalive fix all the issues with max_standby_delay?  Tom?
 
 OK, here's v4.
 
 Summary
 
 * WALSender adds a timestamp onto the header of every WAL chunk sent.
 
 * Each WAL record now has a conceptual send timestamp that remains
 constant while that record is replayed. This is used as the basis from
 which max_standby_delay is calculated when required during replay.
 
 * Send timestamp is calculated as the later of the timestamp of chunk in
 which WAL record was sent and the latest XLog time.
 
 * WALSender sends an empty message as a keepalive when nothing else to
 send. (No longer a special message type for the keepalive).
 
 I think its close, but if there's a gaping hole here somewhere then I'll
 punt for this release.
 
 -- 
  Simon Riggs   www.2ndQuadrant.com

[ Attachment, skipping... ]

 
 -- 
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + None of us is going to be here forever. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-05-27 Thread Simon Riggs
On Wed, 2010-05-26 at 16:22 -0700, Josh Berkus wrote:
  Just this second posted about that, as it turns out.
  
  I have a v3 *almost* ready of the keepalive patch. It still makes sense
  to me after a few days reflection, so is worth discussion and review. In
  or out, I want this settled within a week. Definitely need some RR
  here.
 
 Does the keepalive fix all the issues with max_standby_delay?  Tom?

OK, here's v4.

Summary

* WALSender adds a timestamp onto the header of every WAL chunk sent.

* Each WAL record now has a conceptual send timestamp that remains
constant while that record is replayed. This is used as the basis from
which max_standby_delay is calculated when required during replay.

* Send timestamp is calculated as the later of the timestamp of chunk in
which WAL record was sent and the latest XLog time.

* WALSender sends an empty message as a keepalive when nothing else to
send. (No longer a special message type for the keepalive).

I think its close, but if there's a gaping hole here somewhere then I'll
punt for this release.

-- 
 Simon Riggs   www.2ndQuadrant.com
*** a/doc/src/sgml/protocol.sgml
--- b/doc/src/sgml/protocol.sgml
***
*** 4222,4247  The commands accepted in walsender mode are:
/varlistentry
varlistentry
term
!   Bytereplaceablen/replaceable
/term
listitem
para
!   Data that forms part of WAL data stream.
/para
/listitem
/varlistentry
!   /variablelist
/para
/listitem
/varlistentry
/variablelist
!  /para
!  para
 A single WAL record is never split across two CopyData messages. When
 a WAL record crosses a WAL page boundary, however, and is therefore
 already split using continuation records, it can be split at the page
 boundary. In other words, the first main WAL record and its
 continuation records can be split across different CopyData messages.
   /para
  /listitem
/varlistentry
--- 4222,4257 
/varlistentry
varlistentry
term
!   Bytereplaceable8/replaceable
/term
listitem
para
!   Message timestamp.
/para
/listitem
/varlistentry
!   varlistentry
!   term
!   Bytereplaceablen/replaceable
!   /term
!   listitem
!   para
!   Data that forms part of WAL data stream. (May be zero length).
/para
/listitem
/varlistentry
/variablelist
!   /para
!   para
 A single WAL record is never split across two CopyData messages. When
 a WAL record crosses a WAL page boundary, however, and is therefore
 already split using continuation records, it can be split at the page
 boundary. In other words, the first main WAL record and its
 continuation records can be split across different CopyData messages.
+   /para
+   /listitem
+   /varlistentry
+   /variablelist
   /para
  /listitem
/varlistentry
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***
*** 1938,1944  UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force)
  			UpdateControlFile();
  			minRecoveryPoint = newMinRecoveryPoint;
  
! 			ereport(DEBUG2,
  	(errmsg(updated min recovery point to %X/%X,
  		minRecoveryPoint.xlogid, minRecoveryPoint.xrecoff)));
  		}
--- 1938,1944 
  			UpdateControlFile();
  			minRecoveryPoint = newMinRecoveryPoint;
  
! 			ereport(DEBUG3,
  	(errmsg(updated min recovery point to %X/%X,
  		minRecoveryPoint.xlogid, minRecoveryPoint.xrecoff)));
  		}
***
*** 9210,9218  retry:
  {
  	/*
  	 * While walreceiver is active, wait for new WAL to arrive
! 	 * from primary.
  	 */
! 	receivedUpto = GetWalRcvWriteRecPtr();
  	if (XLByteLT(*RecPtr, receivedUpto))
  	{
  		/*
--- 9210,9218 
  {
  	/*
  	 * While walreceiver is active, wait for new WAL to arrive
! 	 * from primary. Get next applychunk and do other bookkeeping.
  	 */
! 	receivedUpto = GetWalRcvNextApplyChunk();
  	if (XLByteLT(*RecPtr, receivedUpto))
  	{
  		/*
*** a/src/backend/replication/walreceiver.c
--- b/src/backend/replication/walreceiver.c
***
*** 394,410  XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
  		case 'w':/* WAL records */
  			{
  XLogRecPtr	recptr;
  
! if (len  sizeof(XLogRecPtr))
  	ereport(ERROR,
  			(errcode(ERRCODE_PROTOCOL_VIOLATION),
  			 errmsg_internal(invalid WAL message received from primary)));
  
  memcpy(recptr, buf, sizeof(XLogRecPtr));
  buf += sizeof(XLogRecPtr);
  len -= sizeof(XLogRecPtr);
  
! XLogWalRcvWrite(buf, len, recptr);
  break;
  			}
  		default:
--- 394,427 
  		case 'w':/* WAL records */
  			{
  XLogRecPtr	recptr;

Re: [HACKERS] Keepalive for max_standby_delay

2010-05-26 Thread Heikki Linnakangas

On 19/05/10 00:37, Simon Riggs wrote:

On Tue, 2010-05-18 at 17:25 -0400, Heikki Linnakangas wrote:

On 18/05/10 17:17, Simon Riggs wrote:

There's no reason that the buffer size we use for XLogRead() should be
the same as the send buffer, if you're worried about that. My point is
that pq_putmessage contains internal flushes so at the libpq level you
gain nothing by big batches. The read() will be buffered anyway with
readahead so not sure what the issue is. We'll have to do this for sync
rep anyway, so what's the big deal? Just do it now, once. Do we really
want 9.1 code to differ here?


Do what? What exactly is it that you want instead, then?


Read and write smaller messages, so the latency is minimised. Libpq will
send in 8192 byte packets, so writing anything larger gains nothing when
the WAL data is also chunked at exactly the same size.


Committed with chunk size of 128 kB. I hope that's a reasonable 
compromise, in the absence of any performance test data either way.


I'm weary of setting it as low as 8k, as there is some per-message 
overhead. Some of that could be avoided by rearranging the loops so that 
the ps display is not updated at every message as you suggested, but I 
don't feel doing any extra rearrangements at this point. It would not be 
hard, but it also certainly wouldn't make the code simpler.


I believe in practice 128kB is just as good as 8k from the 
responsiveness point of view. If a standby is not responding, walsender 
will be stuck trying to send no matter what the block size is. If it 
responding, no matter how slowly, 128kB should get transferred pretty 
quickly.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-05-26 Thread Josh Berkus

 Committed with chunk size of 128 kB. I hope that's a reasonable
 compromise, in the absence of any performance test data either way.

So where are we with max_standby_delay?  Status check?

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-05-26 Thread Simon Riggs
On Thu, 2010-05-27 at 01:34 +0300, Heikki Linnakangas wrote:

 Committed with chunk size of 128 kB.

Thanks. I'm sure that's fine.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-05-26 Thread Simon Riggs
On Sun, 2010-05-16 at 17:11 +0100, Simon Riggs wrote:

 New version, with some other cleanup of wait processing.
 
 New logic is that when Startup asks for next applychunk of WAL it saves
 the lastChunkTimestamp. That is then the base time used by
 WaitExceedsMaxStandbyDelay(), except when latestXLogTime is later.
 Since multiple receivechunks can arrive from primary before Startup asks
 for next applychunk we use the oldest receivechunk timestamp, not the
 latest. Doing it this way means the lastChunkTimestamp doesn't change
 when new keepalives arrive, so we have a stable and reasonably accurate
 recordSendTimestamp for each WAL record.
 
 The keepalive is sent as the first part of a new message, if any. So
 partial chunks of data always have an accurate timestamp, even if that
 is slightly older as a result. Doesn't make much difference except with
 very large chunks.
 
 I think that addresses the points raised on this thread and others.

Was about to post v3 after your last commit, but just found a minor bug
in my v2-v3 changes, even though they were fairly light. Too tired to
fix now. The general thinking underlying this patch is still the same
though and is worth discussing over next few days.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-05-26 Thread Simon Riggs
On Wed, 2010-05-26 at 15:45 -0700, Josh Berkus wrote:
  Committed with chunk size of 128 kB. I hope that's a reasonable
  compromise, in the absence of any performance test data either way.
 
 So where are we with max_standby_delay?  Status check?

Just this second posted about that, as it turns out.

I have a v3 *almost* ready of the keepalive patch. It still makes sense
to me after a few days reflection, so is worth discussion and review. In
or out, I want this settled within a week. Definitely need some RR
here.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-05-26 Thread Josh Berkus

 Just this second posted about that, as it turns out.
 
 I have a v3 *almost* ready of the keepalive patch. It still makes sense
 to me after a few days reflection, so is worth discussion and review. In
 or out, I want this settled within a week. Definitely need some RR
 here.

Does the keepalive fix all the issues with max_standby_delay?  Tom?

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-05-18 Thread Heikki Linnakangas

On 17/05/10 04:40, Simon Riggs wrote:

On Sun, 2010-05-16 at 16:53 +0100, Simon Riggs wrote:


Attached patch rearranges the walsender loops slightly to fix the above.
XLogSend() now only sends up to MAX_SEND_SIZE bytes (== XLOG_SEG_SIZE /
2) in one round and returns to the main loop after that even if there's
unsent WAL, and the main loop no longer sleeps if there's unsent WAL.
That way the main loop gets to respond to signals quickly, and we also
get to update the shared memory status and PS display more often when
there's a lot of catching up to do.

Comments


8MB at a time still seems like a large batch to me.

libpq is going to send it in smaller chunks anyway, so I don't see the
importance of trying to keep the batch too large. It just introduces
delay into the sending process. We should be sending chunks that matches
libpq better.


More to the point the logic will fail if XLOG_BLCKSZ  PQ_BUFFER_SIZE
because it will send partial pages.


I don't see a failure. We rely on not splitting WAL records across 
messages, but we're talking about libpq-level CopyData messages, not TCP 
messages.



Having MAX_SEND_SIZE  PQ_BUFFER_SIZE is pointless, as libpq currently
stands.


Well, it does affect the size of the read() in walsender, and I'm sure 
there's some overhead in setting the ps display and the other small 
stuff we do once per message. But you're probably right that we could 
easily make MAX_SEND_SIZE much smaller with no noticeable affect on 
performance, while making walsender more responsive to signals. I'll 
decrease it to, say, 512 kB.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-05-18 Thread Heikki Linnakangas

On 17/05/10 12:36, Jim Nasby wrote:

On May 15, 2010, at 12:05 PM, Heikki Linnakangas wrote:

What exactly is the user trying to monitor? If it's how far behind is
the standby, the difference between pg_current_xlog_insert_location()
in the master and pg_last_xlog_replay_location() in the standby seems
more robust and well-defined to me. It's a measure of XLOG location (ie.
bytes) instead of time, but time is a complicated concept.


I can tell you that end users *will* want a time-based indication of how far behind we 
are. DBAs will understand we're this many transactions behind, but managers 
and end users won't. Unless it's unreasonable to provide that info, we should do so.


No doubt about that, the problem is that it's hard to provide a reliable 
time-based indication.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-05-18 Thread Simon Riggs
On Tue, 2010-05-18 at 17:06 -0400, Heikki Linnakangas wrote:
 On 17/05/10 04:40, Simon Riggs wrote:
  On Sun, 2010-05-16 at 16:53 +0100, Simon Riggs wrote:
 
  Attached patch rearranges the walsender loops slightly to fix the above.
  XLogSend() now only sends up to MAX_SEND_SIZE bytes (== XLOG_SEG_SIZE /
  2) in one round and returns to the main loop after that even if there's
  unsent WAL, and the main loop no longer sleeps if there's unsent WAL.
  That way the main loop gets to respond to signals quickly, and we also
  get to update the shared memory status and PS display more often when
  there's a lot of catching up to do.
 
  Comments
 
  8MB at a time still seems like a large batch to me.
 
  libpq is going to send it in smaller chunks anyway, so I don't see the
  importance of trying to keep the batch too large. It just introduces
  delay into the sending process. We should be sending chunks that matches
  libpq better.
 
  More to the point the logic will fail if XLOG_BLCKSZ  PQ_BUFFER_SIZE
  because it will send partial pages.
 
 I don't see a failure. We rely on not splitting WAL records across 
 messages, but we're talking about libpq-level CopyData messages, not TCP 
 messages.

OK

  Having MAX_SEND_SIZE  PQ_BUFFER_SIZE is pointless, as libpq currently
  stands.
 
 Well, it does affect the size of the read() in walsender, and I'm sure 
 there's some overhead in setting the ps display and the other small 
 stuff we do once per message. But you're probably right that we could 
 easily make MAX_SEND_SIZE much smaller with no noticeable affect on 
 performance, while making walsender more responsive to signals. I'll 
 decrease it to, say, 512 kB.

I'm pretty certain we don't need to set the ps display once per message.
ps doesn't need an update 5 times per second on average.

There's no reason that the buffer size we use for XLogRead() should be
the same as the send buffer, if you're worried about that. My point is
that pq_putmessage contains internal flushes so at the libpq level you
gain nothing by big batches. The read() will be buffered anyway with
readahead so not sure what the issue is. We'll have to do this for sync
rep anyway, so what's the big deal? Just do it now, once. Do we really
want 9.1 code to differ here?

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-05-18 Thread Simon Riggs
On Tue, 2010-05-18 at 17:08 -0400, Heikki Linnakangas wrote:
 On 17/05/10 12:36, Jim Nasby wrote:
  On May 15, 2010, at 12:05 PM, Heikki Linnakangas wrote:
  What exactly is the user trying to monitor? If it's how far behind is
  the standby, the difference between pg_current_xlog_insert_location()
  in the master and pg_last_xlog_replay_location() in the standby seems
  more robust and well-defined to me. It's a measure of XLOG location (ie.
  bytes) instead of time, but time is a complicated concept.
 
  I can tell you that end users *will* want a time-based indication of how 
  far behind we are. DBAs will understand we're this many transactions 
  behind, but managers and end users won't. Unless it's unreasonable to 
  provide that info, we should do so.
 
 No doubt about that, the problem is that it's hard to provide a reliable 
 time-based indication.

I think I have one now.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-05-18 Thread Heikki Linnakangas

On 18/05/10 17:17, Simon Riggs wrote:

There's no reason that the buffer size we use for XLogRead() should be
the same as the send buffer, if you're worried about that. My point is
that pq_putmessage contains internal flushes so at the libpq level you
gain nothing by big batches. The read() will be buffered anyway with
readahead so not sure what the issue is. We'll have to do this for sync
rep anyway, so what's the big deal? Just do it now, once. Do we really
want 9.1 code to differ here?


Do what? What exactly is it that you want instead, then?

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-05-18 Thread Simon Riggs
On Tue, 2010-05-18 at 17:25 -0400, Heikki Linnakangas wrote:
 On 18/05/10 17:17, Simon Riggs wrote:
  There's no reason that the buffer size we use for XLogRead() should be
  the same as the send buffer, if you're worried about that. My point is
  that pq_putmessage contains internal flushes so at the libpq level you
  gain nothing by big batches. The read() will be buffered anyway with
  readahead so not sure what the issue is. We'll have to do this for sync
  rep anyway, so what's the big deal? Just do it now, once. Do we really
  want 9.1 code to differ here?
 
 Do what? What exactly is it that you want instead, then?

Read and write smaller messages, so the latency is minimised. Libpq will
send in 8192 byte packets, so writing anything larger gains nothing when
the WAL data is also chunked at exactly the same size.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-05-17 Thread Simon Riggs
On Mon, 2010-05-17 at 11:51 +0900, Fujii Masao wrote:

 Is it OK that this keepalive message cannot be used by HS in file-based
 log-shipping? Even in SR, the startup process cannot use the keepalive
 until walreceiver has been started up.

Yes, I see those things. 

We already have archive_timeout to handle the keepalive case in
file-based. 

When starting up the delay is high anyway, so doesn't really matter
about accuracy - though we do use latestXLogTime in that cases.

 WalSndKeepAlive() always calls initStringInfo(), which seems to cause
 memory-leak.

Thanks.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-05-17 Thread Simon Riggs
On Sun, 2010-05-16 at 16:53 +0100, Simon Riggs wrote:
  
  Attached patch rearranges the walsender loops slightly to fix the above.
  XLogSend() now only sends up to MAX_SEND_SIZE bytes (== XLOG_SEG_SIZE /
  2) in one round and returns to the main loop after that even if there's
  unsent WAL, and the main loop no longer sleeps if there's unsent WAL.
  That way the main loop gets to respond to signals quickly, and we also
  get to update the shared memory status and PS display more often when
  there's a lot of catching up to do.
  
  Comments
 
 8MB at a time still seems like a large batch to me.
 
 libpq is going to send it in smaller chunks anyway, so I don't see the
 importance of trying to keep the batch too large. It just introduces
 delay into the sending process. We should be sending chunks that matches
 libpq better.

More to the point the logic will fail if XLOG_BLCKSZ  PQ_BUFFER_SIZE
because it will send partial pages.

Having MAX_SEND_SIZE  PQ_BUFFER_SIZE is pointless, as libpq currently
stands.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-05-17 Thread Jim Nasby
On May 15, 2010, at 12:05 PM, Heikki Linnakangas wrote:
 What exactly is the user trying to monitor? If it's how far behind is
 the standby, the difference between pg_current_xlog_insert_location()
 in the master and pg_last_xlog_replay_location() in the standby seems
 more robust and well-defined to me. It's a measure of XLOG location (ie.
 bytes) instead of time, but time is a complicated concept.

I can tell you that end users *will* want a time-based indication of how far 
behind we are. DBAs will understand we're this many transactions behind, but 
managers and end users won't. Unless it's unreasonable to provide that info, we 
should do so.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-05-16 Thread Simon Riggs
On Sun, 2010-05-16 at 00:05 +0300, Heikki Linnakangas wrote:
 Heikki Linnakangas wrote:
  Simon Riggs wrote:
  WALSender sleeps even when it might have more WAL to send, it doesn't
  check it just unconditionally sleeps. At least WALReceiver loops until
  it has no more to receive. I just can't imagine why that's useful
  behaviour.
  
  Good catch. That should be fixed.
  
  I also note that walsender doesn't respond to signals, while it's
  sending a large batch. That's analogous to the issue that was addressed
  recently in the archiver process.
 
 Attached patch rearranges the walsender loops slightly to fix the above.
 XLogSend() now only sends up to MAX_SEND_SIZE bytes (== XLOG_SEG_SIZE /
 2) in one round and returns to the main loop after that even if there's
 unsent WAL, and the main loop no longer sleeps if there's unsent WAL.
 That way the main loop gets to respond to signals quickly, and we also
 get to update the shared memory status and PS display more often when
 there's a lot of catching up to do.
 
 Comments

8MB at a time still seems like a large batch to me.

libpq is going to send it in smaller chunks anyway, so I don't see the
importance of trying to keep the batch too large. It just introduces
delay into the sending process. We should be sending chunks that matches
libpq better.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-05-16 Thread Simon Riggs
On Sat, 2010-05-15 at 19:50 +0100, Simon Riggs wrote:
 On Sat, 2010-05-15 at 18:24 +0100, Simon Riggs wrote:
 
  I will recode using that concept.

 Startup gets new pointer when it runs out of data to replay. That might
 or might not include an updated keepalive timestamp, since there's no
 exact relationship between chunks sent and chunks received. Startup
 might ask for a new chunk when half a chunk has been received, or when
 multiple chunks have been received.

New version, with some other cleanup of wait processing.

New logic is that when Startup asks for next applychunk of WAL it saves
the lastChunkTimestamp. That is then the base time used by
WaitExceedsMaxStandbyDelay(), except when latestXLogTime is later.
Since multiple receivechunks can arrive from primary before Startup asks
for next applychunk we use the oldest receivechunk timestamp, not the
latest. Doing it this way means the lastChunkTimestamp doesn't change
when new keepalives arrive, so we have a stable and reasonably accurate
recordSendTimestamp for each WAL record.

The keepalive is sent as the first part of a new message, if any. So
partial chunks of data always have an accurate timestamp, even if that
is slightly older as a result. Doesn't make much difference except with
very large chunks.

I think that addresses the points raised on this thread and others.

-- 
 Simon Riggs   www.2ndQuadrant.com
*** a/doc/src/sgml/protocol.sgml
--- b/doc/src/sgml/protocol.sgml
***
*** 4232,4247  The commands accepted in walsender mode are:
/varlistentry
/variablelist
/para
!   /listitem
!   /varlistentry
!   /variablelist
!  /para
!  para
 A single WAL record is never split across two CopyData messages. When
 a WAL record crosses a WAL page boundary, however, and is therefore
 already split using continuation records, it can be split at the page
 boundary. In other words, the first main WAL record and its
 continuation records can be split across different CopyData messages.
   /para
  /listitem
/varlistentry
--- 4232,4283 
/varlistentry
/variablelist
/para
!   para
 A single WAL record is never split across two CopyData messages. When
 a WAL record crosses a WAL page boundary, however, and is therefore
 already split using continuation records, it can be split at the page
 boundary. In other words, the first main WAL record and its
 continuation records can be split across different CopyData messages.
+   /para
+   /listitem
+   /varlistentry
+   varlistentry
+   term
+   Keepalive (B)
+   /term
+   listitem
+   para
+   variablelist
+   varlistentry
+   term
+   Byte1('k')
+   /term
+   listitem
+   para
+   Identifies the message as a keepalive.
+   /para
+   /listitem
+   /varlistentry
+   varlistentry
+   term
+   TimestampTz
+   /term
+   listitem
+   para
+   The current timestamp on the primary server when the keepalive was sent.
+   /para
+   /listitem
+   /varlistentry
+   /variablelist
+   /para
+   para
+If varnamewal_level/ is set to literalhot_standby/ then a keepalive
+is sent once per varnamewal_sender_delay/. The keepalive is sent after
+WAL data has been sent, if any.
+   /para
+   /listitem
+   /varlistentry
+   /variablelist
   /para
  /listitem
/varlistentry
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***
*** 1938,1944  UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force)
  			UpdateControlFile();
  			minRecoveryPoint = newMinRecoveryPoint;
  
! 			ereport(DEBUG2,
  	(errmsg(updated min recovery point to %X/%X,
  		minRecoveryPoint.xlogid, minRecoveryPoint.xrecoff)));
  		}
--- 1938,1944 
  			UpdateControlFile();
  			minRecoveryPoint = newMinRecoveryPoint;
  
! 			ereport(DEBUG3,
  	(errmsg(updated min recovery point to %X/%X,
  		minRecoveryPoint.xlogid, minRecoveryPoint.xrecoff)));
  		}
***
*** 9212,9218  retry:
  	 * While walreceiver is active, wait for new WAL to arrive
  	 * from primary.
  	 */
! 	receivedUpto = GetWalRcvWriteRecPtr();
  	if (XLByteLT(*RecPtr, receivedUpto))
  	{
  		/*
--- 9212,9218 
  	 * While walreceiver is active, wait for new WAL to arrive
  	 * from primary.
  	 */
! 	receivedUpto = GetWalRcvNextChunk();
  	if (XLByteLT(*RecPtr, receivedUpto))
  	{
  		/*
*** a/src/backend/replication/walreceiver.c
--- b/src/backend/replication/walreceiver.c
***
*** 407,412  XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
--- 407,428 
  XLogWalRcvWrite(buf, len, recptr);
  break;
  			}
+ 		case 'k':/* keepalive */
+ 			{
+ 

Re: [HACKERS] Keepalive for max_standby_delay

2010-05-16 Thread Fujii Masao
On Sun, May 16, 2010 at 6:05 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Heikki Linnakangas wrote:
 Simon Riggs wrote:
 WALSender sleeps even when it might have more WAL to send, it doesn't
 check it just unconditionally sleeps. At least WALReceiver loops until
 it has no more to receive. I just can't imagine why that's useful
 behaviour.

 Good catch. That should be fixed.

 I also note that walsender doesn't respond to signals, while it's
 sending a large batch. That's analogous to the issue that was addressed
 recently in the archiver process.

 Attached patch rearranges the walsender loops slightly to fix the above.
 XLogSend() now only sends up to MAX_SEND_SIZE bytes (== XLOG_SEG_SIZE /
 2) in one round and returns to the main loop after that even if there's
 unsent WAL, and the main loop no longer sleeps if there's unsent WAL.
 That way the main loop gets to respond to signals quickly, and we also
 get to update the shared memory status and PS display more often when
 there's a lot of catching up to do.

 Comments

The main loop needs to respond to also the 'X' message from the standby
quickly?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-05-16 Thread Fujii Masao
On Mon, May 17, 2010 at 1:11 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Sat, 2010-05-15 at 19:50 +0100, Simon Riggs wrote:
 On Sat, 2010-05-15 at 18:24 +0100, Simon Riggs wrote:

  I will recode using that concept.

 Startup gets new pointer when it runs out of data to replay. That might
 or might not include an updated keepalive timestamp, since there's no
 exact relationship between chunks sent and chunks received. Startup
 might ask for a new chunk when half a chunk has been received, or when
 multiple chunks have been received.

 New version, with some other cleanup of wait processing.

 New logic is that when Startup asks for next applychunk of WAL it saves
 the lastChunkTimestamp. That is then the base time used by
 WaitExceedsMaxStandbyDelay(), except when latestXLogTime is later.
 Since multiple receivechunks can arrive from primary before Startup asks
 for next applychunk we use the oldest receivechunk timestamp, not the
 latest. Doing it this way means the lastChunkTimestamp doesn't change
 when new keepalives arrive, so we have a stable and reasonably accurate
 recordSendTimestamp for each WAL record.

 The keepalive is sent as the first part of a new message, if any. So
 partial chunks of data always have an accurate timestamp, even if that
 is slightly older as a result. Doesn't make much difference except with
 very large chunks.

 I think that addresses the points raised on this thread and others.

Is it OK that this keepalive message cannot be used by HS in file-based
log-shipping? Even in SR, the startup process cannot use the keepalive
until walreceiver has been started up.

WalSndKeepAlive() always calls initStringInfo(), which seems to cause
memory-leak.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Keepalive for max_standby_delay

2010-05-15 Thread Simon Riggs

Patch adds a keepalive message to ensure max_standby_delay is useful.

No WAL format changes, no libpq changes. Just an additional message type
for the streaming replication protocol, sent once per main loop in
WALsender. Plus docs.

Comments?

-- 
 Simon Riggs   www.2ndQuadrant.com
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index c63d003..391d990 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -4232,16 +4232,52 @@ The commands accepted in walsender mode are:
   /varlistentry
   /variablelist
   /para
-  /listitem
-  /varlistentry
-  /variablelist
- /para
- para
+  para
A single WAL record is never split across two CopyData messages. When
a WAL record crosses a WAL page boundary, however, and is therefore
already split using continuation records, it can be split at the page
boundary. In other words, the first main WAL record and its
continuation records can be split across different CopyData messages.
+  /para
+  /listitem
+  /varlistentry
+  varlistentry
+  term
+  Keepalive (B)
+  /term
+  listitem
+  para
+  variablelist
+  varlistentry
+  term
+  Byte1('k')
+  /term
+  listitem
+  para
+  Identifies the message as a keepalive.
+  /para
+  /listitem
+  /varlistentry
+  varlistentry
+  term
+  TimestampTz
+  /term
+  listitem
+  para
+  The current timestamp on the primary server when the keepalive was sent.
+  /para
+  /listitem
+  /varlistentry
+  /variablelist
+  /para
+  para
+   If varnamewal_level/ is set to literalhot_standby/ then a keepalive
+   is sent once per varnamewal_sender_delay/. The keepalive is sent after
+   WAL data has been sent, if any.
+  /para
+  /listitem
+  /varlistentry
+  /variablelist
  /para
 /listitem
   /varlistentry
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 607d57e..ee383af 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5566,6 +5566,17 @@ GetLatestXLogTime(void)
 	return recoveryLastXTime;
 }
 
+void
+SetLatestXLogTime(TimestampTz newLastXTime)
+{
+	/* use volatile pointer to prevent code rearrangement */
+	volatile XLogCtlData *xlogctl = XLogCtl;
+
+	SpinLockAcquire(xlogctl-info_lck);
+	xlogctl-recoveryLastXTime = newLastXTime;
+	SpinLockRelease(xlogctl-info_lck);
+}
+
 /*
  * Note that text field supplied is a parameter name and does not require
  * translation
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index bb87a06..8d52c3f 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -407,6 +407,22 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 XLogWalRcvWrite(buf, len, recptr);
 break;
 			}
+		case 'k':/* keepalive */
+			{
+TimestampTz keepalive;
+
+if (len != sizeof(TimestampTz))
+	ereport(ERROR,
+			(errcode(ERRCODE_PROTOCOL_VIOLATION),
+			 errmsg_internal(invalid keepalive message received from primary)));
+
+memcpy(keepalive, buf, sizeof(TimestampTz));
+buf += sizeof(TimestampTz);
+len -= sizeof(TimestampTz);
+
+SetLatestXLogTime(keepalive);
+break;
+			}
 		default:
 			ereport(ERROR,
 	(errcode(ERRCODE_PROTOCOL_VIOLATION),
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 1c04fc3..f2f8750 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -98,6 +98,7 @@ static void WalSndQuickDieHandler(SIGNAL_ARGS);
 static int	WalSndLoop(void);
 static void InitWalSnd(void);
 static void WalSndHandshake(void);
+static bool WalSndKeepAlive(void);
 static void WalSndKill(int code, Datum arg);
 static void XLogRead(char *buf, XLogRecPtr recptr, Size nbytes);
 static bool XLogSend(StringInfo outMsg);
@@ -314,6 +315,30 @@ WalSndHandshake(void)
 	}
 }
 
+static bool
+WalSndKeepAlive(void)
+{
+	StringInfoData outMsg;
+	TimestampTz ts;
+
+	if (!XLogStandbyInfoActive())
+		return true;
+
+	initStringInfo(outMsg);
+	ts = GetCurrentTimestamp();
+
+	/* format the keepalive message */
+	pq_sendbyte(outMsg, 'k');
+	pq_sendbytes(outMsg, (char *) ts, sizeof(TimestampTz));
+
+	/* send the CopyData message */
+	pq_putmessage('d', outMsg.data, outMsg.len);
+	if (pq_flush())
+		return false;
+
+	return true;
+}
+
 /*
  * Check if the remote end has closed the connection.
  */
@@ -428,6 +453,9 @@ WalSndLoop(void)
 		/* Attempt to send the log once every loop */
 		if (!XLogSend(output_message))
 			goto eof;
+
+		if (!WalSndKeepAlive())
+			goto eof;
 	}
 
 	/* can't get here because the above loop never exits */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 8ff68a6..2d01670 100644
--- a/src/include/access/xlog.h

Re: [HACKERS] Keepalive for max_standby_delay

2010-05-15 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 Patch adds a keepalive message to ensure max_standby_delay is useful.

The proposed placement of the keepalive-send is about the worst it could
possibly be.  It needs to be done right before pq_flush to ensure
minimum transfer delay.  Otherwise any attempt to measure clock skew
using the timestamp will be seriously off.  Where you've placed it
guarantees maximum delay not minimum.

I'm also extremely dubious that it's a good idea to set
recoveryLastXTime from this.  Using both that and the timestamps from
the wal log flies in the face of everything I remember about control
theory.  We should be doing only one or only the other, I think.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-05-15 Thread Simon Riggs
On Sat, 2010-05-15 at 11:45 -0400, Tom Lane wrote:
 Simon Riggs si...@2ndquadrant.com writes:
  Patch adds a keepalive message to ensure max_standby_delay is useful.
 
 The proposed placement of the keepalive-send is about the worst it could
 possibly be.  It needs to be done right before pq_flush to ensure
 minimum transfer delay.  Otherwise any attempt to measure clock skew
 using the timestamp will be seriously off.  Where you've placed it
 guarantees maximum delay not minimum.

I don't understand. WalSndKeepAlive() contains a pq_flush() immediately
after the timestamp is set. I did that way for exactly the same reasons
you've said.

Do you mean you only want to see one pq_flush()? I used two so that the
actual data is never delayed by a keepalive. WAL Sender was going to
sleep anyway, so shouldn't be a problem.

 I'm also extremely dubious that it's a good idea to set
 recoveryLastXTime from this.  Using both that and the timestamps from
 the wal log flies in the face of everything I remember about control
 theory.  We should be doing only one or only the other, I think.

I can change it so that the recoveryLastXTime will not be updated if we
are using the value from the keepalives. So we have one, or the other.
Remember that replication can switch backwards and forwards between
modes, so it seems sensible to have a common timing value whichever mode
we're in.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-05-15 Thread Heikki Linnakangas
Simon Riggs wrote:
 On Sat, 2010-05-15 at 11:45 -0400, Tom Lane wrote:
 I'm also extremely dubious that it's a good idea to set
 recoveryLastXTime from this.  Using both that and the timestamps from
 the wal log flies in the face of everything I remember about control
 theory.  We should be doing only one or only the other, I think.
 
 I can change it so that the recoveryLastXTime will not be updated if we
 are using the value from the keepalives. So we have one, or the other.
 Remember that replication can switch backwards and forwards between
 modes, so it seems sensible to have a common timing value whichever mode
 we're in.

That means that recoveryLastXTime can jump forwards and backwards.
Doesn't feel right to me either. If you want to expose the
keepalive-time to queries, it should be a separate field, something like
lastMasterKeepaliveTime and a pg_last_master_keepalive() function to
read it.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-05-15 Thread Simon Riggs
On Sat, 2010-05-15 at 19:30 +0300, Heikki Linnakangas wrote:
 Simon Riggs wrote:
  On Sat, 2010-05-15 at 11:45 -0400, Tom Lane wrote:
  I'm also extremely dubious that it's a good idea to set
  recoveryLastXTime from this.  Using both that and the timestamps from
  the wal log flies in the face of everything I remember about control
  theory.  We should be doing only one or only the other, I think.
  
  I can change it so that the recoveryLastXTime will not be updated if we
  are using the value from the keepalives. So we have one, or the other.
  Remember that replication can switch backwards and forwards between
  modes, so it seems sensible to have a common timing value whichever mode
  we're in.
 
 That means that recoveryLastXTime can jump forwards and backwards.

That behaviour would be bad, so why not just prevent that from
happening?

 Doesn't feel right to me either. If you want to expose the
 keepalive-time to queries, it should be a separate field, something like
 lastMasterKeepaliveTime and a pg_last_master_keepalive() function to
 read it.

That wouldn't be good because then you couldn't easily monitor the
delay? You'd have to run two different functions depending on the state
of replication (for which we would need yet another function). Users
would just wrap that back up into a single function.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-05-15 Thread Heikki Linnakangas
Simon Riggs wrote:
 On Sat, 2010-05-15 at 19:30 +0300, Heikki Linnakangas wrote:
 Doesn't feel right to me either. If you want to expose the
 keepalive-time to queries, it should be a separate field, something like
 lastMasterKeepaliveTime and a pg_last_master_keepalive() function to
 read it.
 
 That wouldn't be good because then you couldn't easily monitor the
 delay? You'd have to run two different functions depending on the state
 of replication (for which we would need yet another function). Users
 would just wrap that back up into a single function.

What exactly is the user trying to monitor? If it's how far behind is
the standby, the difference between pg_current_xlog_insert_location()
in the master and pg_last_xlog_replay_location() in the standby seems
more robust and well-defined to me. It's a measure of XLOG location (ie.
bytes) instead of time, but time is a complicated concept.

Also note that as the patch stands, if you receive a keep-alive from the
master at point X, it doesn't mean that the standby is fully up-to-date.
It's possible that the walsender just finished sending a huge batch of
accumulated WAL, say 1 GB, and it took 1 hour for the batch to be sent.
During that time, a lot more WAL has accumulated, yet walsender sends a
keep-alive with the current timestamp.

In general, the purpose of a keep-alive is to keep the connection alive,
but you're trying to accomplish something else too, and I don't fully
understand what it is.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-05-15 Thread Simon Riggs
On Sat, 2010-05-15 at 20:05 +0300, Heikki Linnakangas wrote:
 Simon Riggs wrote:
  On Sat, 2010-05-15 at 19:30 +0300, Heikki Linnakangas wrote:
  Doesn't feel right to me either. If you want to expose the
  keepalive-time to queries, it should be a separate field, something like
  lastMasterKeepaliveTime and a pg_last_master_keepalive() function to
  read it.
  
  That wouldn't be good because then you couldn't easily monitor the
  delay? You'd have to run two different functions depending on the state
  of replication (for which we would need yet another function). Users
  would just wrap that back up into a single function.
 
 What exactly is the user trying to monitor? If it's how far behind is
 the standby, the difference between pg_current_xlog_insert_location()
 in the master and pg_last_xlog_replay_location() in the standby seems
 more robust and well-defined to me. It's a measure of XLOG location (ie.
 bytes) instead of time, but time is a complicated concept.

Maybe, but its meaningful to users and that is the point.

 Also note that as the patch stands, if you receive a keep-alive from the
 master at point X, it doesn't mean that the standby is fully up-to-date.
 It's possible that the walsender just finished sending a huge batch of
 accumulated WAL, say 1 GB, and it took 1 hour for the batch to be sent.
 During that time, a lot more WAL has accumulated, yet walsender sends a
 keep-alive with the current timestamp.

Not at all. The timestamp for the keepalive is calculated after the
pq_flush for the main WAL data. So it takes 10 years to send the next
blob of WAL data the timestamp will be current.

However, a point you made in an earlier thread is still true here. It
sounds like it would be best if startup process didn't re-access shared
memory for the next location until it had fully replayed all the WAL up
to the point it last accessed. That way the changing value of the shared
timestamp would have no effect on the calculated value at any point in
time. I will recode using that concept.

 In general, the purpose of a keep-alive is to keep the connection alive,
 but you're trying to accomplish something else too, and I don't fully
 understand what it is.

That surprises me. If nothing else, its in the title of the thread,
though since you personally added this to the Hot Standby todo more than
6 months ago I'd hope you of all people would understand the purpose.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-05-15 Thread Simon Riggs
On Sat, 2010-05-15 at 18:24 +0100, Simon Riggs wrote:

 I will recode using that concept.

There's some behaviours that aren't helpful here:

Startup gets new pointer when it runs out of data to replay. That might
or might not include an updated keepalive timestamp, since there's no
exact relationship between chunks sent and chunks received. Startup
might ask for a new chunk when half a chunk has been received, or when
multiple chunks have been received.

WALSender doesn't chunk up what it sends, though libpq does at a lower
level. So there's no way to make a keepalive happen every X bytes
without doing this from within libpq.

WALSender sleeps even when it might have more WAL to send, it doesn't
check it just unconditionally sleeps. At least WALReceiver loops until
it has no more to receive. I just can't imagine why that's useful
behaviour.

All in all, I think we should be calling this burst replication not
streaming replication. That does cause an issue in trying to monitor
what's going on cos there's so much jitter.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-05-15 Thread Heikki Linnakangas
Simon Riggs wrote:
 WALSender sleeps even when it might have more WAL to send, it doesn't
 check it just unconditionally sleeps. At least WALReceiver loops until
 it has no more to receive. I just can't imagine why that's useful
 behaviour.

Good catch. That should be fixed.

I also note that walsender doesn't respond to signals, while it's
sending a large batch. That's analogous to the issue that was addressed
recently in the archiver process.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Keepalive for max_standby_delay

2010-05-15 Thread Heikki Linnakangas
Heikki Linnakangas wrote:
 Simon Riggs wrote:
 WALSender sleeps even when it might have more WAL to send, it doesn't
 check it just unconditionally sleeps. At least WALReceiver loops until
 it has no more to receive. I just can't imagine why that's useful
 behaviour.
 
 Good catch. That should be fixed.
 
 I also note that walsender doesn't respond to signals, while it's
 sending a large batch. That's analogous to the issue that was addressed
 recently in the archiver process.

Attached patch rearranges the walsender loops slightly to fix the above.
XLogSend() now only sends up to MAX_SEND_SIZE bytes (== XLOG_SEG_SIZE /
2) in one round and returns to the main loop after that even if there's
unsent WAL, and the main loop no longer sleeps if there's unsent WAL.
That way the main loop gets to respond to signals quickly, and we also
get to update the shared memory status and PS display more often when
there's a lot of catching up to do.

Comments, have I screwed up anything?

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
*** a/src/backend/replication/walsender.c
--- b/src/backend/replication/walsender.c
***
*** 100,106  static void InitWalSnd(void);
  static void WalSndHandshake(void);
  static void WalSndKill(int code, Datum arg);
  static void XLogRead(char *buf, XLogRecPtr recptr, Size nbytes);
! static bool XLogSend(StringInfo outMsg);
  static void CheckClosedConnection(void);
  
  /*
--- 100,106 
  static void WalSndHandshake(void);
  static void WalSndKill(int code, Datum arg);
  static void XLogRead(char *buf, XLogRecPtr recptr, Size nbytes);
! static bool XLogSend(StringInfo outMsg, bool *caughtup);
  static void CheckClosedConnection(void);
  
  /*
***
*** 360,365  static int
--- 360,366 
  WalSndLoop(void)
  {
  	StringInfoData output_message;
+ 	bool		caughtup = false;
  
  	initStringInfo(output_message);
  
***
*** 387,393  WalSndLoop(void)
  		 */
  		if (ready_to_stop)
  		{
! 			XLogSend(output_message);
  			shutdown_requested = true;
  		}
  
--- 388,394 
  		 */
  		if (ready_to_stop)
  		{
! 			XLogSend(output_message, caughtup);
  			shutdown_requested = true;
  		}
  
***
*** 402,432  WalSndLoop(void)
  		}
  
  		/*
! 		 * Nap for the configured time or until a message arrives.
  		 *
  		 * On some platforms, signals won't interrupt the sleep.  To ensure we
  		 * respond reasonably promptly when someone signals us, break down the
  		 * sleep into NAPTIME_PER_CYCLE increments, and check for
  		 * interrupts after each nap.
  		 */
! 		remain = WalSndDelay * 1000L;
! 		while (remain  0)
  		{
! 			if (got_SIGHUP || shutdown_requested || ready_to_stop)
! break;
  
! 			/*
! 			 * Check to see whether a message from the standby or an interrupt
! 			 * from other processes has arrived.
! 			 */
! 			pg_usleep(remain  NAPTIME_PER_CYCLE ? NAPTIME_PER_CYCLE : remain);
! 			CheckClosedConnection();
  
! 			remain -= NAPTIME_PER_CYCLE;
  		}
- 
  		/* Attempt to send the log once every loop */
! 		if (!XLogSend(output_message))
  			goto eof;
  	}
  
--- 403,434 
  		}
  
  		/*
! 		 * If we had sent all accumulated WAL in last round, nap for the
! 		 * configured time before retrying.
  		 *
  		 * On some platforms, signals won't interrupt the sleep.  To ensure we
  		 * respond reasonably promptly when someone signals us, break down the
  		 * sleep into NAPTIME_PER_CYCLE increments, and check for
  		 * interrupts after each nap.
  		 */
! 		if (caughtup)
  		{
! 			remain = WalSndDelay * 1000L;
! 			while (remain  0)
! 			{
! /* Check for interrupts */
! if (got_SIGHUP || shutdown_requested || ready_to_stop)
! 	break;
  
! /* Sleep and check that the connection is still alive */
! pg_usleep(remain  NAPTIME_PER_CYCLE ? NAPTIME_PER_CYCLE : remain);
! CheckClosedConnection();
  
! remain -= NAPTIME_PER_CYCLE;
! 			}
  		}
  		/* Attempt to send the log once every loop */
! 		if (!XLogSend(output_message, caughtup))
  			goto eof;
  	}
  
***
*** 623,637  XLogRead(char *buf, XLogRecPtr recptr, Size nbytes)
  }
  
  /*
!  * Read all WAL that's been written (and flushed) since last cycle, and send
!  * it to client.
   *
   * Returns true if OK, false if trouble.
   */
  static bool
! XLogSend(StringInfo outMsg)
  {
  	XLogRecPtr	SendRqstPtr;
  	char		activitymsg[50];
  
  	/* use volatile pointer to prevent code rearrangement */
--- 625,644 
  }
  
  /*
!  * Read up to MAX_SEND_SIZE bytes of WAL that's been written (and flushed),
!  * but not yet sent to the client, and send it. If there is no unsent WAL,
!  * *caughtup is set to true and nothing is sent, otherwise *caughtup is set
!  * to false.
   *
   * Returns true if OK, false if trouble.
   */
  static bool
! XLogSend(StringInfo outMsg, bool *caughtup)
  {
  	XLogRecPtr	SendRqstPtr;
+ 	XLogRecPtr	startptr;
+ 	XLogRecPtr	endptr;
+ 	Size		nbytes;
  	char		activitymsg[50];
  
  	/* use