Re: [HACKERS] 9.1.2 ?

2011-11-12 Thread Chris Redekop
On Wed, Nov 9, 2011 at 6:22 PM, Florian Pflug f...@phlo.org wrote:

 On Nov9, 2011, at 23:53 , Daniel Farina wrote:
  I think a novice user would be scared half to death: I know I was the
  first time.  That's not a great impression for the project to leave
  for what is not, at its root, a vast defect, and the fact it's
  occurring for people when they use rsync rather than my very sensitive
  backup routines is indication that it's not very corner-ey.

 Just to emphasize the non-conerish-ness of this problem, it should be
 mentioned that the HS issue was observed even with backups taken with
 pg_basebackup, if memory serves correctly.

Yes I personally can reliably reproduce both the clog+subtrans problems
using pg_basebackup, and can confirm that the
oldestActiveXid_fixed.v2.patch does resolve both issues.


Re: [HACKERS] Hot Standby startup with overflowed snapshots

2011-11-02 Thread Chris Redekop
oopsreply-to-all

-- Forwarded message --
From: Chris Redekop ch...@replicon.com
Date: Wed, Nov 2, 2011 at 8:41 AM
Subject: Re: [HACKERS] Hot Standby startup with overflowed snapshots
To: Simon Riggs si...@2ndquadrant.com


Sure, I've got quite a few logs lying around - I've attached 3 of 'em...let
me know if there are any specific things you'd like me to do or look for
next time it happens


On Wed, Nov 2, 2011 at 2:59 AM, Simon Riggs si...@2ndquadrant.com wrote:

 On Fri, Oct 28, 2011 at 3:42 AM, Chris Redekop ch...@replicon.com wrote:

  On a side note I am sporadically seeing another error on hotstandby
 startup.
   I'm not terribly concerned about it as it is pretty rare and it will
 work
  on a retry so it's not a big deal.  The error is FATAL:  out-of-order
 XID
  insertion in KnownAssignedXids.  If you think it might be a bug and are
  interested in hunting it down let me know and I'll help any way I
 can...but
  if you're not too worried about it then neither am I :)

 I'd be interested to see further details of this if you see it again,
 or have access to previous logs.

 --
  Simon Riggs   http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training  Services



postgresql-2011-10-27_202007.log
Description: Binary data


postgresql-2011-10-31_152925.log
Description: Binary data


postgresql-2011-11-01_094501.log
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Backup with rsync fails at pg_clog if under load

2011-11-02 Thread Chris Redekop
okay, sorry I'm a little confused then.  Should I be able to apply both the
v2 patch as well as the v3 patch?  or is it expected that I'd have to
manually do the merge?


On Wed, Nov 2, 2011 at 1:34 AM, Simon Riggs si...@2ndquadrant.com wrote:

 On Wed, Nov 2, 2011 at 2:40 AM, Chris Redekop ch...@replicon.com wrote:

  looks like the v3 patch re-introduces the pg_subtrans issue...

 No, I just separated the patches to be clearer about the individual
 changes.

 --
  Simon Riggs   http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training  Services



Re: [HACKERS] Hot Backup with rsync fails at pg_clog if under load

2011-11-01 Thread Chris Redekop
looks like the v3 patch re-introduces the pg_subtrans issue...


On Tue, Nov 1, 2011 at 9:33 AM, Simon Riggs si...@2ndquadrant.com wrote:

 On Thu, Oct 27, 2011 at 4:25 PM, Simon Riggs si...@2ndquadrant.com
 wrote:

  StartupMultiXact() didn't need changing, I thought, but I will review
 further.

 Good suggestion.

 On review, StartupMultiXact() could also suffer similar error to the
 clog failure. This was caused *because* MultiXact is not maintained by
 recovery, which I had thought meant it was protected from such
 failure.

 Revised patch attached.

 --
  Simon Riggs   http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training  Services



Re: [HACKERS] Hot Standby startup with overflowed snapshots

2011-10-27 Thread Chris Redekop
Thanks for the patch Simon, but unfortunately it does not resolve the issue
I am seeing.  The standby still refuses to finish starting up until long
after all clients have disconnected from the primary (10 minutes).  I do
see your new log statement on startup, but only once - it does not repeat.
 Is there any way for me to see  what the oldest xid on the standby is via
controldata or something like that?  The standby does stream to keep up with
the primary while the primary has load, and then it becomes idle when the
primary becomes idle (when I kill all the connections)so it appears to
be current...but it just doesn't finish starting up

I'm not sure if it's relevant, but after it has sat idle for a couple
minutes I start seeing these statements in the log (with the same offset
every time):
DEBUG:  skipping restartpoint, already performed at 9/9520



On Thu, Oct 27, 2011 at 7:26 AM, Simon Riggs si...@2ndquadrant.com wrote:

 Chris Redekop's recent report of slow startup for Hot Standby has made
 me revisit the code there.

 Although there isn't a bug, there is a missed opportunity for starting
 up faster which could be the source of Chris' annoyance.

 The following patch allows a faster startup in some circumstances.

 The patch also alters the log levels for messages and gives a single
 simple message for this situation. The log will now say

  LOG:  recovery snapshot waiting for non-overflowed snapshot or until
 oldest active xid on standby is at least %u (now %u)
  ...multiple times until snapshot non-overflowed or xid reached...

 whereas before the first LOG message shown was

  LOG:  consistent state delayed because recovery snapshot incomplete
  and only later, at DEBUG2 do you see
  LOG:  recovery snapshot waiting for %u oldest active xid on standby is %u
  ...multiple times until xid reached...

 Comments please.

 --
  Simon Riggs   http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training  Services


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers




Re: [HACKERS] Hot Standby startup with overflowed snapshots

2011-10-27 Thread Chris Redekop
hrmz, still basically the same behaviour.  I think it might be a *little*
better with this patch.  Before when under load it would start up quickly
maybe 2 or 3 times out of 10 attemptswith this patch it might be up to 4
or 5 times out of 10...ish...or maybe it was just fluke *shrug*.  I'm still
only seeing your log statement a single time (I'm running at debug2).  I
have discovered something though - when the standby is in this state if I
force a checkpoint on the primary then the standby comes right up.  Is there
anything I check or try for you to help figure this out?or is it
actually as designed that it could take 10-ish minutes to start up even
after all clients have disconnected from the primary?


On Thu, Oct 27, 2011 at 11:27 AM, Simon Riggs si...@2ndquadrant.com wrote:

 On Thu, Oct 27, 2011 at 5:26 PM, Chris Redekop ch...@replicon.com wrote:

  Thanks for the patch Simon, but unfortunately it does not resolve the
 issue
  I am seeing.  The standby still refuses to finish starting up until long
  after all clients have disconnected from the primary (10 minutes).  I do
  see your new log statement on startup, but only once - it does not
 repeat.
   Is there any way for me to see  what the oldest xid on the standby is
 via
  controldata or something like that?  The standby does stream to keep up
 with
  the primary while the primary has load, and then it becomes idle when the
  primary becomes idle (when I kill all the connections)so it appears
 to
  be current...but it just doesn't finish starting up
  I'm not sure if it's relevant, but after it has sat idle for a couple
  minutes I start seeing these statements in the log (with the same offset
  every time):
  DEBUG:  skipping restartpoint, already performed at 9/9520

 OK, so it looks like there are 2 opportunities to improve, not just one.

 Try this.

 --
  Simon Riggs   http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training  Services



Re: [HACKERS] Hot Standby startup with overflowed snapshots

2011-10-27 Thread Chris Redekop
Sorry...designed was poor choice of words, I meant not unexpected.
 Doing the checkpoint right after pg_stop_backup() looks like it will work
perfectly for me, so thanks for all your help!

On a side note I am sporadically seeing another error on hotstandby startup.
 I'm not terribly concerned about it as it is pretty rare and it will work
on a retry so it's not a big deal.  The error is FATAL:  out-of-order XID
insertion in KnownAssignedXids.  If you think it might be a bug and are
interested in hunting it down let me know and I'll help any way I can...but
if you're not too worried about it then neither am I :)


On Thu, Oct 27, 2011 at 4:55 PM, Simon Riggs si...@2ndquadrant.com wrote:

 On Thu, Oct 27, 2011 at 10:09 PM, Chris Redekop ch...@replicon.com
 wrote:

  hrmz, still basically the same behaviour.  I think it might be a *little*
  better with this patch.  Before when under load it would start up quickly
  maybe 2 or 3 times out of 10 attemptswith this patch it might be up
 to 4
  or 5 times out of 10...ish...or maybe it was just fluke *shrug*.  I'm
 still
  only seeing your log statement a single time (I'm running at debug2).  I
  have discovered something though - when the standby is in this state if I
  force a checkpoint on the primary then the standby comes right up.  Is
 there
  anything I check or try for you to help figure this out?or is it
  actually as designed that it could take 10-ish minutes to start up even
  after all clients have disconnected from the primary?

 Thanks for testing. The improvements cover specific cases, so its not
 subject to chance; its not a performance patch.

 It's not designed to act the way you describe, but it does.

 The reason this occurs is that you have a transaction heavy workload
 with occasional periods of complete quiet and a base backup time that
 is much less than checkpoint_timeout. If your base backup was slower
 the checkpoint would have hit naturally before recovery had reached a
 consistent state. Which seems fairly atypical. I guess you're doing
 this on a test system.

 It seems cheap to add in a call to LogStandbySnapshot() after each
 call to pg_stop_backup().

 Does anyone think this case is worth adding code for? Seems like one
 more thing to break.

 --
  Simon Riggs   http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training  Services



Re: [HACKERS] Hot Backup with rsync fails at pg_clog if under load

2011-10-26 Thread Chris Redekop
 And I think they also reported that if they didn't run hot standby,
 but just normal recovery into a new master, it didn't have the problem
 either, i.e. without hotstandby, recovery ran, properly extended the
 clog, and then ran as a new master fine.

Yes this is correct...attempting to start as hotstandby will produce the
pg_clog error repeatedly and then without changing anything else, just
turning hot standby off it will start up successfully.

 This fits the OP's observation ob the
 problem vanishing when pg_start_backup() does an immediate checkpoint.

Note that this is *not* the behaviour I'm seeingit's possible it happens
more frequently without the immediate checkpoint, but I am seeing it happen
even with the immediate checkpoint.

 This is a different problem and has already been reported by one of
 your colleagues in a separate thread, and answered in detail by me
 there. There is no bug related to this error message.

Excellent...I will continue this discussion in that thread.


Re: [HACKERS] Hot Backup with rsync fails at pg_clog if under load

2011-10-26 Thread Chris Redekop
FYI I have given this patch a good test and can now no longer reproduce
either the subtrans nor the clog error.  Thanks guys!


On Wed, Oct 26, 2011 at 11:09 AM, Simon Riggs si...@2ndquadrant.com wrote:

 On Wed, Oct 26, 2011 at 5:16 PM, Simon Riggs si...@2ndquadrant.com
 wrote:
  On Wed, Oct 26, 2011 at 5:08 PM, Simon Riggs si...@2ndquadrant.com
 wrote:
 
  Brewing a patch now.
 
  Latest thinking... confirmations or other error reports please.
 
  This fixes both the subtrans and clog bugs in one patch.

 I'll be looking to commit that tomorrow afternoon as two separate
 patches with appropriate credits.

 --
  Simon Riggs   http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training  Services



Re: [HACKERS] Hot Backup with rsync fails at pg_clog if under load

2011-10-25 Thread Chris Redekop
 Chris, can you rearrange the backup so you copy the pg_control file as
 the first act after the pg_start_backup?

I tried this and it doesn't seem to make any difference.  I also tried the
patch and I can no longer reproduce the subtrans error, however instead it
now it starts up, but never gets to the point where it'll accept
connections.  It starts up but if I try to do anything I always get FATAL:
 the database system is starting up...even if the load is removed from the
primary, the standby still never finishes starting up.  Attached below is
a log of one of these startup attempts.  In my testing with the patch
applied approx 3 in 10 attempts start up successfully, 7 in 10 attempts go
into the db is starting up statethe pg_clog error is still there, but
seems much harder to reproduce nowI've seen it only once since applying
the patch (out of probably 50 or 60 under-load startup attempts).  It does
seem to be moody like that thoit will be very difficult to reproduce
for a while, and then it will happen damn-near every time for a
while...weirdness

On a bit of a side note, I've been thinking of changing my scripts so that
they perform an initial rsync prior to doing the
startbackup-rsync-stopbackup just so that the second rsync will be
fasterso that the backup is in progress for a shorter period of time, as
while it is running it will stop other standbys from starting upthis
shouldn't cause any issues eh?


2011-10-25 13:43:24.035 MDT [15072]: [1-1] LOG:  database system was
interrupted; last known up at 2011-10-25 13:43:11 MDT
2011-10-25 13:43:24.035 MDT [15072]: [2-1] LOG:  creating missing WAL
directory pg_xlog/archive_status
2011-10-25 13:43:24.037 MDT [15072]: [3-1] LOG:  entering standby mode
DEBUG:  received replication command: IDENTIFY_SYSTEM
DEBUG:  received replication command: START_REPLICATION 2/CF00
2011-10-25 13:43:24.041 MDT [15073]: [1-1] LOG:  streaming replication
successfully connected to primary
2011-10-25 13:43:24.177 MDT [15092]: [1-1] FATAL:  the database system is
starting up
2011-10-25 13:43:24.781 MDT [15072]: [4-1] DEBUG:  checkpoint record is at
2/CF81A478
2011-10-25 13:43:24.781 MDT [15072]: [5-1] DEBUG:  redo record is at
2/CF20; shutdown FALSE
2011-10-25 13:43:24.781 MDT [15072]: [6-1] DEBUG:  next transaction ID:
0/4634700; next OID: 1188228
2011-10-25 13:43:24.781 MDT [15072]: [7-1] DEBUG:  next MultiXactId: 839;
next MultiXactOffset: 1686
2011-10-25 13:43:24.781 MDT [15072]: [8-1] DEBUG:  oldest unfrozen
transaction ID: 1669, in database 1
2011-10-25 13:43:24.781 MDT [15072]: [9-1] DEBUG:  transaction ID wrap limit
is 2147485316, limited by database with OID 1
2011-10-25 13:43:24.783 MDT [15072]: [10-1] DEBUG:  resetting unlogged
relations: cleanup 1 init 0
2011-10-25 13:43:24.791 MDT [15072]: [11-1] DEBUG:  initializing for hot
standby
2011-10-25 13:43:24.791 MDT [15072]: [12-1] LOG:  consistent recovery state
reached at 2/CF81A4D0
2011-10-25 13:43:24.791 MDT [15072]: [13-1] LOG:  redo starts at 2/CF20
2011-10-25 13:43:25.019 MDT [15072]: [14-1] LOG:  consistent state delayed
because recovery snapshot incomplete
2011-10-25 13:43:25.019 MDT [15072]: [15-1] CONTEXT:  xlog redo  running
xacts:
nextXid 4634700 latestCompletedXid 4634698 oldestRunningXid 4634336; 130
xacts:
4634336 4634337 4634338 4634339 4634340 4634341 4634342 4634343 4634344
4634345
4634346 4634347 4634348 4634349 4634350 4634351 4634352 4634353 4634354
4634355
4634356 4634357 4634358 4634359 4634360 4634361 4634362 4634363 4634364
4634365
4634366 4634367 4634368 4634369 4634370 4634371 4634515 4634516 4634517
4634518
4634519 4634520 4634521 4634522 4634523 4634524 4634525 4634526 4634527
4634528
4634529 4634530 4634531 4634532 4634533 4634534 4634535 4634536 4634537
4634538
4634539 4634540 4634541 4634542 4634543 4634385 4634386 4634387 4634388
4634389
4634390 4634391 4634392 4634393 4634394 4634395 4634396 4634397 4634398
4634399
4634400 4634401 4634402 4634403 4634404 4634405 4634406 4634407 4634408
4634409
4634410 4634411 4634412 4634413 4634414 4634415 4634416 4634417 4634418
4634419
4634420 4634579 4634580 4634581 4634582 4634583 4634584 4634585 4634586
4634587
4634588 4634589 4634590 4634591 4634592 4634593 4634594 4634595 4634596
4634597
4634598 4634599 4634600 4634601 4634602 4634603 4634604 4634605 4634606
4634607;
 subxid ovf
2011-10-25 13:43:25.240 MDT [15130]: [1-1] FATAL:  the database system is
starting up
DEBUG:  standby sync_rep_test has now caught up with primary
2011-10-25 13:43:26.304 MDT [15167]: [1-1] FATAL:  the database system is
starting up
2011-10-25 13:43:27.366 MDT [15204]: [1-1] FATAL:  the database system is
starting up
2011-10-25 13:43:28.426 MDT [15241]: [1-1] FATAL:  the database system is
starting up
2011-10-25 13:43:29.461 MDT [15275]: [1-1] FATAL:  the database system is
starting up
and so on...


On Tue, Oct 25, 2011 at 6:51 AM, Simon Riggs si...@2ndquadrant.com wrote:

 On Tue, Oct 25, 2011 at 12:39 PM, Florian Pflug f...@phlo.org 

Re: [HACKERS] Hot Backup with rsync fails at pg_clog if under load

2011-10-25 Thread Chris Redekop

 That isn't a Hot Standby problem, a recovery problem nor is it certain
 its a PostgreSQL problem.

Do you have any theories on this that I could help investigate?  It happens
even when using pg_basebackup and it persists until another sync is
performed, so the files must be in some state that that it can't recover
fromwithout understanding the internals just viewing from an
outside perspective, I don't really see how this could not be a PostgreSQL
problem


Re: [HACKERS] Hot Backup with rsync fails at pg_clog if under load

2011-10-17 Thread Chris Redekop
I can confirm that both the pg_clog and pg_subtrans errors do occur when
using pg_basebackup instead of rsync.  The data itself seems to be fine
because using the exact same data I can start up a warm standby no problem,
it is just the hot standby that will not start up.


On Sat, Oct 15, 2011 at 7:33 PM, Chris Redekop ch...@replicon.com wrote:

   Linas, could you capture the output of pg_controldata *and* increase
 the
   log level to DEBUG1 on the standby? We should then see nextXid value of
   the checkpoint the recovery is starting from.
 
  I'll try to do that whenever I'm in that territory again... Incidentally,
  recently there was a lot of unrelated-to-this-post work to polish things
 up
  for a talk being given at PGWest 2011 Today :)
 
   I also checked what rsync does when a file vanishes after rsync
 computed the
   file list, but before it is sent. rsync 3.0.7 on OSX, at least,
 complains
   loudly, and doesn't sync the file. It BTW also exits non-zero, with a
 special
   exit code for precisely that failure case.
 
  To be precise, my script has logic to accept the exit code 24, just as
  stated in PG manual:
 
  Docs For example, some versions of rsync return a separate exit code for
  Docs vanished source files, and you can write a driver script to
 accept
  Docs this exit code as a non-error case.

 I also am running into this issue and can reproduce it very reliably.  For
 me, however, it happens even when doing the fast backup like so:
 pg_start_backup('whatever', true)...my traffic is more write-heavy than
 linas's tho, so that might have something to do with it.  Yesterday it
 reliably errored out on pg_clog every time, but today it is
 failing sporadically on pg_subtrans (which seems to be past where the
 pg_clog error was)the only thing that has changed is that I've changed
 the log level to debug1I wouldn't think that could be related though.
  I've linked the requested pg_controldata and debug1 logs for both errors.
  Both links contain the output from pg_start_backup, rsync, pg_stop_backup,
 pg_controldata, and then the postgres debug1 log produced from a subsequent
 startup attempt.

 pg_clog: http://pastebin.com/mTfdcjwH
 pg_subtrans: http://pastebin.com/qAXEHAQt

 Any workarounds would be very appreciated.would copying clog+subtrans
 before or after the rest of the data directory (or something like that) make
 any difference?

 Thanks!



Re: [HACKERS] Hot Backup with rsync fails at pg_clog if under load

2011-10-17 Thread Chris Redekop
Well, on the other hand maybe there is something wrong with the data.
 Here's the test/steps I just did -
1. I do the pg_basebackup when the master is under load, hot slave now will
not start up but warm slave will.
2. I start a warm slave and let it catch up to current
3. On the slave I change 'hot_standby=on' and do a 'service postgresql
restart'
4. The postgres fails to restart with the same error.
5. I turn hot_standby back off and postgres starts back up fine as a warm
slave
6. I then turn off the load, the slave is all caught up, master and slave
are both sitting idle
7. I, again, change 'hot_standby=on' and do a service restart
8. Again it fails, with the same error, even though there is no longer any
load.
9. I repeat this warmstart/hotstart cycle a couple more times until to my
surprise, instead of failing, it successfully starts up as a hot standby
(this is after maybe 5 minutes or so of sitting idle)

So...given that it continued to fail even after the load had been turned of,
that makes me believe that the data which was copied over was invalid in
some way.  And when a checkpoint/logrotation/somethingelse occurred when not
under load it cleared itself upI'm shooting in the dark here

Anyone have any suggestions/ideas/things to try?


On Mon, Oct 17, 2011 at 2:13 PM, Chris Redekop ch...@replicon.com wrote:

 I can confirm that both the pg_clog and pg_subtrans errors do occur when
 using pg_basebackup instead of rsync.  The data itself seems to be fine
 because using the exact same data I can start up a warm standby no problem,
 it is just the hot standby that will not start up.


 On Sat, Oct 15, 2011 at 7:33 PM, Chris Redekop ch...@replicon.com wrote:

   Linas, could you capture the output of pg_controldata *and* increase
 the
   log level to DEBUG1 on the standby? We should then see nextXid value
 of
   the checkpoint the recovery is starting from.
 
  I'll try to do that whenever I'm in that territory again...
 Incidentally,
  recently there was a lot of unrelated-to-this-post work to polish things
 up
  for a talk being given at PGWest 2011 Today :)
 
   I also checked what rsync does when a file vanishes after rsync
 computed the
   file list, but before it is sent. rsync 3.0.7 on OSX, at least,
 complains
   loudly, and doesn't sync the file. It BTW also exits non-zero, with a
 special
   exit code for precisely that failure case.
 
  To be precise, my script has logic to accept the exit code 24, just as
  stated in PG manual:
 
  Docs For example, some versions of rsync return a separate exit code
 for
  Docs vanished source files, and you can write a driver script to
 accept
  Docs this exit code as a non-error case.

 I also am running into this issue and can reproduce it very reliably.  For
 me, however, it happens even when doing the fast backup like so:
 pg_start_backup('whatever', true)...my traffic is more write-heavy than
 linas's tho, so that might have something to do with it.  Yesterday it
 reliably errored out on pg_clog every time, but today it is
 failing sporadically on pg_subtrans (which seems to be past where the
 pg_clog error was)the only thing that has changed is that I've changed
 the log level to debug1I wouldn't think that could be related though.
  I've linked the requested pg_controldata and debug1 logs for both errors.
  Both links contain the output from pg_start_backup, rsync, pg_stop_backup,
 pg_controldata, and then the postgres debug1 log produced from a subsequent
 startup attempt.

 pg_clog: http://pastebin.com/mTfdcjwH
 pg_subtrans: http://pastebin.com/qAXEHAQt

 Any workarounds would be very appreciated.would copying clog+subtrans
 before or after the rest of the data directory (or something like that) make
 any difference?

 Thanks!





Re: [HACKERS] Hot Backup with rsync fails at pg_clog if under load

2011-10-15 Thread Chris Redekop
  Linas, could you capture the output of pg_controldata *and* increase the
  log level to DEBUG1 on the standby? We should then see nextXid value of
  the checkpoint the recovery is starting from.

 I'll try to do that whenever I'm in that territory again... Incidentally,
 recently there was a lot of unrelated-to-this-post work to polish things
up
 for a talk being given at PGWest 2011 Today :)

  I also checked what rsync does when a file vanishes after rsync computed
the
  file list, but before it is sent. rsync 3.0.7 on OSX, at least,
complains
  loudly, and doesn't sync the file. It BTW also exits non-zero, with a
special
  exit code for precisely that failure case.

 To be precise, my script has logic to accept the exit code 24, just as
 stated in PG manual:

 Docs For example, some versions of rsync return a separate exit code for
 Docs vanished source files, and you can write a driver script to accept
 Docs this exit code as a non-error case.

I also am running into this issue and can reproduce it very reliably.  For
me, however, it happens even when doing the fast backup like so:
pg_start_backup('whatever', true)...my traffic is more write-heavy than
linas's tho, so that might have something to do with it.  Yesterday it
reliably errored out on pg_clog every time, but today it is
failing sporadically on pg_subtrans (which seems to be past where the
pg_clog error was)the only thing that has changed is that I've changed
the log level to debug1I wouldn't think that could be related though.
 I've linked the requested pg_controldata and debug1 logs for both errors.
 Both links contain the output from pg_start_backup, rsync, pg_stop_backup,
pg_controldata, and then the postgres debug1 log produced from a subsequent
startup attempt.

pg_clog: http://pastebin.com/mTfdcjwH
pg_subtrans: http://pastebin.com/qAXEHAQt

Any workarounds would be very appreciated.would copying clog+subtrans
before or after the rest of the data directory (or something like that) make
any difference?

Thanks!


Re: [HACKERS] pg_last_xact_insert_timestamp

2011-09-08 Thread Chris Redekop
Thanks for all the feedback guys.  Just to throw another monkey wrench in
here - I've been playing with Simon's proposed solution of returning 0 when
the WAL positions match, and I've come to the realizatiion that even if
using pg_last_xact_insert_timestamp, although it would help, we still
wouldn't be able to get a 100% accurate how far behind? counternot
that this is a big deal, but I know my ops team is going to bitch to me
about it :).take this situation: there's a lull of 30 seconds where
there are no transactions committed on the serverthe slave is totally
caught up, WAL positions match, I'm reporting 0, everything is happy.  Then
a transaction is committed on the masterbefore the slave gets it my
query hits it and sees that we're 30 seconds behind (when in reality we're
1sec behind).Because of this affect my graph is a little spikey...I
mean it's not a huge deal or anything - I can put some sanity checking in my
number reporting (if 1 second ago you were 0 seconds behind, you can't be
more than 1 second behind now sorta thing).  But if we wanted to go for
super-ideal solution there would be a way to get the timestamp of
pg_stat_replication.replay_location+1 (the first transaction that the slave
does not have).


On Thu, Sep 8, 2011 at 7:03 AM, Robert Haas robertmh...@gmail.com wrote:

 On Thu, Sep 8, 2011 at 6:14 AM, Fujii Masao masao.fu...@gmail.com wrote:
  OTOH, new function enables users to monitor the delay as a timestamp.
  For users, a timestamp is obviously easier to handle than LSN, and the
 delay
  as a timestamp is more intuitive. So, I think that it's worth adding
  something like pg_last_xact_insert_timestamp into core for improvement
  of user-friendness.

 It seems very nice from a usability point of view, but I have to agree
 with Simon's concern about performance.  Actually, as of today,
 WALInsertLock is such a gigantic bottleneck that I suspect the
 overhead of this additional bookkeeping would be completely
 unnoticeable.  But I'm still reluctant to add more centralized
 spinlocks that everyone has to fight over, having recently put a lot
 of effort into getting rid of some of the ones we've traditionally
 had.

 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company