Re: [ADMIN] does wal archiving block the current client connection?

2006-05-23 Thread Simon Riggs
On Fri, 2006-05-19 at 08:53 -0700, Jeff Frost wrote:
 On Fri, 19 May 2006, Tom Lane wrote:
 
  Jeff Frost [EMAIL PROTECTED] writes:
  Do you think the postmaster on 5432 is trying to archive the other
  postmaster's WAL files somehow?
 
  Not as long as they aren't in the same data directory ;-).  What Simon
  was wondering about was whether an archiver process had somehow been
  left over from a previous incarnation of the test postmaster.  The thing
  to do is look through ps auxww (or local equivalent) and see if you
  see more than one thing calling itself an archiver process.
 
  (Whether or not this explains Jeff's problem, it definitely seems like
  a failure mode that we need to guard against.  We go to great lengths
  to prevent a new postmaster from starting when there are still live
  backends from a previous postmaster, but I don't think that interlock
  is effective for the archiver.)
 
 Well now, will you look at this:
 
 postgres 20228 1  0 May17 ?00:00:00 postgres: archiver process
 postgres 20573 1  0 May17 ?00:00:00 postgres: archiver process
 postgres 23817 23810  0 May17 pts/11   00:00:00 postgres: archiver process
 
 23810 is the running postmaster:
 
 postgres 23810 1  0 May17 pts/11   00:03:01 
 /usr/local/pgsql-8.1.3/bin/postm
 
 do you think that got left around the last time I did a pg_ctl restart?

OK, I'm on it.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [ADMIN] does wal archiving block the current client connection?

2006-05-23 Thread Tom Lane
Jeff Frost [EMAIL PROTECTED] writes:
 Do you think the postmaster on 5432 is trying to archive the other 
 postmaster's WAL files somehow?

Not as long as they aren't in the same data directory ;-).  What Simon
was wondering about was whether an archiver process had somehow been
left over from a previous incarnation of the test postmaster.  The thing
to do is look through ps auxww (or local equivalent) and see if you
see more than one thing calling itself an archiver process.

(Whether or not this explains Jeff's problem, it definitely seems like
a failure mode that we need to guard against.  We go to great lengths
to prevent a new postmaster from starting when there are still live
backends from a previous postmaster, but I don't think that interlock
is effective for the archiver.)

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [ADMIN] does wal archiving block the current client connection?

2006-05-23 Thread Simon Riggs
On Fri, 2006-05-19 at 08:23 -0700, Jeff Frost wrote:
 On Fri, 19 May 2006, Simon Riggs wrote:
 
  On Thu, 2006-05-18 at 10:08 -0700, Jeff Frost wrote:
 
  May 18 08:00:18 discord postgres[20228]: [129-1] LOG:  archived 
  transaction log file 0001007F
  May 18 08:00:41 discord postgres[20573]: [254-1] LOG:  archived 
  transaction log file 0001007F
  May 18 08:00:41 discord postgres[20573]: [255-1] WARNING:  could not 
  rename file pg_xlog/archive_status/0001007F.ready to
  May 18 08:00:41 discord postgres[20573]: [255-2]  
  pg_xlog/archive_status/0001007F.done: No such file or 
  directory
 
  I'm not clear how you can have two processes both issuing this message,
  since it is only the archiver process that ever says this.
 
  Do you have two archivers running on this system?
  pid = 20229 and pid = 20573
 
  Presumably pg_xlog/archive_status/0001007F.done already
  exists?
 
  Are there two postmasters running (at all)?
  Is there somehow an archiver process running from a previously shutdown
  postmaster (somehow)?
 
 Ah!!  Good catch Simon!  There are in fact two postmasters running on this 
 system.  The normal postmaster running in /var/lib/pgsql on port 5432 and 
 this 
 test postmaster running on port 55432.  The 8.0.7 postmaster on port 5432 
 shows the following for it's archive_command though:
 
 template1=# show archive_command;
   archive_command
 -
   unset
 (1 row)
 
 and the port 55432 postmaster shows:
 
 template1=# show archive_command;
 archive_command
 --
   /usr/local/pgsql-8.1.3/bin/archive_test.sh %p %f
 (1 row)
 
 Do you think the postmaster on 5432 is trying to archive the other 
 postmaster's WAL files somehow?

Some wierd interaction is occurring. The log showed two attempts to
archive the file, the second one each time failing, as you'd expect. But
two archivers should not be accessing the same data directory.

What were those two pids? Which postmasters are they children of?

Give me some more info and I can probably track down a bug...

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [ADMIN] does wal archiving block the current client connection?

2006-05-23 Thread Jeff Frost

On Fri, 19 May 2006, Simon Riggs wrote:


On Thu, 2006-05-18 at 10:08 -0700, Jeff Frost wrote:


May 18 08:00:18 discord postgres[20228]: [129-1] LOG:  archived transaction log file 
0001007F
May 18 08:00:41 discord postgres[20573]: [254-1] LOG:  archived transaction log file 
0001007F
May 18 08:00:41 discord postgres[20573]: [255-1] WARNING:  could not rename file 
pg_xlog/archive_status/0001007F.ready to
May 18 08:00:41 discord postgres[20573]: [255-2]  
pg_xlog/archive_status/0001007F.done: No such file or 
directory


I'm not clear how you can have two processes both issuing this message,
since it is only the archiver process that ever says this.

Do you have two archivers running on this system?
pid = 20229 and pid = 20573

Presumably pg_xlog/archive_status/0001007F.done already
exists?

Are there two postmasters running (at all)?
Is there somehow an archiver process running from a previously shutdown
postmaster (somehow)?


Ah!!  Good catch Simon!  There are in fact two postmasters running on this 
system.  The normal postmaster running in /var/lib/pgsql on port 5432 and this 
test postmaster running on port 55432.  The 8.0.7 postmaster on port 5432 
shows the following for it's archive_command though:


template1=# show archive_command;
 archive_command
-
 unset
(1 row)

and the port 55432 postmaster shows:

template1=# show archive_command;
   archive_command
--
 /usr/local/pgsql-8.1.3/bin/archive_test.sh %p %f
(1 row)

Do you think the postmaster on 5432 is trying to archive the other 
postmaster's WAL files somehow?


--
Jeff Frost, Owner   [EMAIL PROTECTED]
Frost Consulting, LLC   http://www.frostconsultingllc.com/
Phone: 650-780-7908 FAX: 650-649-1954

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [ADMIN] does wal archiving block the current client connection?

2006-05-23 Thread Tom Lane
Jeff Frost [EMAIL PROTECTED] writes:
 Well now, will you look at this:

 postgres 20228 1  0 May17 ?00:00:00 postgres: archiver process
 postgres 20573 1  0 May17 ?00:00:00 postgres: archiver process
 postgres 23817 23810  0 May17 pts/11   00:00:00 postgres: archiver process

 23810 is the running postmaster:

 postgres 23810 1  0 May17 pts/11   00:03:01 
 /usr/local/pgsql-8.1.3/bin/postm

Well, there's our smoking gun.  IIRC, all the failures you showed us are
consistent with race conditions caused by multiple archiver processes
all trying to do the same tasks concurrently.

Do you frequently stop and restart the postmaster?  Because I don't see
how you could get into this state without having done so.

I've just been looking at the code, and the archiver does commit
hara-kiri when it notices its parent postmaster is dead; but it only
checks that in the outer loop.  Given sufficiently long delays in the
archive_command, that could be a long time after the postmaster died;
and in the meantime, successive executions of the archive_command could
be conflicting with those launched by a later archiver incarnation.

Seems we need an interlock to ensure there's not more than one archiver
active.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [ADMIN] does wal archiving block the current client connection?

2006-05-22 Thread Jeff Frost

On Sun, 21 May 2006, Jeff Frost wrote:


So the chances of the original problem being archiver related are
receding...


This is possible, but I guess I should try and reproduce the actual problem 
with the same archive_command script and a CIFS mount just to see what 
happens.  Perhaps the real root of the problem is elsewhere, it just seems 
strange since the archive_command is the only postgres related process that 
accesses the CIFS share.  More later.


I tried both pulling the plug on the CIFS server and unsharing the CIFS share, 
but pgbench continued completely unconcerned.  I guess the failure mode of the 
NAS device in the customer colo must be something different that I don't yet 
know how to simulate.  I suspect I'll have to wait till it happens again and 
try to gather some more data before restarting the NAS device.  Thanks for all 
your suggestions guys!


--
Jeff Frost, Owner   [EMAIL PROTECTED]
Frost Consulting, LLC   http://www.frostconsultingllc.com/
Phone: 650-780-7908 FAX: 650-649-1954

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [ADMIN] does wal archiving block the current client connection?

2006-05-22 Thread Tom Lane
Jeff Frost [EMAIL PROTECTED] writes:
 I tried both pulling the plug on the CIFS server and unsharing the CIFS 
 share, 
 but pgbench continued completely unconcerned.  I guess the failure mode of 
 the 
 NAS device in the customer colo must be something different that I don't yet 
 know how to simulate.  I suspect I'll have to wait till it happens again and 
 try to gather some more data before restarting the NAS device.  Thanks for 
 all 
 your suggestions guys!

I'm still thinking that the simplest explanation is that $PGDATA/pg_clog/
is on the NAS device.  Please double-check the file locations.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [ADMIN] does wal archiving block the current client connection?

2006-05-22 Thread Jeff Frost

On Tue, 23 May 2006, Tom Lane wrote:


I'm still thinking that the simplest explanation is that $PGDATA/pg_clog/
is on the NAS device.  Please double-check the file locations.


I know that seems like an excellent candidate, but it really isn't, I swear. 
In fact, you almost had me convinced the last time you asked me to check.. I 
thought some helpful admin had moved something, but no:


postgres  9194  0.0  0.4 486568 16464 ?  SMay16   0:11 
/usr/local/pgsql/bin/postmaster -p 5432 -D /usr/local/pgsql/data


db3:~/data $ pwd
/usr/local/pgsql/data

db3:~/data $ ls -l
total 64
-rw---  1 postgres postgres 4 Feb 13 20:13 PG_VERSION
drwx--  6 postgres postgres  4096 Feb 13 21:00 base
drwx--  2 postgres postgres  4096 May 22 21:03 global
drwx--  2 postgres postgres  4096 May 22 17:45 pg_clog
-rw---  1 postgres postgres  3575 Feb 13 20:13 pg_hba.conf
-rw---  1 postgres postgres  1460 Feb 13 20:13 pg_ident.conf
drwx--  4 postgres postgres  4096 Feb 13 20:13 pg_multixact
drwx--  2 postgres postgres  4096 May 22 20:45 pg_subtrans
drwx--  2 postgres postgres  4096 Feb 13 20:13 pg_tblspc
drwx--  2 postgres postgres  4096 Feb 13 20:13 pg_twophase
lrwxrwxrwx  1 postgres postgres 9 Feb 13 22:10 pg_xlog - /pg_xlog/
-rw---  1 postgres postgres 13688 May 16 17:50 postgresql.conf
-rw---  1 postgres postgres63 May 16 17:54 postmaster.opts
-rw---  1 postgres postgres47 May 16 17:54 postmaster.pid

db3:~/data $ mount
/dev/sda2 on / type ext3 (rw)
none on /proc type proc (rw)
none on /sys type sysfs (rw)
none on /dev/pts type devpts (rw,gid=5,mode=620)
usbfs on /proc/bus/usb type usbfs (rw)
/dev/sda1 on /boot type ext3 (rw)
none on /dev/shm type tmpfs (rw)
/dev/sdb1 on /usr/local/pgsql type ext3 (rw,data=writeback)
/dev/sda5 on /var type ext3 (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
//10.1.1.28/pgbackup on /mnt/pgbackup type cifs (rw,mand,noexec,nosuid,nodev)

So, no..I wish it was that easy. :-/

If you have any other suggestions or inspirations, I'm all ears!

--
Jeff Frost, Owner   [EMAIL PROTECTED]
Frost Consulting, LLC   http://www.frostconsultingllc.com/
Phone: 650-780-7908 FAX: 650-649-1954

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [ADMIN] does wal archiving block the current client connection?

2006-05-21 Thread Jeff Frost

On Fri, 19 May 2006, Simon Riggs wrote:


Now I can run my same pg_bench, or do you guys
have any other suggestions on attempting to reproduce the problem?


No. We're back on track to try to reproduce the original error.


I've been futzing with trying to reproduce the original problem for a few days 
and so far postgres seems to be just fine with a long delay on archiving, so 
now I'm rather at a loss.  In fact, I currently have 1,234 xlog files in 
pg_xlog, but the archiver is happily archiving one every 5 minutes.  Perhaps 
I'll try a long delay followed by a failure to see if that could be it.


--
Jeff Frost, Owner   [EMAIL PROTECTED]
Frost Consulting, LLC   http://www.frostconsultingllc.com/
Phone: 650-780-7908 FAX: 650-649-1954

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [ADMIN] does wal archiving block the current client connection?

2006-05-21 Thread Simon Riggs
On Sun, 2006-05-21 at 14:16 -0700, Jeff Frost wrote:
 On Fri, 19 May 2006, Simon Riggs wrote:
 
  Now I can run my same pg_bench, or do you guys
  have any other suggestions on attempting to reproduce the problem?
 
  No. We're back on track to try to reproduce the original error.
 
 I've been futzing with trying to reproduce the original problem for a few 
 days 
 and so far postgres seems to be just fine with a long delay on archiving, so 
 now I'm rather at a loss.  In fact, I currently have 1,234 xlog files in 
 pg_xlog, but the archiver is happily archiving one every 5 minutes.  Perhaps 
 I'll try a long delay followed by a failure to see if that could be it.

So the chances of the original problem being archiver related are
receding...

-- 
  Simon Riggs
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [ADMIN] does wal archiving block the current client connection?

2006-05-21 Thread Jeff Frost

On Sun, 21 May 2006, Simon Riggs wrote:


I've been futzing with trying to reproduce the original problem for a few days
and so far postgres seems to be just fine with a long delay on archiving, so
now I'm rather at a loss.  In fact, I currently have 1,234 xlog files in
pg_xlog, but the archiver is happily archiving one every 5 minutes.  Perhaps
I'll try a long delay followed by a failure to see if that could be it.


So the chances of the original problem being archiver related are
receding...


This is possible, but I guess I should try and reproduce the actual problem 
with the same archive_command script and a CIFS mount just to see what 
happens.  Perhaps the real root of the problem is elsewhere, it just seems 
strange since the archive_command is the only postgres related process that 
accesses the CIFS share.  More later.


--
Jeff Frost, Owner   [EMAIL PROTECTED]
Frost Consulting, LLC   http://www.frostconsultingllc.com/
Phone: 650-780-7908 FAX: 650-649-1954

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [ADMIN] does wal archiving block the current client connection?

2006-05-19 Thread Simon Riggs
On Thu, 2006-05-18 at 10:08 -0700, Jeff Frost wrote:

 May 18 08:00:18 discord postgres[20228]: [129-1] LOG:  archived transaction 
 log file 0001007F
 May 18 08:00:41 discord postgres[20573]: [254-1] LOG:  archived transaction 
 log file 0001007F
 May 18 08:00:41 discord postgres[20573]: [255-1] WARNING:  could not rename 
 file pg_xlog/archive_status/0001007F.ready to
 May 18 08:00:41 discord postgres[20573]: [255-2]  
 pg_xlog/archive_status/0001007F.done: No such file or 
 directory

I'm not clear how you can have two processes both issuing this message,
since it is only the archiver process that ever says this.

Do you have two archivers running on this system?
pid = 20229 and pid = 20573

Presumably pg_xlog/archive_status/0001007F.done already
exists?

Are there two postmasters running (at all)?
Is there somehow an archiver process running from a previously shutdown
postmaster (somehow)?

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [ADMIN] does wal archiving block the current client connection?

2006-05-19 Thread Jeff Frost

On Fri, 19 May 2006, Tom Lane wrote:


Jeff Frost [EMAIL PROTECTED] writes:

Do you think the postmaster on 5432 is trying to archive the other
postmaster's WAL files somehow?


Not as long as they aren't in the same data directory ;-).  What Simon
was wondering about was whether an archiver process had somehow been
left over from a previous incarnation of the test postmaster.  The thing
to do is look through ps auxww (or local equivalent) and see if you
see more than one thing calling itself an archiver process.

(Whether or not this explains Jeff's problem, it definitely seems like
a failure mode that we need to guard against.  We go to great lengths
to prevent a new postmaster from starting when there are still live
backends from a previous postmaster, but I don't think that interlock
is effective for the archiver.)


Well now, will you look at this:

postgres 20228 1  0 May17 ?00:00:00 postgres: archiver process
postgres 20573 1  0 May17 ?00:00:00 postgres: archiver process
postgres 23817 23810  0 May17 pts/11   00:00:00 postgres: archiver process

23810 is the running postmaster:

postgres 23810 1  0 May17 pts/11   00:03:01 /usr/local/pgsql-8.1.3/bin/postm

do you think that got left around the last time I did a pg_ctl restart?

I guess I can stop my re-run of the test after stopping the /var/lib/pgsql 
postmaster.


--
Jeff Frost, Owner   [EMAIL PROTECTED]
Frost Consulting, LLC   http://www.frostconsultingllc.com/
Phone: 650-780-7908 FAX: 650-649-1954

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [ADMIN] does wal archiving block the current client connection?

2006-05-19 Thread Jeff Frost

On Fri, 19 May 2006, Tom Lane wrote:


Well, there's our smoking gun.  IIRC, all the failures you showed us are
consistent with race conditions caused by multiple archiver processes
all trying to do the same tasks concurrently.

Do you frequently stop and restart the postmaster?  Because I don't see
how you could get into this state without having done so.

I've just been looking at the code, and the archiver does commit
hara-kiri when it notices its parent postmaster is dead; but it only
checks that in the outer loop.  Given sufficiently long delays in the
archive_command, that could be a long time after the postmaster died;
and in the meantime, successive executions of the archive_command could
be conflicting with those launched by a later archiver incarnation.


Hurray!  Unfortunately, the postmaster on the original troubled server almost 
never gets restarted, and in fact only has only one archiver process running 
right now.  Drat!


I guess I'll have to try and catch it in the act again the next time the NAS 
gets wedged so I can debug a little more (it was caught by one of the windows 
folks last time) and gather some useful data.


Let me know if you want me to test a patch since I've already got this test 
case setup.


--
Jeff Frost, Owner   [EMAIL PROTECTED]
Frost Consulting, LLC   http://www.frostconsultingllc.com/
Phone: 650-780-7908 FAX: 650-649-1954

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [ADMIN] does wal archiving block the current client connection?

2006-05-19 Thread Tom Lane
Jeff Frost [EMAIL PROTECTED] writes:
 Hurray!  Unfortunately, the postmaster on the original troubled server almost
 never gets restarted, and in fact only has only one archiver process running 
 right now.  Drat!

Well, the fact that there's only one archiver *now* doesn't mean there
wasn't more than one when the problem happened.  The orphaned archiver
would eventually quit.

Do you have logs that would let you check when the production postmaster
was restarted?

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [ADMIN] does wal archiving block the current client connection?

2006-05-19 Thread Tom Lane
I wrote:
 Well, the fact that there's only one archiver *now* doesn't mean there
 wasn't more than one when the problem happened.  The orphaned archiver
 would eventually quit.

But, actually, nevermind: we have explained the failures you were seeing
in the test setup, but a multiple-active-archiver situation still
doesn't explain the original situation of incoming connections getting
blocked.

What I'd suggest is resuming the test after making sure you've killed
off any old archivers, and seeing if you can make any progress on
reproducing the original problem.  We definitely need a
multiple-archiver interlock, but I think that must be unrelated to your
real problem.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [ADMIN] does wal archiving block the current client connection?

2006-05-19 Thread Simon Riggs
On Fri, 2006-05-19 at 12:20 -0400, Tom Lane wrote:
 I wrote:
  Well, the fact that there's only one archiver *now* doesn't mean there
  wasn't more than one when the problem happened.  The orphaned archiver
  would eventually quit.
 
 But, actually, nevermind: we have explained the failures you were seeing
 in the test setup, but a multiple-active-archiver situation still
 doesn't explain the original situation of incoming connections getting
 blocked.

Agreed.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [ADMIN] does wal archiving block the current client connection?

2006-05-19 Thread Jeff Frost

On Fri, 19 May 2006, Tom Lane wrote:


Well, the fact that there's only one archiver *now* doesn't mean there
wasn't more than one when the problem happened.  The orphaned archiver
would eventually quit.

Do you have logs that would let you check when the production postmaster
was restarted?


I looked through /var/log/messages* and there wasn't a restart prior to the 
problem in the logs.  They go back to April 16.  The postmaster was restarted 
on May 15th (this Monday), but that was after the reported problem.



--
Jeff Frost, Owner   [EMAIL PROTECTED]
Frost Consulting, LLC   http://www.frostconsultingllc.com/
Phone: 650-780-7908 FAX: 650-649-1954

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [ADMIN] does wal archiving block the current client connection?

2006-05-19 Thread Jeff Frost

On Fri, 19 May 2006, Tom Lane wrote:


What I'd suggest is resuming the test after making sure you've killed
off any old archivers, and seeing if you can make any progress on
reproducing the original problem.  We definitely need a
multiple-archiver interlock, but I think that must be unrelated to your
real problem.


Ok, so I've got the old archivers gone (and btw, after a restart I ended up 
with 3 of them - so I stopped postmaster, and killed them all individually and 
started postmaster again).  Now I can run my same pg_bench, or do you guys 
have any other suggestions on attempting to reproduce the problem?


--
Jeff Frost, Owner   [EMAIL PROTECTED]
Frost Consulting, LLC   http://www.frostconsultingllc.com/
Phone: 650-780-7908 FAX: 650-649-1954

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [ADMIN] does wal archiving block the current client connection?

2006-05-19 Thread Simon Riggs
On Fri, 2006-05-19 at 09:36 -0700, Jeff Frost wrote:
 On Fri, 19 May 2006, Tom Lane wrote:
 
  What I'd suggest is resuming the test after making sure you've killed
  off any old archivers, and seeing if you can make any progress on
  reproducing the original problem.  We definitely need a
  multiple-archiver interlock, but I think that must be unrelated to your
  real problem.
 
 Ok, so I've got the old archivers gone (and btw, after a restart I ended up 
 with 3 of them - so I stopped postmaster, and killed them all individually 
 and 
 started postmaster again).  

Thats good.

 Now I can run my same pg_bench, or do you guys 
 have any other suggestions on attempting to reproduce the problem?

No. We're back on track to try to reproduce the original error.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [ADMIN] does wal archiving block the current client connection?

2006-05-18 Thread Jeff Frost

On Thu, 18 May 2006, Tom Lane wrote:


Jeff Frost [EMAIL PROTECTED] writes:

I seem to get alot of these:



May 17 21:34:04 discord postgres[20573]: [5-1] WARNING:  could not rename file
pg_xlog/archive_status/00010001.ready to
May 17 21:34:04 discord postgres[20573]: [5-2]
pg_xlog/archive_status/00010001.done: No such file or
directory


That seems odd ...


Further interesting items:

May 17 22:41:18 discord postgres[23817]: [3-1] LOG:  archive command 
/usr/local/pgsql-8.1.3/bin/archive_test.sh pg_xlog/0001000E 
0001000E

May 17 22:41:18 discord postgres[23817]: [3-2]  failed: return code 256

And in the window where I started postgres via pg_ctl, I had this:

cat: pg_xlog/0001000E: No such file or directory
cat: pg_xlog/0001000E: No such file or directory

Seems bad.





Currently I'm using pgbench to generate WAL rollover, do you guys have any
other handy testing tools for this sort of job or is this the best tool?


pgbench seems like an OK load for this, although it doesn't start/end
any new connections while running.  I *think* that that's not critical
--- my guess is that your observation of new connections hanging is just
because each new connection has to run one startup transaction, and
transactions in general are at risk of hanging --- but that could be
wrong.

Also, you could increase the rate of WAL generation by decreasing the
checkpoint segments/timeout parameters, if you need to.

regards, tom lane




--
Jeff Frost, Owner   [EMAIL PROTECTED]
Frost Consulting, LLC   http://www.frostconsultingllc.com/
Phone: 650-780-7908 FAX: 650-649-1954

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [ADMIN] does wal archiving block the current client connection?

2006-05-18 Thread Jeff Frost

On Wed, 17 May 2006, Jeff Frost wrote:


And in the window where I started postgres via pg_ctl, I had this:

cat: pg_xlog/0001000E: No such file or directory
cat: pg_xlog/0001000E: No such file or directory


Hrmmm...my pgbench died with an integer out of range error:

Client 56 aborted in state 8: ERROR:  integer out of range
client 81 receiving
client 81 sending UPDATE branches SET bbalance = bbalance + 4512 WHERE bid = 
1;

client 86 receiving
Client 86 aborted in state 8: ERROR:  integer out of range
client 85 receiving
Client 85 aborted in state 8: ERROR:  integer out of range
client 81 receiving
Client 81 aborted in state 8: ERROR:  integer out of range
client 78 receiving
client 78 sending UPDATE branches SET bbalance = bbalance + 2868 WHERE bid = 
1;

client 78 receiving
Client 78 aborted in state 8: ERROR:  integer out of range
Client 56 aborted in state 8: ERROR:  integer out of range
client 81 receiving
client 81 sending UPDATE branches SET bbalance = bbalance + 4512 WHERE bid = 
1;

client 86 receiving
Client 86 aborted in state 8: ERROR:  integer out of range
client 85 receiving
Client 85 aborted in state 8: ERROR:  integer out of range
client 81 receiving
Client 81 aborted in state 8: ERROR:  integer out of range
client 78 receiving
client 78 sending UPDATE branches SET bbalance = bbalance + 2868 WHERE bid = 
1;

client 78 receiving
Client 78 aborted in state 8: ERROR:  integer out of range

I'm guessing those bbalance + x updates ran bbalance up too high.

Anyway, I keep getting these on occassion:

May 17 23:01:22 discord postgres[23817]: [8-1] LOG:  archive command 
/usr/local/pgsql-8.1.3/bin/archive_test.sh pg_xlog/00010011 
00010011

May 17 23:01:22 discord postgres[23817]: [8-2]  failed: return code 256
May 17 23:01:22 discord postgres[23817]: [9-1] WARNING:  transaction log file 
00010011 could not be archived: too many failures
May 17 23:04:06 discord postgres[20573]: [40-1] LOG:  archived transaction log 
file 00010013
May 17 23:04:06 discord postgres[20573]: [41-1] WARNING:  could not rename 
file pg_xlog/archive_status/00010013.ready to
May 17 23:04:06 discord postgres[20573]: [41-2] 
pg_xlog/archive_status/00010013.done: No such file or 
directory
May 17 23:04:27 discord postgres[20228]: [22-1] LOG:  archived transaction log 
file 00010014


and in the other window:

cat: pg_xlog/00010011: No such file or directory
cat: pg_xlog/00010011: No such file or directory
cat: pg_xlog/00010011: No such file or directory

How on earth can this happen?







Currently I'm using pgbench to generate WAL rollover, do you guys have any
other handy testing tools for this sort of job or is this the best tool?


pgbench seems like an OK load for this, although it doesn't start/end
any new connections while running.  I *think* that that's not critical
--- my guess is that your observation of new connections hanging is just
because each new connection has to run one startup transaction, and
transactions in general are at risk of hanging --- but that could be
wrong.

Also, you could increase the rate of WAL generation by decreasing the
checkpoint segments/timeout parameters, if you need to.

regards, tom lane







--
Jeff Frost, Owner   [EMAIL PROTECTED]
Frost Consulting, LLC   http://www.frostconsultingllc.com/
Phone: 650-780-7908 FAX: 650-649-1954

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [ADMIN] does wal archiving block the current client connection?

2006-05-18 Thread Simon Riggs
On Wed, 2006-05-17 at 22:45 -0700, Jeff Frost wrote:
 On Thu, 18 May 2006, Tom Lane wrote:
 
  Jeff Frost [EMAIL PROTECTED] writes:
  I seem to get alot of these:
 
  May 17 21:34:04 discord postgres[20573]: [5-1] WARNING:  could not rename 
  file
  pg_xlog/archive_status/00010001.ready to
  May 17 21:34:04 discord postgres[20573]: [5-2]
  pg_xlog/archive_status/00010001.done: No such file or
  directory
 
  That seems odd ...
 
 Further interesting items:
 
 May 17 22:41:18 discord postgres[23817]: [3-1] LOG:  archive command 
 /usr/local/pgsql-8.1.3/bin/archive_test.sh 
 pg_xlog/0001000E 
 0001000E
 May 17 22:41:18 discord postgres[23817]: [3-2]  failed: return code 256
 And in the window where I started postgres via pg_ctl, I had this:
 
 cat: pg_xlog/0001000E: No such file or directory
 cat: pg_xlog/0001000E: No such file or directory
 
 Seems bad.

Seems so.

Can you post the full test, plus full execution log.

[You don't need to cat you could just do ls instead FWIW]

Are you doing *anything* with pg_xlog directory or below? I understand
your saying No to that question and pg_xlog has not been moved, its just
underneath data directory, which is on normal disk?

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [ADMIN] does wal archiving block the current client connection?

2006-05-18 Thread Tom Lane
Jeff Frost [EMAIL PROTECTED] writes:
 Hrmmm...my pgbench died with an integer out of range error:

That's normal, if you run it long enough without re-creating the tables.
It keeps adding small values to the balances, and eventually they
overflow.  (Possibly someone should fix it so that the deltas are
uniformly distributed around zero, instead of being always positive.)

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [ADMIN] does wal archiving block the current client connection?

2006-05-18 Thread Jeff Frost

On Thu, 18 May 2006, Simon Riggs wrote:


Seems so.

Can you post the full test, plus full execution log.

[You don't need to cat you could just do ls instead FWIW]

Are you doing *anything* with pg_xlog directory or below? I understand
your saying No to that question and pg_xlog has not been moved, its just
underneath data directory, which is on normal disk?


I did a fresh compile of 8.1.3 with --prefix=/usr/local/pgsql-8.1.3.  On this 
particular system, this is actually located on / which is /dev/md1 (a mirror).
I'll try and start a brand new version of the test so I can capture the 
logging to syslog for you guys as well as the stdout on the pg_ctl console.


More later.

Unfortunately, I'm not really sure this is related to the problems we saw 
before.  (i.e. the system doesn't get unusually slow or anything, nor do the 
connections seems to block)


--
Jeff Frost, Owner   [EMAIL PROTECTED]
Frost Consulting, LLC   http://www.frostconsultingllc.com/
Phone: 650-780-7908 FAX: 650-649-1954

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [ADMIN] does wal archiving block the current client connection?

2006-05-17 Thread Simon Riggs
On Wed, 2006-05-17 at 00:36 -0400, Tom Lane wrote:
 Jeff Frost [EMAIL PROTECTED] writes:
  On Tue, 16 May 2006, Simon Riggs wrote:
  Whatever happened between 02:08 and 02:14 seems important.
 
  I have the logs and after reviewing /var/log/messages for that time period, 
  there is no other activity besides postgres.
 
 I have a lurking feeling that the still-hypothetical connection between
 archiver and foreground operations might come into operation at pg_clog
 page boundaries (which require emitting XLOG events) --- that is, every
 32K transactions something special happens.  The time delay between
 archiver wedging and foreground wedging would then correspond to how
 long it took the XID counter to reach the next 32K multiple.  (Jeff,
 what transaction rate do you see on that server --- is that a plausible
 delay for some thousands of transactions to pass?)
 
 This is just a guess, but if you check the archives for Chris K-L's
 out-of-disk-space server meltdown a year or three ago, you'll see
 something similar.

You'll have to explain a little more. I checked the archives...

archiver looks for archive_status files that end with .ready and that
has got nothing at all to do with transactions, LWlocks etc. If there's
a file ready, it will archive it, if there's not - it won't. There is
very deliberately a very low amount of synchronization there: archiver
holds no locks, LWLocks or spinlocks at any time.

The lurking feeling scenario above might or might nor be an issue
here, but I can't see how the archiver could be involved at all. I see
no evidence for the archiver to be the source of a problem here and that
the only reason we're checking that is as a result of Jeff's original
conjecture that there was a connection. There *was* a problem, yes, but
I think we're looking in the wrong place for the murder weapon.

pg_clog page extension does look like it can offer problems, generally.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [ADMIN] does wal archiving block the current client connection?

2006-05-17 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 You'll have to explain a little more. I checked the archives...

I was thinking of
http://archives.postgresql.org/pgsql-hackers/2004-01/msg00530.php
full explanation here:
http://archives.postgresql.org/pgsql-hackers/2004-01/msg00606.php

 The lurking feeling scenario above might or might nor be an issue
 here, but I can't see how the archiver could be involved at all.

Well, I don't see it either; at this point we're waiting on Jeff to
provide some harder evidence ...

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [ADMIN] does wal archiving block the current client connection?

2006-05-17 Thread Simon Riggs
On Wed, 2006-05-17 at 10:01 -0400, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  You'll have to explain a little more. I checked the archives...
 
 I was thinking of
 http://archives.postgresql.org/pgsql-hackers/2004-01/msg00530.php
 full explanation here:
 http://archives.postgresql.org/pgsql-hackers/2004-01/msg00606.php

Sorry, I see my note looks like I couldn't find it. I meant I've read
the archives. Thanks though.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [ADMIN] does wal archiving block the current client connection?

2006-05-17 Thread Jeff Frost

On Wed, 17 May 2006, Tom Lane wrote:


The lurking feeling scenario above might or might nor be an issue
here, but I can't see how the archiver could be involved at all.


Well, I don't see it either; at this point we're waiting on Jeff to
provide some harder evidence ...


Was the 3,000 transactions per minute helpful?  What other evidence should I 
be looking for?  Really the only evidence I have at this point is that when 
the NAS gets rebooted and comes back up, the postmaster resumes normal 
operations and the client count goes back down; however, that could just be 
coincidental, but we've seen it happen 3 times now.  This might have nothing 
to do with archiving, it just seems like the only likely candidate since there 
are only two things which happen on this volmume: PITR and rsyncing of 
pg_dumps (we dump locally first for speed reasons).  Maybe this is just a 
linux kernel thing?  Or maybe something else is blocking because it tries to 
stat all the filesystems and /mnt/pgbackup blocks?  I can't imagine postgres 
doing that though.


--
Jeff Frost, Owner   [EMAIL PROTECTED]
Frost Consulting, LLC   http://www.frostconsultingllc.com/
Phone: 650-780-7908 FAX: 650-649-1954

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [ADMIN] does wal archiving block the current client connection?

2006-05-17 Thread Tom Lane
Jeff Frost [EMAIL PROTECTED] writes:
 Was the 3,000 transactions per minute helpful?  What other evidence should I 
 be looking for?

Did you try generating a test case using a long sleep() as a replacement
for the archive_command script?  If there is a PG bug here it shouldn't
be that hard to expose it in a simple test case.  I'm up to my armpits
in other stuff and don't have time to try it myself...

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [ADMIN] does wal archiving block the current client connection?

2006-05-17 Thread Jeff Frost

On Wed, 17 May 2006, Tom Lane wrote:


Did you try generating a test case using a long sleep() as a replacement
for the archive_command script?  If there is a PG bug here it shouldn't
be that hard to expose it in a simple test case.  I'm up to my armpits
in other stuff and don't have time to try it myself...


Interesting that you should say that as I had just started setting up a test. 
I seem to get alot of these:


May 17 21:34:04 discord postgres[20573]: [4-1] LOG:  archived transaction log 
file 00010001
May 17 21:34:04 discord postgres[20573]: [5-1] WARNING:  could not rename file 
pg_xlog/archive_status/00010001.ready to
May 17 21:34:04 discord postgres[20573]: [5-2] 
pg_xlog/archive_status/00010001.done: No such file or 
directory


I'm guessing that the archiver moves the WAL file to wal filename.ready, and 
successful completion, it renames it to wal filename.done?  My 
archive_command is the utmost in simplicity...looks like this:



#!/bin/sh

FULLPATH=$1
FILENAME=$2

sleep 300

cat $FULLPATH  /dev/null


So far I haven't been able to reproduce, but I'm just getting started. 
Currently I'm using pgbench to generate WAL rollover, do you guys have any 
other handy testing tools for this sort of job or is this the best tool?



--
Jeff Frost, Owner   [EMAIL PROTECTED]
Frost Consulting, LLC   http://www.frostconsultingllc.com/
Phone: 650-780-7908 FAX: 650-649-1954

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [ADMIN] does wal archiving block the current client connection?

2006-05-17 Thread Jeff Frost

On Wed, 17 May 2006, Jeff Frost wrote:


On Wed, 17 May 2006, Tom Lane wrote:


Did you try generating a test case using a long sleep() as a replacement
for the archive_command script?  If there is a PG bug here it shouldn't
be that hard to expose it in a simple test case.  I'm up to my armpits
in other stuff and don't have time to try it myself...


Interesting that you should say that as I had just started setting up a test. 
I seem to get alot of these:


May 17 21:34:04 discord postgres[20573]: [4-1] LOG:  archived transaction log 
file 00010001
May 17 21:34:04 discord postgres[20573]: [5-1] WARNING:  could not rename 
file pg_xlog/archive_status/00010001.ready to
May 17 21:34:04 discord postgres[20573]: [5-2] 
pg_xlog/archive_status/00010001.done: No such file or 
directory


I'm guessing that the archiver moves the WAL file to wal filename.ready, 
and successful completion, it renames it to wal filename.done?


Oh, except that they are 0 bytes, so they must be just status files. 
Nevermind on that question. :-)



--
Jeff Frost, Owner   [EMAIL PROTECTED]
Frost Consulting, LLC   http://www.frostconsultingllc.com/
Phone: 650-780-7908 FAX: 650-649-1954

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [ADMIN] does wal archiving block the current client connection?

2006-05-17 Thread Tom Lane
Jeff Frost [EMAIL PROTECTED] writes:
 I seem to get alot of these:

 May 17 21:34:04 discord postgres[20573]: [5-1] WARNING:  could not rename 
 file 
 pg_xlog/archive_status/00010001.ready to
 May 17 21:34:04 discord postgres[20573]: [5-2] 
 pg_xlog/archive_status/00010001.done: No such file or 
 directory

That seems odd ...

 Currently I'm using pgbench to generate WAL rollover, do you guys have any 
 other handy testing tools for this sort of job or is this the best tool?

pgbench seems like an OK load for this, although it doesn't start/end
any new connections while running.  I *think* that that's not critical
--- my guess is that your observation of new connections hanging is just
because each new connection has to run one startup transaction, and
transactions in general are at risk of hanging --- but that could be
wrong.

Also, you could increase the rate of WAL generation by decreasing the
checkpoint segments/timeout parameters, if you need to.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [ADMIN] does wal archiving block the current client connection?

2006-05-16 Thread Simon Riggs
On Mon, 2006-05-15 at 16:58 -0700, Jeff Frost wrote:

 The log is below.  Note that the problem began around 2a.m. around the time 
 the complaint about checkpoint segments happens.  After a bit of research it 
 appears that the checkpoint complaint happens when our db maintenance job 
 kicks off at 2:00a.m. The crisis ended around 3:45a.m. with the restart of 
 the 
 NAS device and normal operations resumed just before the end of this log. 
 There were also lots of CIFS errors of the following type in dmesg and in the 
 log:
 
   CIFS VFS: Send error in read = -9
   CIFS VFS: Send error in read = -9

I don't see much evidence for a connection between archiver and these
issues. The problems start after autovacuum of vb_web at 02:08. That
seems much more likely to have something to do with client connections
than the archiver - which is really nothing too special. However,
archiver doesn't start processing files again after 02:06. When is the
next archived transaction log file message?

Whatever happened between 02:08 and 02:14 seems important. Do you have
any other logs to cross-correlate for different events. What do the
clients think happened? What did they do during that time? 

 May 15 01:57:31 db3 postgres[12723]: [1-1] LOG:  autovacuum: processing 
 database vb_web
 May 15 01:58:03 db3 crond(pam_unix)[12755]: session closed for user postgres
 May 15 01:58:31 db3 postgres[12824]: [1-1] LOG:  autovacuum: processing 
 database template1
 May 15 01:59:01 db3 crond(pam_unix)[12858]: session closed for user postgres
 May 15 01:59:31 db3 postgres[12928]: [1-1] LOG:  autovacuum: processing 
 database postgres
 May 15 02:00:10 db3 crond(pam_unix)[12964]: session closed for user postgres
 May 15 02:00:11 db3 postgres[22008]: [1090-1] LOG:  checkpoints are occurring 
 too frequently (10 seconds apart)
 May 15 02:00:11 db3 postgres[22008]: [1090-2] HINT:  Consider increasing the 
 configuration parameter checkpoint_segments.
 May 15 02:00:14 db3 postgres[22009]: [4886-1] LOG:  archived transaction log 
 file 00010016007F
 May 15 02:00:23 db3 postgres[22009]: [4887-1] LOG:  archived transaction log 
 file 000100160080
 May 15 02:00:23 db3 postgres[22008]: [1091-1] LOG:  checkpoints are occurring 
 too frequently (9 seconds apart)
 May 15 02:00:23 db3 postgres[22008]: [1091-2] HINT:  Consider increasing the 
 configuration parameter checkpoint_segments.
 May 15 02:00:31 db3 postgres[22009]: [4888-1] LOG:  archived transaction log 
 file 000100160081
 May 15 02:00:36 db3 postgres[13747]: [1-1] LOG:  autovacuum: processing 
 database vb_web
 May 15 02:00:36 db3 postgres[22008]: [1092-1] LOG:  checkpoints are occurring 
 too frequently (13 seconds apart)
 May 15 02:00:36 db3 postgres[22008]: [1092-2] HINT:  Consider increasing the 
 configuration parameter checkpoint_segments.
 May 15 02:00:38 db3 postgres[22009]: [4889-1] LOG:  archived transaction log 
 file 000100160082
 May 15 02:00:45 db3 postgres[22009]: [4890-1] LOG:  archived transaction log 
 file 000100160083
 May 15 02:00:53 db3 postgres[22009]: [4891-1] LOG:  archived transaction log 
 file 000100160084
 May 15 02:00:54 db3 postgres[22008]: [1093-1] LOG:  checkpoints are occurring 
 too frequently (18 seconds apart)
 May 15 02:00:54 db3 postgres[22008]: [1093-2] HINT:  Consider increasing the 
 configuration parameter checkpoint_segments.
 May 15 02:01:00 db3 postgres[22009]: [4892-1] LOG:  archived transaction log 
 file 000100160085
 May 15 02:01:05 db3 crond(pam_unix)[13815]: session closed for user postgres
 May 15 02:01:07 db3 postgres[22009]: [4893-1] LOG:  archived transaction log 
 file 000100160086
 May 15 02:01:13 db3 postgres[22008]: [1094-1] LOG:  checkpoints are occurring 
 too frequently (19 seconds apart)
 May 15 02:01:13 db3 postgres[22008]: [1094-2] HINT:  Consider increasing the 
 configuration parameter checkpoint_segments.
 May 15 02:01:14 db3 postgres[22009]: [4894-1] LOG:  archived transaction log 
 file 000100160087
 May 15 02:01:23 db3 postgres[22009]: [4895-1] LOG:  archived transaction log 
 file 000100160088
 May 15 02:01:25 db3 postgres[22008]: [1095-1] LOG:  checkpoints are occurring 
 too frequently (12 seconds apart)
 May 15 02:01:25 db3 postgres[22008]: [1095-2] HINT:  Consider increasing the 
 configuration parameter checkpoint_segments.
 May 15 02:01:31 db3 postgres[22009]: [4896-1] LOG:  archived transaction log 
 file 000100160089
 May 15 02:01:35 db3 postgres[22008]: [1096-1] LOG:  checkpoints are occurring 
 too frequently (10 seconds apart)
 May 15 02:01:35 db3 postgres[22008]: [1096-2] HINT:  Consider increasing the 
 configuration parameter checkpoint_segments.
 May 15 02:01:39 db3 postgres[22009]: [4897-1] LOG:  archived transaction log 
 file 00010016008A
 May 15 02:01:45 db3 postgres[22008]: [1097-1] LOG:  checkpoints are occurring 
 too frequently (10 seconds apart)
 

Re: [ADMIN] does wal archiving block the current client connection?

2006-05-16 Thread Jeff Frost

On Tue, 16 May 2006, Simon Riggs wrote:


I don't see much evidence for a connection between archiver and these
issues. The problems start after autovacuum of vb_web at 02:08. That
seems much more likely to have something to do with client connections
than the archiver - which is really nothing too special. However,
archiver doesn't start processing files again after 02:06. When is the
next archived transaction log file message?

Whatever happened between 02:08 and 02:14 seems important. Do you have
any other logs to cross-correlate for different events. What do the
clients think happened? What did they do during that time?


I have the logs and after reviewing /var/log/messages for that time period, 
there is no other activity besides postgres.  The next archived transaction 
log file messages are:


May 14 02:24:54 db3 postgres[22009]: [4876-1] LOG:  archived transaction log 
fil

e 000100160078
May 14 02:34:05 db3 postgres[22009]: [4877-1] LOG:  archived transaction log 
fil

e 000100160079
May 14 02:35:49 db3 postgres[22009]: [4878-1] LOG:  archived transaction log 
fil

e 00010016007A
May 14 02:41:58 db3 postgres[22009]: [4879-1] LOG:  archived transaction log 
fil

e 00010016007B
May 14 02:57:35 db3 postgres[22009]: [4880-1] LOG:  archived transaction log 
fil

e 00010016007C

The clients are all on two windows web servers, anybody know where to look for 
errors with that?






May 15 01:57:31 db3 postgres[12723]: [1-1] LOG:  autovacuum: processing database 
vb_web
May 15 01:58:03 db3 crond(pam_unix)[12755]: session closed for user postgres
May 15 01:58:31 db3 postgres[12824]: [1-1] LOG:  autovacuum: processing database 
template1
May 15 01:59:01 db3 crond(pam_unix)[12858]: session closed for user postgres
May 15 01:59:31 db3 postgres[12928]: [1-1] LOG:  autovacuum: processing database 
postgres
May 15 02:00:10 db3 crond(pam_unix)[12964]: session closed for user postgres
May 15 02:00:11 db3 postgres[22008]: [1090-1] LOG:  checkpoints are occurring 
too frequently (10 seconds apart)
May 15 02:00:11 db3 postgres[22008]: [1090-2] HINT:  Consider increasing the 
configuration parameter checkpoint_segments.
May 15 02:00:14 db3 postgres[22009]: [4886-1] LOG:  archived transaction log file 
00010016007F
May 15 02:00:23 db3 postgres[22009]: [4887-1] LOG:  archived transaction log file 
000100160080
May 15 02:00:23 db3 postgres[22008]: [1091-1] LOG:  checkpoints are occurring 
too frequently (9 seconds apart)
May 15 02:00:23 db3 postgres[22008]: [1091-2] HINT:  Consider increasing the 
configuration parameter checkpoint_segments.
May 15 02:00:31 db3 postgres[22009]: [4888-1] LOG:  archived transaction log file 
000100160081
May 15 02:00:36 db3 postgres[13747]: [1-1] LOG:  autovacuum: processing database 
vb_web
May 15 02:00:36 db3 postgres[22008]: [1092-1] LOG:  checkpoints are occurring 
too frequently (13 seconds apart)
May 15 02:00:36 db3 postgres[22008]: [1092-2] HINT:  Consider increasing the 
configuration parameter checkpoint_segments.
May 15 02:00:38 db3 postgres[22009]: [4889-1] LOG:  archived transaction log file 
000100160082
May 15 02:00:45 db3 postgres[22009]: [4890-1] LOG:  archived transaction log file 
000100160083
May 15 02:00:53 db3 postgres[22009]: [4891-1] LOG:  archived transaction log file 
000100160084
May 15 02:00:54 db3 postgres[22008]: [1093-1] LOG:  checkpoints are occurring 
too frequently (18 seconds apart)
May 15 02:00:54 db3 postgres[22008]: [1093-2] HINT:  Consider increasing the 
configuration parameter checkpoint_segments.
May 15 02:01:00 db3 postgres[22009]: [4892-1] LOG:  archived transaction log file 
000100160085
May 15 02:01:05 db3 crond(pam_unix)[13815]: session closed for user postgres
May 15 02:01:07 db3 postgres[22009]: [4893-1] LOG:  archived transaction log file 
000100160086
May 15 02:01:13 db3 postgres[22008]: [1094-1] LOG:  checkpoints are occurring 
too frequently (19 seconds apart)
May 15 02:01:13 db3 postgres[22008]: [1094-2] HINT:  Consider increasing the 
configuration parameter checkpoint_segments.
May 15 02:01:14 db3 postgres[22009]: [4894-1] LOG:  archived transaction log file 
000100160087
May 15 02:01:23 db3 postgres[22009]: [4895-1] LOG:  archived transaction log file 
000100160088
May 15 02:01:25 db3 postgres[22008]: [1095-1] LOG:  checkpoints are occurring 
too frequently (12 seconds apart)
May 15 02:01:25 db3 postgres[22008]: [1095-2] HINT:  Consider increasing the 
configuration parameter checkpoint_segments.
May 15 02:01:31 db3 postgres[22009]: [4896-1] LOG:  archived transaction log file 
000100160089
May 15 02:01:35 db3 postgres[22008]: [1096-1] LOG:  checkpoints are occurring 
too frequently (10 seconds apart)
May 15 02:01:35 db3 postgres[22008]: [1096-2] HINT:  Consider increasing the 
configuration parameter checkpoint_segments.
May 15 02:01:39 db3 

Re: [ADMIN] does wal archiving block the current client connection?

2006-05-16 Thread Tom Lane
Jeff Frost [EMAIL PROTECTED] writes:
 On Tue, 16 May 2006, Simon Riggs wrote:
 Whatever happened between 02:08 and 02:14 seems important.

 I have the logs and after reviewing /var/log/messages for that time period, 
 there is no other activity besides postgres.

I have a lurking feeling that the still-hypothetical connection between
archiver and foreground operations might come into operation at pg_clog
page boundaries (which require emitting XLOG events) --- that is, every
32K transactions something special happens.  The time delay between
archiver wedging and foreground wedging would then correspond to how
long it took the XID counter to reach the next 32K multiple.  (Jeff,
what transaction rate do you see on that server --- is that a plausible
delay for some thousands of transactions to pass?)

This is just a guess, but if you check the archives for Chris K-L's
out-of-disk-space server meltdown a year or three ago, you'll see
something similar.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [ADMIN] does wal archiving block the current client connection?

2006-05-16 Thread Jeff Frost

On Wed, 17 May 2006, Tom Lane wrote:



I have a lurking feeling that the still-hypothetical connection between
archiver and foreground operations might come into operation at pg_clog
page boundaries (which require emitting XLOG events) --- that is, every
32K transactions something special happens.  The time delay between
archiver wedging and foreground wedging would then correspond to how
long it took the XID counter to reach the next 32K multiple.  (Jeff,
what transaction rate do you see on that server --- is that a plausible
delay for some thousands of transactions to pass?)

This is just a guess, but if you check the archives for Chris K-L's
out-of-disk-space server meltdown a year or three ago, you'll see
something similar.


This sounds interesting and I'll go have a look for this thread in the 
archives.


It looks like it's running about 3,000 transactions per minute right now 
(around 10pm).  I got that by doing the following...tell me if this is 
reasonable:


db3:~ $ echo select xact_commit from pg_stat_database where datname = 
\'vb_web\'\; | psql vb_web ; sleep 60 ; echo select xact_commit from 
pg_stat_database where datname = \'vb_web\'\; | psql vb_web

 xact_commit
-
  806325
(1 row)

 xact_commit
-
  809414
(1 row)

So, doing a little subtraction, I got 3,089 transactions for the 60 second 
period I ran it.  I would guess it's quite a bit more during the day.  That 
means it would hit 32k transactions around 10-11 minutes...does that fit in 
with our timing from above?  (Tom, I'm not sure which timing you're 
using.)



--
Jeff Frost, Owner   [EMAIL PROTECTED]
Frost Consulting, LLC   http://www.frostconsultingllc.com/
Phone: 650-780-7908 FAX: 650-649-1954

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [ADMIN] does wal archiving block the current client connection?

2006-05-15 Thread Tom Arthurs
What might be more bullet proof would be to make the archive command 
copy the file to an intermediate local directory, then have a 
daemon/cron job that wakes up once a minute or so, check for new files, 
then copy them to the network mount.  You may want to use something like 
lofs to make sure the archive command has finished and closed the file 
before moving it to the network drive.


This is what I do, and I've never had a failure of the archive command. 
-- Had a few network errors on the network drive (I use nfs) which I 
fixed at my leisure, with no problems for the postgresql server.


Jeff Frost wrote:
I've run into a problem with a PITR setup at a client.  The problem is 
that whenever the CIFS NAS device that we're mounting at /mnt/pgbackup 
has problems, it seems that the current client connection gets blocked 
and this eventually builds up to a sorry, too many clients already 
error.  I'm wondering if this is expected behavior with the archive 
command and if I should build in some more smarts to my archive 
script.  Maybe I should fork and waitpid such that I can use a manual 
timeout shorter than whatever the CIFS timeout is so that I can return 
an error in a reasonable amount of time?


Has anyone else seen this problem?  Restarting the NAS device fixes 
the problem but it would be much preferable if postges could soldier 
along without the NAS for a little while before we resuscitate it.  We 
don't have an NFS or rsync server available in this environment 
currently, though I suppose setting up an rsync server for windows on 
the NAS would be a possibility.


Any suggestions much appreciated.

Currently the script is fairly simple and just does a 'cp' and then a 
'gzip' although we do use cp -f to copy over a possible previosly 
failed 'cp'. Script is below:


. /usr/local/lib/includes.sh

FULLPATH=$1
FILENAME=$2

#
# Make sure we have pgbackup dir mounted
#
checkpgbackupmount

/bin/cp -f $FULLPATH $PITRDESTDIR/$FILENAME
if [ $? -ne 0 ]; then
die Could not cp $FULLPATH to $PITRDESTDIR/$FILENAME
fi

/usr/bin/gzip -f $PITRDESTDIR/$FILENAME
#
# Make sure it worked, otherwise roll back
#
if [ $? -ne 0 ]; then
/bin/rm -f $PITRDESTDIR/$FILENAME
die Could not /usr/bin/gzip $PITRDESTDIR/$FILENAME
fi

exit 0




---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [ADMIN] does wal archiving block the current client connection?

2006-05-15 Thread Tom Lane
Jeff Frost [EMAIL PROTECTED] writes:
 I've run into a problem with a PITR setup at a client.  The problem is that 
 whenever the CIFS NAS device that we're mounting at /mnt/pgbackup has 
 problems, it seems that the current client connection gets blocked and this 
 eventually builds up to a sorry, too many clients already error.  I'm 
 wondering if this is expected behavior with the archive command

No, I can't see what the connection should be there.  It's supposed to
be designed so that the archive command can take its sweet old time and
nothing happens except that a backlog of WAL files builds up in pg_xlog.

What PG version is this exactly?  Are you sure that the only connection
of the NAS device to PG operations is through the archive script, ie,
you don't have any part of the normal PG data directory mounted on it?

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [ADMIN] does wal archiving block the current client connection?

2006-05-15 Thread Jeff Frost

On Mon, 15 May 2006, Tom Lane wrote:


No, I can't see what the connection should be there.  It's supposed to
be designed so that the archive command can take its sweet old time and
nothing happens except that a backlog of WAL files builds up in pg_xlog.


That's what I thought, but that doesn't seem to be what I'm observing.  Of 
course the NAS device only gets wedged about once every month or two, so it's 
difficult to reproduce.  Probably would need to build another system similar 
and hit it with pg_bench or something similar, then pull the plug on the NAS 
device to reproduce it and see if it acts similarly.




What PG version is this exactly?  Are you sure that the only connection
of the NAS device to PG operations is through the archive script, ie,
you don't have any part of the normal PG data directory mounted on it?


It's postgresql-8.1.3 compiled from source.

PGDATA is in /usr/local/pgsql/data which is a local RAID10 array, and /pg_xlog 
is mounted on a local RAID1.


Things that touch the /mnt/pgbackup directory are the pitr base backup scripts 
which run twice monthly and the archive_command script.




--
Jeff Frost, Owner   [EMAIL PROTECTED]
Frost Consulting, LLC   http://www.frostconsultingllc.com/
Phone: 650-780-7908 FAX: 650-649-1954

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [ADMIN] does wal archiving block the current client connection?

2006-05-15 Thread Tom Lane
Jeff Frost [EMAIL PROTECTED] writes:
 On Mon, 15 May 2006, Tom Lane wrote:
 No, I can't see what the connection should be there.  It's supposed to
 be designed so that the archive command can take its sweet old time and
 nothing happens except that a backlog of WAL files builds up in pg_xlog.

 That's what I thought, but that doesn't seem to be what I'm observing.  Of 
 course the NAS device only gets wedged about once every month or two, so it's
 difficult to reproduce.

If it's really a PG bug, it should be trivial to reproduce: put a long
sleep in the archive-command script.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [ADMIN] does wal archiving block the current client connection?

2006-05-15 Thread Jeff Frost

On Mon, 15 May 2006, Tom Lane wrote:


That's what I thought, but that doesn't seem to be what I'm observing.  Of
course the NAS device only gets wedged about once every month or two, so it's
difficult to reproduce.


If it's really a PG bug, it should be trivial to reproduce: put a long
sleep in the archive-command script.


Thanks Tom.  I'll try and reproduce on one of the other PG servers I have 
available when I have a chance and get back to you if it looks like a possible 
bug.



--
Jeff Frost, Owner   [EMAIL PROTECTED]
Frost Consulting, LLC   http://www.frostconsultingllc.com/
Phone: 650-780-7908 FAX: 650-649-1954

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [ADMIN] does wal archiving block the current client connection?

2006-05-15 Thread Simon Riggs
On Mon, 2006-05-15 at 09:28 -0700, Jeff Frost wrote:
 I've run into a problem with a PITR setup at a client.  The problem is that 
 whenever the CIFS NAS device that we're mounting at /mnt/pgbackup has 
 problems

What kind of problems?

 , it seems that the current client connection gets blocked and this 
 eventually builds up to a sorry, too many clients already error.  

This sounds like the archiver keeps waking up and trying the command,
but it fails, yet that request is causing a resource leak on the NAS.
Eventually, archiver retrying the command eventually fails. Or am I
misunderstanding your issues?

 I'm 
 wondering if this is expected behavior with the archive command and if I 
 should build in some more smarts to my archive script.  Maybe I should fork 
 and waitpid such that I can use a manual timeout shorter than whatever the 
 CIFS timeout is so that I can return an error in a reasonable amount of time?

The archiver is designed around the thought that *attempting* to archive
is a task that it can do indefinitely without a problem; its up to you
to spot that the link is down.

We can put something in to make the retry period elongate, but you'd
need to put a reasonable case for how that would increase robustness. 

-- 
  Simon Riggs
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [ADMIN] does wal archiving block the current client connection?

2006-05-15 Thread Jeff Frost

On Mon, 15 May 2006, Simon Riggs wrote:


On Mon, 2006-05-15 at 09:28 -0700, Jeff Frost wrote:

I've run into a problem with a PITR setup at a client.  The problem is that
whenever the CIFS NAS device that we're mounting at /mnt/pgbackup has
problems


What kind of problems?


It becomes unwritable for whatever reason CIFS shares become unwritable.  It's 
a windows 2003 NAS device and a reboot solves the problem, but it leaves no 
event logs on the windows side of things, so difficult to determine the root 
cause.





, it seems that the current client connection gets blocked and this
eventually builds up to a sorry, too many clients already error.


This sounds like the archiver keeps waking up and trying the command,
but it fails, yet that request is causing a resource leak on the NAS.
Eventually, archiver retrying the command eventually fails. Or am I
misunderstanding your issues?


that's possible.  Does the archiver use a DB connection whenever it tries to 
run archive_command?  If so, then that's almost certainly the problem.  I 
suspect a faster timeout on the CIFS mount would fix the issue as well, but I 
didn't see any such options in the mount.cifs manpage.



The archiver is designed around the thought that *attempting* to archive
is a task that it can do indefinitely without a problem; its up to you
to spot that the link is down.

We can put something in to make the retry period elongate, but you'd
need to put a reasonable case for how that would increase robustness.


That all sounds perfectly reasonable.  If the archiver is using up a 
connection for each archive_command issued, then I suspect that's our problem, 
as there were also lots of debug logs showing that the db was trying to 
archive several WAL files at near the same time, likely pushing us over our 
100 connection limit.  If the archiver does not use up a connection, then I 
suppose I don't know what's actually going on unless postgres blocks the 
commit of the transaction which triggered the archive_command until the 
archive command finishes (or fails).


--
Jeff Frost, Owner   [EMAIL PROTECTED]
Frost Consulting, LLC   http://www.frostconsultingllc.com/
Phone: 650-780-7908 FAX: 650-649-1954

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [ADMIN] does wal archiving block the current client connection?

2006-05-15 Thread Scott Marlowe
On Mon, 2006-05-15 at 16:29, Jeff Frost wrote:
 On Mon, 15 May 2006, Simon Riggs wrote:
 
  On Mon, 2006-05-15 at 09:28 -0700, Jeff Frost wrote:
  I've run into a problem with a PITR setup at a client.  The problem is that
  whenever the CIFS NAS device that we're mounting at /mnt/pgbackup has
  problems
 
  What kind of problems?
 
 It becomes unwritable for whatever reason CIFS shares become unwritable.  
 It's 
 a windows 2003 NAS device and a reboot solves the problem, but it leaves no 
 event logs on the windows side of things, so difficult to determine the root 
 cause.

I don't think you need logs to figure out that the problem is that it's
a Windows based NAS device.  :)

Seriously, we had BSD based NAS devices one place I worked, and the
company making them switched to Win2k3 and after we had one or two do
exactly what you're talking about, we just started building our own
based on RedHat or BSD.  Same hardware (we just formatted the drives on
one of the ones that came with w2k3) and we never had another moment's
problem with them.

In other news... Does the archiver actually connect to the database?  

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [ADMIN] does wal archiving block the current client connection?

2006-05-15 Thread Simon Riggs
On Mon, 2006-05-15 at 14:29 -0700, Jeff Frost wrote:
 On Mon, 15 May 2006, Simon Riggs wrote:
 
  On Mon, 2006-05-15 at 09:28 -0700, Jeff Frost wrote:
  I've run into a problem with a PITR setup at a client.  The problem is that
  whenever the CIFS NAS device that we're mounting at /mnt/pgbackup has
  problems
 
  What kind of problems?
 
 It becomes unwritable for whatever reason CIFS shares become unwritable.  
 It's 
 a windows 2003 NAS device and a reboot solves the problem, but it leaves no 
 event logs on the windows side of things, so difficult to determine the root 
 cause.

You should be able to re-create this problem without the database being
involved. Just set up a driver program over the top of the archive
script so it flies in a tighter loop than the archiver would make it. If
you still get the Windows NAS error... well, I'll leave that to you.

  , it seems that the current client connection gets blocked and this
  eventually builds up to a sorry, too many clients already error.

Tell us more about what the blockage looks like. We may yet thank
Windows for finding a bug, but I'm not sure yet.

  This sounds like the archiver keeps waking up and trying the command,
  but it fails, yet that request is causing a resource leak on the NAS.
  Eventually, archiver retrying the command eventually fails. Or am I
  misunderstanding your issues?
 
 that's possible.  Does the archiver use a DB connection whenever it tries to 
 run archive_command? 

Not at all.

  If so, then that's almost certainly the problem.  I 
 suspect a faster timeout on the CIFS mount would fix the issue as well, but I 
 didn't see any such options in the mount.cifs manpage.
 
  The archiver is designed around the thought that *attempting* to archive
  is a task that it can do indefinitely without a problem; its up to you
  to spot that the link is down.
 
  We can put something in to make the retry period elongate, but you'd
  need to put a reasonable case for how that would increase robustness.
 
 That all sounds perfectly reasonable.  If the archiver is using up a 
 connection for each archive_command issued, then I suspect that's our 
 problem, 
 as there were also lots of debug logs showing that the db was trying to 
 archive several WAL files at near the same time, likely pushing us over our 
 100 connection limit. 

Oh, you mean database clients cannot connect. I thought you meant you
were getting a CIFS client connection error from the archiver. That's
wierd.

  If the archiver does not use up a connection, then I 
 suppose I don't know what's actually going on unless postgres blocks the 
 commit of the transaction which triggered the archive_command until the 
 archive command finishes (or fails).

I think you need to show the database log covering the period in error.

Are you running out of disk space in the database directory? Can you
check again that pg_xlog and pg_xlog/archive_status is definitely not on
the NAS?

-- 
  Simon Riggs
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly