Re: [Pgpool-general] Second stage online recovery with PITR problems on pgpool 3.0.3 / postgresql 9.0.4

2011-09-14 Thread Nikola Ivačič
Yep ... I've found the same post on the web and upgraded to Postgresql
9.1 (and to pgpool-II-3.1, while I was at it).
Everything works now.

Thanks

On Wed, Sep 14, 2011 at 06:33, Toshihiro Kitagawa kitag...@sraoss.co.jp wrote:
 Hi,

 [2011-09-13 12:43:01 CEST]-[]-[31877|] LOG:  invalid record length at 
 1/2120

 Was your PostgreSQL 9.0.4 built by gcc 4.6.0?

 gcc 4.6.0 has the bug which cause this error.
 See the following thread for more details:
 http://archives.postgresql.org/pgsql-hackers/2011-06/msg00661.php

 --
 Toshihiro Kitagawa
 SRA OSS, Inc. Japan

 On Tue, 13 Sep 2011 13:12:32 +0200
 Nikola Ivačič nikola.iva...@gmail.com wrote:

 I have problem with 2nd. stage online PITR recovery procedure.
 The data received in second stage after base backup and prior to WAL
 switch gets lost.

 I've managed to isolate the problem down to postgresql without the
 pgpool-II running:
 - stop failed node
 //1st stage
 - start backup
 - rsync files to failed node
 - stop backup
 - do intentional insert in master node
 //2nd stage
 - do pg_switch_log (tested also pgpool_xlog_switch with same results)
 - rsync archive WAL files to failed node
 - start failed node

 The failed node starts fine and it does recovery, but for the last WAL
 file it always reports invalid record length error, and it returns
 to last known good WAL file (the one created in backup step).

 Log from failed node when I do restore (increasing verbosity reveals
 no more information):
 [2011-09-13 12:42:58 CEST]-[]-[31877|] LOG:  database system was
 interrupted; last known up at 2011-09-13 12:40:46 CEST
 [2011-09-13 12:42:58 CEST]-[]-[31877|] LOG:  creating missing WAL
 directory pg_xlog/archive_status
 [2011-09-13 12:42:58 CEST]-[]-[31877|] LOG:  starting archive recovery
 [2011-09-13 12:42:58 CEST]-[postgres]-[31882|] FATAL:  the database
 system is starting up
 [2011-09-13 12:42:59 CEST]-[]-[31877|] LOG:  restored log file
 000200010020 from archive
 [2011-09-13 12:42:59 CEST]-[]-[31877|] LOG:  redo starts at 1/2078
 [2011-09-13 12:42:59 CEST]-[]-[31877|] LOG:  consistent recovery state
 reached at 1/2100
 [2011-09-13 12:42:59 CEST]-[postgres]-[31886|] FATAL:  the database
 system is starting up
 [2011-09-13 12:43:00 CEST]-[postgres]-[31887|] FATAL:  the database
 system is starting up
 [2011-09-13 12:43:01 CEST]-[]-[31877|] LOG:  restored log file
 000200010021 from archive
 [2011-09-13 12:43:01 CEST]-[]-[31877|] LOG:  invalid record length at 
 1/2120
 [2011-09-13 12:43:01 CEST]-[]-[31877|] LOG:  redo done at 1/20A0
 [2011-09-13 12:43:01 CEST]-[postgres]-[31890|] FATAL:  the database
 system is starting up
 [2011-09-13 12:43:01 CEST]-[]-[31877|] LOG:  restored log file
 000200010020 from archive
 [2011-09-13 12:43:01 CEST]-[]-[31877|] LOG:  selected new timeline ID: 3
 [2011-09-13 12:43:01 CEST]-[]-[31877|] LOG:  archive recovery complete
 [2011-09-13 12:43:01 CEST]-[]-[31883|] LOG:  checkpoint starting:
 end-of-recovery immediate wait
 [2011-09-13 12:43:02 CEST]-[]-[31883|] LOG:  checkpoint complete:
 wrote 0 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0
 recycled; write=0.000 s, sync=0.000 s, total=0.659 s
 [2011-09-13 12:43:02 CEST]-[]-[31876|] LOG:  database system is ready
 to accept connections
 [2011-09-13 12:43:02 CEST]-[]-[31896|] LOG:  autovacuum launcher started

 I've done md5sum of 000200010021 WAL file in archive dir
 on master and target node, and the file is the same on both nodes.

 So my question goes:
 Did I miss something, or did I get the procedure wrong?
 Is online recovery with PITR procedure still valid as it is presented
 in manual?
 Can I replace the pg_switch_xlog with another pg_start_backup and
 pg_stop_backup call and what are performance implications in this
 case?

 Software versions:
 I'm using: PostgreSQL 9.0.4 on both nodes with same OS
 Restore master:
 Linux miho 3.0-ARCH #1 SMP PREEMPT Wed Aug 17 21:55:57 CEST 2011
 x86_64 Intel(R) Core(TM) i7 CPU 930 @ 2.80GHz GenuineIntel GNU/Linux
 Restore target:
 Linux alice 3.0-ARCH #1 SMP PREEMPT Wed Aug 17 21:55:57 CEST 2011
 x86_64 Intel(R) Core(TM) i7 CPU 930 @ 2.80GHz GenuineIntel GNU/Linux

 Thanks for help.
 Nikola
 ___
 Pgpool-general mailing list
 Pgpool-general@pgfoundry.org
 http://pgfoundry.org/mailman/listinfo/pgpool-general



___
Pgpool-general mailing list
Pgpool-general@pgfoundry.org
http://pgfoundry.org/mailman/listinfo/pgpool-general


Re: [Pgpool-general] confirm 2b4736d3dbf2f7ccea62d713d3d64985a93c4c1a

2011-09-14 Thread Imre Facchin
I am looking for Failover and Loadbalancing in postgresql.

my choice is very likely to be pgpool. but i have concerns regarding it
beeing a SPOF. so i found pgpool-HA.
but nowhere is a description of what this actually does. I would like to
keep all my VMs the same (not have a dedicated DB loadbalancer) So there
would be a pgpool server on every database server knowing about all other
databases. my goal would be to be able to takl to any of the pgpool
instances and get the same result. the question is will pgpool-HA keep the
information about what servers are available/disconnected synchronised over
all pgpool instances. or is it just a hot-standby solution where the new
pgpool server takes the place of the old one if it fails.

tldr; is a active:active configuration for pgpool instances possibel with
pgpool-HA?
___
Pgpool-general mailing list
Pgpool-general@pgfoundry.org
http://pgfoundry.org/mailman/listinfo/pgpool-general


Re: [Pgpool-general] seemingly hung pgpool process consuming 100% CPU

2011-09-14 Thread Lonni J Friedman
This problem has returned yet again:
  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
29191 postgres  20   0 80192  14m 1544 R 89.8  0.2  51:15.91 pgpool

postgres 29191  3.4  0.1  80192 14728 ?RSep13  51:40
pgpool: lfriedman nightly 10.31.96.84(61698) idle


I'd really appreciate some input on how to debug this.


On Fri, Sep 9, 2011 at 8:11 AM, Lonni J Friedman netll...@gmail.com wrote:
 No one else has experienced this or has suggestions how to debug it?

 On Wed, Sep 7, 2011 at 12:49 PM, Lonni J Friedman netll...@gmail.com wrote:
 Greetings,
 I'm running pgpool-3.0.4 on a Linux-x86_64 server serving as a load
 balancer for a three server postgresql-9.0.4 cluster (1 master, 2
 standby).  I'm seeing strange behavior where a single pgpool process
 seems to hang after some period of time, and then consume 100% of the
 CPU.  I've seen this behavior happen twice since last Friday (when
 pgpool was brought online in my production environment).  At the
 moment the current hung process looks like this in 'ps auxww' output:

 postgres 19838 98.7  0.0  68856  2904 ?        R    Sep06 1027:36
 pgpool: lfriedman nightly 10.31.45.20(58277) idle


 In top, I see:
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 19838 postgres  20   0 68856 2904 1072 R 100.0  0.0   1027:29 pgpool


 When to connect to the process with strace, there is no output, so I'm
 guessing the process is stuck spinning somewhere:
 # strace -p 19838
 Process 19838 attached - interrupt to quit
 ...
 ^CProcess 19838 detached

 One thing that i'm certain of is that the client IP (10.31.45.20)
 associated with the hung process has rebooted at least once since that
 process was spawned.  So pgpool seems to be in some confused state, as
 the client definitely severed the connection already.  I checked the
 pgpool log and there are no explicit references to PID 19838.  I'm at
 a loss how to debug this further, but clearly something is wrong
 somewhere, and this isn't normal/expected behavior.
___
Pgpool-general mailing list
Pgpool-general@pgfoundry.org
http://pgfoundry.org/mailman/listinfo/pgpool-general


Re: [Pgpool-general] seemingly hung pgpool process consuming 100% CPU

2011-09-14 Thread Lonni J Friedman
Thanks for your reply.  I'll do this the next time this happens (which
will likely be within a few days based on history).

On Wed, Sep 14, 2011 at 3:57 PM, Tatsuo Ishii is...@sraoss.co.jp wrote:
 Please use gdb. For example,

 become postgres user (or root user)
 gdb pgpool 29191
 bt
 cont
 bt
 cont
 :
 :
 :

 This will give us an idea where it's looping.
 --
 Tatsuo Ishii
 SRA OSS, Inc. Japan
 English: http://www.sraoss.co.jp/index_en.php
 Japanese: http://www.sraoss.co.jp

 This problem has returned yet again:
   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 29191 postgres  20   0 80192  14m 1544 R 89.8  0.2  51:15.91 pgpool

 postgres 29191  3.4  0.1  80192 14728 ?        R    Sep13  51:40
 pgpool: lfriedman nightly 10.31.96.84(61698) idle


 I'd really appreciate some input on how to debug this.


 On Fri, Sep 9, 2011 at 8:11 AM, Lonni J Friedman netll...@gmail.com wrote:
 No one else has experienced this or has suggestions how to debug it?

 On Wed, Sep 7, 2011 at 12:49 PM, Lonni J Friedman netll...@gmail.com 
 wrote:
 Greetings,
 I'm running pgpool-3.0.4 on a Linux-x86_64 server serving as a load
 balancer for a three server postgresql-9.0.4 cluster (1 master, 2
 standby).  I'm seeing strange behavior where a single pgpool process
 seems to hang after some period of time, and then consume 100% of the
 CPU.  I've seen this behavior happen twice since last Friday (when
 pgpool was brought online in my production environment).  At the
 moment the current hung process looks like this in 'ps auxww' output:

 postgres 19838 98.7  0.0  68856  2904 ?        R    Sep06 1027:36
 pgpool: lfriedman nightly 10.31.45.20(58277) idle


 In top, I see:
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 19838 postgres  20   0 68856 2904 1072 R 100.0  0.0   1027:29 pgpool


 When to connect to the process with strace, there is no output, so I'm
 guessing the process is stuck spinning somewhere:
 # strace -p 19838
 Process 19838 attached - interrupt to quit
 ...
 ^CProcess 19838 detached

 One thing that i'm certain of is that the client IP (10.31.45.20)
 associated with the hung process has rebooted at least once since that
 process was spawned.  So pgpool seems to be in some confused state, as
 the client definitely severed the connection already.  I checked the
 pgpool log and there are no explicit references to PID 19838.  I'm at
 a loss how to debug this further, but clearly something is wrong
 somewhere, and this isn't normal/expected behavior.
___
Pgpool-general mailing list
Pgpool-general@pgfoundry.org
http://pgfoundry.org/mailman/listinfo/pgpool-general


Re: [Pgpool-general] unexpected EOF on client connection

2011-09-14 Thread Tatsuo Ishii
 I'm pretty sure that's not the case as the messages stop whenever
 pgpool isn't running, they were not present prior to using pgpool, and
 pg_hba.conf is setup such that the database servers only accept
 connections from each other, and the server running pgpool.  None of
 these servers have normal users connected directly to them (such as
 with ssh), nor are they running anything that would connect to the
 database as a client.  Also, the volume of these messages are such
 that something significant has to be causing them.  Last night, in the
 span of 5 minutes, there were 117 of these messages.

 Ok. I would like to narraow down the reason why we have unexpected
 EOF on client connection message frequently. I think currently there
 are two possiblities:

 1) pgpool child was killed by some unknown reason(we can omit
   segfault case because you don't see it in the pgpool log)

 2) pgpool child disconnects to PostgreSQL in ungraceful manner

 For 1) I would like to know if pgpool child process are fine since
 they are spawned. Are you seeing any pgpool child process disappeared
 since pgpool started?
 
 I assume this should be determined by num_init_children (which I've
 set to 195 in pgpool.conf)?  If so, then I currently have 195
 processes in either the wait for connection request state or
 actively connected state.

No. Pgpool parent process automatically respawns child process if it's
dyning. So having num_init_children child process is not showing
anything usefull. You record 195 process ids and compare current
process ids. If some of them have been changed, we can assume that
child process is dying.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

___
Pgpool-general mailing list
Pgpool-general@pgfoundry.org
http://pgfoundry.org/mailman/listinfo/pgpool-general


Re: [Pgpool-general] unexpected EOF on client connection

2011-09-14 Thread Tatsuo Ishii
 On Wed, Sep 14, 2011 at 4:22 PM, Tatsuo Ishii is...@sraoss.co.jp wrote:
 I'm pretty sure that's not the case as the messages stop whenever
 pgpool isn't running, they were not present prior to using pgpool, and
 pg_hba.conf is setup such that the database servers only accept
 connections from each other, and the server running pgpool.  None of
 these servers have normal users connected directly to them (such as
 with ssh), nor are they running anything that would connect to the
 database as a client.  Also, the volume of these messages are such
 that something significant has to be causing them.  Last night, in the
 span of 5 minutes, there were 117 of these messages.

 Ok. I would like to narraow down the reason why we have unexpected
 EOF on client connection message frequently. I think currently there
 are two possiblities:

 1) pgpool child was killed by some unknown reason(we can omit
   segfault case because you don't see it in the pgpool log)

 2) pgpool child disconnects to PostgreSQL in ungraceful manner

 For 1) I would like to know if pgpool child process are fine since
 they are spawned. Are you seeing any pgpool child process disappeared
 since pgpool started?

 I assume this should be determined by num_init_children (which I've
 set to 195 in pgpool.conf)?  If so, then I currently have 195
 processes in either the wait for connection request state or
 actively connected state.

 No. Pgpool parent process automatically respawns child process if it's
 dyning. So having num_init_children child process is not showing
 anything usefull. You record 195 process ids and compare current
 process ids. If some of them have been changed, we can assume that
 child process is dying.
 
 Ah, good point.  I just diffed the list of PIDs associated with pgpool
 processes before and after another EOF message in the log, and there
 were no differences.  So I think that rules out any processes dying?

Right.

 One other thing that I just noticed from comparing logs between all of
 the database servers is that the time stamps for every one of the
 'unexpected EOF on client connection' instances are identical.  In
 other words, they are happening at the same time on each server.  I
 think this further suggests that pgpool has to be doing it?

Yes, I think so unless you set connection_life_time to other than 0 or
the network connection between PostgreSQL and pgpool is unstable.

Let me think how we can make further investigation...
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

___
Pgpool-general mailing list
Pgpool-general@pgfoundry.org
http://pgfoundry.org/mailman/listinfo/pgpool-general


Re: [Pgpool-general] unexpected EOF on client connection

2011-09-14 Thread Lonni J Friedman
On Wed, Sep 14, 2011 at 6:00 PM, Tatsuo Ishii is...@sraoss.co.jp wrote:
 On Wed, Sep 14, 2011 at 4:22 PM, Tatsuo Ishii is...@sraoss.co.jp wrote:
 I'm pretty sure that's not the case as the messages stop whenever
 pgpool isn't running, they were not present prior to using pgpool, and
 pg_hba.conf is setup such that the database servers only accept
 connections from each other, and the server running pgpool.  None of
 these servers have normal users connected directly to them (such as
 with ssh), nor are they running anything that would connect to the
 database as a client.  Also, the volume of these messages are such
 that something significant has to be causing them.  Last night, in the
 span of 5 minutes, there were 117 of these messages.

 Ok. I would like to narraow down the reason why we have unexpected
 EOF on client connection message frequently. I think currently there
 are two possiblities:

 1) pgpool child was killed by some unknown reason(we can omit
   segfault case because you don't see it in the pgpool log)

 2) pgpool child disconnects to PostgreSQL in ungraceful manner

 For 1) I would like to know if pgpool child process are fine since
 they are spawned. Are you seeing any pgpool child process disappeared
 since pgpool started?

 I assume this should be determined by num_init_children (which I've
 set to 195 in pgpool.conf)?  If so, then I currently have 195
 processes in either the wait for connection request state or
 actively connected state.

 No. Pgpool parent process automatically respawns child process if it's
 dyning. So having num_init_children child process is not showing
 anything usefull. You record 195 process ids and compare current
 process ids. If some of them have been changed, we can assume that
 child process is dying.

 Ah, good point.  I just diffed the list of PIDs associated with pgpool
 processes before and after another EOF message in the log, and there
 were no differences.  So I think that rules out any processes dying?

 Right.

 One other thing that I just noticed from comparing logs between all of
 the database servers is that the time stamps for every one of the
 'unexpected EOF on client connection' instances are identical.  In
 other words, they are happening at the same time on each server.  I
 think this further suggests that pgpool has to be doing it?

 Yes, I think so unless you set connection_life_time to other than 0 or
 the network connection between PostgreSQL and pgpool is unstable.

connection _life_time is currently 0 (since you recommended I change
it earlier).  I don't have any evidence to suggest that the network
connection is unstable.  There are 0 errors of any kind in ifconfig
output.


 Let me think how we can make further investigation...

ok, thanks.
___
Pgpool-general mailing list
Pgpool-general@pgfoundry.org
http://pgfoundry.org/mailman/listinfo/pgpool-general