Re: [Pgpool-general] Second stage online recovery with PITR problems on pgpool 3.0.3 / postgresql 9.0.4
Yep ... I've found the same post on the web and upgraded to Postgresql 9.1 (and to pgpool-II-3.1, while I was at it). Everything works now. Thanks On Wed, Sep 14, 2011 at 06:33, Toshihiro Kitagawa kitag...@sraoss.co.jp wrote: Hi, [2011-09-13 12:43:01 CEST]-[]-[31877|] LOG: invalid record length at 1/2120 Was your PostgreSQL 9.0.4 built by gcc 4.6.0? gcc 4.6.0 has the bug which cause this error. See the following thread for more details: http://archives.postgresql.org/pgsql-hackers/2011-06/msg00661.php -- Toshihiro Kitagawa SRA OSS, Inc. Japan On Tue, 13 Sep 2011 13:12:32 +0200 Nikola Ivačič nikola.iva...@gmail.com wrote: I have problem with 2nd. stage online PITR recovery procedure. The data received in second stage after base backup and prior to WAL switch gets lost. I've managed to isolate the problem down to postgresql without the pgpool-II running: - stop failed node //1st stage - start backup - rsync files to failed node - stop backup - do intentional insert in master node //2nd stage - do pg_switch_log (tested also pgpool_xlog_switch with same results) - rsync archive WAL files to failed node - start failed node The failed node starts fine and it does recovery, but for the last WAL file it always reports invalid record length error, and it returns to last known good WAL file (the one created in backup step). Log from failed node when I do restore (increasing verbosity reveals no more information): [2011-09-13 12:42:58 CEST]-[]-[31877|] LOG: database system was interrupted; last known up at 2011-09-13 12:40:46 CEST [2011-09-13 12:42:58 CEST]-[]-[31877|] LOG: creating missing WAL directory pg_xlog/archive_status [2011-09-13 12:42:58 CEST]-[]-[31877|] LOG: starting archive recovery [2011-09-13 12:42:58 CEST]-[postgres]-[31882|] FATAL: the database system is starting up [2011-09-13 12:42:59 CEST]-[]-[31877|] LOG: restored log file 000200010020 from archive [2011-09-13 12:42:59 CEST]-[]-[31877|] LOG: redo starts at 1/2078 [2011-09-13 12:42:59 CEST]-[]-[31877|] LOG: consistent recovery state reached at 1/2100 [2011-09-13 12:42:59 CEST]-[postgres]-[31886|] FATAL: the database system is starting up [2011-09-13 12:43:00 CEST]-[postgres]-[31887|] FATAL: the database system is starting up [2011-09-13 12:43:01 CEST]-[]-[31877|] LOG: restored log file 000200010021 from archive [2011-09-13 12:43:01 CEST]-[]-[31877|] LOG: invalid record length at 1/2120 [2011-09-13 12:43:01 CEST]-[]-[31877|] LOG: redo done at 1/20A0 [2011-09-13 12:43:01 CEST]-[postgres]-[31890|] FATAL: the database system is starting up [2011-09-13 12:43:01 CEST]-[]-[31877|] LOG: restored log file 000200010020 from archive [2011-09-13 12:43:01 CEST]-[]-[31877|] LOG: selected new timeline ID: 3 [2011-09-13 12:43:01 CEST]-[]-[31877|] LOG: archive recovery complete [2011-09-13 12:43:01 CEST]-[]-[31883|] LOG: checkpoint starting: end-of-recovery immediate wait [2011-09-13 12:43:02 CEST]-[]-[31883|] LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.000 s, sync=0.000 s, total=0.659 s [2011-09-13 12:43:02 CEST]-[]-[31876|] LOG: database system is ready to accept connections [2011-09-13 12:43:02 CEST]-[]-[31896|] LOG: autovacuum launcher started I've done md5sum of 000200010021 WAL file in archive dir on master and target node, and the file is the same on both nodes. So my question goes: Did I miss something, or did I get the procedure wrong? Is online recovery with PITR procedure still valid as it is presented in manual? Can I replace the pg_switch_xlog with another pg_start_backup and pg_stop_backup call and what are performance implications in this case? Software versions: I'm using: PostgreSQL 9.0.4 on both nodes with same OS Restore master: Linux miho 3.0-ARCH #1 SMP PREEMPT Wed Aug 17 21:55:57 CEST 2011 x86_64 Intel(R) Core(TM) i7 CPU 930 @ 2.80GHz GenuineIntel GNU/Linux Restore target: Linux alice 3.0-ARCH #1 SMP PREEMPT Wed Aug 17 21:55:57 CEST 2011 x86_64 Intel(R) Core(TM) i7 CPU 930 @ 2.80GHz GenuineIntel GNU/Linux Thanks for help. Nikola ___ Pgpool-general mailing list Pgpool-general@pgfoundry.org http://pgfoundry.org/mailman/listinfo/pgpool-general ___ Pgpool-general mailing list Pgpool-general@pgfoundry.org http://pgfoundry.org/mailman/listinfo/pgpool-general
Re: [Pgpool-general] confirm 2b4736d3dbf2f7ccea62d713d3d64985a93c4c1a
I am looking for Failover and Loadbalancing in postgresql. my choice is very likely to be pgpool. but i have concerns regarding it beeing a SPOF. so i found pgpool-HA. but nowhere is a description of what this actually does. I would like to keep all my VMs the same (not have a dedicated DB loadbalancer) So there would be a pgpool server on every database server knowing about all other databases. my goal would be to be able to takl to any of the pgpool instances and get the same result. the question is will pgpool-HA keep the information about what servers are available/disconnected synchronised over all pgpool instances. or is it just a hot-standby solution where the new pgpool server takes the place of the old one if it fails. tldr; is a active:active configuration for pgpool instances possibel with pgpool-HA? ___ Pgpool-general mailing list Pgpool-general@pgfoundry.org http://pgfoundry.org/mailman/listinfo/pgpool-general
Re: [Pgpool-general] seemingly hung pgpool process consuming 100% CPU
This problem has returned yet again: PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 29191 postgres 20 0 80192 14m 1544 R 89.8 0.2 51:15.91 pgpool postgres 29191 3.4 0.1 80192 14728 ?RSep13 51:40 pgpool: lfriedman nightly 10.31.96.84(61698) idle I'd really appreciate some input on how to debug this. On Fri, Sep 9, 2011 at 8:11 AM, Lonni J Friedman netll...@gmail.com wrote: No one else has experienced this or has suggestions how to debug it? On Wed, Sep 7, 2011 at 12:49 PM, Lonni J Friedman netll...@gmail.com wrote: Greetings, I'm running pgpool-3.0.4 on a Linux-x86_64 server serving as a load balancer for a three server postgresql-9.0.4 cluster (1 master, 2 standby). I'm seeing strange behavior where a single pgpool process seems to hang after some period of time, and then consume 100% of the CPU. I've seen this behavior happen twice since last Friday (when pgpool was brought online in my production environment). At the moment the current hung process looks like this in 'ps auxww' output: postgres 19838 98.7 0.0 68856 2904 ? R Sep06 1027:36 pgpool: lfriedman nightly 10.31.45.20(58277) idle In top, I see: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 19838 postgres 20 0 68856 2904 1072 R 100.0 0.0 1027:29 pgpool When to connect to the process with strace, there is no output, so I'm guessing the process is stuck spinning somewhere: # strace -p 19838 Process 19838 attached - interrupt to quit ... ^CProcess 19838 detached One thing that i'm certain of is that the client IP (10.31.45.20) associated with the hung process has rebooted at least once since that process was spawned. So pgpool seems to be in some confused state, as the client definitely severed the connection already. I checked the pgpool log and there are no explicit references to PID 19838. I'm at a loss how to debug this further, but clearly something is wrong somewhere, and this isn't normal/expected behavior. ___ Pgpool-general mailing list Pgpool-general@pgfoundry.org http://pgfoundry.org/mailman/listinfo/pgpool-general
Re: [Pgpool-general] seemingly hung pgpool process consuming 100% CPU
Thanks for your reply. I'll do this the next time this happens (which will likely be within a few days based on history). On Wed, Sep 14, 2011 at 3:57 PM, Tatsuo Ishii is...@sraoss.co.jp wrote: Please use gdb. For example, become postgres user (or root user) gdb pgpool 29191 bt cont bt cont : : : This will give us an idea where it's looping. -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp This problem has returned yet again: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 29191 postgres 20 0 80192 14m 1544 R 89.8 0.2 51:15.91 pgpool postgres 29191 3.4 0.1 80192 14728 ? R Sep13 51:40 pgpool: lfriedman nightly 10.31.96.84(61698) idle I'd really appreciate some input on how to debug this. On Fri, Sep 9, 2011 at 8:11 AM, Lonni J Friedman netll...@gmail.com wrote: No one else has experienced this or has suggestions how to debug it? On Wed, Sep 7, 2011 at 12:49 PM, Lonni J Friedman netll...@gmail.com wrote: Greetings, I'm running pgpool-3.0.4 on a Linux-x86_64 server serving as a load balancer for a three server postgresql-9.0.4 cluster (1 master, 2 standby). I'm seeing strange behavior where a single pgpool process seems to hang after some period of time, and then consume 100% of the CPU. I've seen this behavior happen twice since last Friday (when pgpool was brought online in my production environment). At the moment the current hung process looks like this in 'ps auxww' output: postgres 19838 98.7 0.0 68856 2904 ? R Sep06 1027:36 pgpool: lfriedman nightly 10.31.45.20(58277) idle In top, I see: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 19838 postgres 20 0 68856 2904 1072 R 100.0 0.0 1027:29 pgpool When to connect to the process with strace, there is no output, so I'm guessing the process is stuck spinning somewhere: # strace -p 19838 Process 19838 attached - interrupt to quit ... ^CProcess 19838 detached One thing that i'm certain of is that the client IP (10.31.45.20) associated with the hung process has rebooted at least once since that process was spawned. So pgpool seems to be in some confused state, as the client definitely severed the connection already. I checked the pgpool log and there are no explicit references to PID 19838. I'm at a loss how to debug this further, but clearly something is wrong somewhere, and this isn't normal/expected behavior. ___ Pgpool-general mailing list Pgpool-general@pgfoundry.org http://pgfoundry.org/mailman/listinfo/pgpool-general
Re: [Pgpool-general] unexpected EOF on client connection
I'm pretty sure that's not the case as the messages stop whenever pgpool isn't running, they were not present prior to using pgpool, and pg_hba.conf is setup such that the database servers only accept connections from each other, and the server running pgpool. None of these servers have normal users connected directly to them (such as with ssh), nor are they running anything that would connect to the database as a client. Also, the volume of these messages are such that something significant has to be causing them. Last night, in the span of 5 minutes, there were 117 of these messages. Ok. I would like to narraow down the reason why we have unexpected EOF on client connection message frequently. I think currently there are two possiblities: 1) pgpool child was killed by some unknown reason(we can omit segfault case because you don't see it in the pgpool log) 2) pgpool child disconnects to PostgreSQL in ungraceful manner For 1) I would like to know if pgpool child process are fine since they are spawned. Are you seeing any pgpool child process disappeared since pgpool started? I assume this should be determined by num_init_children (which I've set to 195 in pgpool.conf)? If so, then I currently have 195 processes in either the wait for connection request state or actively connected state. No. Pgpool parent process automatically respawns child process if it's dyning. So having num_init_children child process is not showing anything usefull. You record 195 process ids and compare current process ids. If some of them have been changed, we can assume that child process is dying. -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp ___ Pgpool-general mailing list Pgpool-general@pgfoundry.org http://pgfoundry.org/mailman/listinfo/pgpool-general
Re: [Pgpool-general] unexpected EOF on client connection
On Wed, Sep 14, 2011 at 4:22 PM, Tatsuo Ishii is...@sraoss.co.jp wrote: I'm pretty sure that's not the case as the messages stop whenever pgpool isn't running, they were not present prior to using pgpool, and pg_hba.conf is setup such that the database servers only accept connections from each other, and the server running pgpool. None of these servers have normal users connected directly to them (such as with ssh), nor are they running anything that would connect to the database as a client. Also, the volume of these messages are such that something significant has to be causing them. Last night, in the span of 5 minutes, there were 117 of these messages. Ok. I would like to narraow down the reason why we have unexpected EOF on client connection message frequently. I think currently there are two possiblities: 1) pgpool child was killed by some unknown reason(we can omit segfault case because you don't see it in the pgpool log) 2) pgpool child disconnects to PostgreSQL in ungraceful manner For 1) I would like to know if pgpool child process are fine since they are spawned. Are you seeing any pgpool child process disappeared since pgpool started? I assume this should be determined by num_init_children (which I've set to 195 in pgpool.conf)? If so, then I currently have 195 processes in either the wait for connection request state or actively connected state. No. Pgpool parent process automatically respawns child process if it's dyning. So having num_init_children child process is not showing anything usefull. You record 195 process ids and compare current process ids. If some of them have been changed, we can assume that child process is dying. Ah, good point. I just diffed the list of PIDs associated with pgpool processes before and after another EOF message in the log, and there were no differences. So I think that rules out any processes dying? Right. One other thing that I just noticed from comparing logs between all of the database servers is that the time stamps for every one of the 'unexpected EOF on client connection' instances are identical. In other words, they are happening at the same time on each server. I think this further suggests that pgpool has to be doing it? Yes, I think so unless you set connection_life_time to other than 0 or the network connection between PostgreSQL and pgpool is unstable. Let me think how we can make further investigation... -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp ___ Pgpool-general mailing list Pgpool-general@pgfoundry.org http://pgfoundry.org/mailman/listinfo/pgpool-general
Re: [Pgpool-general] unexpected EOF on client connection
On Wed, Sep 14, 2011 at 6:00 PM, Tatsuo Ishii is...@sraoss.co.jp wrote: On Wed, Sep 14, 2011 at 4:22 PM, Tatsuo Ishii is...@sraoss.co.jp wrote: I'm pretty sure that's not the case as the messages stop whenever pgpool isn't running, they were not present prior to using pgpool, and pg_hba.conf is setup such that the database servers only accept connections from each other, and the server running pgpool. None of these servers have normal users connected directly to them (such as with ssh), nor are they running anything that would connect to the database as a client. Also, the volume of these messages are such that something significant has to be causing them. Last night, in the span of 5 minutes, there were 117 of these messages. Ok. I would like to narraow down the reason why we have unexpected EOF on client connection message frequently. I think currently there are two possiblities: 1) pgpool child was killed by some unknown reason(we can omit segfault case because you don't see it in the pgpool log) 2) pgpool child disconnects to PostgreSQL in ungraceful manner For 1) I would like to know if pgpool child process are fine since they are spawned. Are you seeing any pgpool child process disappeared since pgpool started? I assume this should be determined by num_init_children (which I've set to 195 in pgpool.conf)? If so, then I currently have 195 processes in either the wait for connection request state or actively connected state. No. Pgpool parent process automatically respawns child process if it's dyning. So having num_init_children child process is not showing anything usefull. You record 195 process ids and compare current process ids. If some of them have been changed, we can assume that child process is dying. Ah, good point. I just diffed the list of PIDs associated with pgpool processes before and after another EOF message in the log, and there were no differences. So I think that rules out any processes dying? Right. One other thing that I just noticed from comparing logs between all of the database servers is that the time stamps for every one of the 'unexpected EOF on client connection' instances are identical. In other words, they are happening at the same time on each server. I think this further suggests that pgpool has to be doing it? Yes, I think so unless you set connection_life_time to other than 0 or the network connection between PostgreSQL and pgpool is unstable. connection _life_time is currently 0 (since you recommended I change it earlier). I don't have any evidence to suggest that the network connection is unstable. There are 0 errors of any kind in ifconfig output. Let me think how we can make further investigation... ok, thanks. ___ Pgpool-general mailing list Pgpool-general@pgfoundry.org http://pgfoundry.org/mailman/listinfo/pgpool-general