I just raised health_check_timeout and authentication_timeout in pgpool.conf
I'm really concerned that pgpool stopped working without logging a failover message for the backend. Aleksey On Wed, Aug 24, 2011 at 11:45 AM, Aleksey Tsalolikhin <[email protected]> wrote: > Situation: pgpool stops working. stops accepting requests. but no > failover message! (even though configured to log failover events) > > 09:11:38.325349 IP pgpool.44937 > database.5432: P > 2447077764:2447077849(85) ack 240034773 win 501 <nop,nop,timestamp > 2748616563 2731686531> > 09:11:38.325819 IP database.5432 > pgpool.44937: P 1:58(57) ack 85 win > 501 <nop,nop,timestamp 2731751942 2748616563> > 09:11:38.325941 IP pgpool.44937 > database.5432: . ack 58 win 501 > <nop,nop,timestamp 2748616564 2731751942> > 09:11:38.328699 IP pgpool.44937 > database.5432: P 85:418(333) ack 58 > win 501 <nop,nop,timestamp 2748616564 2731751942> > 09:11:38.329098 IP database.5432 > pgpool.44937: P 58:78(20) ack 418 > win 501 <nop,nop,timestamp 2731751945 2748616564> > 09:11:38.329666 IP pgpool.44937 > database.5432: P 418:430(12) ack 78 > win 501 <nop,nop,timestamp 2748616564 2731751945> > 09:11:38.330222 IP database.5432 > pgpool.44937: P 78:96(18) ack 430 > win 501 <nop,nop,timestamp 2731751946 2748616564> > 09:11:38.361913 IP pgpool.44937 > database.5432: P 430:489(59) ack 96 > win 501 <nop,nop,timestamp 2748616573 2731751946> > 09:11:38.362110 IP database.5432 > pgpool.44937: P 96:122(26) ack 489 > win 501 <nop,nop,timestamp 2731751978 2748616573> > 09:11:38.362692 IP pgpool.44937 > database.5432: P 489:642(153) ack > 122 win 501 <nop,nop,timestamp 2748616573 2731751978> > 09:11:38.363169 IP database.5432 > pgpool.44937: P 122:179(57) ack 642 > win 501 <nop,nop,timestamp 2731751979 2748616573> > 09:11:38.365298 IP pgpool.44937 > database.5432: P 642:1049(407) ack > 179 win 501 <nop,nop,timestamp 2748616573 2731751979> > 09:11:38.365950 IP database.5432 > pgpool.44937: P 179:650(471) ack > 1049 win 501 <nop,nop,timestamp 2731751982 2748616573> > 09:11:38.405640 IP pgpool.44937 > database.5432: . ack 650 win 501 > <nop,nop,timestamp 2748616584 2731751982> > > > # lsof -i:44937 > COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME > pgpool 20830 root 7u IPv4 1025702434 TCP > ddc-gate-dev02:44937->ddc-db-dev02:postgres (ESTABLISHED) > # strace -f -p 20830 > > > tcpdump shows SQL traffic between pgpool and database servers, and > lsof and strace confirm it is a pgpool process exchanging SQL with the > database > > > I can run psql on the pgpool server, targeting the database server, > and it connects just fine, so the backend is up. > > > pgpool log contains messages like this: > > 2011-08-24 05:07:48 ERROR: pid 14589: pool_read: read failed > (Connection reset by peer) > 2011-08-24 05:07:48 ERROR: pid 14589: pool_read: read failed > (Connection reset by peer) > 2011-08-24 05:07:48 ERROR: pid 14589: pool_read: read failed > (Connection reset by peer) > 2011-08-24 05:07:48 ERROR: pid 14589: pool_read: read failed > (Connection reset by peer) > 2011-08-24 05:07:48 ERROR: pid 14589: pool_read: read failed > (Connection reset by peer) > 2011-08-24 05:07:48 ERROR: pid 14589: pool_read: read failed > (Connection reset by peer) > 2011-08-24 05:07:48 ERROR: pid 14589: pool_read: read failed > (Connection reset by peer) > 2011-08-24 05:07:48 ERROR: pid 14589: pool_read: read failed > (Connection reset by peer) > 2011-08-24 05:07:48 ERROR: pid 14589: pool_read: read failed > (Connection reset by peer) > 2011-08-24 05:31:23 ERROR: pid 13731: pool_read: read failed > (Connection reset by peer) > 2011-08-24 07:49:47 ERROR: pid 27235: pool_read: read failed > (Connection reset by peer) > 2011-08-24 07:49:47 ERROR: pid 27235: pool_read: read failed > (Connection reset by peer) > 2011-08-24 07:49:47 ERROR: pid 27235: pool_read: read failed > (Connection reset by peer) > 2011-08-24 07:49:47 ERROR: pid 27235: pool_read: read failed > (Connection reset by peer) > 2011-08-24 07:49:47 ERROR: pid 27235: pool_read: read failed > (Connection reset by peer) > 2011-08-24 07:49:48 ERROR: pid 27235: pool_read: read failed > (Connection reset by peer) > 2011-08-24 07:49:48 ERROR: pid 27235: pool_read: read failed > (Connection reset by peer) > 2011-08-24 08:56:19 ERROR: pid 24482: pool_read: read failed > (Connection reset by peer) > 2011-08-24 08:56:19 ERROR: pid 24482: pool_read: read failed > (Connection reset by peer) > 2011-08-24 08:56:19 ERROR: pid 24482: pool_read: read failed > (Connection reset by peer) > 2011-08-24 08:56:19 ERROR: pid 24482: pool_read: read failed > (Connection reset by peer) > 2011-08-24 08:56:19 ERROR: pid 24482: pool_read: read failed > (Connection reset by peer) > 2011-08-24 09:09:20 LOG: pid 19949: authentication is timeout > > I stopped pgpool (/etc/init.d/pgpool stop). > > I tried to start it but got this: > > # /etc/init.d/pgpool start > Starting pgpool service: pid file found but it seems bogus. Trying to > start pgpool anyway... > 2011-08-24 09:20:16 LOG: pid 16492: read_status_file: 0 th backend > is set to down status > 2011-08-24 09:20:16 ERROR: pid 16492: bind(/tmp/.s.PGSQL.9999) failed. > reason: Address already in use > 2011-08-24 09:20:16 ERROR: pid 16492: unlink() failed: No such file or > directory > [FAILED] > # > > I tried again and got this: > > Starting pgpool service: 2011-08-24 09:21:29 LOG: pid 16658: > pgpool-II successfully started. version 3.0.4 (umiyameboshi) > 2011-08-24 09:21:29 ERROR: pid 16658: bind(/tmp/.s.PGSQL.9898) failed. > reason: Address already in use > > No pgpool process running... > > # lsof -i:9898 > # lsof -i:9999 > # > > > I tried again a third time and it started. > > > The first sign of trouble was at 2:19 AM, when some queries through > pgpool failed. I came in at 9 AM and restarted pgpool. > > Then a couple of hours later, queries through pgpool started failing > again, and I restarted it again, and it worked fine afterwards. > > I do monitor the database server. The load is higher than usual now. > Could a query taking a longer time to return cause this kind of > instability in pgpool? If so, is there any way to compensate for it? > > Please let me know what other information I could gather if this > happens again -- I did grab a core dump of the pgpool parent at 9 AM, > and an lsof on that process. > > We are running pgpool on CentOS 5.5, 64 bit. > > Best, > Aleksey > _______________________________________________ Pgpool-general mailing list [email protected] http://pgfoundry.org/mailman/listinfo/pgpool-general
