> Thanks again Tatsuo, You are welcome! -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp
> I've been giving it a try, and, it seems, as you say, that with exit(0) > something goes wrong with the recovery, but, agains as you say, it seems > solved with child_exit(0), maybe it lasts some more seconds to continue > after unplug the wire, but I cannot assure that, anyway, it seems to work > fine now. I will continue testing it, but, for now, pgpool is working fine > in the "unplugged wire" situation!! > > Thanks! > > 2010/3/15 Tatsuo Ishii <[email protected]> > > > This will miss some house keeping jobs such as count down connection > > counter. Thus you will be in trouble when you do online recovery. You > > would want to use child_exit(0) instead exit(0). > > -- > > Tatsuo Ishii > > SRA OSS, Inc. Japan > > English: http://www.sraoss.co.jp/index_en.php > > Japanese: http://www.sraoss.co.jp > > > > > Hi Tatsuo, > > > > > > We have been doing some more debugging, and we found what the problem is. > > As > > > I commented in my previous message, the problem was that a proccess > > (where > > > the client where connecting), didn't die. > > > > > > We have found a solution,... it is clearly not an "elegant" solution, but > > it > > > works for us (I have to try it a little more), and sure you will be able > > to > > > find a better solution. > > > > > > Anyway. We edited: child.c > > > > > > Arrount the line 1204. > > > Inside "case SIGQUIT" we added : > > > > > > pool_debug("Received SIGQUIT"); > > > exit(0); > > > > > > Before the break: > > > case SIGQUIT: /* immediate shutdown */ > > > pool_debug("Received SIGQUIT"); > > > exit(0); > > > break; > > > > > > > > > I have to finish testing/debugging it to see if it works and it is not > > > affecting negatively in some other place, but, for now it works,... > > almost > > > inmediately after loosing the wire, it continues working in the alive > > nodes. > > > > > > Hope it can be helpful ! > > > > > > Thanks! > > > > > > 2010/3/6 Tatsuo Ishii <[email protected]> > > > > > > > Daniel, > > > > > > > > I have committed fix for this. Please check it out. > > > > -- > > > > Tatsuo Ishii > > > > SRA OSS, Inc. Japan > > > > English: http://www.sraoss.co.jp/index_en.php > > > > Japanese: http://www.sraoss.co.jp > > > > > > > > > Daniel, > > > > > > > > > > Looks like interrupted connect() ought to check if SIGQUIT has been > > > > > delivered. Will check. I have no time until next week though. > > > > > -- > > > > > Tatsuo Ishii > > > > > SRA OSS, Inc. Japan > > > > > English: http://www.sraoss.co.jp/index_en.php > > > > > Japanese: http://www.sraoss.co.jp > > > > > > > > > > > Hi Tatsuo, > > > > > > > > > > > > I did some more debug, and it seems the problem I had is solved. > > But > > > > there > > > > > > is still a problem. > > > > > > > > > > > > The test I did is: > > > > > > 1) Start postgres node2 > > > > > > 2) Start postgres node1 > > > > > > 3) start pgpool (node1) > > > > > > 4) From a remote terminal (client1): > > > > > > for i in $(seq 0 2000); do { echo $(date +%H:%M:%S); psql -d > > database > > > > -p > > > > > > 9999 -h 192.168.1.5 -U postgres -c "insert into pri values ($i, > > > > 'nom$i');"; > > > > > > echo $i; sleep 1; }; done > > > > > > 5) Unplug node2 > > > > > > Here the inserts on client terminal stops. It restarts more or less > > 5 > > > > min > > > > > > and a half later. > > > > > > 6)After 10 second (more or les) after unplug the wire, From a > > remote > > > > > > terminal (client2): > > > > > > for i in $(seq 0 2000); do { echo $(date +%H:%M:%S); psql -d > > database > > > > -p > > > > > > 9999 -h 192.168.1.5 -U postgres -c "insert into pri values ($i, > > > > 'nom$i');"; > > > > > > echo $i; sleep 1; }; done > > > > > > Works perfectly!! I mean, pgpool is not blocked, it works,... there > > is > > > > just > > > > > > some problem in the first session. > > > > > > > > > > > > > > > > > > The lines bothering me are: > > > > > > [...] > > > > > > 11:33:42.414222 select(6, 0xbfa01ae8, 0, 0, 0xbfa01b8c2010-03-04 > > > > 11:33:42 > > > > > > DEBUG: pid 4744: pool_read_message_length: slot: 0 length: 8 > > > > > > [...] > > > > > > 2010-03-04 11:33:42 DEBUG: pid 4744: pool_read_message_length: > > slot: 0 > > > > > > length: 5 > > > > > > 2010-03-04 11:33:42 DEBUG: pid 4744: pool_read_message_length: > > slot: 1 > > > > > > length: 5 > > > > > > 2010-03-04 11:33:42 DEBUG: pid 4744: ReadyForQuery: message length: > > 5 > > > > > > 2010-03-04 11:33:42 DEBUG: pid 4744: ReadyForQuery: transaction > > state: > > > > I > > > > > > 2010-03-04 11:33:42 DEBUG: pid 4744: read kind from frontend X(58) > > > > > > 2010-03-04 11:33:43 DEBUG: pid 4744: I am 4744 accept fd 5 > > > > > > 2010-03-04 11:33:43 LOG: pid 4744: connection received: > > > > host=192.168.1.13 > > > > > > port=38751 > > > > > > 2010-03-04 11:33:43 DEBUG: pid 4744: Protocol Major: 1234 Minor: > > 5679 > > > > > > database: user: > > > > > > 2010-03-04 11:33:43 DEBUG: pid 4744: SSLRequest from client > > > > > > 2010-03-04 11:33:43 DEBUG: pid 4744: pool_ssl: SSL requested but > > SSL > > > > support > > > > > > is not available > > > > > > 2010-03-04 11:33:43 DEBUG: pid 4744: Protocol Major: 3 Minor: 0 > > > > database: > > > > > > database user: postgres > > > > > > 2010-03-04 11:33:43 DEBUG: pid 4744: new_connection: connecting 0 > > > > backend > > > > > > 2010-03-04 11:33:43 DEBUG: pid 4744: new_connection: connecting 1 > > > > backend > > > > > > <unfinished ...> > > > > > > 11:33:43.357089 --- SIGALRM (Alarm clock) --- > > > > > > 11:33:44.416652 <... select resumed> ) = 0 > > > > > > 11:33:44.416860 sigprocmask(2, 0x8180380, NULL) = 0 > > > > > > 11:33:44.417547 sigprocmask(2, 0x8180280, NULL) = 0 > > > > > > 11:33:44.418019 gettimeofday(0xbfa01b94, NULL) = 0 > > > > > > 11:33:44.418723 sigprocmask(2, 0x8180380, NULL) = 0 > > > > > > 11:33:44.419311 sigprocmask(2, 0x8180380, 0xbfa01b14) = 0 > > > > > > 11:33:44.419866 getpid() = > > 4743 > > > > > > 11:33:44.420353 time(NULL) = > > > > 1267698824 > > > > > > 11:33:44.420878 localtime(0xbfa01ae4) = > > > > 0xcbb300 > > > > > > [...] > > > > > > > > > > > > > > > > > > When pgpool is started I can see 3 processes: (sorry if times an > > PID > > > > does > > > > > > not match, because I am takeing data from different debug sessions) > > > > > > 5348 pts/1 S+ 0:00 pgpool -d -n > > > > > > 5349 pts/1 S+ 0:00 pgpool: wait for connection request > > > > > > 5350 pts/1 S+ 0:00 pgpool: PCP: wait for connection request > > > > > > > > > > > > I monitored de process with name pgpool: wait for connection > > request > > > > > > And it stops at: > > > > > > [...] > > > > > > 11:33:43.107622 time(NULL) = 1267698823 > > > > > > 11:33:43.107773 stat64("/etc/localtime", {st_mode=S_IFREG|0644, > > > > > > st_size=2593, ...}) = 0 > > > > > > 11:33:43.107961 write(2, "2010-03-04 11:33:43 DEBUG: pid 4"..., 74) > > = > > > > 74 > > > > > > 11:33:43.108359 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 > > > > > > 11:33:43.108510 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 7 > > > > > > 11:33:43.108670 setsockopt(7, SOL_TCP, TCP_NODELAY, [1], 4) = 0 > > > > > > 11:33:43.108823 connect(7, {sa_family=AF_INET, > > sin_port=htons(5432), > > > > > > sin_addr=inet_addr("192.168.1.10")}, 16) = ? ERESTARTSYS (To be > > > > restarted) > > > > > > 11:33:45.460457 --- SIGQUIT (Quit) @ 0 (0) --- > > > > > > 11:33:45.460997 rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE > > > > SEGV > > > > > > CONT SYS RTMIN RT_1], [QUIT], 8) = 0 > > > > > > 11:33:45.461156 time(NULL) = 1267698825 > > > > > > 11:33:45.461305 stat64("/etc/localtime", {st_mode=S_IFREG|0644, > > > > > > st_size=2593, ...}) = 0 > > > > > > 11:33:45.461537 write(2, "2010-03-04 11:33:45 DEBUG: pid 4"..., 78) > > = > > > > 78 > > > > > > 11:33:45.461720 rt_sigprocmask(SIG_SETMASK, [QUIT], NULL, 8) = 0 > > > > > > 11:33:45.475194 sigreturn() = ? (mask now []) > > > > > > 11:33:45.475320 connect(7, {sa_family=AF_INET, > > sin_port=htons(5432), > > > > > > sin_addr=inet_addr("192.168.1.10")}, 16) = -1 EHOSTUNREACH (No > > route to > > > > > > host) > > > > > > > > > > > > > > > > > > After some time it continues with: > > > > > > [...] > > > > > > 11:35:19.066376 rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE > > > > SEGV > > > > > > CONT SYS RTMIN RT_1], [], 8) = 0 > > > > > > 11:35:19.066888 time(NULL) = 1267698919 > > > > > > 11:35:19.067042 stat64("/etc/localtime", {st_mode=S_IFREG|0644, > > > > > > st_size=2593, ...}) = 0 > > > > > > 11:35:19.067408 write(2, "2010-03-04 11:35:19 ERROR: pid 4"..., > > 100) = > > > > 100 > > > > > > 11:35:19.067848 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 > > > > > > [...] > > > > > > > > > > > > > > > > > > I observed, in that time that now there is one more process, and > > the > > > > process > > > > > > I am monitoring changed: > > > > > > 5348 pts/1 S+ 0:00 pgpool -d -n > > > > > > 5349 pts/1 S+ 0:00 pgpool: accept connection > > > > > > 5350 pts/1 S+ 0:00 pgpool: PCP: wait for connection request > > > > > > 5467 pts/1 S+ 0:00 pgpool: wait for connection request > > > > > > > > > > > > > > > > > > I would say that there are two "problems". > > > > > > One of them is that pgpool takes so long to see node2 failed (i > > cannot > > > > see > > > > > > how long,.. it seems to me that is not allways the same). I have > > > > > > connection_life_time=1 . Is there any other timeout that I would > > have > > > > to > > > > > > change? (the health check is every second). > > > > > > I noticed too that, after the node is down, some healths checks > > fail to > > > > > > detect it, don't know if maybe it could be because of the > > > > > > virtualitzation,... > > > > > > > > > > > > I see that, everytime the node falls, the process(wait for > > connection > > > > > > request) is killed and anotherone is started. That is not a problem > > if > > > > in > > > > > > the meantime there is not a query. If a query is done before the > > old > > > > process > > > > > > dies and the nwe is create,, the "old process" will take it and > > won't > > > > die, > > > > > > and will wait the 5 minutes (I don't know why this time) as > > "pgpool: > > > > accept > > > > > > connection" still if the new process is alredy started, my for loop > > > > will > > > > > > wait there. All the queryies that comes after the new process is > > > > started > > > > > > will work perfectly. > > > > > > The problem is that, once the new process is started (and works > > ok), > > > > the old > > > > > > process is still there, retaining the clients(wich started the > > > > connection > > > > > > there) connection (and stoping them of working), and, maybe it > > would be > > > > > > better if, once the new process exist, the oldone would die and > > would > > > > free > > > > > > the "retained" query. > > > > > > > > > > > > Hope this can be helpfull for you. > > > > > > > > > > > > Thanks! > > > > > > > > > > > > 2010/3/3 Tatsuo Ishii <[email protected]> > > > > > > > > > > > > > Daniel, > > > > > > > > > > > > > > I have committed fix into CVS HEAD. Could you try it out? > > > > > > > -- > > > > > > > Tatsuo Ishii > > > > > > > SRA OSS, Inc. Japan > > > > > > > English: http://www.sraoss.co.jp/index_en.php > > > > > > > Japanese: http://www.sraoss.co.jp > > > > > > > > > > > > > > > Thanks to you Tatsuo,... > > > > > > > > > > > > > > > > I am glad my report helped you! As soon as you tell me it is > > done > > > > I will > > > > > > > > start testing/debugging it again. > > > > > > > > > > > > > > > > 2010/3/2 Tatsuo Ishii <[email protected]> > > > > > > > > > > > > > > > > > Daniel, > > > > > > > > > > > > > > > > > > Thanks for the report! > > > > > > > > > > > > > > > > > > > I spoke in another message about this problem, yet, I > > debugged > > > > deeper > > > > > > > and > > > > > > > > > I > > > > > > > > > > have more specific information, that, maybe, can be > > usefull. > > > > > > > > > > (The thread I spoke something about was: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > http://lists.pgfoundry.org/pipermail/pgpool-general/2010-February/002565.html > > > > > > > > > > ) > > > > > > > > > > > > > > > > > > > > I am working with two VB Virtual machines with CentOS 5 > > (i386). > > > > > > > Running > > > > > > > > > > PostgreSQL 8.3.9 and pgpool 2.3.2.1. > > > > > > > > > > > > > > > > > > > > The test was simple. While I was inserting values every > > second, > > > > I > > > > > > > > > unplugged > > > > > > > > > > one of the nodes. > > > > > > > > > > health check is every second and it's timeout is 2 seconds. > > > > > > > > > > > > > > > > > > > > In that moment all inserts stops, and pgpool waits. > > > > > > > > > > The point where it stops is: > > > > > > > > > > > > > > > > > > > > [...] > > > > > > > > > > [pid 29444] 10:47:55.537470 rt_sigprocmask(SIG_SETMASK, [], > > > > NULL, 8) > > > > > > > = 0 > > > > > > > > > > [pid 29444] 10:47:55.537591 socket(PF_INET, SOCK_STREAM, > > > > IPPROTO_IP) > > > > > > > = 9 > > > > > > > > > > [pid 29444] 10:47:55.537726 setsockopt(9, SOL_TCP, > > TCP_NODELAY, > > > > [1], > > > > > > > 4) = > > > > > > > > > 0 > > > > > > > > > > [pid 29444] 10:47:55.537886 connect(9, {sa_family=AF_INET, > > > > > > > > > > sin_port=htons(5432), sin_addr=inet_addr("192.168.1.10")}, > > 16) > > > > = ? > > > > > > > > > > ERESTARTSYS (To be restarted) > > > > > > > > > > [pid 29444] 10:47:56.529113 --- SIGALRM (Alarm clock) @ 0 > > (0) > > > > --- > > > > > > > > > > [pid 29444] 10:47:56.529235 rt_sigprocmask(SIG_SETMASK, > > ~[ILL > > > > TRAP > > > > > > > ABRT > > > > > > > > > BUS > > > > > > > > > > FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0 > > > > > > > > > > [pid 29444] 10:47:56.529428 rt_sigprocmask(SIG_SETMASK, [], > > > > NULL, 8) > > > > > > > = 0 > > > > > > > > > > [pid 29444] 10:47:56.529602 sigreturn() = ? (mask now []) > > > > > > > > > > [pid 29444] 10:47:56.529894 connect(9, {sa_family=AF_INET, > > > > > > > > > > sin_port=htons(5432), sin_addr=inet_addr("192.168.1.10")}, > > 16 > > > > > > > <unfinished > > > > > > > > > > ...> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > First it does a connect() wich receives de SIGALARM, and > > > > continues. > > > > > > > But > > > > > > > > > then > > > > > > > > > > it does another connect(), and this time it does not > > receive > > > > any > > > > > > > > > SIGALARM, > > > > > > > > > > so, it waits (I think) till the system closes the > > connection. > > > > > > > > > > > > > > > > > > > > After waiting (too long) it starts working again (now with > > the > > > > node > > > > > > > > > down): > > > > > > > > > > > > > > > > > > > > [...] > > > > > > > > > > [pid 29445] 10:49:30.273727 <... connect resumed> ) = -1 > > > > EHOSTUNREACH > > > > > > > (No > > > > > > > > > > route to host) > > > > > > > > > > [pid 29444] 10:49:30.274739 <... connect resumed> ) = -1 > > > > EHOSTUNREACH > > > > > > > (No > > > > > > > > > > route to host) > > > > > > > > > > [pid 29445] 10:49:30.274809 rt_sigprocmask(SIG_SETMASK, > > ~[ILL > > > > TRAP > > > > > > > ABRT > > > > > > > > > BUS > > > > > > > > > > FPE SEGV CONT SYS RTMIN RT_1], [], 8) = 0 > > > > > > > > > > [pid 29445] 10:49:30.275057 time(NULL) = 1267436970 > > > > > > > > > > [pid 29445] 10:49:30.275202 stat64("/etc/localtime", > > > > > > > > > {st_mode=S_IFREG|0644, > > > > > > > > > > st_size=2593, ...}) = 0 > > > > > > > > > > [pid 29445] 10:49:30.275485 write(2, "2010-03-01 10:49:30 > > > > ERROR: pid > > > > > > > > > 2"..., > > > > > > > > > > 1012010-03-01 10:49:30 ERROR: pid 29445: > > > > connect_inet_domain_socket: > > > > > > > > > > connect() failed: No route to host > > > > > > > > > > ) = 101 > > > > > > > > > > [pid 29445] 10:49:30.275911 rt_sigprocmask(SIG_SETMASK, [], > > > > NULL, 8) > > > > > > > = 0 > > > > > > > > > > [pid 29445] 10:49:30.276062 close(7) = 0 > > > > > > > > > > [pid 29445] 10:49:30.276221 rt_sigprocmask(SIG_SETMASK, > > ~[ILL > > > > TRAP > > > > > > > ABRT > > > > > > > > > BUS > > > > > > > > > > FPE SEGV CONT SYS RTMIN RT_1], [], 8) = 0 > > > > > > > > > > [pid 29445] 10:49:30.276389 time(NULL) = 1267436970 > > > > > > > > > > [pid 29445] 10:49:30.276715 stat64("/etc/localtime", > > > > > > > > > {st_mode=S_IFREG|0644, > > > > > > > > > > st_size=2593, ...}) = 0 > > > > > > > > > > [pid 29445] 10:49:30.276895 write(2, "2010-03-01 10:49:30 > > > > ERROR: pid > > > > > > > > > 2"..., > > > > > > > > > > 782010-03-01 10:49:30 ERROR: pid 29445: connection to > > > > > > > 192.168.1.10(5432) > > > > > > > > > > failed > > > > > > > > > > ) = 78 > > > > > > > > > > [...] > > > > > > > > > > > > > > > > > > > > As you can see it restarts after 1 min and a half (wich is > > too > > > > much). > > > > > > > It > > > > > > > > > is > > > > > > > > > > always the same (without changeing any system values) > > > > > > > > > > > > > > > > > > > > If it is necessary I can show more debug lines. > > > > > > > > > > > > > > > > > > > > Looking trough the source, we think, maybe it could be a > > > > problem with > > > > > > > the > > > > > > > > > > connection being blocked. Maybe, it would be possible not > > to > > > > block it > > > > > > > > > > (speaking about the socket). > > > > > > > > > > We suppose something is happening in pool_connection_pool.c > > > > arround > > > > > > > line > > > > > > > > > 473 > > > > > > > > > > ("connect_inet_domain_socket_by_port"). > > > > > > > > > > > > > > > > > > > > Or maybe I am doing something wrong,... does anybody else > > > > tested the > > > > > > > > > > "unpluged wire" ? Is it working? > > > > > > > > > > > > > > > > > > What health_check() does here is: > > > > > > > > > > > > > > > > > > start alarm (done by caller of health_check) > > > > > > > > > connect() > > > > > > > > > write() > > > > > > > > > read() > > > > > > > > > : > > > > > > > > > : > > > > > > > > > > > > > > > > > > If the wire is unplugged, one of system calls will be blocked > > and > > > > > > > > > eventually alarm interrupt any of connect/write/read and > > > > health_check > > > > > > > > > returns with error code. Write() and read() are fine. Problem > > is, > > > > > > > > > connect is done by connect_inet_domain_socket_by_port, which > > does > > > > > > > > > retry if connect() is interrupted by a system call. > > > > > > > > > > > > > > > > > > I belive what you saw was that. > > > > > > > > > > First it does a connect() wich receives de SIGALARM, and > > > > continues. > > > > > > > But > > > > > > > > > then > > > > > > > > > > it does another connect(), and this time it does not > > receive > > > > any > > > > > > > > > SIGALARM, > > > > > > > > > > so, it waits (I think) till the system closes the > > connection. > > > > > > > > > > > > > > > > > > The retry should be turn off if it's called from > > health_check(). > > > > Will > > > > > > > > > fix. > > > > > > > > > > > > > > > > > > Thanks again for good testing and analysis. > > > > > > > > > -- > > > > > > > > > Tatsuo Ishii > > > > > > > > > SRA OSS, Inc. Japan > > > > > > > > > English: http://www.sraoss.co.jp/index_en.php > > > > > > > > > Japanese: http://www.sraoss.co.jp > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > Pgpool-general mailing list > > > > > [email protected] > > > > > http://pgfoundry.org/mailman/listinfo/pgpool-general > > > > > > _______________________________________________ Pgpool-general mailing list [email protected] http://pgfoundry.org/mailman/listinfo/pgpool-general
