Re: [Pgpool-general] The Unplugged wire

Tatsuo Ishii Thu, 18 Mar 2010 08:02:54 -0700

> Thanks again Tatsuo,

You are welcome!
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp


> I've been giving it a try, and, it seems, as you say, that with exit(0)
> something goes wrong with the recovery, but, agains as you say, it seems
> solved with child_exit(0), maybe it lasts some more seconds to continue
> after unplug the wire, but I cannot assure that, anyway, it seems to work
> fine now. I will continue testing it, but, for now, pgpool is working fine
> in the "unplugged wire" situation!!
> 
> Thanks!
> 
> 2010/3/15 Tatsuo Ishii <[email protected]>
> 
> > This will miss some house keeping jobs such as count down connection
> > counter. Thus you will be in trouble when you do online recovery. You
> > would want to use child_exit(0) instead exit(0).
> > --
> > Tatsuo Ishii
> > SRA OSS, Inc. Japan
> > English: http://www.sraoss.co.jp/index_en.php
> > Japanese: http://www.sraoss.co.jp
> >
> > > Hi Tatsuo,
> > >
> > > We have been doing some more debugging, and we found what the problem is.
> > As
> > > I commented in my previous message, the problem was that a proccess
> > (where
> > > the client where connecting), didn't die.
> > >
> > > We have found a solution,... it is clearly not an "elegant" solution, but
> > it
> > > works for us (I have to try it a little more), and sure you will be able
> > to
> > > find a better solution.
> > >
> > > Anyway. We edited: child.c
> > >
> > > Arrount the line 1204.
> > > Inside "case SIGQUIT" we added :
> > >
> > > pool_debug("Received SIGQUIT");
> > >  exit(0);
> > >
> > > Before the break:
> > >         case SIGQUIT:    /* immediate shutdown */
> > >                         pool_debug("Received SIGQUIT");
> > >                         exit(0);
> > >                         break;
> > >
> > >
> > > I have to finish testing/debugging it to see if it works and it is not
> > > affecting negatively in some other place, but, for now it works,...
> > almost
> > > inmediately after loosing the wire, it continues working in the alive
> > nodes.
> > >
> > > Hope it can be helpful !
> > >
> > > Thanks!
> > >
> > > 2010/3/6 Tatsuo Ishii <[email protected]>
> > >
> > > > Daniel,
> > > >
> > > > I have committed fix for this. Please check it out.
> > > > --
> > > > Tatsuo Ishii
> > > > SRA OSS, Inc. Japan
> > > > English: http://www.sraoss.co.jp/index_en.php
> > > > Japanese: http://www.sraoss.co.jp
> > > >
> > > > > Daniel,
> > > > >
> > > > > Looks like interrupted connect() ought to check if SIGQUIT has been
> > > > > delivered. Will check. I have no time until next week though.
> > > > > --
> > > > > Tatsuo Ishii
> > > > > SRA OSS, Inc. Japan
> > > > > English: http://www.sraoss.co.jp/index_en.php
> > > > > Japanese: http://www.sraoss.co.jp
> > > > >
> > > > > > Hi Tatsuo,
> > > > > >
> > > > > > I did some more debug, and it seems the problem I had is solved.
> > But
> > > > there
> > > > > > is still a problem.
> > > > > >
> > > > > > The test I did is:
> > > > > > 1) Start postgres node2
> > > > > > 2) Start postgres node1
> > > > > > 3) start pgpool (node1)
> > > > > > 4) From a remote terminal (client1):
> > > > > > for i in $(seq 0 2000); do { echo $(date +%H:%M:%S); psql -d
> > database
> > > > -p
> > > > > > 9999 -h 192.168.1.5 -U postgres -c "insert into pri values ($i,
> > > > 'nom$i');";
> > > > > > echo $i; sleep 1; }; done
> > > > > > 5) Unplug node2
> > > > > > Here the inserts on client terminal stops. It restarts more or less
> > 5
> > > > min
> > > > > > and a half later.
> > > > > > 6)After 10 second (more or les) after unplug the wire,  From a
> > remote
> > > > > > terminal (client2):
> > > > > > for i in $(seq 0 2000); do { echo $(date +%H:%M:%S); psql -d
> > database
> > > > -p
> > > > > > 9999 -h 192.168.1.5 -U postgres -c "insert into pri values ($i,
> > > > 'nom$i');";
> > > > > > echo $i; sleep 1; }; done
> > > > > > Works perfectly!! I mean, pgpool is not blocked, it works,... there
> > is
> > > > just
> > > > > > some problem in the first session.
> > > > > >
> > > > > >
> > > > > > The lines bothering me are:
> > > > > > [...]
> > > > > > 11:33:42.414222 select(6, 0xbfa01ae8, 0, 0, 0xbfa01b8c2010-03-04
> > > > 11:33:42
> > > > > > DEBUG: pid 4744: pool_read_message_length: slot: 0 length: 8
> > > > > > [...]
> > > > > > 2010-03-04 11:33:42 DEBUG: pid 4744: pool_read_message_length:
> > slot: 0
> > > > > > length: 5
> > > > > > 2010-03-04 11:33:42 DEBUG: pid 4744: pool_read_message_length:
> > slot: 1
> > > > > > length: 5
> > > > > > 2010-03-04 11:33:42 DEBUG: pid 4744: ReadyForQuery: message length:
> > 5
> > > > > > 2010-03-04 11:33:42 DEBUG: pid 4744: ReadyForQuery: transaction
> > state:
> > > > I
> > > > > > 2010-03-04 11:33:42 DEBUG: pid 4744: read kind from frontend X(58)
> > > > > > 2010-03-04 11:33:43 DEBUG: pid 4744: I am 4744 accept fd 5
> > > > > > 2010-03-04 11:33:43 LOG:   pid 4744: connection received:
> > > > host=192.168.1.13
> > > > > > port=38751
> > > > > > 2010-03-04 11:33:43 DEBUG: pid 4744: Protocol Major: 1234 Minor:
> > 5679
> > > > > > database:  user:
> > > > > > 2010-03-04 11:33:43 DEBUG: pid 4744: SSLRequest from client
> > > > > > 2010-03-04 11:33:43 DEBUG: pid 4744: pool_ssl: SSL requested but
> > SSL
> > > > support
> > > > > > is not available
> > > > > > 2010-03-04 11:33:43 DEBUG: pid 4744: Protocol Major: 3 Minor: 0
> > > > database:
> > > > > > database user: postgres
> > > > > > 2010-03-04 11:33:43 DEBUG: pid 4744: new_connection: connecting 0
> > > > backend
> > > > > > 2010-03-04 11:33:43 DEBUG: pid 4744: new_connection: connecting 1
> > > > backend
> > > > > >  <unfinished ...>
> > > > > > 11:33:43.357089 --- SIGALRM (Alarm clock) ---
> > > > > > 11:33:44.416652 <... select resumed> )                        = 0
> > > > > > 11:33:44.416860 sigprocmask(2, 0x8180380, NULL)               = 0
> > > > > > 11:33:44.417547 sigprocmask(2, 0x8180280, NULL)               = 0
> > > > > > 11:33:44.418019 gettimeofday(0xbfa01b94, NULL)                = 0
> > > > > > 11:33:44.418723 sigprocmask(2, 0x8180380, NULL)               = 0
> > > > > > 11:33:44.419311 sigprocmask(2, 0x8180380, 0xbfa01b14)         = 0
> > > > > > 11:33:44.419866 getpid()                                      =
> > 4743
> > > > > > 11:33:44.420353 time(NULL)                                    =
> > > > 1267698824
> > > > > > 11:33:44.420878 localtime(0xbfa01ae4)                         =
> > > > 0xcbb300
> > > > > > [...]
> > > > > >
> > > > > >
> > > > > > When pgpool is started I can see 3 processes: (sorry if times an
> > PID
> > > > does
> > > > > > not match, because I am takeing data from different debug sessions)
> > > > > >  5348 pts/1    S+     0:00 pgpool -d -n
> > > > > >  5349 pts/1    S+     0:00 pgpool: wait for connection request
> > > > > >  5350 pts/1    S+     0:00 pgpool: PCP: wait for connection request
> > > > > >
> > > > > >  I monitored de process with name pgpool: wait for connection
> > request
> > > > > >  And it stops at:
> > > > > >  [...]
> > > > > >  11:33:43.107622 time(NULL)              = 1267698823
> > > > > > 11:33:43.107773 stat64("/etc/localtime", {st_mode=S_IFREG|0644,
> > > > > > st_size=2593, ...}) = 0
> > > > > > 11:33:43.107961 write(2, "2010-03-04 11:33:43 DEBUG: pid 4"..., 74)
> > =
> > > > 74
> > > > > > 11:33:43.108359 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> > > > > > 11:33:43.108510 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 7
> > > > > > 11:33:43.108670 setsockopt(7, SOL_TCP, TCP_NODELAY, [1], 4) = 0
> > > > > > 11:33:43.108823 connect(7, {sa_family=AF_INET,
> > sin_port=htons(5432),
> > > > > > sin_addr=inet_addr("192.168.1.10")}, 16) = ? ERESTARTSYS (To be
> > > > restarted)
> > > > > > 11:33:45.460457 --- SIGQUIT (Quit) @ 0 (0) ---
> > > > > > 11:33:45.460997 rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE
> > > > SEGV
> > > > > > CONT SYS RTMIN RT_1], [QUIT], 8) = 0
> > > > > > 11:33:45.461156 time(NULL)              = 1267698825
> > > > > > 11:33:45.461305 stat64("/etc/localtime", {st_mode=S_IFREG|0644,
> > > > > > st_size=2593, ...}) = 0
> > > > > > 11:33:45.461537 write(2, "2010-03-04 11:33:45 DEBUG: pid 4"..., 78)
> > =
> > > > 78
> > > > > > 11:33:45.461720 rt_sigprocmask(SIG_SETMASK, [QUIT], NULL, 8) = 0
> > > > > > 11:33:45.475194 sigreturn()             = ? (mask now [])
> > > > > > 11:33:45.475320 connect(7, {sa_family=AF_INET,
> > sin_port=htons(5432),
> > > > > > sin_addr=inet_addr("192.168.1.10")}, 16) = -1 EHOSTUNREACH (No
> > route to
> > > > > > host)
> > > > > >
> > > > > >
> > > > > > After some time it continues with:
> > > > > > [...]
> > > > > > 11:35:19.066376 rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE
> > > > SEGV
> > > > > > CONT SYS RTMIN RT_1], [], 8) = 0
> > > > > > 11:35:19.066888 time(NULL)              = 1267698919
> > > > > > 11:35:19.067042 stat64("/etc/localtime", {st_mode=S_IFREG|0644,
> > > > > > st_size=2593, ...}) = 0
> > > > > > 11:35:19.067408 write(2, "2010-03-04 11:35:19 ERROR: pid 4"...,
> > 100) =
> > > > 100
> > > > > > 11:35:19.067848 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> > > > > > [...]
> > > > > >
> > > > > >
> > > > > > I observed, in that time that now there is one more process, and
> > the
> > > > process
> > > > > > I am monitoring changed:
> > > > > >  5348 pts/1    S+     0:00 pgpool -d -n
> > > > > >  5349 pts/1    S+     0:00 pgpool: accept connection
> > > > > >  5350 pts/1    S+     0:00 pgpool: PCP: wait for connection request
> > > > > >  5467 pts/1    S+     0:00 pgpool: wait for connection request
> > > > > >
> > > > > >
> > > > > >  I would say that there are two "problems".
> > > > > >  One of them is that pgpool takes so long to see node2 failed (i
> > cannot
> > > > see
> > > > > > how long,.. it seems to me that is not allways the same). I have
> > > > > > connection_life_time=1 . Is there any other timeout that I would
> > have
> > > > to
> > > > > > change? (the health check is every second).
> > > > > > I noticed too that, after the node is down, some healths checks
> > fail to
> > > > > > detect it, don't know if maybe it could be because of the
> > > > > > virtualitzation,...
> > > > > >
> > > > > >  I see that, everytime the node falls, the process(wait for
> > connection
> > > > > > request) is killed and anotherone is started. That is not a problem
> > if
> > > > in
> > > > > > the meantime there is not a query. If a query is done before the
> > old
> > > > process
> > > > > > dies and the nwe is create,, the "old process" will take it and
> > won't
> > > > die,
> > > > > > and will wait the 5 minutes (I don't know why this time) as
> > "pgpool:
> > > > accept
> > > > > > connection" still if the new process is alredy started, my for loop
> > > > will
> > > > > > wait there. All the queryies that comes after the new process is
> > > > started
> > > > > > will work perfectly.
> > > > > > The problem is that, once the new process is started (and works
> > ok),
> > > > the old
> > > > > > process is still there, retaining the clients(wich started the
> > > > connection
> > > > > > there) connection (and stoping them of working), and, maybe it
> > would be
> > > > > > better if, once the new process exist, the oldone would die and
> > would
> > > > free
> > > > > > the "retained" query.
> > > > > >
> > > > > > Hope this can be helpfull for you.
> > > > > >
> > > > > > Thanks!
> > > > > >
> > > > > > 2010/3/3 Tatsuo Ishii <[email protected]>
> > > > > >
> > > > > > > Daniel,
> > > > > > >
> > > > > > > I have committed fix into CVS HEAD. Could you try it out?
> > > > > > > --
> > > > > > > Tatsuo Ishii
> > > > > > > SRA OSS, Inc. Japan
> > > > > > > English: http://www.sraoss.co.jp/index_en.php
> > > > > > > Japanese: http://www.sraoss.co.jp
> > > > > > >
> > > > > > > > Thanks to you Tatsuo,...
> > > > > > > >
> > > > > > > > I am glad  my report helped you! As soon as you tell me it is
> > done
> > > > I will
> > > > > > > > start testing/debugging it again.
> > > > > > > >
> > > > > > > > 2010/3/2 Tatsuo Ishii <[email protected]>
> > > > > > > >
> > > > > > > > > Daniel,
> > > > > > > > >
> > > > > > > > > Thanks for the report!
> > > > > > > > >
> > > > > > > > > > I spoke in another message about this problem, yet, I
> > debugged
> > > > deeper
> > > > > > > and
> > > > > > > > > I
> > > > > > > > > > have more specific information, that, maybe, can be
> > usefull.
> > > > > > > > > > (The thread I spoke something about was:
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > >
> > http://lists.pgfoundry.org/pipermail/pgpool-general/2010-February/002565.html
> > > > > > > > > > )
> > > > > > > > > >
> > > > > > > > > > I am working with two VB Virtual machines with CentOS 5
> > (i386).
> > > > > > > Running
> > > > > > > > > > PostgreSQL 8.3.9 and pgpool 2.3.2.1.
> > > > > > > > > >
> > > > > > > > > > The test was simple. While I was inserting values every
> > second,
> > > > I
> > > > > > > > > unplugged
> > > > > > > > > > one of the nodes.
> > > > > > > > > > health check is every second and it's timeout is 2 seconds.
> > > > > > > > > >
> > > > > > > > > > In that moment all inserts stops, and pgpool waits.
> > > > > > > > > > The point where it stops is:
> > > > > > > > > >
> > > > > > > > > > [...]
> > > > > > > > > > [pid 29444] 10:47:55.537470 rt_sigprocmask(SIG_SETMASK, [],
> > > > NULL, 8)
> > > > > > > = 0
> > > > > > > > > > [pid 29444] 10:47:55.537591 socket(PF_INET, SOCK_STREAM,
> > > > IPPROTO_IP)
> > > > > > > = 9
> > > > > > > > > > [pid 29444] 10:47:55.537726 setsockopt(9, SOL_TCP,
> > TCP_NODELAY,
> > > > [1],
> > > > > > > 4) =
> > > > > > > > > 0
> > > > > > > > > > [pid 29444] 10:47:55.537886 connect(9, {sa_family=AF_INET,
> > > > > > > > > > sin_port=htons(5432), sin_addr=inet_addr("192.168.1.10")},
> > 16)
> > > > = ?
> > > > > > > > > > ERESTARTSYS (To be restarted)
> > > > > > > > > > [pid 29444] 10:47:56.529113 --- SIGALRM (Alarm clock) @ 0
> > (0)
> > > > ---
> > > > > > > > > > [pid 29444] 10:47:56.529235 rt_sigprocmask(SIG_SETMASK,
> > ~[ILL
> > > > TRAP
> > > > > > > ABRT
> > > > > > > > > BUS
> > > > > > > > > > FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0
> > > > > > > > > > [pid 29444] 10:47:56.529428 rt_sigprocmask(SIG_SETMASK, [],
> > > > NULL, 8)
> > > > > > > = 0
> > > > > > > > > > [pid 29444] 10:47:56.529602 sigreturn() = ? (mask now [])
> > > > > > > > > > [pid 29444] 10:47:56.529894 connect(9, {sa_family=AF_INET,
> > > > > > > > > > sin_port=htons(5432), sin_addr=inet_addr("192.168.1.10")},
> > 16
> > > > > > > <unfinished
> > > > > > > > > > ...>
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > First it does a connect() wich receives de SIGALARM, and
> > > > continues.
> > > > > > > But
> > > > > > > > > then
> > > > > > > > > > it does another connect(), and this time it does not
> > receive
> > > > any
> > > > > > > > > SIGALARM,
> > > > > > > > > > so, it waits (I think) till the system closes the
> > connection.
> > > > > > > > > >
> > > > > > > > > > After waiting (too long) it starts working again (now with
> > the
> > > > node
> > > > > > > > > down):
> > > > > > > > > >
> > > > > > > > > > [...]
> > > > > > > > > > [pid 29445] 10:49:30.273727 <... connect resumed> ) = -1
> > > > EHOSTUNREACH
> > > > > > > (No
> > > > > > > > > > route to host)
> > > > > > > > > > [pid 29444] 10:49:30.274739 <... connect resumed> ) = -1
> > > > EHOSTUNREACH
> > > > > > > (No
> > > > > > > > > > route to host)
> > > > > > > > > > [pid 29445] 10:49:30.274809 rt_sigprocmask(SIG_SETMASK,
> > ~[ILL
> > > > TRAP
> > > > > > > ABRT
> > > > > > > > > BUS
> > > > > > > > > > FPE SEGV CONT SYS RTMIN RT_1], [], 8) = 0
> > > > > > > > > > [pid 29445] 10:49:30.275057 time(NULL)  = 1267436970
> > > > > > > > > > [pid 29445] 10:49:30.275202 stat64("/etc/localtime",
> > > > > > > > > {st_mode=S_IFREG|0644,
> > > > > > > > > > st_size=2593, ...}) = 0
> > > > > > > > > > [pid 29445] 10:49:30.275485 write(2, "2010-03-01 10:49:30
> > > > ERROR: pid
> > > > > > > > > 2"...,
> > > > > > > > > > 1012010-03-01 10:49:30 ERROR: pid 29445:
> > > > connect_inet_domain_socket:
> > > > > > > > > > connect() failed: No route to host
> > > > > > > > > > ) = 101
> > > > > > > > > > [pid 29445] 10:49:30.275911 rt_sigprocmask(SIG_SETMASK, [],
> > > > NULL, 8)
> > > > > > > = 0
> > > > > > > > > > [pid 29445] 10:49:30.276062 close(7)    = 0
> > > > > > > > > > [pid 29445] 10:49:30.276221 rt_sigprocmask(SIG_SETMASK,
> > ~[ILL
> > > > TRAP
> > > > > > > ABRT
> > > > > > > > > BUS
> > > > > > > > > > FPE SEGV CONT SYS RTMIN RT_1], [], 8) = 0
> > > > > > > > > > [pid 29445] 10:49:30.276389 time(NULL)  = 1267436970
> > > > > > > > > > [pid 29445] 10:49:30.276715 stat64("/etc/localtime",
> > > > > > > > > {st_mode=S_IFREG|0644,
> > > > > > > > > > st_size=2593, ...}) = 0
> > > > > > > > > > [pid 29445] 10:49:30.276895 write(2, "2010-03-01 10:49:30
> > > > ERROR: pid
> > > > > > > > > 2"...,
> > > > > > > > > > 782010-03-01 10:49:30 ERROR: pid 29445: connection to
> > > > > > > 192.168.1.10(5432)
> > > > > > > > > > failed
> > > > > > > > > > ) = 78
> > > > > > > > > > [...]
> > > > > > > > > >
> > > > > > > > > > As you can see it restarts after 1 min and a half (wich is
> > too
> > > > much).
> > > > > > > It
> > > > > > > > > is
> > > > > > > > > > always the same (without changeing any system values)
> > > > > > > > > >
> > > > > > > > > > If it is necessary I can show more debug lines.
> > > > > > > > > >
> > > > > > > > > > Looking trough the source, we think, maybe it could be a
> > > > problem with
> > > > > > > the
> > > > > > > > > > connection being blocked. Maybe, it would be possible not
> > to
> > > > block it
> > > > > > > > > > (speaking about the socket).
> > > > > > > > > > We suppose something is happening in pool_connection_pool.c
> > > > arround
> > > > > > > line
> > > > > > > > > 473
> > > > > > > > > > ("connect_inet_domain_socket_by_port").
> > > > > > > > > >
> > > > > > > > > > Or maybe I am doing something wrong,... does anybody else
> > > > tested the
> > > > > > > > > > "unpluged wire" ? Is it working?
> > > > > > > > >
> > > > > > > > > What health_check() does here is:
> > > > > > > > >
> > > > > > > > > start alarm (done by caller of health_check)
> > > > > > > > > connect()
> > > > > > > > > write()
> > > > > > > > > read()
> > > > > > > > > :
> > > > > > > > > :
> > > > > > > > >
> > > > > > > > > If the wire is unplugged, one of system calls will be blocked
> > and
> > > > > > > > > eventually alarm interrupt any of connect/write/read and
> > > > health_check
> > > > > > > > > returns with error code. Write() and read() are fine. Problem
> > is,
> > > > > > > > > connect is done by connect_inet_domain_socket_by_port, which
> > does
> > > > > > > > > retry if connect() is interrupted by a system call.
> > > > > > > > >
> > > > > > > > > I belive what you saw was that.
> > > > > > > > > > First it does a connect() wich receives de SIGALARM, and
> > > > continues.
> > > > > > > But
> > > > > > > > > then
> > > > > > > > > > it does another connect(), and this time it does not
> > receive
> > > > any
> > > > > > > > > SIGALARM,
> > > > > > > > > > so, it waits (I think) till the system closes the
> > connection.
> > > > > > > > >
> > > > > > > > > The retry should be turn off if it's called from
> > health_check().
> > > > Will
> > > > > > > > > fix.
> > > > > > > > >
> > > > > > > > > Thanks again for good testing and analysis.
> > > > > > > > > --
> > > > > > > > > Tatsuo Ishii
> > > > > > > > > SRA OSS, Inc. Japan
> > > > > > > > > English: http://www.sraoss.co.jp/index_en.php
> > > > > > > > > Japanese: http://www.sraoss.co.jp
> > > > > > > > >
> > > > > > >
> > > > > _______________________________________________
> > > > > Pgpool-general mailing list
> > > > > [email protected]
> > > > > http://pgfoundry.org/mailman/listinfo/pgpool-general
> > > >
> >
_______________________________________________
Pgpool-general mailing list
[email protected]
http://pgfoundry.org/mailman/listinfo/pgpool-general

Re: [Pgpool-general] The Unplugged wire

Reply via email to