Re: [HACKERS] pg_terminate_backend() idea
"Bruce Momjian" <[EMAIL PROTECTED]> writes: > Gregory Stark wrote: > >> It seems to me we could replace all of the above with either SIGINT or USR1 >> and have a bunch of boolean flags in MyProc. I'm not sure of the implication >> for sinval processing of having to get a whole bunch of LWLocks though. > > Keep in mind PGPROC->terminate was easier because once it was set it was > never cleared. If you need to clear the flag and keep going the code is > going to need to be a little more sophisticated to avoid lost interrupts. That's kind of the point. I don't think we care about lost interrupts for any of these things. As long as we know a sinval flush happened I think we don't care how many more happened. Or a cancel request or terminate request. For sinval of course it would be important to clear the flags (with MyProc locked) prior to processing the queue in case one arrives as soon as we're finished. But I think that's the only detail. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training! -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_terminate_backend idea
On Tue, 2005-06-21 at 23:34 -0400, Tom Lane wrote: > Bruce Momjian writes: > > Tom Lane wrote: > >> In any case the correct way to solve the problem is to find out what's > >> being left corrupt by SIGTERM, rather than install more messiness in > >> order to avoid facing the real issue ... > > > I am confused. Are you talking about the client SIGTERM or the server? > > I am talking about Rod Taylor's reports that SIGTERM'ing individual > backends tends to lead to "lock table corrupted" crashes awhile later. > Now, I've been playing the part of Chicken Little on this for awhile, > but seeing an actual report of problems from the field certainly > strengthens my feelings about it. Bringing this thread back to life. I have not seen a lock table corruption issue with SIGTERM in 8.1 on Solaris/Sun IV, Linux/AMD64, or Linux/Intel. I don't recall seeing one on 8.0.3 either though I'm pretty sure there were several on 8.0.1. There are times when locks for a process hang around for a few minutes before getting cleared. I don't recall whether they were ungranted table locks or entries waiting on a transaction ID lock, but the source was Slony and a large pg_listener structure with more than 2 pages (yes, pages not tuples). I have also seen processes refusing to acknowledge the signal and exit during btree index builds, but that's not a data corruption issue. -- ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] pg_terminate_backend idea
Tom Lane wrote: > Bruce Momjian writes: > > I have been running some tests to try to see the lock table corruption > > but I have been unable to reproduce the problem. > > It's possible that what Rod was complaining of is fixed in CVS tip --- > see discussion about RemoveFromWaitQueue() bug. If so, it would have > been a bug that could be seen in other circumstances too, but maybe > SIGTERM made it more probable for some reason. Was that backpatched to 8.0.X? If not, I can test that branch of CVS for the problem. -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] pg_terminate_backend idea
Bruce Momjian writes: > I have been running some tests to try to see the lock table corruption > but I have been unable to reproduce the problem. It's possible that what Rod was complaining of is fixed in CVS tip --- see discussion about RemoveFromWaitQueue() bug. If so, it would have been a bug that could be seen in other circumstances too, but maybe SIGTERM made it more probable for some reason. regards, tom lane ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] pg_terminate_backend idea
Tom Lane wrote: > "Magnus Hagander" <[EMAIL PROTECTED]> writes: > > Assuming we don't get such a case, and a chance to fix it, before 8.1 > > (while still hoping we will get it fixed properly, we can't be sure, can > > we? If we were, it'd be fixed already). In this case, will you consider > > such a kludgy solution as a temporary fix to resolve a problem that a > > lot of users are having? And then plan to have it removed once sending > > SIGTERM directly to a backend can be considered safe? > > Kluges tend to become institutionalized, so my reaction is "no". It's > also worth pointing out that with so little understanding of the problem > Rod is reporting, it's tough to make a convincing case that this kluge > will avoid it. SIGTERM exit *shouldn't* be leaving any corrupted > locktable entries behind; it's not that much different from the normal > case. Until we find out what's going on, introducing still another exit > path isn't really going to make me feel more comfortable, no matter how > close it's alleged to be to the normal path. I have been running some tests to try to see the lock table corruption but I have been unable to reproduce the problem. I have attached my crude test scripts. I would run the scripts and then open another session as a different user and do UPDATE and LOCK to cause the psql session to block. The only functional difference I can see between a SIGTERM exit and a cancel followed by a normal exit is the call to AbortCurrentTransaction(). Could that be significant? Because I can't reproduce the failure I can't know for sure. -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 #!/usr/contrib/bin/expect -- set timeout -1 eval spawn sql test expect -nocase "test=>" send "begin;\r" expect -nocase "test=>" send "lock pg_class;\r" expect -nocase "test=>" send "select * from pg_locks;\r" expect -nocase "test=>" send "update test set x=3;\r" expect -nocase "test=>" expect eof exit while : do XPID=`/letc/ps_sysv -ef | grep 'postgres test'|grep -v grep|awk '{print $2}'` if [ "$XPID" != "" ] thenkill $XPID echo $XPID XPID=`/letc/ps_sysv -ef | grep 'psql test'|grep -v execargs|awk '{print $2}'` kill $XPID fi sleep 1 done ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] pg_terminate_backend idea
"Magnus Hagander" <[EMAIL PROTECTED]> writes: > Assuming we don't get such a case, and a chance to fix it, before 8.1 > (while still hoping we will get it fixed properly, we can't be sure, can > we? If we were, it'd be fixed already). In this case, will you consider > such a kludgy solution as a temporary fix to resolve a problem that a > lot of users are having? And then plan to have it removed once sending > SIGTERM directly to a backend can be considered safe? Kluges tend to become institutionalized, so my reaction is "no". It's also worth pointing out that with so little understanding of the problem Rod is reporting, it's tough to make a convincing case that this kluge will avoid it. SIGTERM exit *shouldn't* be leaving any corrupted locktable entries behind; it's not that much different from the normal case. Until we find out what's going on, introducing still another exit path isn't really going to make me feel more comfortable, no matter how close it's alleged to be to the normal path. regards, tom lane ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] pg_terminate_backend idea
> >> In any case the correct way to solve the problem is to find out > >> what's being left corrupt by SIGTERM, rather than install more > >> messiness in order to avoid facing the real issue ... > > > That is unfortunatly way over my head. And it doesn't seem like > > anybody who actually has what it takes to do the "proper > solution" is > > interested in doing it. > > A test case --- even one that fails only a small percentage > of the time > --- would make things far easier. So far all I've seen are > very vague reports, and it's impossible to do anything about > it without more info. Very well. Let me try putting it like this, then: Assuming we don't get such a case, and a chance to fix it, before 8.1 (while still hoping we will get it fixed properly, we can't be sure, can we? If we were, it'd be fixed already). In this case, will you consider such a kludgy solution as a temporary fix to resolve a problem that a lot of users are having? And then plan to have it removed once sending SIGTERM directly to a backend can be considered safe? //Magnus ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] pg_terminate_backend idea
On 2005-06-22, Tom Lane <[EMAIL PROTECTED]> wrote: > Andrew - Supernews <[EMAIL PROTECTED]> writes: >> On 2005-06-22, Tom Lane <[EMAIL PROTECTED]> wrote: >>> Andreas Pflug <[EMAIL PROTECTED]> writes: I've seen cancel *not* working. >>> >>> Even a moment's perusal of the code will prove that there is no >>> situation in which a backend will respond to SIGTERM but not SIGINT > >> "idle in transaction". (or "idle" for that matter, but that's usually less >> significant.) > > In that case there's no query to cancel, so I would dispute the claim > that that constitutes "not working". You are totally missing the point. A backend that is "idle in transaction" is holding locks and an open xid that cannot be cleared by anything short of SIGTERM. Whether the fact that it ignores SIGINT is intentional or not is irrelevent, the fact is that this is the classic scenario where SIGTERM is effective and SIGINT is not. -- Andrew, Supernews http://www.supernews.com - individual and corporate NNTP services ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] pg_terminate_backend idea
Andrew - Supernews <[EMAIL PROTECTED]> writes: > On 2005-06-22, Tom Lane <[EMAIL PROTECTED]> wrote: >> Andreas Pflug <[EMAIL PROTECTED]> writes: >>> I've seen cancel *not* working. >> >> Even a moment's perusal of the code will prove that there is no >> situation in which a backend will respond to SIGTERM but not SIGINT > "idle in transaction". (or "idle" for that matter, but that's usually less > significant.) In that case there's no query to cancel, so I would dispute the claim that that constitutes "not working". QueryCancel is defined to cancel the current query, not necessarily to abort your whole transaction. (Before 8.0 there wasn't much of a difference, but now there is: QueryCancel is an ordinary error that can be trapped by a savepoint. Are you arguing it should not be so trappable?) regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] pg_terminate_backend idea
On 2005-06-22, Tom Lane <[EMAIL PROTECTED]> wrote: > Andreas Pflug <[EMAIL PROTECTED]> writes: >>> I thought we agreed that using the cancel functionality, which we know >>> works and is tested, > >> I've seen cancel *not* working. In 80 % this was the reason to use >> terminate. > > Even a moment's perusal of the code will prove that there is no > situation in which a backend will respond to SIGTERM but not SIGINT "idle in transaction". (or "idle" for that matter, but that's usually less significant.) -- Andrew, Supernews http://www.supernews.com - individual and corporate NNTP services ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] pg_terminate_backend idea
Andreas Pflug <[EMAIL PROTECTED]> writes: >> I thought we agreed that using the cancel functionality, which we know >> works and is tested, > I've seen cancel *not* working. In 80 % this was the reason to use > terminate. Even a moment's perusal of the code will prove that there is no situation in which a backend will respond to SIGTERM but not SIGINT --- there is only one InterruptPending flag and both cases are checked in ProcessInterrupts(). So I don't believe the above argument for using terminate in the slightest. I can easily believe that we have missed some places that need a CHECK_FOR_INTERRUPTS() call added, to ensure the backend can't go too long without making these checks. I added one in the planner main loop just a couple weeks ago, for instance. If you can identify what a backend that's ignoring a cancel request is doing, please let us know. regards, tom lane ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] pg_terminate_backend idea
"Magnus Hagander" <[EMAIL PROTECTED]> writes: >> In any case the correct way to solve the problem is to find >> out what's being left corrupt by SIGTERM, rather than install >> more messiness in order to avoid facing the real issue ... > That is unfortunatly way over my head. And it doesn't seem like anybody > who actually has what it takes to do the "proper solution" is interested > in doing it. A test case --- even one that fails only a small percentage of the time --- would make things far easier. So far all I've seen are very vague reports, and it's impossible to do anything about it without more info. regards, tom lane ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] pg_terminate_backend idea
Bruce Momjian wrote: Tom Lane wrote: "Magnus Hagander" <[EMAIL PROTECTED]> writes: But it still requires me to send some data (such as a dummy query) to the backend before it exits. This is because server side libpq blocks when reading and ignores signals at this time. I believe the fix for this would be to pass a flag down to the libpq routines that we want to be abort in case of signal+flag, set only when doing the "main call" to recv, so we can kill idle process. Yech! That code is messy enough already, lets not pile another kluge atop it in order to handle something that's not even being requested AFAIR. In any case the correct way to solve the problem is to find out what's being left corrupt by SIGTERM, rather than install more messiness in order to avoid facing the real issue ... I am confused. Are you talking about the client SIGTERM or the server? I thought we agreed that using the cancel functionality, which we know works and is tested, I've seen cancel *not* working. In 80 % this was the reason to use terminate. Regards, Andreas ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] pg_terminate_backend idea
> > But it still requires me to send some data (such as a dummy > query) to > > the backend before it exits. This is because server side > libpq blocks > > when reading and ignores signals at this time. I believe > the fix for > > this would be to pass a flag down to the libpq routines > that we want > > to be abort in case of signal+flag, set only when doing the "main > > call" to recv, so we can kill idle process. > > Yech! That code is messy enough already, lets not pile > another kluge atop it in order to handle something that's not > even being requested AFAIR. While I agree it'sa bit of a cludge, saying that it's not requested is absolutely and totally untrue. It has been requested *a lot*. People even use a method that is now *known* to be unsafe, simply because we do not provide another alternative. Therefor, I prefer a kludge than nothing at all. But a "proper solution" is of course better. > In any case the correct way to solve the problem is to find > out what's being left corrupt by SIGTERM, rather than install > more messiness in order to avoid facing the real issue ... That is unfortunatly way over my head. And it doesn't seem like anybody who actually has what it takes to do the "proper solution" is interested in doing it. //Magnus ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] pg_terminate_backend idea
On Tue, 2005-06-21 at 23:34 -0400, Tom Lane wrote: > Bruce Momjian writes: > > Tom Lane wrote: > >> In any case the correct way to solve the problem is to find out what's > >> being left corrupt by SIGTERM, rather than install more messiness in > >> order to avoid facing the real issue ... > > > I am confused. Are you talking about the client SIGTERM or the server? > > I am talking about Rod Taylor's reports that SIGTERM'ing individual > backends tends to lead to "lock table corrupted" crashes awhile later. > Now, I've been playing the part of Chicken Little on this for awhile, > but seeing an actual report of problems from the field certainly > strengthens my feelings about it. > > What I think we need to do is find a way to isolate and fix the behavior > Rod is seeing. It may be that the bug occurs only for SIGTERM, or it > may be that it's a general problem that a SIGTERM just increases the > probability of seeing. In any case I think we have to solve it, not > create new mechanisms to try to ignore it. If it helps, it seems to occur primarily (perhaps always) when there are schema changes being performed when the SIGTERM is issued. I don't remember seeing them on Intel or on v7.2 (we didn't stay on 7.4 very long), but on a fairly well loaded Solaris machine (v880 with between 100 and 200 connections) it happens enough that we automatically schedule a server restart during the first opportunity when we need to kill connections in this way This is generally when the server doesn't recognize the client has dropped -- pgpool can be clumsy with connections). > > TODO has: > > > * Allow administrators to safely terminate individual sessions > > > Right now, SIGTERM will terminate a session, but it is treated as > > though the postmaster has paniced and shared memory might not be > > cleaned up properly. > > That statement is entirely incorrect. > > regards, tom lane > > ---(end of broadcast)--- > TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED] > -- ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] pg_terminate_backend idea
Bruce Momjian writes: > Tom Lane wrote: >> In any case the correct way to solve the problem is to find out what's >> being left corrupt by SIGTERM, rather than install more messiness in >> order to avoid facing the real issue ... > I am confused. Are you talking about the client SIGTERM or the server? I am talking about Rod Taylor's reports that SIGTERM'ing individual backends tends to lead to "lock table corrupted" crashes awhile later. Now, I've been playing the part of Chicken Little on this for awhile, but seeing an actual report of problems from the field certainly strengthens my feelings about it. What I think we need to do is find a way to isolate and fix the behavior Rod is seeing. It may be that the bug occurs only for SIGTERM, or it may be that it's a general problem that a SIGTERM just increases the probability of seeing. In any case I think we have to solve it, not create new mechanisms to try to ignore it. > TODO has: > * Allow administrators to safely terminate individual sessions > Right now, SIGTERM will terminate a session, but it is treated as > though the postmaster has paniced and shared memory might not be > cleaned up properly. That statement is entirely incorrect. regards, tom lane ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] pg_terminate_backend idea
Tom Lane wrote: > "Magnus Hagander" <[EMAIL PROTECTED]> writes: > > But it still requires me to send some data (such as a dummy query) to > > the backend before it exits. This is because server side libpq blocks > > when reading and ignores signals at this time. I believe the fix for > > this would be to pass a flag down to the libpq routines that we want to > > be abort in case of signal+flag, set only when doing the "main call" to > > recv, so we can kill idle process. > > Yech! That code is messy enough already, lets not pile another kluge > atop it in order to handle something that's not even being requested > AFAIR. > > In any case the correct way to solve the problem is to find out what's > being left corrupt by SIGTERM, rather than install more messiness in > order to avoid facing the real issue ... I am confused. Are you talking about the client SIGTERM or the server? I thought we agreed that using the cancel functionality, which we know works and is tested, to do backend terminate was the right approach. TODO has: * Allow administrators to safely terminate individual sessions Right now, SIGTERM will terminate a session, but it is treated as though the postmaster has paniced and shared memory might not be cleaned up properly. A new signal is needed for safe termination because backends must first do a query cancel, then exit once they have run the query cancel cleanup routine. I don't see us ever able to handle backend terminate in any other way. Are you complaining about the issue with terminating the client? I had not considered that. -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] pg_terminate_backend idea
Tom Lane wrote: > "Magnus Hagander" <[EMAIL PROTECTED]> writes: > >>But it still requires me to send some data (such as a dummy query) to >>the backend before it exits. This is because server side libpq blocks >>when reading and ignores signals at this time. I believe the fix for >>this would be to pass a flag down to the libpq routines that we want to >>be abort in case of signal+flag, set only when doing the "main call" to >>recv, so we can kill idle process. > > > Yech! That code is messy enough already, lets not pile another kluge > atop it in order to handle something that's not even being requested > AFAIR. I ran into the same problem back when I was trying to implement an idle-in-transaction timeout, so solving this might be useful in more than one place.. -O ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] pg_terminate_backend idea
"Magnus Hagander" <[EMAIL PROTECTED]> writes: > But it still requires me to send some data (such as a dummy query) to > the backend before it exits. This is because server side libpq blocks > when reading and ignores signals at this time. I believe the fix for > this would be to pass a flag down to the libpq routines that we want to > be abort in case of signal+flag, set only when doing the "main call" to > recv, so we can kill idle process. Yech! That code is messy enough already, lets not pile another kluge atop it in order to handle something that's not even being requested AFAIR. In any case the correct way to solve the problem is to find out what's being left corrupt by SIGTERM, rather than install more messiness in order to avoid facing the real issue ... regards, tom lane ---(end of broadcast)--- TIP 8: explain analyze is your friend