Re: [HACKERS] pg_terminate_backend() idea

2008-04-15 Thread Gregory Stark
"Bruce Momjian" <[EMAIL PROTECTED]> writes:

> Gregory Stark wrote:
>
>> It seems to me we could replace all of the above with either SIGINT or USR1
>> and have a bunch of boolean flags in MyProc. I'm not sure of the implication
>> for sinval processing of having to get a whole bunch of LWLocks though.
>
> Keep in mind PGPROC->terminate was easier because once it was set it was
> never cleared.  If you need to clear the flag and keep going the code is
> going to need to be a little more sophisticated to avoid lost interrupts.

That's kind of the point. I don't think we care about lost interrupts for any
of these things. As long as we know a sinval flush happened I think we don't
care how many more happened. Or a cancel request or terminate request.

For sinval of course it would be important to clear the flags (with MyProc
locked) prior to processing the queue in case one arrives as soon as we're
finished. But I think that's the only detail.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com
  Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL 
training!

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_terminate_backend idea

2006-07-10 Thread Rod Taylor
On Tue, 2005-06-21 at 23:34 -0400, Tom Lane wrote:
> Bruce Momjian  writes:
> > Tom Lane wrote:
> >> In any case the correct way to solve the problem is to find out what's
> >> being left corrupt by SIGTERM, rather than install more messiness in
> >> order to avoid facing the real issue ...
> 
> > I am confused.  Are you talking about the client SIGTERM or the server? 
> 
> I am talking about Rod Taylor's reports that SIGTERM'ing individual
> backends tends to lead to "lock table corrupted" crashes awhile later.
> Now, I've been playing the part of Chicken Little on this for awhile,
> but seeing an actual report of problems from the field certainly
> strengthens my feelings about it.

Bringing this thread back to life.

I have not seen a lock table corruption issue with SIGTERM in 8.1 on
Solaris/Sun IV, Linux/AMD64, or Linux/Intel. I don't recall seeing one
on 8.0.3 either though I'm pretty sure there were several on 8.0.1.

There are times when locks for a process hang around for a few minutes
before getting cleared. I don't recall whether they were ungranted table
locks or entries waiting on a transaction ID lock, but the source was
Slony and a large pg_listener structure with more than 2 pages (yes,
pages not tuples).

I have also seen processes refusing to acknowledge the signal and exit
during btree index builds, but that's not a data corruption issue.
-- 


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] pg_terminate_backend idea

2005-06-24 Thread Bruce Momjian
Tom Lane wrote:
> Bruce Momjian  writes:
> > I have been running some tests to try to see the lock table corruption
> > but I have been unable to reproduce the problem.
> 
> It's possible that what Rod was complaining of is fixed in CVS tip ---
> see discussion about RemoveFromWaitQueue() bug.  If so, it would have
> been a bug that could be seen in other circumstances too, but maybe
> SIGTERM made it more probable for some reason.

Was that backpatched to 8.0.X?  If not, I can test that branch of CVS
for the problem.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] pg_terminate_backend idea

2005-06-24 Thread Tom Lane
Bruce Momjian  writes:
> I have been running some tests to try to see the lock table corruption
> but I have been unable to reproduce the problem.

It's possible that what Rod was complaining of is fixed in CVS tip ---
see discussion about RemoveFromWaitQueue() bug.  If so, it would have
been a bug that could be seen in other circumstances too, but maybe
SIGTERM made it more probable for some reason.

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] pg_terminate_backend idea

2005-06-24 Thread Bruce Momjian
Tom Lane wrote:
> "Magnus Hagander" <[EMAIL PROTECTED]> writes:
> > Assuming we don't get such a case, and a chance to fix it, before 8.1
> > (while still hoping we will get it fixed properly, we can't be sure, can
> > we? If we were, it'd be fixed already). In this case, will you consider
> > such a kludgy solution as a temporary fix to resolve a problem that a
> > lot of users are having? And then plan to have it removed once sending
> > SIGTERM directly to a backend can be considered safe?
> 
> Kluges tend to become institutionalized, so my reaction is "no".  It's
> also worth pointing out that with so little understanding of the problem
> Rod is reporting, it's tough to make a convincing case that this kluge
> will avoid it.  SIGTERM exit *shouldn't* be leaving any corrupted
> locktable entries behind; it's not that much different from the normal
> case.  Until we find out what's going on, introducing still another exit
> path isn't really going to make me feel more comfortable, no matter how
> close it's alleged to be to the normal path.

I have been running some tests to try to see the lock table corruption
but I have been unable to reproduce the problem.  I have attached my
crude test scripts.  I would run the scripts and then open another
session as a different user and do UPDATE and LOCK to cause the psql
session to block.

The only functional difference I can see between a SIGTERM exit and a
cancel followed by a normal exit is the call to
AbortCurrentTransaction().  Could that be significant?  Because I can't
reproduce the failure I can't know for sure.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073
#!/usr/contrib/bin/expect --
set timeout -1
eval spawn sql test
expect -nocase "test=>"
send "begin;\r"
expect -nocase "test=>"
send "lock pg_class;\r"
expect -nocase "test=>"
send "select * from pg_locks;\r"
expect -nocase "test=>"
send "update test set x=3;\r"
expect -nocase "test=>"
expect eof
exit
while :
do
XPID=`/letc/ps_sysv -ef | grep 'postgres test'|grep -v grep|awk '{print 
$2}'`
if [ "$XPID" != "" ]
thenkill $XPID
echo $XPID
XPID=`/letc/ps_sysv -ef | grep 'psql test'|grep -v execargs|awk 
'{print $2}'`
kill $XPID
fi
sleep 1
done

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] pg_terminate_backend idea

2005-06-22 Thread Tom Lane
"Magnus Hagander" <[EMAIL PROTECTED]> writes:
> Assuming we don't get such a case, and a chance to fix it, before 8.1
> (while still hoping we will get it fixed properly, we can't be sure, can
> we? If we were, it'd be fixed already). In this case, will you consider
> such a kludgy solution as a temporary fix to resolve a problem that a
> lot of users are having? And then plan to have it removed once sending
> SIGTERM directly to a backend can be considered safe?

Kluges tend to become institutionalized, so my reaction is "no".  It's
also worth pointing out that with so little understanding of the problem
Rod is reporting, it's tough to make a convincing case that this kluge
will avoid it.  SIGTERM exit *shouldn't* be leaving any corrupted
locktable entries behind; it's not that much different from the normal
case.  Until we find out what's going on, introducing still another exit
path isn't really going to make me feel more comfortable, no matter how
close it's alleged to be to the normal path.

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] pg_terminate_backend idea

2005-06-22 Thread Magnus Hagander
> >> In any case the correct way to solve the problem is to find out 
> >> what's being left corrupt by SIGTERM, rather than install more 
> >> messiness in order to avoid facing the real issue ...
> 
> > That is unfortunatly way over my head. And it doesn't seem like 
> > anybody who actually has what it takes to do the "proper 
> solution" is 
> > interested in doing it.
> 
> A test case --- even one that fails only a small percentage 
> of the time
> --- would make things far easier.  So far all I've seen are 
> very vague reports, and it's impossible to do anything about 
> it without more info.

Very well. Let me try putting it like this, then:

Assuming we don't get such a case, and a chance to fix it, before 8.1
(while still hoping we will get it fixed properly, we can't be sure, can
we? If we were, it'd be fixed already). In this case, will you consider
such a kludgy solution as a temporary fix to resolve a problem that a
lot of users are having? And then plan to have it removed once sending
SIGTERM directly to a backend can be considered safe?


//Magnus

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] pg_terminate_backend idea

2005-06-22 Thread Andrew - Supernews
On 2005-06-22, Tom Lane <[EMAIL PROTECTED]> wrote:
> Andrew - Supernews <[EMAIL PROTECTED]> writes:
>> On 2005-06-22, Tom Lane <[EMAIL PROTECTED]> wrote:
>>> Andreas Pflug <[EMAIL PROTECTED]> writes:
 I've seen cancel *not* working.
>>> 
>>> Even a moment's perusal of the code will prove that there is no
>>> situation in which a backend will respond to SIGTERM but not SIGINT
>
>> "idle in transaction". (or "idle" for that matter, but that's usually less
>> significant.)
>
> In that case there's no query to cancel, so I would dispute the claim
> that that constitutes "not working".

You are totally missing the point.

A backend that is "idle in transaction" is holding locks and an open xid
that cannot be cleared by anything short of SIGTERM.

Whether the fact that it ignores SIGINT is intentional or not is irrelevent,
the fact is that this is the classic scenario where SIGTERM is effective and
SIGINT is not.

-- 
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] pg_terminate_backend idea

2005-06-22 Thread Tom Lane
Andrew - Supernews <[EMAIL PROTECTED]> writes:
> On 2005-06-22, Tom Lane <[EMAIL PROTECTED]> wrote:
>> Andreas Pflug <[EMAIL PROTECTED]> writes:
>>> I've seen cancel *not* working.
>> 
>> Even a moment's perusal of the code will prove that there is no
>> situation in which a backend will respond to SIGTERM but not SIGINT

> "idle in transaction". (or "idle" for that matter, but that's usually less
> significant.)

In that case there's no query to cancel, so I would dispute the claim
that that constitutes "not working".  QueryCancel is defined to cancel
the current query, not necessarily to abort your whole transaction.
(Before 8.0 there wasn't much of a difference, but now there is:
QueryCancel is an ordinary error that can be trapped by a savepoint.
Are you arguing it should not be so trappable?)

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] pg_terminate_backend idea

2005-06-22 Thread Andrew - Supernews
On 2005-06-22, Tom Lane <[EMAIL PROTECTED]> wrote:
> Andreas Pflug <[EMAIL PROTECTED]> writes:
>>> I thought we agreed that using the cancel functionality, which we know
>>> works and is tested,
>
>> I've seen cancel *not* working. In 80 % this was the reason to use 
>> terminate.
>
> Even a moment's perusal of the code will prove that there is no
> situation in which a backend will respond to SIGTERM but not SIGINT

"idle in transaction". (or "idle" for that matter, but that's usually less
significant.)

-- 
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] pg_terminate_backend idea

2005-06-22 Thread Tom Lane
Andreas Pflug <[EMAIL PROTECTED]> writes:
>> I thought we agreed that using the cancel functionality, which we know
>> works and is tested,

> I've seen cancel *not* working. In 80 % this was the reason to use 
> terminate.

Even a moment's perusal of the code will prove that there is no
situation in which a backend will respond to SIGTERM but not SIGINT
--- there is only one InterruptPending flag and both cases are checked
in ProcessInterrupts().  So I don't believe the above argument for
using terminate in the slightest.

I can easily believe that we have missed some places that need a
CHECK_FOR_INTERRUPTS() call added, to ensure the backend can't go too
long without making these checks.  I added one in the planner main
loop just a couple weeks ago, for instance.  If you can identify what
a backend that's ignoring a cancel request is doing, please let us know.

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] pg_terminate_backend idea

2005-06-22 Thread Tom Lane
"Magnus Hagander" <[EMAIL PROTECTED]> writes:
>> In any case the correct way to solve the problem is to find 
>> out what's being left corrupt by SIGTERM, rather than install 
>> more messiness in order to avoid facing the real issue ...

> That is unfortunatly way over my head. And it doesn't seem like anybody
> who actually has what it takes to do the "proper solution" is interested
> in doing it.

A test case --- even one that fails only a small percentage of the time
--- would make things far easier.  So far all I've seen are very vague
reports, and it's impossible to do anything about it without more info.

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] pg_terminate_backend idea

2005-06-22 Thread Andreas Pflug

Bruce Momjian wrote:

Tom Lane wrote:


"Magnus Hagander" <[EMAIL PROTECTED]> writes:


But it still requires me to send some data (such as a dummy query) to
the backend before it exits. This is because server side libpq blocks
when reading and ignores signals at this time. I believe the fix for
this would be to pass a flag down to the libpq routines that we want to
be abort in case of signal+flag, set only when doing the "main call" to
recv, so we can kill idle process.


Yech!  That code is messy enough already, lets not pile another kluge
atop it in order to handle something that's not even being requested
AFAIR.

In any case the correct way to solve the problem is to find out what's
being left corrupt by SIGTERM, rather than install more messiness in
order to avoid facing the real issue ...



I am confused.  Are you talking about the client SIGTERM or the server? 
I thought we agreed that using the cancel functionality, which we know

works and is tested,


I've seen cancel *not* working. In 80 % this was the reason to use 
terminate.


Regards,
Andreas

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] pg_terminate_backend idea

2005-06-22 Thread Magnus Hagander
> > But it still requires me to send some data (such as a dummy 
> query) to 
> > the backend before it exits. This is because server side 
> libpq blocks 
> > when reading and ignores signals at this time. I believe 
> the fix for 
> > this would be to pass a flag down to the libpq routines 
> that we want 
> > to be abort in case of signal+flag, set only when doing the "main 
> > call" to recv, so we can kill idle process.
> 
> Yech!  That code is messy enough already, lets not pile 
> another kluge atop it in order to handle something that's not 
> even being requested AFAIR.

While I agree it'sa  bit of a cludge, saying that it's not requested is
absolutely and totally untrue. It has been requested *a lot*. People
even use a method that is now *known* to be unsafe, simply because we do
not provide another alternative.

Therefor, I prefer a kludge than nothing at all. But a "proper solution"
is of course better.


> In any case the correct way to solve the problem is to find 
> out what's being left corrupt by SIGTERM, rather than install 
> more messiness in order to avoid facing the real issue ...

That is unfortunatly way over my head. And it doesn't seem like anybody
who actually has what it takes to do the "proper solution" is interested
in doing it.


//Magnus

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] pg_terminate_backend idea

2005-06-21 Thread Rod Taylor
On Tue, 2005-06-21 at 23:34 -0400, Tom Lane wrote:
> Bruce Momjian  writes:
> > Tom Lane wrote:
> >> In any case the correct way to solve the problem is to find out what's
> >> being left corrupt by SIGTERM, rather than install more messiness in
> >> order to avoid facing the real issue ...
> 
> > I am confused.  Are you talking about the client SIGTERM or the server? 
> 
> I am talking about Rod Taylor's reports that SIGTERM'ing individual
> backends tends to lead to "lock table corrupted" crashes awhile later.
> Now, I've been playing the part of Chicken Little on this for awhile,
> but seeing an actual report of problems from the field certainly
> strengthens my feelings about it.
> 
> What I think we need to do is find a way to isolate and fix the behavior
> Rod is seeing.  It may be that the bug occurs only for SIGTERM, or it
> may be that it's a general problem that a SIGTERM just increases the
> probability of seeing.  In any case I think we have to solve it, not
> create new mechanisms to try to ignore it.

If it helps, it seems to occur primarily (perhaps always) when there are
schema changes being performed when the SIGTERM is issued.

I don't remember seeing them on Intel or on v7.2 (we didn't stay on 7.4
very long), but on a fairly well loaded Solaris machine (v880 with
between 100 and 200 connections) it happens enough that we automatically
schedule a server restart during the first opportunity when we need to
kill connections in this way This is generally when the server doesn't
recognize the client has dropped -- pgpool can be clumsy with
connections).

> > TODO has:
> 
> > * Allow administrators to safely terminate individual sessions
>   
> >   Right now, SIGTERM will terminate a session, but it is treated as
> >   though the postmaster has paniced and shared memory might not be
> >   cleaned up properly.
> 
> That statement is entirely incorrect.
> 
>   regards, tom lane
> 
> ---(end of broadcast)---
> TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
> 
-- 


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] pg_terminate_backend idea

2005-06-21 Thread Tom Lane
Bruce Momjian  writes:
> Tom Lane wrote:
>> In any case the correct way to solve the problem is to find out what's
>> being left corrupt by SIGTERM, rather than install more messiness in
>> order to avoid facing the real issue ...

> I am confused.  Are you talking about the client SIGTERM or the server? 

I am talking about Rod Taylor's reports that SIGTERM'ing individual
backends tends to lead to "lock table corrupted" crashes awhile later.
Now, I've been playing the part of Chicken Little on this for awhile,
but seeing an actual report of problems from the field certainly
strengthens my feelings about it.

What I think we need to do is find a way to isolate and fix the behavior
Rod is seeing.  It may be that the bug occurs only for SIGTERM, or it
may be that it's a general problem that a SIGTERM just increases the
probability of seeing.  In any case I think we have to solve it, not
create new mechanisms to try to ignore it.

> TODO has:

>   * Allow administrators to safely terminate individual sessions

> Right now, SIGTERM will terminate a session, but it is treated as
> though the postmaster has paniced and shared memory might not be
> cleaned up properly.

That statement is entirely incorrect.

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] pg_terminate_backend idea

2005-06-21 Thread Bruce Momjian
Tom Lane wrote:
> "Magnus Hagander" <[EMAIL PROTECTED]> writes:
> > But it still requires me to send some data (such as a dummy query) to
> > the backend before it exits. This is because server side libpq blocks
> > when reading and ignores signals at this time. I believe the fix for
> > this would be to pass a flag down to the libpq routines that we want to
> > be abort in case of signal+flag, set only when doing the "main call" to
> > recv, so we can kill idle process.
> 
> Yech!  That code is messy enough already, lets not pile another kluge
> atop it in order to handle something that's not even being requested
> AFAIR.
> 
> In any case the correct way to solve the problem is to find out what's
> being left corrupt by SIGTERM, rather than install more messiness in
> order to avoid facing the real issue ...

I am confused.  Are you talking about the client SIGTERM or the server? 
I thought we agreed that using the cancel functionality, which we know
works and is tested, to do backend terminate was the right approach. 
TODO has:

* Allow administrators to safely terminate individual sessions

  Right now, SIGTERM will terminate a session, but it is treated as
  though the postmaster has paniced and shared memory might not be
  cleaned up properly.  A new signal is needed for safe termination
  because backends must first do a query cancel, then exit once they
  have run the query cancel cleanup routine.

I don't see us ever able to handle backend terminate in any other way. 
Are you complaining about the issue with terminating the client?  I had
not considered that.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] pg_terminate_backend idea

2005-06-21 Thread Oliver Jowett
Tom Lane wrote:
> "Magnus Hagander" <[EMAIL PROTECTED]> writes:
> 
>>But it still requires me to send some data (such as a dummy query) to
>>the backend before it exits. This is because server side libpq blocks
>>when reading and ignores signals at this time. I believe the fix for
>>this would be to pass a flag down to the libpq routines that we want to
>>be abort in case of signal+flag, set only when doing the "main call" to
>>recv, so we can kill idle process.
> 
> 
> Yech!  That code is messy enough already, lets not pile another kluge
> atop it in order to handle something that's not even being requested
> AFAIR.

I ran into the same problem back when I was trying to implement an
idle-in-transaction timeout, so solving this might be useful in more
than one place..

-O

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] pg_terminate_backend idea

2005-06-21 Thread Tom Lane
"Magnus Hagander" <[EMAIL PROTECTED]> writes:
> But it still requires me to send some data (such as a dummy query) to
> the backend before it exits. This is because server side libpq blocks
> when reading and ignores signals at this time. I believe the fix for
> this would be to pass a flag down to the libpq routines that we want to
> be abort in case of signal+flag, set only when doing the "main call" to
> recv, so we can kill idle process.

Yech!  That code is messy enough already, lets not pile another kluge
atop it in order to handle something that's not even being requested
AFAIR.

In any case the correct way to solve the problem is to find out what's
being left corrupt by SIGTERM, rather than install more messiness in
order to avoid facing the real issue ...

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend