Re: [HACKERS] Quite strange crash

2001-01-09 Thread Vadim Mikheev

  Well, it's not good idea because of SIGTERM is used for ABORT + EXIT
  (pg_ctl -m fast stop), but shouldn't ABORT clean up everything?
 
 Er, shouldn't ABORT leave the system in the exact state that it's
 in so that one can get a crashdump/traceback on a wedged process
 without it trying to clean up after itself?

Sorry, I've meant "transaction abort"...

Vadim





Re: [HACKERS] Quite strange crash

2001-01-09 Thread Tom Lane

[EMAIL PROTECTED] (Nathan Myers) writes:
 The relevance to the issue at hand is that processes dying during 
 heavy memory load is a documented feature of our supported platforms.

Ugh.  Do you know anything about *how* they get killed --- ie, with
what signal?

regards, tom lane



Re: [HACKERS] Quite strange crash

2001-01-09 Thread Tom Lane

Denis Perchine [EMAIL PROTECTED] writes:
 Didn't you get my mail with a piece of Linux kernel code? I think all is 
 clear there.

That was implementing CPU-time-exceeded kill, which is a different
issue.

regards, tom lane



Re: [HACKERS] Quite strange crash

2001-01-09 Thread Denis Perchine

  Didn't you get my mail with a piece of Linux kernel code? I think all is
  clear there.

 That was implementing CPU-time-exceeded kill, which is a different
 issue.

Opps.. You are talking about OOM killer.

/* This process has hardware access, be more careful. */
if (cap_t(p-cap_effective)  CAP_TO_MASK(CAP_SYS_RAWIO)) {
  force_sig(SIGTERM, p);
} else {
  force_sig(SIGKILL, p);
}

You will get SIGKILL in most cases.

-- 
Sincerely Yours,
Denis Perchine

--
E-Mail: [EMAIL PROTECTED]
HomePage: http://www.perchine.com/dyp/
FidoNet: 2:5000/120.5
--



RE: [HACKERS] Quite strange crash

2001-01-09 Thread Mikheev, Vadim

  START_/END_CRIT_SECTION is mostly CritSectionCount++/--.
  Recording could be made as 
  LockedSpinLocks[LockedSpinCounter++] = spinlock
  in pre-allocated array.
 
 Yeah, I suppose.  We already do record locking of all the fixed
 spinlocks (BufMgrLock etc), it's just the per-buffer spinlocks that
 are missing from that (and CRIT_SECTION calls). Would it be 
 reasonable to assume that only one buffer spinlock could be held
 at a time?

No. UPDATE holds two spins, btree split even more.
But stop - afair bufmgr remembers locked buffers, probably
we could just add XXX_CRIT_SECTION to LockBuffer..?

Vadim



Re: [HACKERS] Quite strange crash

2001-01-09 Thread Tom Lane

"Mikheev, Vadim" [EMAIL PROTECTED] writes:
 Yeah, I suppose.  We already do record locking of all the fixed
 spinlocks (BufMgrLock etc), it's just the per-buffer spinlocks that
 are missing from that (and CRIT_SECTION calls). Would it be 
 reasonable to assume that only one buffer spinlock could be held
 at a time?

 No. UPDATE holds two spins, btree split even more.
 But stop - afair bufmgr remembers locked buffers, probably
 we could just add XXX_CRIT_SECTION to LockBuffer..?

Right.  A buffer lock isn't a spinlock, ie, we don't hold the spinlock
except within LockBuffer.  So a quick CRIT_SECTION should deal with
that.  Actually, with careful placement of CRIT_SECTION calls in
LockBuffer, there's no need to record holding the buffer's cntxt
spinlock at all, I think.  Will work on it.

regards, tom lane



Re: [HACKERS] Quite strange crash

2001-01-09 Thread Tom Lane

Denis Perchine [EMAIL PROTECTED] writes:
 You will get SIGKILL in most cases.

Well, a SIGKILL will cause the postmaster to shut down and restart the
other backends, so we should be safe if that happens.  (Annoyed as heck,
maybe, but safe.)

Anyway, this is looking more and more like the SIGTERM that caused your
vacuum to die must have been done manually.

The CRIT_SECTION code that I'm about to go off and add to spinlocking
should prevent similar problems from happening in 7.1, but I don't think
it's reasonable to try to retrofit that into 7.0.*.

regards, tom lane



Re: [HACKERS] Quite strange crash

2001-01-09 Thread Nathan Myers

On Wed, Jan 10, 2001 at 12:46:50AM +0600, Denis Perchine wrote:
   Didn't you get my mail with a piece of Linux kernel code? I think all is
   clear there.
 
  That was implementing CPU-time-exceeded kill, which is a different
  issue.
 
 Opps.. You are talking about OOM killer.
 
 /* This process has hardware access, be more careful. */
 if (cap_t(p-cap_effective)  CAP_TO_MASK(CAP_SYS_RAWIO)) {
   force_sig(SIGTERM, p);
 } else {
   force_sig(SIGKILL, p);
 }
 
 You will get SIGKILL in most cases.

... on Linux, anyhow.  There's no standard for this behavior.
Probably others try a SIGTERM first (on several processes) and 
then a SIGKILL if none die.

If a backend dies while holding a lock, doesn't that imply that
the shared memory may be in an inconsistent state?  Surely a death
while holding a lock should shut down the whole database, without
writing anything to disk.

Nathan Myers
[EMAIL PROTECTED]



Re: [HACKERS] Quite strange crash

2001-01-09 Thread Tom Lane

[EMAIL PROTECTED] (Nathan Myers) writes:
 If a backend dies while holding a lock, doesn't that imply that
 the shared memory may be in an inconsistent state?

Yup.  I had just come to the realization that we'd be best off to treat
the *entire* period from SpinAcquire to SpinRelease as a critical
section for the purposes of die().  That is, response to SIGTERM will be
held off until we have released the spinlock.  Most of the places where
we grab spinlocks would have to make such a critical section anyway, at
least for large parts of the time that they are holding the spinlock,
because they are manipulating shared data structures and the
instantaneous intermediate states aren't always self-consistent.  So we
might as well follow the KISS principle and just do START_CRIT_SECTION
in SpinAcquire and END_CRIT_SECTION in SpinRelease.

Vadim, any objection?

regards, tom lane



RE: [HACKERS] Quite strange crash

2001-01-09 Thread Mikheev, Vadim

 Yup.  I had just come to the realization that we'd be best 
 off to treat the *entire* period from SpinAcquire to SpinRelease
 as a critical section for the purposes of die(). That is, response
 to SIGTERM will be held off until we have released the spinlock.
 Most of the places where we grab spinlocks would have to make such
 a critical section anyway, at least for large parts of the time that
 they are holding the spinlock, because they are manipulating shared
 data structures and the instantaneous intermediate states aren't always
 self-consistent.  So we might as well follow the KISS principle and
 just do START_CRIT_SECTION in SpinAcquire and END_CRIT_SECTION in
 SpinRelease.
 
 Vadim, any objection?

No one for the moment. If we'll just add XXX_CRIT_SECTION
to SpinXXX funcs without changing anything else then it will be easy
to remove them later (in the event we'll find any problems with this),
so - do it.

Vadim



Re: [HACKERS] Quite strange crash

2001-01-08 Thread Tom Lane

Denis Perchine [EMAIL PROTECTED] writes:
 On Monday 08 January 2001 00:08, Tom Lane wrote:
 FATAL: s_lock(401f7435) at bufmgr.c:2350, stuck spinlock. Aborting.
 
 Were there any errors before that?

 No... Just clean log (I redirect log from stderr/out t file, and all
 other to syslog).

The error messages would be in the syslog then, not in stderr.

 And the last query was:
 Jan  7 04:27:53 mx postgres[1008]: query: select message_id from pop3 where 
 server_id = 22615

How about the prior queries of other processes?  Keep in mind that the
spinlock could have been left locked by any backend, not only the one
that complained about it.

regards, tom lane



Re: [HACKERS] Quite strange crash

2001-01-08 Thread Denis Perchine

  FATAL: s_lock(401f7435) at bufmgr.c:2350, stuck spinlock. Aborting.
 
  Were there any errors before that?
 
  No... Just clean log (I redirect log from stderr/out t file, and all
  other to syslog).

 The error messages would be in the syslog then, not in stderr.

Hmmm... The only strange errors I see are:

Jan  7 04:22:14 mx postgres[679]: query: insert into statistic (date, 
visit_count, variant_id) values (now(), 1, 2)
Jan  7 04:22:14 mx postgres[631]: query: insert into statistic (date, 
visit_count, variant_id) values (now(), 1, 2)
Jan  7 04:22:14 mx postgres[700]: query: insert into statistic (date, 
visit_count, variant_id) values (now(), 1, 2)
Jan  7 04:22:14 mx postgres[665]: query: insert into statistic (date, 
visit_count, variant_id) values (now(), 1, 2)
Jan  7 04:22:14 mx postgres[633]: query: insert into statistic (date, 
visit_count, variant_id) values (now(), 1, 2)
Jan  7 04:22:14 mx postgres[629]: query: insert into statistic (date, 
visit_count, variant_id) values (now(), 1, 2)
Jan  7 04:22:14 mx postgres[736]: query: commit
Jan  7 04:22:14 mx postgres[736]: ProcessUtility: commit
Jan  7 04:22:14 mx postgres[700]: ERROR:  Cannot insert a duplicate key into 
unique index statistic_date_vid_key
Jan  7 04:22:14 mx postgres[700]: query: update users set 
rcpt_ip='213.75.35.129',rcptdate=now() where id=1428067
Jan  7 04:22:14 mx postgres[700]: NOTICE:  current transaction is aborted, 
queries ignored until end of transaction block
Jan  7 04:22:14 mx postgres[679]: query: commit
Jan  7 04:22:14 mx postgres[679]: ProcessUtility: commit
Jan  7 04:22:14 mx postgres[679]: query: update users set 
rcpt_ip='213.75.55.185',rcptdate=now() where id=1430836
Jan  7 04:22:14 mx postgres[665]: ERROR:  Cannot insert a duplicate key into 
unique index statistic_date_vid_key
Jan  7 04:22:14 mx postgres[665]: query: update users set 
rcpt_ip='202.156.121.139',rcptdate=now() where id=1271397
Jan  7 04:22:14 mx postgres[665]: NOTICE:  current transaction is aborted, 
queries ignored until end of transaction block
Jan  7 04:22:14 mx postgres[631]: ERROR:  Cannot insert a duplicate key into 
unique index statistic_date_vid_key
Jan  7 04:22:14 mx postgres[631]: query: update users set 
rcpt_ip='24.20.53.63',rcptdate=now() where id=1451254
Jan  7 04:22:14 mx postgres[631]: NOTICE:  current transaction is aborted, 
queries ignored until end of transaction block
Jan  7 04:22:14 mx postgres[633]: ERROR:  Cannot insert a duplicate key into 
unique index statistic_date_vid_key
Jan  7 04:22:14 mx postgres[633]: query: update users set 
rcpt_ip='213.116.168.173',rcptdate=now() where id=1378049
Jan  7 04:22:14 mx postgres[633]: NOTICE:  current transaction is aborted, 
queries ignored until end of transaction block
Jan  7 04:22:14 mx postgres[630]: query: select id,msg,next from alert
Jan  7 04:22:14 mx postgres[630]: query: select email,type from email where 
variant_id=2
Jan  7 04:22:14 mx postgres[630]: query:
select * from users where senderdate  now()-'10days'::interval AND
variant_id=2 AND crypt='21AN6KRffJdFRFc511'
 
Jan  7 04:22:14 mx postgres[629]: ERROR:  Cannot insert a duplicate key into 
unique index statistic_date_vid_key
Jan  7 04:22:14 mx postgres[629]: query: update users set 
rcpt_ip='213.42.45.81',rcptdate=now() where id=1441046
Jan  7 04:22:14 mx postgres[629]: NOTICE:  current transaction is aborted, 
queries ignored until end of transaction block
Jan  7 04:22:15 mx postgres[711]: query: select message_id from pop3 where 
server_id = 17746
Jan  7 04:22:15 mx postgres[711]: ERROR:  Relation 'pop3' does not exist

They popped up 4 minutes before. And the most interesting is that relation 
pop3 does exist!

  And the last query was:
  Jan  7 04:27:53 mx postgres[1008]: query: select message_id from pop3
  where server_id = 22615

 How about the prior queries of other processes?

I do not want to flood maillist (it will be too much of info). I can send you 
complete log file from Jan 7. It is 128Mb uncompressed. With gz it is 8Mb. 
Maybe it will be smaller with bz2.

  Keep in mind that the
 spinlock could have been left locked by any backend, not only the one
 that complained about it.

Actually you can have a look on the logs yourself. Remember I gave you a 
password from postgres user. This is the same postgres. Logs are in 
/var/log/postgres. You will need postgres.log.1.gz.

-- 
Sincerely Yours,
Denis Perchine

--
E-Mail: [EMAIL PROTECTED]
HomePage: http://www.perchine.com/dyp/
FidoNet: 2:5000/120.5
--



Re: [HACKERS] Quite strange crash

2001-01-08 Thread Tom Lane

Denis Perchine [EMAIL PROTECTED] writes:
 FATAL: s_lock(401f7435) at bufmgr.c:2350, stuck spinlock. Aborting.
 
 Were there any errors before that?

 Actually you can have a look on the logs yourself.

Well, I found a smoking gun:

Jan  7 04:27:51 mx postgres[2501]: FATAL 1:  The system is shutting down

PID 2501 had been running:

Jan  7 04:25:44 mx postgres[2501]: query: vacuum verbose lazy;

What seems to have happened is that 2501 curled up and died, leaving
one or more buffer spinlocks locked.  Roughly one spinlock timeout
later, at 04:29:07, we have 1008 complaining of a stuck spinlock.
So that fits.

The real question is what happened to 2501?  None of the other backends
reported a SIGTERM signal, so the signal did not come from the
postmaster.

Another interesting datapoint: there is a second place in this logfile
where one single backend reports SIGTERM while its brethren keep running:

Jan  7 04:30:47 mx postgres[4269]: query: vacuum verbose;
...
Jan  7 04:38:16 mx postgres[4269]: FATAL 1:  The system is shutting down

There is something pretty fishy about this.  You aren't by any chance
running the postmaster under a ulimit setting that might cut off
individual backends after a certain amount of CPU time, are you?
What signal does a ulimit violation deliver on your machine, anyway?

regards, tom lane



Re: [HACKERS] Quite strange crash

2001-01-08 Thread Nathan Myers

On Mon, Jan 08, 2001 at 12:21:38PM -0500, Tom Lane wrote:
 Denis Perchine [EMAIL PROTECTED] writes:
  FATAL: s_lock(401f7435) at bufmgr.c:2350, stuck spinlock. Aborting.
  
  Were there any errors before that?
 
  Actually you can have a look on the logs yourself.
 
 Well, I found a smoking gun: ...
 What seems to have happened is that 2501 curled up and died, leaving
 one or more buffer spinlocks locked.  ...
 There is something pretty fishy about this.  You aren't by any chance
 running the postmaster under a ulimit setting that might cut off
 individual backends after a certain amount of CPU time, are you?
 What signal does a ulimit violation deliver on your machine, anyway?

It's worth noting here that modern Unixes run around killing user-level
processes more or less at random when free swap space (and sometimes
just RAM) runs low.  AIX was the first such, but would send SIGDANGER
to processes first to try to reclaim some RAM; critical daemons were
expected to explicitly ignore SIGDANGER. Other Unixes picked up the 
idea without picking up the SIGDANGER behavior.

The reason for this common pathological behavior is usually traced
to sloppy resource accounting.  It manifests as the bad policy of 
having malloc() (and sbrk() or mmap() underneath) return a valid 
pointer rather than NULL, on the assumption that most of the memory 
asked for won't be used just yet.  Anyhow, the system doesn't know 
how much memory is really available at that moment.

Usually the problem is explained with the example of a very large
process that forks, suddenly demanding twice as much memory. (Apache
is particularly egregious this way, allocating lots of memory and
then forking several times.)  Instead of failing the fork, the kernel
waits for a process to touch memory it was granted and then see if 
any RAM/swap has turned up to satisfy it, and then kill the process 
(or some random other process!) if not.

Now that programs have come to depend on this behavior, it has become
very hard to fix it. The implication for the rest of us is that we 
should expect our processes to be killed at random, just for touching 
memory granted, or for no reason at all. (Kernel people say, "They're 
just user-level programs, restart them;" or, "Maybe we can designate 
some critical processes that don't get killed".)  In Linux they try 
to invent heuristics to avoid killing the X server, because so many 
programs depend on it.  It's a disgraceful mess, really.

The relevance to the issue at hand is that processes dying during 
heavy memory load is a documented feature of our supported platforms.

Nathan Myers 
[EMAIL PROTECTED]



Re: [HACKERS] Quite strange crash

2001-01-08 Thread Denis Perchine

  Well, I found a smoking gun: ...
  What seems to have happened is that 2501 curled up and died, leaving
  one or more buffer spinlocks locked.  ...
  There is something pretty fishy about this.  You aren't by any chance
  running the postmaster under a ulimit setting that might cut off
  individual backends after a certain amount of CPU time, are you?
  What signal does a ulimit violation deliver on your machine, anyway?

 It's worth noting here that modern Unixes run around killing user-level
 processes more or less at random when free swap space (and sometimes
 just RAM) runs low.  AIX was the first such, but would send SIGDANGER
 to processes first to try to reclaim some RAM; critical daemons were
 expected to explicitly ignore SIGDANGER. Other Unixes picked up the
 idea without picking up the SIGDANGER behavior.

That's not the case for sure. There are 512Mb on the machine, and when I had 
this problem it was compltely unloaded (300Mb in caches).

-- 
Sincerely Yours,
Denis Perchine

--
E-Mail: [EMAIL PROTECTED]
HomePage: http://www.perchine.com/dyp/
FidoNet: 2:5000/120.5
--



Re: [HACKERS] Quite strange crash

2001-01-08 Thread Tom Lane

Denis Perchine [EMAIL PROTECTED] writes:
 It's worth noting here that modern Unixes run around killing user-level
 processes more or less at random when free swap space (and sometimes
 just RAM) runs low.

 That's not the case for sure. There are 512Mb on the machine, and when I had 
 this problem it was compltely unloaded (300Mb in caches).

The fact that VACUUM processes seemed to be preferential victims
suggests a resource limit of some sort.  I had suggested a CPU-time
limit, but perhaps it could also be disk-pages-written.

regards, tom lane



Re: [HACKERS] Quite strange crash

2001-01-08 Thread Denis Perchine

On Monday 08 January 2001 23:21, Tom Lane wrote:
 Denis Perchine [EMAIL PROTECTED] writes:
  FATAL: s_lock(401f7435) at bufmgr.c:2350, stuck spinlock. Aborting.
 
  Were there any errors before that?
 
  Actually you can have a look on the logs yourself.

 Well, I found a smoking gun:

 Jan  7 04:27:51 mx postgres[2501]: FATAL 1:  The system is shutting down

 PID 2501 had been running:

 Jan  7 04:25:44 mx postgres[2501]: query: vacuum verbose lazy;

Hmmm... actually this is real problem with vacuum lazy. Sometimes it just do 
something for enormous amount of time (I have mailed a sample database to 
Vadim, but did not get any response yet). It is possible, that it was me, who 
killed the backend.

 What seems to have happened is that 2501 curled up and died, leaving
 one or more buffer spinlocks locked.  Roughly one spinlock timeout
 later, at 04:29:07, we have 1008 complaining of a stuck spinlock.
 So that fits.

 The real question is what happened to 2501?  None of the other backends
 reported a SIGTERM signal, so the signal did not come from the
 postmaster.

 Another interesting datapoint: there is a second place in this logfile
 where one single backend reports SIGTERM while its brethren keep running:

 Jan  7 04:30:47 mx postgres[4269]: query: vacuum verbose;
 ...
 Jan  7 04:38:16 mx postgres[4269]: FATAL 1:  The system is shutting down

Hmmm... Maybe this also was me... But I am not sure here.

 There is something pretty fishy about this.  You aren't by any chance
 running the postmaster under a ulimit setting that might cut off
 individual backends after a certain amount of CPU time, are you?

[postgres@mx postgres]$ ulimit -a
core file size (blocks)  100
data seg size (kbytes)   unlimited
file size (blocks)   unlimited
max memory size (kbytes) unlimited
stack size (kbytes)  8192
cpu time (seconds)   unlimited
max user processes   2048
pipe size (512 bytes)8
open files   1024
virtual memory (kbytes)  2105343

No, there are no any ulimits.

 What signal does a ulimit violation deliver on your machine, anyway?

if (psecs / HZ  p-rlim[RLIMIT_CPU].rlim_cur) {
/* Send SIGXCPU every second.. */
if (!(psecs % HZ))
send_sig(SIGXCPU, p, 1);
/* and SIGKILL when we go over max.. */
if (psecs / HZ  p-rlim[RLIMIT_CPU].rlim_max)
send_sig(SIGKILL, p, 1);
}

This part of the kernel show the logic. This mean that process wil get 
SIGXCPU each second if it above soft limit, and SIGKILL when it will be above 
hardlimit.

-- 
Sincerely Yours,
Denis Perchine

--
E-Mail: [EMAIL PROTECTED]
HomePage: http://www.perchine.com/dyp/
FidoNet: 2:5000/120.5
--



Re: [HACKERS] Quite strange crash

2001-01-08 Thread Tom Lane

Denis Perchine [EMAIL PROTECTED] writes:
 Hmmm... actually this is real problem with vacuum lazy. Sometimes it
 just do something for enormous amount of time (I have mailed a sample
 database to Vadim, but did not get any response yet). It is possible,
 that it was me, who killed the backend.

Killing an individual backend with SIGTERM is bad luck.  The backend
will assume that it's being killed by the postmaster, and will exit
without a whole lot of concern for cleaning up shared memory --- the
expectation is that as soon as all the backends are dead, the postmaster
will reinitialize shared memory.

You can get away with sending SIGINT (QueryCancel) to an individual
backend.  Anything else voids the warranty ;=)

But, having said that --- this VACUUM process had only been running
for two minutes of real time.  Seems unlikely that you'd have chosen
to kill it so quickly.

regards, tom lane



RE: [HACKERS] Quite strange crash

2001-01-08 Thread Mikheev, Vadim

 Killing an individual backend with SIGTERM is bad luck.  The backend
 will assume that it's being killed by the postmaster, and will exit
 without a whole lot of concern for cleaning up shared memory --- the

What code will be returned to postmaster in this case?

Vadim



Re: [HACKERS] Quite strange crash

2001-01-08 Thread Tom Lane

"Mikheev, Vadim" [EMAIL PROTECTED] writes:
 Killing an individual backend with SIGTERM is bad luck.  The backend
 will assume that it's being killed by the postmaster, and will exit
 without a whole lot of concern for cleaning up shared memory --- the

 What code will be returned to postmaster in this case?

Right at the moment, the backend will exit with status 0.  I think you
are thinking the same thing I am: maybe a backend that receives SIGTERM
ought to exit with nonzero status.

That would mean that killing an individual backend would instantly
translate into an installation-wide restart.  I am not sure whether
that's a good idea.  Perhaps this cure is worse than the disease.
Comments anyone?

regards, tom lane



RE: [HACKERS] Quite strange crash

2001-01-08 Thread Mikheev, Vadim

  Killing an individual backend with SIGTERM is bad luck.  
  The backend will assume that it's being killed by the postmaster,
  and will exit without a whole lot of concern for cleaning up shared 
  memory --- the

SIGTERM -- die() -- elog(FATAL)

Is it true that elog(FATAL) doesn't clean up shmem etc?
This would be very bad...

  What code will be returned to postmaster in this case?
 
 Right at the moment, the backend will exit with status 0.  I think you
 are thinking the same thing I am: maybe a backend that 
 receives SIGTERM ought to exit with nonzero status.
 
 That would mean that killing an individual backend would instantly
 translate into an installation-wide restart.  I am not sure whether
 that's a good idea.  Perhaps this cure is worse than the disease.

Well, it's not good idea because of SIGTERM is used for ABORT + EXIT
(pg_ctl -m fast stop), but shouldn't ABORT clean up everything?

Vadim



Re: [HACKERS] Quite strange crash

2001-01-08 Thread Tom Lane

"Mikheev, Vadim" [EMAIL PROTECTED] writes:
 Killing an individual backend with SIGTERM is bad luck.  

 SIGTERM -- die() -- elog(FATAL)

 Is it true that elog(FATAL) doesn't clean up shmem etc?
 This would be very bad...

It tries, but I don't think it's possible to make a complete guarantee
without an unreasonable amount of overhead.  The case at hand was a
stuck spinlock because die() -- elog(FATAL) had neglected to release
that particular spinlock before exiting.  To guarantee that all
spinlocks will be released by die(), we'd need something like

START_CRIT_SECTION;
S_LOCK(spinlock);
record that we own spinlock;
END_CRIT_SECTION;

around every existing S_LOCK() call, and the reverse around every
S_UNLOCK.  Are you willing to pay that kind of overhead?  I'm not
sure this'd be enough anyway.  Guaranteeing that you have consistent
state at every instant that an ISR could interrupt you is not easy.

regards, tom lane



Re: [HACKERS] Quite strange crash

2001-01-08 Thread Alfred Perlstein

* Mikheev, Vadim [EMAIL PROTECTED] [010108 23:08] wrote:
   Killing an individual backend with SIGTERM is bad luck.  
   The backend will assume that it's being killed by the postmaster,
   and will exit without a whole lot of concern for cleaning up shared 
   memory --- the
 
 SIGTERM -- die() -- elog(FATAL)
 
 Is it true that elog(FATAL) doesn't clean up shmem etc?
 This would be very bad...
 
   What code will be returned to postmaster in this case?
  
  Right at the moment, the backend will exit with status 0.  I think you
  are thinking the same thing I am: maybe a backend that 
  receives SIGTERM ought to exit with nonzero status.
  
  That would mean that killing an individual backend would instantly
  translate into an installation-wide restart.  I am not sure whether
  that's a good idea.  Perhaps this cure is worse than the disease.
 
 Well, it's not good idea because of SIGTERM is used for ABORT + EXIT
 (pg_ctl -m fast stop), but shouldn't ABORT clean up everything?

Er, shouldn't ABORT leave the system in the exact state that it's
in so that one can get a crashdump/traceback on a wedged process
without it trying to clean up after itself?

-- 
-Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]]
"I have the heart of a child; I keep it in a jar on my desk."



[HACKERS] Quite strange crash

2001-01-07 Thread Denis Perchine

Hi,

Does anyone seen this on PostgreSQL 7.0.3?

FATAL: s_lock(401f7435) at bufmgr.c:2350, stuck spinlock. Aborting.
 
FATAL: s_lock(401f7435) at bufmgr.c:2350, stuck spinlock. Aborting.
Server process (pid 1008) exited with status 6 at Sun Jan  7 04:29:07 2001
Terminating any active server processes...
Server processes were terminated at Sun Jan  7 04:29:07 2001
Reinitializing shared memory and semaphores

-- 
Sincerely Yours,
Denis Perchine

--
E-Mail: [EMAIL PROTECTED]
HomePage: http://www.perchine.com/dyp/
FidoNet: 2:5000/120.5
--



Re: [HACKERS] Quite strange crash

2001-01-07 Thread Tom Lane

Denis Perchine [EMAIL PROTECTED] writes:
 Does anyone seen this on PostgreSQL 7.0.3?
 FATAL: s_lock(401f7435) at bufmgr.c:2350, stuck spinlock. Aborting.

Were there any errors before that?

I've been suspicious for awhile that the system might neglect to release
buffer cntx_lock spinlocks if an elog() occurs while one is held.  This
looks like it might be such a case, but you're only showing us the end
symptom not what led up to it ...

regards, tom lane



Re: [HACKERS] Quite strange crash

2001-01-07 Thread Denis Perchine

On Monday 08 January 2001 00:08, Tom Lane wrote:
 Denis Perchine [EMAIL PROTECTED] writes:
  Does anyone seen this on PostgreSQL 7.0.3?
  FATAL: s_lock(401f7435) at bufmgr.c:2350, stuck spinlock. Aborting.

 Were there any errors before that?

No... Just clean log (I redirect log from stderr/out t file, and all other to 
syslog).

Here it is just from the begin:


DEBUG:  Data Base System is starting up at Sun Jan  7 04:22:00 2001
DEBUG:  Data Base System was interrupted being in production at Thu Jan  4 
23:30:22 2001
DEBUG:  Data Base System is in production state at Sun Jan  7 04:22:00 2001
 
FATAL: s_lock(401f7435) at bufmgr.c:2350, stuck spinlock. Aborting.
 
FATAL: s_lock(401f7435) at bufmgr.c:2350, stuck spinlock. Aborting.
Server process (pid 1008) exited with status 6 at Sun Jan  7 04:29:07 2001
Terminating any active server processes...
Server processes were terminated at Sun Jan  7 04:29:07 2001
Reinitializing shared memory and semaphores
-

As far as you can see it happends almost just after start.

I can give you full list of queries which was made by process 1008. But 
basically there was only queries like this:
select message_id from pop3 where server_id = 6214

insert into pop3 (server_id, mailfrom, mailto, subject, message_id, 
sent_date, sent_date_text, recieved_date, state) values (25641, 
'virtualo.com', '[EMAIL PROTECTED]', 'Joao roque Dias I have
tried them allthis one is for real!', 
'[EMAIL PROTECTED]', 
'2001-01-07 04:06:23 -00', 'Sat, 06 Jan 2001 23:06:23 -0500', 'now', 1)

And the last query was:
Jan  7 04:27:53 mx postgres[1008]: query: select message_id from pop3 where 
server_id = 22615

 I've been suspicious for awhile that the system might neglect to release
 buffer cntx_lock spinlocks if an elog() occurs while one is held.  This
 looks like it might be such a case, but you're only showing us the end
 symptom not what led up to it ...

Just say me what can I do. Unfortunatly I can not reproduce the situation...

-- 
Sincerely Yours,
Denis Perchine

--
E-Mail: [EMAIL PROTECTED]
HomePage: http://www.perchine.com/dyp/
FidoNet: 2:5000/120.5
--