Re: [ADMIN] Shared memory corrupted?

2003-11-01 Thread Dmitry Morozovsky
On Thu, 30 Oct 2003, Jeff Boes wrote:

JB> We are experiencing the following error, usually during our nightly
JB> delete-and-vacuum cycle (when there are very few other connections to
JB> the database):
JB>
JB> 2003-10-30 01:36:59 [25392]  LOG:  server process (pid 697) was
JB> terminated by signal 14
JB> 2003-10-30 01:36:59 [25392]  LOG:  terminating any other active server
JB> processes
JB> 2003-10-30 01:37:01 [1977]   FATAL:  The database system is in recovery mode
JB> 2003-10-30 01:37:08 [25392]  LOG:  all server processes terminated;
JB> reinitializing shared memory and semaphores
JB> 2003-10-30 01:37:09 [2856]   FATAL:  The database system is starting up
JB> 2003-10-30 01:37:09 [2855]   LOG:  database system was interrupted at
JB> 2003-10-30 01:26:13 EST
JB>
JB> The only clues we have are that the server processes interrupted by
JB> "signal 14" *seem* to be backends connected to Apache processes (on
JB> another server). But even that isn't certain, because of the difficulty
JB> in tracking down which process was doing what at the time.

Signal 14 is SIGALRM. Some kind of badly-behaving watchdog?

Sincerely,
D.Marck [DM5020, MCK-RIPE, DM3-RIPN]

*** Dmitry Morozovsky --- D.Marck --- Wild Woozle --- [EMAIL PROTECTED] ***


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [ADMIN] Shared memory corrupted?

2003-10-30 Thread Tom Lane
Jeff Boes <[EMAIL PROTECTED]> writes:
> [How would a plperl function that changes the local behavior of SIGALRM 
> affect the backend?]

IIRC, SIGALRM is used for two things: one, to trigger a deadlock check
cycle if we wait too long for a lock (see deadlock_timeout), and two,
to implement statement_timeout.  If you are using statement_timeout then
I think it would be dangerous to mess with SIGALRM at all.  If you are
not, then I think it would be all right to modify the SIGALRM handler
setting locally, as long as you restore it to its original setting when
you are done.  Don't try to run any database access operations while you
have a nonstandard setting of the SIGALRM handler, though, or you risk
problems with deadlock checking.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [ADMIN] Shared memory corrupted?

2003-10-30 Thread Jeff Boes
Tom Lane wrote:

[ thinks... ]  Another possibility is that you are running some
non-Postgres code that resets SIGALRM handling to default.  I have
heard rumors that Perl will do that in some cases, for example.
Are you using plperl?
 

Yes, we are. I know there are some places in the code where SIGALRM is 
used, so I'll start looking there. But if you or anyone else thinks of 
anything, let me know ...

[How would a plperl function that changes the local behavior of SIGALRM 
affect the backend?]

--
Jeff Boes  vox 269.226.9550 ext 24
Database Engineer fax 269.349.9076
Nexcerpt, Inc. http://www.nexcerpt.com
  ...Nexcerpt... Extend your Expertise


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [ADMIN] Shared memory corrupted?

2003-10-30 Thread Tom Lane
Jeff Boes <[EMAIL PROTECTED]> writes:
> Tom Lane wrote:
>> What's signal 14 on your machine?  (Look in /usr/include/signal.h to
>> be sure.)  Also, what PG version is this?

>  14) SIGALRM   

> This is Pg 7.3.4, running on Linux 7.3 (Kernel 2.4.18-18.7.xsmp on a 
> 2-processor i686).

Hm.  That doesn't make any sense at all, because SIGALRM is either
caught by a handler or ignored everywhere in the Postgres backend.
There is no situation where it could legitimately cause process
termination.  Is it possible you are dealing with a kernel bug?

[ thinks... ]  Another possibility is that you are running some
non-Postgres code that resets SIGALRM handling to default.  I have
heard rumors that Perl will do that in some cases, for example.
Are you using plperl?

regards, tom lane

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [ADMIN] Shared memory corrupted?

2003-10-30 Thread Tom Lane
Jeff Boes <[EMAIL PROTECTED]> writes:
> We are experiencing the following error, usually during our nightly 
> delete-and-vacuum cycle (when there are very few other connections to 
> the database):

> 2003-10-30 01:36:59 [25392]  LOG:  server process (pid 697) was 
> terminated by signal 14

What's signal 14 on your machine?  (Look in /usr/include/signal.h to
be sure.)  Also, what PG version is this?

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [ADMIN] Shared memory corrupted?

2003-10-30 Thread Jeff Boes
Tom Lane wrote:

Jeff Boes <[EMAIL PROTECTED]> writes:
 

We are experiencing the following error, usually during our nightly 
delete-and-vacuum cycle (when there are very few other connections to 
the database):
   

 

2003-10-30 01:36:59 [25392]  LOG:  server process (pid 697) was 
terminated by signal 14
   

What's signal 14 on your machine?  (Look in /usr/include/signal.h to
be sure.)  Also, what PG version is this?
			regards, tom lane
 

signal.h doesn't have any definitions for signal numbers in it; "kill 
-l" lists 14 as:

14) SIGALRM   

This is Pg 7.3.4, running on Linux 7.3 (Kernel 2.4.18-18.7.xsmp on a 
2-processor i686).

The system has 4 GB of RAM. Shared memory parameters out of 
/etc/sysctl.conf follow:

kernel.shmall = 1352914698
kernel.shmmax = 1352914698
And here's what I guess are the pertinent data from the postgresql.conf 
file:

sort_mem = 65536
vacuum_mem = 262144
effective_cache_size = 196608
shared_buffers = 131072
max_fsm_relations = 200
max_fsm_pages = 35
wal_buffers = 32
We've seen the problem with vacuum_mem = 65536 also.

--
Jeff Boes  vox 269.226.9550 ext 24
Database Engineer fax 269.349.9076
Nexcerpt, Inc. http://www.nexcerpt.com
  ...Nexcerpt... Extend your Expertise


---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


[ADMIN] Shared memory corrupted?

2003-10-30 Thread Jeff Boes
We are experiencing the following error, usually during our nightly 
delete-and-vacuum cycle (when there are very few other connections to 
the database):

2003-10-30 01:36:59 [25392]  LOG:  server process (pid 697) was 
terminated by signal 14
2003-10-30 01:36:59 [25392]  LOG:  terminating any other active server 
processes
2003-10-30 01:37:01 [1977]   FATAL:  The database system is in recovery mode
2003-10-30 01:37:08 [25392]  LOG:  all server processes terminated; 
reinitializing shared memory and semaphores
2003-10-30 01:37:09 [2856]   FATAL:  The database system is starting up
2003-10-30 01:37:09 [2855]   LOG:  database system was interrupted at 
2003-10-30 01:26:13 EST

The only clues we have are that the server processes interrupted by 
"signal 14" *seem* to be backends connected to Apache processes (on 
another server). But even that isn't certain, because of the difficulty 
in tracking down which process was doing what at the time.

--
Jeff Boes  vox 269.226.9550 ext 24
Database Engineer fax 269.349.9076
Nexcerpt, Inc. http://www.nexcerpt.com
   ...Nexcerpt... Extend your Expertise
---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]