Re: [ADMIN] Shared memory corrupted?
On Thu, 30 Oct 2003, Jeff Boes wrote: JB> We are experiencing the following error, usually during our nightly JB> delete-and-vacuum cycle (when there are very few other connections to JB> the database): JB> JB> 2003-10-30 01:36:59 [25392] LOG: server process (pid 697) was JB> terminated by signal 14 JB> 2003-10-30 01:36:59 [25392] LOG: terminating any other active server JB> processes JB> 2003-10-30 01:37:01 [1977] FATAL: The database system is in recovery mode JB> 2003-10-30 01:37:08 [25392] LOG: all server processes terminated; JB> reinitializing shared memory and semaphores JB> 2003-10-30 01:37:09 [2856] FATAL: The database system is starting up JB> 2003-10-30 01:37:09 [2855] LOG: database system was interrupted at JB> 2003-10-30 01:26:13 EST JB> JB> The only clues we have are that the server processes interrupted by JB> "signal 14" *seem* to be backends connected to Apache processes (on JB> another server). But even that isn't certain, because of the difficulty JB> in tracking down which process was doing what at the time. Signal 14 is SIGALRM. Some kind of badly-behaving watchdog? Sincerely, D.Marck [DM5020, MCK-RIPE, DM3-RIPN] *** Dmitry Morozovsky --- D.Marck --- Wild Woozle --- [EMAIL PROTECTED] *** ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [ADMIN] Shared memory corrupted?
Jeff Boes <[EMAIL PROTECTED]> writes: > [How would a plperl function that changes the local behavior of SIGALRM > affect the backend?] IIRC, SIGALRM is used for two things: one, to trigger a deadlock check cycle if we wait too long for a lock (see deadlock_timeout), and two, to implement statement_timeout. If you are using statement_timeout then I think it would be dangerous to mess with SIGALRM at all. If you are not, then I think it would be all right to modify the SIGALRM handler setting locally, as long as you restore it to its original setting when you are done. Don't try to run any database access operations while you have a nonstandard setting of the SIGALRM handler, though, or you risk problems with deadlock checking. regards, tom lane ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [ADMIN] Shared memory corrupted?
Tom Lane wrote: [ thinks... ] Another possibility is that you are running some non-Postgres code that resets SIGALRM handling to default. I have heard rumors that Perl will do that in some cases, for example. Are you using plperl? Yes, we are. I know there are some places in the code where SIGALRM is used, so I'll start looking there. But if you or anyone else thinks of anything, let me know ... [How would a plperl function that changes the local behavior of SIGALRM affect the backend?] -- Jeff Boes vox 269.226.9550 ext 24 Database Engineer fax 269.349.9076 Nexcerpt, Inc. http://www.nexcerpt.com ...Nexcerpt... Extend your Expertise ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [ADMIN] Shared memory corrupted?
Jeff Boes <[EMAIL PROTECTED]> writes: > Tom Lane wrote: >> What's signal 14 on your machine? (Look in /usr/include/signal.h to >> be sure.) Also, what PG version is this? > 14) SIGALRM > This is Pg 7.3.4, running on Linux 7.3 (Kernel 2.4.18-18.7.xsmp on a > 2-processor i686). Hm. That doesn't make any sense at all, because SIGALRM is either caught by a handler or ignored everywhere in the Postgres backend. There is no situation where it could legitimately cause process termination. Is it possible you are dealing with a kernel bug? [ thinks... ] Another possibility is that you are running some non-Postgres code that resets SIGALRM handling to default. I have heard rumors that Perl will do that in some cases, for example. Are you using plperl? regards, tom lane ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [ADMIN] Shared memory corrupted?
Jeff Boes <[EMAIL PROTECTED]> writes: > We are experiencing the following error, usually during our nightly > delete-and-vacuum cycle (when there are very few other connections to > the database): > 2003-10-30 01:36:59 [25392] LOG: server process (pid 697) was > terminated by signal 14 What's signal 14 on your machine? (Look in /usr/include/signal.h to be sure.) Also, what PG version is this? regards, tom lane ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [ADMIN] Shared memory corrupted?
Tom Lane wrote: Jeff Boes <[EMAIL PROTECTED]> writes: We are experiencing the following error, usually during our nightly delete-and-vacuum cycle (when there are very few other connections to the database): 2003-10-30 01:36:59 [25392] LOG: server process (pid 697) was terminated by signal 14 What's signal 14 on your machine? (Look in /usr/include/signal.h to be sure.) Also, what PG version is this? regards, tom lane signal.h doesn't have any definitions for signal numbers in it; "kill -l" lists 14 as: 14) SIGALRM This is Pg 7.3.4, running on Linux 7.3 (Kernel 2.4.18-18.7.xsmp on a 2-processor i686). The system has 4 GB of RAM. Shared memory parameters out of /etc/sysctl.conf follow: kernel.shmall = 1352914698 kernel.shmmax = 1352914698 And here's what I guess are the pertinent data from the postgresql.conf file: sort_mem = 65536 vacuum_mem = 262144 effective_cache_size = 196608 shared_buffers = 131072 max_fsm_relations = 200 max_fsm_pages = 35 wal_buffers = 32 We've seen the problem with vacuum_mem = 65536 also. -- Jeff Boes vox 269.226.9550 ext 24 Database Engineer fax 269.349.9076 Nexcerpt, Inc. http://www.nexcerpt.com ...Nexcerpt... Extend your Expertise ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
[ADMIN] Shared memory corrupted?
We are experiencing the following error, usually during our nightly delete-and-vacuum cycle (when there are very few other connections to the database): 2003-10-30 01:36:59 [25392] LOG: server process (pid 697) was terminated by signal 14 2003-10-30 01:36:59 [25392] LOG: terminating any other active server processes 2003-10-30 01:37:01 [1977] FATAL: The database system is in recovery mode 2003-10-30 01:37:08 [25392] LOG: all server processes terminated; reinitializing shared memory and semaphores 2003-10-30 01:37:09 [2856] FATAL: The database system is starting up 2003-10-30 01:37:09 [2855] LOG: database system was interrupted at 2003-10-30 01:26:13 EST The only clues we have are that the server processes interrupted by "signal 14" *seem* to be backends connected to Apache processes (on another server). But even that isn't certain, because of the difficulty in tracking down which process was doing what at the time. -- Jeff Boes vox 269.226.9550 ext 24 Database Engineer fax 269.349.9076 Nexcerpt, Inc. http://www.nexcerpt.com ...Nexcerpt... Extend your Expertise ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]