First of all thanks to those who provided input.

This problem is now fixed and I thought I would post this solution so that others might benefit in the future.

For the sake of completeness:

The error was that if "show all" was run on this postgresql (version 8.3) server, postgres would crash and then recover.
        Otherwise the server "seemed" healthy

        The postgres log showed:
Sep 10 23:55:36 theconsole postgres[31118]: [4-1] 0: LOG: 00000: server process (PID 31145) was terminated by signal 11: Segmentation fault Sep 10 23:55:36 theconsole postgres[31118]: [4-2] 0: LOCATION: LogChildExit, postmaster.c:2529 Sep 10 23:55:36 theconsole postgres[31118]: [5-1] 0: LOG: 00000: terminating any other active server processes Sep 10 23:55:36 theconsole postgres[31118]: [5-2] 0: LOCATION: HandleChildCrash, postmaster.c:2374 Sep 10 23:55:36 theconsole postgres[31118]: [6-1] 0: LOG: 00000: all server processes terminated; reinitializing Sep 10 23:55:36 theconsole postgres[31118]: [6-2] 0: LOCATION: PostmasterStateMachine, postmaster.c:2690 Sep 10 23:55:36 theconsole postgres[31146]: [7-1] 0: LOG: 00000: database system was interrupted; last known up at 2009-09-10 23:55:14 EST Sep 10 23:55:36 theconsole postgres[31146]: [7-2] 0: LOCATION: StartupXLOG, xlog.c:4836 Sep 10 23:55:36 theconsole postgres[31147]: [7-1] [local] postgres postgres 0: FATAL: 57P03: the database system is in recovery mode Sep 10 23:55:36 theconsole postgres[31147]: [7-2] [local] postgres postgres 0: LOCATION: ProcessStartupPacket, postmaster.c:1648 Sep 10 23:55:36 theconsole postgres[31146]: [8-1] 0: LOG: 00000: database system was not properly shut down; automatic recovery in progress Sep 10 23:55:36 theconsole postgres[31146]: [8-2] 0: LOCATION: StartupXLOG, xlog.c:5003 Sep 10 23:55:36 theconsole postgres[31146]: [9-1] 0: LOG: 00000: record with zero length at 2A/E734761C Sep 10 23:55:36 theconsole postgres[31146]: [9-2] 0: LOCATION: ReadRecord, xlog.c:3126 Sep 10 23:55:36 theconsole postgres[31146]: [10-1] 0: LOG: 00000: redo is not required Sep 10 23:55:36 theconsole postgres[31146]: [10-2] 0: LOCATION: StartupXLOG, xlog.c:5146 Sep 10 23:55:36 theconsole postgres[31150]: [7-1] 0: LOG: 00000: autovacuum launcher started Sep 10 23:55:36 theconsole postgres[31150]: [7-2] 0: LOCATION: AutoVacLauncherMain, autovacuum.c:520 Sep 10 23:55:36 theconsole postgres[31118]: [7-1] 0: LOG: 00000: database system is ready to accept connections

SOLUTION:
        Increase the memory on the server.

WHY
We had recently ( a month before) had installed splunk on the server. It was running ok The combination of splunk and other tasks running had pushed the memory too close. What we did not notice was that swap had been almost completely consumed - nasty

RESULT
We shut it all down, increased the memory (double) and voila - problem gone.

It goes to show that when hunting problems we should not ignore the basic environmental elements. It also goes to show that our monitoring system was not looking at this relatively new server.
(this confession is not an invitation for a spanking)

again thanks for the help
Grant


On 11/09/2009, at 9:09 AM, Grant Maxwell wrote:


On 11/09/2009, at 8:36 AM, Tom Lane wrote:

Grant Maxwell <grant.maxw...@maxan.com.au> writes:
On the problem server:
        shared_preload_libraries = 'pgmemcache'
        #local_preload_libraries = ''

on the others both are emply.

Sounds like a smoking gun to me.

For good measure I removed pgmemcache but the problem persists.

Did you restart the postmaster afterwards?  shared_preload_libraries
is only considered at postmaster start.

        yep - full restart.

                        regards, tom lane


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Reply via email to