Richard Huxton wrote:
Justin Pasher wrote:
Hello,

I have a server running PostgreSQL 8.1.15-0etch1 (Debian etch) that was
recently put into production. Last week a developer started having a problem
with his psql connection being terminated every couple of minutes when he
was running a query. When I look through the logs, I noticed this message.

2009-01-09 08:09:46 CST LOG:  autovacuum process (PID 15012) was terminated
by signal 11

Segmentation fault - probably a bug or bad RAM.

It's a relatively new machine, but that's obviously a possibility with any hardware. I haven't seen any other programs experiencing problems on the box, but the Postgres daemon is the one that is primarily utilized, so it's a little biased toward that.

I looked through the logs some more and I noticed that this was occurring
every minute or so. The database is a pretty heavily utilized system
(judging by the age(datfrozenxid) from pg_database, the system had run
approximately 500 million queries in less than a week). I noticed that right
before every autovacuum termination, it tried to autovacuum a database.

2009-01-09 08:09:46 CST LOG:  transaction ID wrap limit is 4563352, limited
by database "database_name"

It was always showing the same database, so I decided to manually vacuum the
database. Once that was done (it was successful the first time without
errors), the problem seemed to go away. I went ahead and manually vacuumed
the remaining databases just to take care of the potential xid wraparound
issue.

I'd be suspicious of possible corruption in autovacuum's internal data.
Can you trace these problems back to a power-outage or system crash? It
doesn't look like "database_name" itself since you vacuumed that
successfully. If autovacuum is running normally now, that might indicate
it was something in the way autovacuum was keeping track of "database_name".

The server hasn't been rebooted since it was installed (about 9 months ago, but only being utilized within the past month), so there haven't been any crashes or power outages. The only abnormal things I can find in the Postgres logs are the autovacuum segfaults. Looking in the logs today, it looks like it's still happening (once again on a different database). I manually vacuumed that one database and the problem went away (for now).

Are there any internal Postgres tables I can look at that may shed some light on this? Any particular maintenance commands that could be run for repair?

It's also probably worth running some memory tests on the server -
(memtest86 or similar) to see if that shows anything. Was it *always*
the autovacuum process getting sig11? If not then it might just be a
pattern of usage that makes it more likely to use some bad RAM

I might try the memtest if we can actually get the databases off of the server to allow some downtime. None of the logs indicate anything else acting abnormally or being terminated abnormally, just the autovacuum daemon. From what I can tell, the segfaults only when the databases pass the half way point (when age(datfrozenxid) exceeds around 1500000000). When this is not the case, the segfaults do not occur according to the logs.


Justin Pasher

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Reply via email to