[HACKERS] Backends stalled in 'startup' state: index corruption

Greg Sabino Mullane Thu, 24 May 2012 10:34:07 -0700

Yesterday I had a client that experienced a sudden high load on 
one of their servers (8.3.5 - yes, I know. Those of you with 
clients will understand). When I checked, almost all connections 
were in a "startup" state, very similar to this thread:


http://postgresql.1045698.n5.nabble.com/9-1-3-backends-getting-stuck-in-startup-td5670712.html

Running a strace showed a lot of semop activity, and the logs showed a
successful connection, then a 5 minute plus wait before a query was issued.
So obviously, blocking on something. Unlike the thread above, I *did* find
problems in the system catalogs. For example, both pg_class and pg_index
gave warnings like this for every index during a VACUUM FULL
VERBOSE tablename:

WARNING: index "pg_class_relname_nsp_index" contains 7712 row versions,
  but table contains 9471 row versions
HINT:  Rebuild the index with REINDEX.

A REINDEX did not solve the problem (initially), as a REINDEX followed
by a VAC showed the same warning and hint.

The next step was dropping to standalone mode, but before that could
be done, the REINDEXes fixed the problem (no warnings, no stalled 
connections). So my questions are:

* Why would a REINDEX not fix the problem as the hint suggested?
Why would it then start "working"?

* What exactly is the relationship between bad indexes and shared
memory locks?

* Is there some other emergency fix when it happens, such as killing
all backends and hoping you kill the one that is actually holding
the lock (if any was).

* Did anything in the 8.3 series fix this? I saw nothing relevant in
the release notes for everything up to 8.3.18 (which it will be on soon).

-- 
Greg Sabino Mullane [email protected]
End Point Corporation
PGP Key: 0x14964AC8

pgpwbxRekkZQf.pgp
Description: PGP signature

[HACKERS] Backends stalled in 'startup' state: index corruption

Reply via email to