Yesterday I had a client that experienced a sudden high load on one of their servers (8.3.5 - yes, I know. Those of you with clients will understand). When I checked, almost all connections were in a "startup" state, very similar to this thread:
http://postgresql.1045698.n5.nabble.com/9-1-3-backends-getting-stuck-in-startup-td5670712.html Running a strace showed a lot of semop activity, and the logs showed a successful connection, then a 5 minute plus wait before a query was issued. So obviously, blocking on something. Unlike the thread above, I *did* find problems in the system catalogs. For example, both pg_class and pg_index gave warnings like this for every index during a VACUUM FULL VERBOSE tablename: WARNING: index "pg_class_relname_nsp_index" contains 7712 row versions, but table contains 9471 row versions HINT: Rebuild the index with REINDEX. A REINDEX did not solve the problem (initially), as a REINDEX followed by a VAC showed the same warning and hint. The next step was dropping to standalone mode, but before that could be done, the REINDEXes fixed the problem (no warnings, no stalled connections). So my questions are: * Why would a REINDEX not fix the problem as the hint suggested? Why would it then start "working"? * What exactly is the relationship between bad indexes and shared memory locks? * Is there some other emergency fix when it happens, such as killing all backends and hoping you kill the one that is actually holding the lock (if any was). * Did anything in the 8.3 series fix this? I saw nothing relevant in the release notes for everything up to 8.3.18 (which it will be on soon). -- Greg Sabino Mullane g...@endpoint.com End Point Corporation PGP Key: 0x14964AC8
pgpwbxRekkZQf.pgp
Description: PGP signature