[GENERAL] Database corruption event, unlockable rows, possibly bogus virtual xids? Invalid backend server xid

Ned Wolpert Thu, 21 Feb 2013 12:00:28 -0800

(I originally posted this to pgsql-admin and was pointed to here instead.)

Folks-


  I'm doing a postmortem on a corruption event we had. I have an idea on
what happened, but not sure. I figure I'd share what happened and see if
I'm close to right here.

  Event: Running 9.1.6 with hot-standby, archiving 4 months of wal files,
and even a nightly pg_dump all. 50G database.  Trying to update or delete a
row in a small (21 row, but heavily used table) would lock up completely.
Never finish. Removed all clients, restarted the db instance, no joy. Check
pg_stat_activity, and nothing that wasn't idle.... run the delete, locked
up.

  Ran (SELECT*FROM pg_class JOIN pg_locks ON pg_locks.relation =
pg_class.oid;) with no clients touching this row, fresh restart of the db,
and saw virtualtransactions against this same table where the values would
be -1/nnnnn were nnnnn was a huge number.  Turns out we had about 159
entries from different tables in the database.  Checked hot-standby and, of
course, no locks or anything.  Switched to hot-standby.

  Hot-standby instantly gained these locks, Also noticed that 2 prepared
transactions migrated to the hot-standby. Binary upgraded to 9.1.8, locks
still existed. Ended up executing the one thing we knew would work. Take
the site down, pg_dumpall to fresh instance. Everything is fine.

A little more background: We were running 9.1.4 back when 9.1.6 came out.
We saw there was possible corruption issues and did a binary upgrade and
reindexing. Everything seemed to be fine, but I wonder if we really had
problems back then. We rebuilt the hot-standby after the binary upgrade via
normal restore and wal-file replays.  I should also note that this row that
had the lock on it that would not go away, was created by an app server
that was killed (via -9) since it was non-responsive, and the row 'create
date' (in db and logs) is the exact time the app server was killed.

I was wondering if a) these virtualtransactions that start with '-1/'
indicate a problem, b) if this could have happened from pre 9.1.6
corruption that was fixed in 9.1.6. Or, could this have occurred when we
killed that app server? Or.... am I looking in the wrong place.

I do still have the old data directories so I can start them up and check
out the dataset. Any advice?

Currently running environment: CentOS 5.x
Used the http://yum.postgresql.org/9.1 repo...
Currently using PostgreSQL 9.1.8 on x86_64-unknown-linux-gnu, compiled by
gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-52), 64-bit
but was on 9.1.6 at failure time.
The hardware on both primary and secondary servers are in good condition,
raid5 via a hardware card with no issues.  Each a 2x4x2 (2 hyperthreaded
Xeon E5620 2.4Ghz, 4 cores each, 16 threads total) 32G Ram.

Data from pg_dumpall checked out fine; no data loss occurred that we could
tell. Just this transaction lock and what seemed like invalid backend ids
listed in the virtualtransaction from the pg_locks table/view.
(-1/verylargenumber)






-- 
Virtually, Ned Wolpert

"Settle thy studies, Faustus, and begin..."   --Marlowe

[GENERAL] Database corruption event, unlockable rows, possibly bogus virtual xids? Invalid backend server xid

Reply via email to