On 12/26/12 7:23 PM, Greg Stark wrote:
It's also possible it's a bad cpu, not bad memory. If it affects
decrement or increment in particular it's possible that the pattern of
usage on LocalRefCount is particularly prone to triggering it.
This looks to be the winning answer. It turns out that under extended
multi-hour loads at high concurrency, something related to CPU
overheating was occasionally flipping a bit. One round of compressed
air for all the fans/vents, a little tweaking of the fan controls, and
now the system goes >24 hours with no problems.
Sorry about all the noise over this. I do think the improved warning
messages that came out of the diagnosis ideas are useful. The reworked
code must slows down the checking a few cycles, but if you care about
performance these assertions are tacked onto the biggest pig around.
I added the patch to the January CF as "Improve buffer refcount leak
warning messages". The sample I showed with the patch submission was a
simulated one. Here's the output from the last crash before resolving
the issue, where the assertion really triggered:
WARNING: buffer refcount leak: [170583] (rel=base/16384/16578,
blockNum=302295, flags=0x106, refcount=0 1073741824)
WARNING: buffers with non-zero refcount is 1
TRAP: FailedAssertion("!(RefCountErrors == 0)", File: "bufmgr.c", Line:
1712)
--
Greg Smith 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers