On 12/26/12 7:23 PM, Greg Stark wrote:
It's also possible it's a bad cpu, not bad memory. If it affects
decrement or increment in particular it's possible that the pattern of
usage on LocalRefCount is particularly prone to triggering it.

This looks to be the winning answer. It turns out that under extended multi-hour loads at high concurrency, something related to CPU overheating was occasionally flipping a bit. One round of compressed air for all the fans/vents, a little tweaking of the fan controls, and now the system goes >24 hours with no problems.

Sorry about all the noise over this. I do think the improved warning messages that came out of the diagnosis ideas are useful. The reworked code must slows down the checking a few cycles, but if you care about performance these assertions are tacked onto the biggest pig around.

I added the patch to the January CF as "Improve buffer refcount leak warning messages". The sample I showed with the patch submission was a simulated one. Here's the output from the last crash before resolving the issue, where the assertion really triggered:

WARNING: buffer refcount leak: [170583] (rel=base/16384/16578, blockNum=302295, flags=0x106, refcount=0 1073741824)
WARNING:  buffers with non-zero refcount is 1
TRAP: FailedAssertion("!(RefCountErrors == 0)", File: "bufmgr.c", Line: 1712)

--
Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to