"Gregory Stark" <[EMAIL PROTECTED]> writes: > "Tom Lane" <[EMAIL PROTECTED]> writes: > >> Gregory Stark <[EMAIL PROTECTED]> writes: >>> We're seeing a problem where occasionally a process appears to be granted a >>> lock but miss its semaphore signal. >> >> Kernel bug maybe? What's the platform? > > It does sound like it given the way my description went. I was worried it may > be some code path not setting waitStatus properly or the compiler caching it > incorrectly somehow. > > But now that I check I see it's a pretty old kernel version (Linux 2.6.5)
For what it's worth we've reproduced the problem with 2.6.16.21 which is "only" about a year old. I want to rerun this with a shiny new 2.6.22 kernel but really 2.6.16 is recent enough that I don't know of any major bugs fixed in IPC handling since then (with the exception of hugetlb interaction which we're not using on this machine) . So now this is probably either an ongoing kernel bug affecting Postgres or it's elsewhere -- either in Postgres or GCC. I'm really concerned about this because while the behaviour with deadlock_timeout set quite high (we have it set to 60s on this machine) is bad enough -- the behaviour with it set to the default 1s is far more scary. On the default 1s timeout on a machine undergoing lock waits which are mostly under 1s you will probably never notice anything recognizably similar to this. You'll occasionally have some lock waits which last a second for no good reason but you'll never notice that. *But* if you should have a lock wait which lasts more than 1s before it's granted, then when it's granted the semaphore gets lost you're in serious doo doo. The deadlock timeout only fires once and then nothing's going to wake up that process ever again. IIRC we've actually gotten a couple reports of people claiming they've got a "deadlock" when there was no evidence of a deadlock in pg_locks. We always chalked it down to a single long-lived process holding the lock and blocking, but never did much analysis on those reports to see if that was really the case. It's quite possible we had users already observing this problem. If it's a real problem then we're in a bit of a bind. Even if we find and fix a Linux kernel problem we'll still have users on versions of the kernel prior to 2.6.23 or whatever has the bug fixed. We may be best off including an option to have the deadlock timer refire every deadlock_timeout interval instead of just firing once. Then we could print a message any time it occurs and include a HINT about upgrading to a kernel with the bug fixed. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate