Hi Tom, the problem persists, even when starting from scratch. I did the following:
# wget ftp://ftp.de.postgresql.org/mirror/postgresql/source/v7.3.4/postgresql-7.3.4.tar.gz # tar xzf postgresql-7.3.4.tar.gz # cd postgresql-7.3.4/ # cat ../mypatch --- src/backend/storage/lmgr/lock.c~ 2002-11-01 01:40:23.000000000 +0100 +++ src/backend/storage/lmgr/lock.c 2003-08-29 11:23:02.000000000 +0200 @@ -467,6 +467,8 @@ LWLockAcquire(masterLock, LW_EXCLUSIVE); + printf("lock\n"); fflush(stdout); + /* * Find or create a lock with this tag */ @@ -682,8 +684,13 @@ /* * Sleep till someone wakes me up. */ + + printf("before wait\n"); fflush(stdout); + status = WaitOnLock(lockmethod, lockmode, lock, holder); + printf("after wait\n"); fflush(stdout); + /* * NOTE: do not do any material change of state between here and * return. All required changes in locktable state must have been # patch -p0 < ../mypatch # gmake # gmake install After running DBT3 with scale factor 0.025 and 8 concurrent processes: $ wc -l run/dbt3_logfile 51941 run/dbt3_logfile $ grep lock run/dbt3_logfile | wc -l 51941 $ grep wait run/dbt3_logfile | wc -l 0 Well, I just added three printf() statements. I cannot imagine how that could break postgresql. I repeated the test with following additional modifications: # cat ../mypatch2 --- src/backend/storage/lmgr/lock.c~ 2003-08-29 11:26:37.000000000 +0200 +++ src/backend/storage/lmgr/lock.c 2003-08-29 11:57:26.000000000 +0200 @@ -39,6 +39,7 @@ #include "utils/memutils.h" #include "utils/ps_status.h" +#include <sched.h> /* This configuration variable is used to set the lock table size */ int max_locks_per_xact; /* set by guc.c */ @@ -1160,6 +1161,7 @@ ProcLockWakeup(lockMethodTable, lock); LWLockRelease(masterLock); + sched_yield(); return TRUE; } @@ -1337,6 +1339,8 @@ elog(LOG, "LockReleaseAll: done"); #endif + sched_yield(); + return TRUE; } This should lead to very heavy scheduling, such that processes are better interleaved. After running DBT3: same result. With my other patch producing thorough log output, the sched_yield() leads to higher probability for observing badly granted locks. So it is very unlikely that my printf()s and postprocessing of the logfile leads to that problem. I have even observed cases where the error occurs within the first 10 locks, such that I can compute the lock state by hand and verify by hand that there really exist locks of mode 7 which are granted in parallel to different processes. Although I cannot be sure that my environment (kernel, libc, compiler, ...) produces that behaviour, I think that there remains some probability for a bug in the lock manager. I have repeated the tests on two different machines, one of them a dual-processor Athlon MP-1900+, the other a single processor Athlon 3000+. OK, both systems are running Redhat 9, so there remain some chances that something very obscure happens on the OS level which is reproducible on both systems. In order to find out possible OS effects, the above tests should be repeated by other people on other platforms. Please, if anyone could kindly do that, report the results here. Tom, it sounds really strange, and I also cannot nearly believe it, but I could imagine why that problem (if it really exists) was not detected before. The following is no claim, it is just an idea how it could have happened. Please don't take it as a personal threat, I just want to explain that it _could_ be possible that a non-working lock manager has not led to any noticable problems. Also, I don't want to stimulate a discussion whether the following is right or not. It could be wrong. (1) Most of the locks are con-conflicting by nature. (2) If I understand it right, read-only txns use time-domain-addressing and thus never conflict with any other txns. Only read-write txns can ever produce races on data. (3) Ciritical regions are often only a small percentage of the overall running time of a process. (4) Rescheduling by the OS occurs not when processes are woken up, but rather only when a process blocks for itself or when a timer interrupt occurs. (5) Current processors are by a factor of 10 million faster than timer interrupts (typically 100/s). When a process does not block for itself, it will be interrupted only after 10 million instructions in average. Thus the probability to hit a critical region just in that seldom moment is extremely low. (6) I ran my tests on extremely small databases which fit in the buffer cache of the OS. Real-world apps are doing much more physical disk IO. At disk IO, rescheduling _always_ occurs at the same place. When processes are running less than 10ms until the next timer interrupt, there will be never interruptions at unforeseeable places. In summary, if this theory is right, it _could_ be _possible_ that "unpredictable" behaviour has never been noticed, because it occurs only with extremely low probability. I don't want to claim that just this is the reality, just provide some idea how it _could_ have happend if the problem really exists. Tom, please dig into the problem. If the lock manager is really wrong, all my measurements are at least questionable, if not void. I have written a paper relying on that measurements and want to submit it to a conference in 2 weeks. I hope that fixing that problem (if it exists) will not lead to toally different behaviour and render my whole work void. Please, help me by investigating the problem and finding out what happens, and fixing it if it should turn out a bug. Cheers, Thomas ---------------------------(end of broadcast)--------------------------- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly