Hi Tom,

the problem persists, even when starting from scratch. I did the following:

# wget 
ftp://ftp.de.postgresql.org/mirror/postgresql/source/v7.3.4/postgresql-7.3.4.tar.gz
# tar xzf postgresql-7.3.4.tar.gz
# cd postgresql-7.3.4/
# cat ../mypatch
--- src/backend/storage/lmgr/lock.c~    2002-11-01 01:40:23.000000000 +0100
+++ src/backend/storage/lmgr/lock.c     2003-08-29 11:23:02.000000000 +0200
@@ -467,6 +467,8 @@

        LWLockAcquire(masterLock, LW_EXCLUSIVE);

+       printf("lock\n"); fflush(stdout);
+
        /*
         * Find or create a lock with this tag
         */
@@ -682,8 +684,13 @@
                /*
                 * Sleep till someone wakes me up.
                 */
+
+               printf("before wait\n"); fflush(stdout);
+
                status = WaitOnLock(lockmethod, lockmode, lock, holder);

+               printf("after wait\n"); fflush(stdout);
+
                /*
                 * NOTE: do not do any material change of state between here and
                 * return.      All required changes in locktable state must have been
# patch -p0 < ../mypatch
# gmake
# gmake install

After running DBT3 with scale factor 0.025 and 8 concurrent processes:

$ wc -l run/dbt3_logfile
  51941 run/dbt3_logfile
$ grep lock run/dbt3_logfile | wc -l
  51941
$ grep wait run/dbt3_logfile | wc -l
      0

Well, I just added three printf() statements. I cannot imagine
how that could break postgresql.

I repeated the test with following additional modifications:

# cat ../mypatch2
--- src/backend/storage/lmgr/lock.c~    2003-08-29 11:26:37.000000000 +0200
+++ src/backend/storage/lmgr/lock.c     2003-08-29 11:57:26.000000000 +0200
@@ -39,6 +39,7 @@
 #include "utils/memutils.h"
 #include "utils/ps_status.h"

+#include <sched.h>

 /* This configuration variable is used to set the lock table size */
 int                    max_locks_per_xact; /* set by guc.c */
@@ -1160,6 +1161,7 @@
                ProcLockWakeup(lockMethodTable, lock);

        LWLockRelease(masterLock);
+       sched_yield();
        return TRUE;
 }

@@ -1337,6 +1339,8 @@
                elog(LOG, "LockReleaseAll: done");
 #endif

+       sched_yield();
+
        return TRUE;
 }

This should lead to very heavy scheduling, such that processes are
better interleaved. After running DBT3: same result.

With my other patch producing thorough log output, the sched_yield()
leads to higher probability for observing badly granted locks.

So it is very unlikely that my printf()s and postprocessing of the
logfile leads to that problem. I have even observed cases where
the error occurs within the first 10 locks, such that I can
compute the lock state by hand and verify by hand that there really
exist locks of mode 7 which are granted in parallel to different
processes.

Although I cannot be sure that my environment (kernel, libc,
compiler, ...) produces that behaviour, I think that there
remains some probability for a bug in the lock manager. I have
repeated the tests on two different machines, one of them a
dual-processor Athlon MP-1900+, the other a single processor
Athlon 3000+. OK, both systems are running Redhat 9, so there
remain some chances that something very obscure happens on the
OS level which is reproducible on both systems.

In order to find out possible OS effects, the above tests should
be repeated by other people on other platforms. Please, if anyone
could kindly do that, report the results here.

Tom, it sounds really strange, and I also cannot nearly believe it,
but I could imagine why that problem (if it really exists) was
not detected before. The following is no claim, it is just an idea
how it could have happened. Please don't take it as a personal
threat, I just want to explain that it _could_ be possible that
a non-working lock manager has not led to any noticable problems.
Also, I don't want to stimulate a discussion whether the following
is right or not. It could be wrong.

(1) Most of the locks are con-conflicting by nature.
(2) If I understand it right, read-only txns use time-domain-addressing
    and thus never conflict with any other txns. Only read-write
    txns can ever produce races on data.
(3) Ciritical regions are often only a small percentage of the overall
    running time of a process.
(4) Rescheduling by the OS occurs not when processes are woken up,
    but rather only when a process blocks for itself or when a
    timer interrupt occurs.
(5) Current processors are by a factor of 10 million faster than
    timer interrupts (typically 100/s). When a process does not
    block for itself, it will be interrupted only after 10 million
    instructions in average. Thus the probability to hit a critical
    region just in that seldom moment is extremely low.
(6) I ran my tests on extremely small databases which fit in the
    buffer cache of the OS. Real-world apps are doing much more
    physical disk IO. At disk IO, rescheduling _always_ occurs at
    the same place. When processes are running less than 10ms until
    the next timer interrupt, there will be never interruptions
    at unforeseeable places.

In summary, if this theory is right, it _could_ be _possible_ that
"unpredictable" behaviour has never been noticed, because it occurs
only with extremely low probability.

I don't want to claim that just this is the reality, just provide
some idea how it _could_ have happend if the problem really exists.

Tom, please dig into the problem. If the lock manager is really
wrong, all my measurements are at least questionable, if not void.
I have written a paper relying on that measurements and want
to submit it to a conference in 2 weeks. I hope that fixing that
problem (if it exists) will not lead to toally different behaviour
and render my whole work void. Please, help me by investigating
the problem and finding out what happens, and fixing it if it should
turn out a bug.

Cheers,

Thomas

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
      subscribe-nomail command to [EMAIL PROTECTED] so that your
      message can get through to the mailing list cleanly

Reply via email to