subject:"Re\: \[PATCH\] Re\: Negative scalability by removal of"

Re: [PATCH] Re: Negative scalability by removal of

2000-11-20 Thread lamont



there's already the Linux Scalability Project's wake_one() patch for 2.2.9
(which applies fine to 2.2.18preX):

http://www.citi.umich.edu/projects/linux-scalability/patches/p_accept-2.2.9.diff



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Re: Negative scalability by removal of

2000-11-07 Thread Alan Cox


> Anyway, version 2 below uses LIFO for the accept() wakeups.  This
> appears to be a 5%-10% win for Apache.  The browsing loop for
> exclusive tasks will now pull in cachelines 0 and 2, rather
> than the previous 0 and 1.

That makes it much worse for the newest cpus which use 64byte lines (Athlon
and PIV)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Re: Negative scalability by removal of

2000-11-07 Thread Andrew Morton


Linus Torvalds wrote:
> 
> On Tue, 7 Nov 2000, Andrew Morton wrote:
> 
> > Alan Cox wrote:
> > >
> > > > Even 2.2.x can be fixed to do the wake-one for accept(), if required.
> > >
> > > Do we really want to retrofit wake_one to 2.2. I know Im not terribly keen to
> > > try and backport all the mechanism. I think for 2.2 using the semaphore is a
> > > good approach. Its a hack to fix an old OS kernel. For 2.4 its not needed
> >
> > It's a 16-liner!  I'll cheerfully admit that this patch
> > may be completely broken, but hey, it's free.  I suggest
> > that _something_ has to be done for 2.2 now, because
> > Apache has switched to unserialised accept().
> 
> This is why I'd love to _not_ see silly work-arounds in apache: we
> obviously _can_ fix the places where our performance sucks, but only if we
> don't have other band-aids hiding the true issues.
> 
> For example, with a file-locking apache, we'd have to fix the (noticeably
> harder) file locking thing to be wake-one instead, and even then we'd
> never be able to do as well as something that gets the same wake-one thing
> without the two extra system calls.
> 
> The patch looks superficially fine to me, although it does seem to add
> another cache-line to the wakeup setup - it migth be worth-while to have
> the exclusive state closer. But maybe I just didn't count right.

Your counting's fine.  But I figured the third cachline was OK
because we're going to need that in add_to_runqueue() a few
cycles later.

Anyway, version 2 below uses LIFO for the accept() wakeups.  This
appears to be a 5%-10% win for Apache.  The browsing loop for
exclusive tasks will now pull in cachelines 0 and 2, rather
than the previous 0 and 1.

--- linux-2.2.18-pre19/include/linux/sched.hSun Nov  5 11:46:54 2000
+++ linux-akpm/include/linux/sched.hTue Nov  7 20:20:13 2000
@@ -79,6 +79,7 @@
 #define TASK_ZOMBIE4
 #define TASK_STOPPED   8
 #define TASK_SWAPPING  16
+#define TASK_EXCLUSIVE 32
 
 /*
  * Scheduling policies
@@ -251,6 +252,7 @@
struct task_struct *next_task, *prev_task;
struct task_struct *next_run,  *prev_run;
 
+   unsigned int task_exclusive;/* task wants wake-one semantics in 
+__wake_up() */
 /* task state */
struct linux_binfmt *binfmt;
int exit_code, exit_signal;
@@ -370,6 +372,7 @@
 /* counter */  DEF_PRIORITY,DEF_PRIORITY,0, \
 /* SMP */  0,0,0,-1, \
 /* schedlink */_task,_task, _task, _task, \
+/* task_exclusive */ 0, \
 /* binfmt */   NULL, \
 /* ec,brk... */0,0,0,0,0,0, \
 /* pid etc.. */0,0,0,0,0, \
@@ -496,8 +499,8 @@
signed long timeout));
 extern void FASTCALL(wake_up_process(struct task_struct * tsk));
 
-#define wake_up(x) __wake_up((x),TASK_UNINTERRUPTIBLE | 
TASK_INTERRUPTIBLE)
-#define wake_up_interruptible(x)   __wake_up((x),TASK_INTERRUPTIBLE)
+#define wake_up(x) __wake_up((x),TASK_UNINTERRUPTIBLE | 
+TASK_INTERRUPTIBLE | TASK_EXCLUSIVE)
+#define wake_up_interruptible(x)   __wake_up((x),TASK_INTERRUPTIBLE | 
+TASK_EXCLUSIVE)
 
 #define __set_current_state(state_value)   do { current->state = state_value; } 
while (0)
 #ifdef __SMP__
--- linux-2.2.18-pre19/kernel/sched.c   Sun Nov  5 11:46:54 2000
+++ linux-akpm/kernel/sched.c   Tue Nov  7 20:23:25 2000
@@ -890,8 +890,9 @@
  */
 void __wake_up(struct wait_queue **q, unsigned int mode)
 {
-   struct task_struct *p;
+   struct task_struct *p, *last_exclusive;
struct wait_queue *head, *next;
+   unsigned int done_exclusive, do_exclusive;
 
 if (!q)
goto out;
@@ -906,10 +907,17 @@
if (!next)
goto out_unlock;
 
+   last_exclusive = NULL;
+   do_exclusive = mode & TASK_EXCLUSIVE;
while (next != head) {
p = next->task;
next = next->next;
if (p->state & mode) {
+   if (do_exclusive && p->task_exclusive) {
+   last_exclusive = p;
+   continue;
+   }
+
/*
 * We can drop the read-lock early if this
 * is the only/last process.
@@ -922,6 +930,8 @@
wake_up_process(p);
}
}
+   if (last_exclusive)
+   wake_up_process(last_exclusive);
 out_unlock:
read_unlock(_lock);
 out:
--- linux-2.2.18-pre19/net/ipv4/tcp.c   Sun Nov  5 11:46:54 2000
+++ linux-akpm/net/ipv4/tcp.c   Tue Nov  7 20:20:13 2000
@@ -1619,6 +1619,7 @@
struct wait_queue wait = { current, NULL };
struct open_request *req;
 
+   current->task_exclusive = 1;
add_wait_queue(sk->sleep, );
for (;;) {
current->state = TASK_INTERRUPTIBLE;
@@ -1632,6 +1633,8 @@
break;
}
current->state =

Re: [PATCH] Re: Negative scalability by removal of

2000-11-07 Thread dean gaudet


haha, ok!  :)

(well i'm sure you know the history, but for others -- that code entered
apache not specifically for linux... but specifically for handling the
many early-to-mid 90s unixes that just plain broke on multiple accept :)

-dean

On Mon, 6 Nov 2000, David S. Miller wrote:

>Date:  Mon, 6 Nov 2000 21:23:57 -0800 (PST)
>From: dean gaudet <[EMAIL PROTECTED]>
> 
>  apache is about correctness first, and performance second.
> 
> Which is why we say it is "incorrect" for apache to try
> and work around kernel performance problems. :-)))
> 
> Later,
> David S. Miller
> [EMAIL PROTECTED]
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> Please read the FAQ at http://www.tux.org/lkml/
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Re: Negative scalability by removal of

2000-11-07 Thread dean gaudet


haha, ok!  :)

(well i'm sure you know the history, but for others -- that code entered
apache not specifically for linux... but specifically for handling the
many early-to-mid 90s unixes that just plain broke on multiple accept :)

-dean

On Mon, 6 Nov 2000, David S. Miller wrote:

Date:  Mon, 6 Nov 2000 21:23:57 -0800 (PST)
From: dean gaudet [EMAIL PROTECTED]
 
  apache is about correctness first, and performance second.
 
 Which is why we say it is "incorrect" for apache to try
 and work around kernel performance problems. :-)))
 
 Later,
 David S. Miller
 [EMAIL PROTECTED]
 -
 To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
 the body of a message to [EMAIL PROTECTED]
 Please read the FAQ at http://www.tux.org/lkml/
 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Re: Negative scalability by removal of

2000-11-07 Thread Andrew Morton


Linus Torvalds wrote:
 
 On Tue, 7 Nov 2000, Andrew Morton wrote:
 
  Alan Cox wrote:
  
Even 2.2.x can be fixed to do the wake-one for accept(), if required.
  
   Do we really want to retrofit wake_one to 2.2. I know Im not terribly keen to
   try and backport all the mechanism. I think for 2.2 using the semaphore is a
   good approach. Its a hack to fix an old OS kernel. For 2.4 its not needed
 
  It's a 16-liner!  I'll cheerfully admit that this patch
  may be completely broken, but hey, it's free.  I suggest
  that _something_ has to be done for 2.2 now, because
  Apache has switched to unserialised accept().
 
 This is why I'd love to _not_ see silly work-arounds in apache: we
 obviously _can_ fix the places where our performance sucks, but only if we
 don't have other band-aids hiding the true issues.
 
 For example, with a file-locking apache, we'd have to fix the (noticeably
 harder) file locking thing to be wake-one instead, and even then we'd
 never be able to do as well as something that gets the same wake-one thing
 without the two extra system calls.
 
 The patch looks superficially fine to me, although it does seem to add
 another cache-line to the wakeup setup - it migth be worth-while to have
 the exclusive state closer. But maybe I just didn't count right.

Your counting's fine.  But I figured the third cachline was OK
because we're going to need that in add_to_runqueue() a few
cycles later.

Anyway, version 2 below uses LIFO for the accept() wakeups.  This
appears to be a 5%-10% win for Apache.  The browsing loop for
exclusive tasks will now pull in cachelines 0 and 2, rather
than the previous 0 and 1.

--- linux-2.2.18-pre19/include/linux/sched.hSun Nov  5 11:46:54 2000
+++ linux-akpm/include/linux/sched.hTue Nov  7 20:20:13 2000
@@ -79,6 +79,7 @@
 #define TASK_ZOMBIE4
 #define TASK_STOPPED   8
 #define TASK_SWAPPING  16
+#define TASK_EXCLUSIVE 32
 
 /*
  * Scheduling policies
@@ -251,6 +252,7 @@
struct task_struct *next_task, *prev_task;
struct task_struct *next_run,  *prev_run;
 
+   unsigned int task_exclusive;/* task wants wake-one semantics in 
+__wake_up() */
 /* task state */
struct linux_binfmt *binfmt;
int exit_code, exit_signal;
@@ -370,6 +372,7 @@
 /* counter */  DEF_PRIORITY,DEF_PRIORITY,0, \
 /* SMP */  0,0,0,-1, \
 /* schedlink */init_task,init_task, init_task, init_task, \
+/* task_exclusive */ 0, \
 /* binfmt */   NULL, \
 /* ec,brk... */0,0,0,0,0,0, \
 /* pid etc.. */0,0,0,0,0, \
@@ -496,8 +499,8 @@
signed long timeout));
 extern void FASTCALL(wake_up_process(struct task_struct * tsk));
 
-#define wake_up(x) __wake_up((x),TASK_UNINTERRUPTIBLE | 
TASK_INTERRUPTIBLE)
-#define wake_up_interruptible(x)   __wake_up((x),TASK_INTERRUPTIBLE)
+#define wake_up(x) __wake_up((x),TASK_UNINTERRUPTIBLE | 
+TASK_INTERRUPTIBLE | TASK_EXCLUSIVE)
+#define wake_up_interruptible(x)   __wake_up((x),TASK_INTERRUPTIBLE | 
+TASK_EXCLUSIVE)
 
 #define __set_current_state(state_value)   do { current-state = state_value; } 
while (0)
 #ifdef __SMP__
--- linux-2.2.18-pre19/kernel/sched.c   Sun Nov  5 11:46:54 2000
+++ linux-akpm/kernel/sched.c   Tue Nov  7 20:23:25 2000
@@ -890,8 +890,9 @@
  */
 void __wake_up(struct wait_queue **q, unsigned int mode)
 {
-   struct task_struct *p;
+   struct task_struct *p, *last_exclusive;
struct wait_queue *head, *next;
+   unsigned int done_exclusive, do_exclusive;
 
 if (!q)
goto out;
@@ -906,10 +907,17 @@
if (!next)
goto out_unlock;
 
+   last_exclusive = NULL;
+   do_exclusive = mode  TASK_EXCLUSIVE;
while (next != head) {
p = next-task;
next = next-next;
if (p-state  mode) {
+   if (do_exclusive  p-task_exclusive) {
+   last_exclusive = p;
+   continue;
+   }
+
/*
 * We can drop the read-lock early if this
 * is the only/last process.
@@ -922,6 +930,8 @@
wake_up_process(p);
}
}
+   if (last_exclusive)
+   wake_up_process(last_exclusive);
 out_unlock:
read_unlock(waitqueue_lock);
 out:
--- linux-2.2.18-pre19/net/ipv4/tcp.c   Sun Nov  5 11:46:54 2000
+++ linux-akpm/net/ipv4/tcp.c   Tue Nov  7 20:20:13 2000
@@ -1619,6 +1619,7 @@
struct wait_queue wait = { current, NULL };
struct open_request *req;
 
+   current-task_exclusive = 1;
add_wait_queue(sk-sleep, wait);
for (;;) {
current-state = TASK_INTERRUPTIBLE;
@@ -1632,6 +1633,8 @@
break;
}
current-state = TASK_RUNNING;
+   wmb();
+

Re: [PATCH] Re: Negative scalability by removal of

2000-11-07 Thread Alan Cox


 Anyway, version 2 below uses LIFO for the accept() wakeups.  This
 appears to be a 5%-10% win for Apache.  The browsing loop for
 exclusive tasks will now pull in cachelines 0 and 2, rather
 than the previous 0 and 1.

That makes it much worse for the newest cpus which use 64byte lines (Athlon
and PIV)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Re: Negative scalability by removal of

2000-11-06 Thread David S. Miller

   Date:Mon, 6 Nov 2000 21:23:57 -0800 (PST)
   From: dean gaudet <[EMAIL PROTECTED]>

 apache is about correctness first, and performance second.

Which is why we say it is "incorrect" for apache to try
and work around kernel performance problems. :-)))

Later,
David S. Miller
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Re: Negative scalability by removal of

2000-11-06 Thread dean gaudet


On Mon, 6 Nov 2000, Linus Torvalds wrote:

> This is why I'd love to _not_ see silly work-arounds in apache

hey, maybe it's time for me to repeat something that i'm often quoted as
saying:

  apache is about correctness first, and performance second.

i don't think that's silly personally.  remember most websites can be
served fine off an anemic 486/33 with one ethernet port tied behind its
back while doing a three legged race with a 6502 up a hill in san
francisco during el nino.

don't let the benchmarks fool ya!  it's generally more important that a
server be able to fork perl and parse CGIs fast than it is for it to
accept an extra 1000 conns/s.

apache-1.3.15 defines SINGLE_LISTEN_UNSERIALIZED_ACCEPT on linux 2.2 and
later.  dunno when the release date will be... someone go find a security
flaw and it'll push up the release ;)  (p.s. and rbb promised to forward
the change into 2.0 and rse said he'd forward the change into mm, all of
which were based off the same code.)

-dean

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Re: Negative scalability by removal of

2000-11-06 Thread Linus Torvalds

On Tue, 7 Nov 2000, Andrew Morton wrote:

> Alan Cox wrote:
> > 
> > > Even 2.2.x can be fixed to do the wake-one for accept(), if required.
> > 
> > Do we really want to retrofit wake_one to 2.2. I know Im not terribly keen to
> > try and backport all the mechanism. I think for 2.2 using the semaphore is a
> > good approach. Its a hack to fix an old OS kernel. For 2.4 its not needed
> 
> It's a 16-liner!  I'll cheerfully admit that this patch
> may be completely broken, but hey, it's free.  I suggest
> that _something_ has to be done for 2.2 now, because
> Apache has switched to unserialised accept().

This is why I'd love to _not_ see silly work-arounds in apache: we
obviously _can_ fix the places where our performance sucks, but only if we
don't have other band-aids hiding the true issues.

For example, with a file-locking apache, we'd have to fix the (noticeably
harder) file locking thing to be wake-one instead, and even then we'd
never be able to do as well as something that gets the same wake-one thing
without the two extra system calls.

The patch looks superficially fine to me, although it does seem to add
another cache-line to the wakeup setup - it migth be worth-while to have
the exclusive state closer. But maybe I just didn't count right.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Re: Negative scalability by removal of

2000-11-06 Thread Alan Cox


> It's a 16-liner!  I'll cheerfully admit that this patch
> may be completely broken, but hey, it's free.  I suggest
> that _something_ has to be done for 2.2 now, because
> Apache has switched to unserialised accept().

Interesting

> The fact that the throughput is 3-4 time worse for 2, 3, 4 and 5
> server processes is completely wierd.  Perhaps some strange miss
> pattern, but it doesn't do it on 2.4.  I'll dump this problem
> onto the netdev list, see if anyone has any bright ideas.

That would be consistent with the fact that thttpd is single threaded and
kicks apache for performance in 2.2 (less so 2.4!)

> 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Re: Negative scalability by removal of

2000-11-06 Thread Andrea Arcangeli

On Tue, Nov 07, 2000 at 01:27:07AM +1100, Andrew Morton wrote:
> context. For networking, where it is called from softirq context
> it is O(N).

Yes, the heuristic that runs from irqs has an O(N) worst case in wakeup.
But the current implementation is silly because the best case could run 
without browsing the whole queue. This is an untested fix for such heuristic
implementation complexity problem:

--- 2.4.0-test10/kernel/sched.c.~1~ Thu Nov  2 20:59:05 2000
+++ 2.4.0-test10/kernel/sched.c Mon Nov  6 16:44:07 2000
@@ -742,9 +742,10 @@
if (irq && (state & mode & TASK_EXCLUSIVE)) {
if (!best_exclusive)
best_exclusive = p;
-   else if ((p->processor == best_cpu) &&
-   (best_exclusive->processor != best_cpu))
-   best_exclusive = p;
+   if (p->processor == best_cpu) {
+   best_exclusive = p;
+   break;
+   }
} else {
if (sync)
wake_up_process_synchronous(p);

This also ensure a FIFO order between the threads that are affine to the
current CPU (while previously it was first LIFO for the affine threads and then
FIFO if there was none affine thread and that doesn't make sense, since for
seldom cases like accept we want LIFO in both cases and for everything else
like semaphores we need FIFO and just allowing affine tasks to pass the other
threads can generate starvation in theory... (and that heuristic in infact
not fair at all with tasks running in other CPUs while using irq affinity, and
on alpha we are likley to use it most of the time to avoid interrupting all the
CPUs).

> hmm.. Has this been changed in 2.4?  The performance is the same.

It is fixed in 2.4.x. Anyways it probably worth to do the wake-one in 2.2.x
too even if it's a wake-three in the first stage.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Re: Negative scalability by removal of

2000-11-06 Thread Andrew Morton


Alan Cox wrote:
> 
> > Even 2.2.x can be fixed to do the wake-one for accept(), if required.
> 
> Do we really want to retrofit wake_one to 2.2. I know Im not terribly keen to
> try and backport all the mechanism. I think for 2.2 using the semaphore is a
> good approach. Its a hack to fix an old OS kernel. For 2.4 its not needed

It's a 16-liner!  I'll cheerfully admit that this patch
may be completely broken, but hey, it's free.  I suggest
that _something_ has to be done for 2.2 now, because
Apache has switched to unserialised accept().


More columns of numbers...

Apache connection rate.  2.2.18-pre19-UP, uniprocessor:

Servers Unpatched  (conn/sec)   Patched (conn/sec)
  3   371   375
  10  361   374
  30  361   374
  150 112   351


Same test as the last email - minimal connect/close to
localhost:

Servers Unpatched   Patched
(waiters in accept())   conn/secconn/sec

  1 16702000
  2 670 670
  3 470 470
  4 470 1670
  5 470 470
  6 16701670
  7 16701670
  8 14301670
  9 14301670
  1014301670
  50590 1430
  100   250 1250
  200   117 1110


The fact that the throughput is 3-4 time worse for 2, 3, 4 and 5
server processes is completely wierd.  Perhaps some strange miss
pattern, but it doesn't do it on 2.4.  I'll dump this problem
onto the netdev list, see if anyone has any bright ideas.





--- linux-2.2.18-pre19/include/linux/sched.hSun Nov  5 11:46:54 2000
+++ linux-akpm/include/linux/sched.hMon Nov  6 22:01:51 2000
@@ -79,6 +79,7 @@
 #define TASK_ZOMBIE4
 #define TASK_STOPPED   8
 #define TASK_SWAPPING  16
+#define TASK_EXCLUSIVE 32
 
 /*
  * Scheduling policies
@@ -251,6 +252,7 @@
struct task_struct *next_task, *prev_task;
struct task_struct *next_run,  *prev_run;
 
+   unsigned int task_exclusive;/* task wants wake-one semantics in 
+__wake_up() */
 /* task state */
struct linux_binfmt *binfmt;
int exit_code, exit_signal;
@@ -370,6 +372,7 @@
 /* counter */  DEF_PRIORITY,DEF_PRIORITY,0, \
 /* SMP */  0,0,0,-1, \
 /* schedlink */_task,_task, _task, _task, \
+/* task_exclusive */ 0, \
 /* binfmt */   NULL, \
 /* ec,brk... */0,0,0,0,0,0, \
 /* pid etc.. */0,0,0,0,0, \
@@ -496,8 +499,8 @@
signed long timeout));
 extern void FASTCALL(wake_up_process(struct task_struct * tsk));
 
-#define wake_up(x) __wake_up((x),TASK_UNINTERRUPTIBLE | 
TASK_INTERRUPTIBLE)
-#define wake_up_interruptible(x)   __wake_up((x),TASK_INTERRUPTIBLE)
+#define wake_up(x) __wake_up((x),TASK_UNINTERRUPTIBLE | 
+TASK_INTERRUPTIBLE | TASK_EXCLUSIVE)
+#define wake_up_interruptible(x)   __wake_up((x),TASK_INTERRUPTIBLE | 
+TASK_EXCLUSIVE)
 
 #define __set_current_state(state_value)   do { current->state = state_value; } 
while (0)
 #ifdef __SMP__
--- linux-2.2.18-pre19/kernel/sched.c   Sun Nov  5 11:46:54 2000
+++ linux-akpm/kernel/sched.c   Mon Nov  6 21:51:57 2000
@@ -892,6 +892,7 @@
 {
struct task_struct *p;
struct wait_queue *head, *next;
+   unsigned int done_exclusive, do_exclusive;
 
 if (!q)
goto out;
@@ -906,10 +907,18 @@
if (!next)
goto out_unlock;
 
+   done_exclusive = 0;
+   do_exclusive = mode & TASK_EXCLUSIVE;
while (next != head) {
p = next->task;
next = next->next;
if (p->state & mode) {
+   if (do_exclusive && p->task_exclusive) {
+   if (done_exclusive)
+   continue;
+   done_exclusive = 1;
+   }
+
/*
 * We can drop the read-lock early if this
 * is the only/last process.
--- linux-2.2.18-pre19/net/ipv4/tcp.c   Sun Nov  5 11:46:54 2000
+++ linux-akpm/net/ipv4/tcp.c   Mon Nov  6 21:26:48 2000
@@ -1619,6 +1619,7 @@
struct wait_queue wait = { current, NULL };
struct open_request *req;
 
+   current->task_exclusive = 1;
add_wait_queue(sk->sleep, );
for (;;) {
current->state = TASK_INTERRUPTIBLE;
@@ -1632,6 +1633,8 @@

Re: [PATCH] Re: Negative scalability by removal of

2000-11-06 Thread Andrew Morton

Andrea Arcangeli wrote:
> 
> On Sat, Nov 04, 2000 at 09:22:58AM -0800, Linus Torvalds wrote:
> > We don't need to backport of the full exclusive wait queues: we could do
> > the equivalent of the semaphore inside the kernel around just accept(). It
> > wouldn't be a generic thing, but it would fix the specific case of
> > accept().
> 
> The first wake-one patch floating around was against 2.2.x waitqueues and
> it's a very simple patch and it fixes the problem (it also gives LIFO
> to accept with the downside that it needs to do an O(N) browse on the
> waitqueue before doing the exclusive wakeup compared to 2.4.x that does
> the wake-one task selection in O(1) if everybody is sleeping in accept, but it
> does that FIFO unfortunately).

Andrea,  the 2.4 wakeup code is only O(1) if called from process
context. For networking, where it is called from softirq context
it is O(N).

The waitqueue patch which I did for kumon@fujitsu does O(1) from
all contexts as well as LIFO and the results are fairly clear.

Here the 'server' just forks a bunch of children which sit
in accept().  The 'client' opens and closes the connection
as fast as it can, 10,000 times. All processes run on localhost:

2.4.0-test10-SMP on UP
==

Servers Unpatched   Patched
(secs)  (secs)

  1 6   6
  2 6   6
  3 6   6
  4 6   6
  100   7   6
  200   7   6
  400   9   6
  600   10  6
  1000  13  6.6

2.4.0-test10-SMP on SMP (2 x x86)
=

Servers Unpatched   Patched
  1 1.1 1.1
  2 1.2 1.3
  3 1.2 1.3
  4 1.2 1.3
  100   1.8 1.3
  200   2.3 1.3
  400   3.2 1.3
  600   4.1 1.3
  1000  6.0 1.3

Easy to explain.  5 seconds / 1000 servers / 10,000 iterations
is 500nSecs per loop in __wake_up_common.  I figure that loop
has three misses and probably a TLB reload.

However the Apache performance was increased by 10% irrespective
of the number of servers - from 3 to 150.  Presumably the effect
of the LIFO.

Frankly, the fact that we fall from 9100 conn/sec to 5500 when
there are 100 processes waiting on a socket doesn't seem
particularly important to me - this is a mistuned server.
I suggest that we not sweat it.

But the patch is a better implementation, so expect to hear more
about it when the time is appropriate :)  (It currently does
LIFO for semaphores as well which, as you have explained is
wrong.   Will fix.)

> The real problem that DaveM knows well is that TCP in 2.2.x will end doing
> three wakeups every time the socket moves from LISTEN to ESTABLISHED state, so
> it was really doing a wake-three not a wake-one :). So the brainer part
> is to fix TCP and not the scheduler/waitqueue part.

hmm.. Has this been changed in 2.4?  The performance is the same.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Re: Negative scalability by removal of

2000-11-06 Thread David S. Miller

   Date:Mon, 6 Nov 2000 21:23:57 -0800 (PST)
   From: dean gaudet [EMAIL PROTECTED]

 apache is about correctness first, and performance second.

Which is why we say it is "incorrect" for apache to try
and work around kernel performance problems. :-)))

Later,
David S. Miller
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Re: Negative scalability by removal of

2000-11-06 Thread Andrew Morton


Andrea Arcangeli wrote:
 
 On Sat, Nov 04, 2000 at 09:22:58AM -0800, Linus Torvalds wrote:
  We don't need to backport of the full exclusive wait queues: we could do
  the equivalent of the semaphore inside the kernel around just accept(). It
  wouldn't be a generic thing, but it would fix the specific case of
  accept().
 
 The first wake-one patch floating around was against 2.2.x waitqueues and
 it's a very simple patch and it fixes the problem (it also gives LIFO
 to accept with the downside that it needs to do an O(N) browse on the
 waitqueue before doing the exclusive wakeup compared to 2.4.x that does
 the wake-one task selection in O(1) if everybody is sleeping in accept, but it
 does that FIFO unfortunately).

Andrea,  the 2.4 wakeup code is only O(1) if called from process
context. For networking, where it is called from softirq context
it is O(N).

The waitqueue patch which I did for kumon@fujitsu does O(1) from
all contexts as well as LIFO and the results are fairly clear.

Here the 'server' just forks a bunch of children which sit
in accept().  The 'client' opens and closes the connection
as fast as it can, 10,000 times. All processes run on localhost:

2.4.0-test10-SMP on UP
==


Servers Unpatched   Patched
(secs)  (secs)

  1 6   6
  2 6   6
  3 6   6
  4 6   6
  100   7   6
  200   7   6
  400   9   6
  600   10  6
  1000  13  6.6


2.4.0-test10-SMP on SMP (2 x x86)
=

Servers Unpatched   Patched
  1 1.1 1.1
  2 1.2 1.3
  3 1.2 1.3
  4 1.2 1.3
  100   1.8 1.3
  200   2.3 1.3
  400   3.2 1.3
  600   4.1 1.3
  1000  6.0 1.3

Easy to explain.  5 seconds / 1000 servers / 10,000 iterations
is 500nSecs per loop in __wake_up_common.  I figure that loop
has three misses and probably a TLB reload.

However the Apache performance was increased by 10% irrespective
of the number of servers - from 3 to 150.  Presumably the effect
of the LIFO.

Frankly, the fact that we fall from 9100 conn/sec to 5500 when
there are 100 processes waiting on a socket doesn't seem
particularly important to me - this is a mistuned server.
I suggest that we not sweat it.

But the patch is a better implementation, so expect to hear more
about it when the time is appropriate :)  (It currently does
LIFO for semaphores as well which, as you have explained is
wrong.   Will fix.)

 The real problem that DaveM knows well is that TCP in 2.2.x will end doing
 three wakeups every time the socket moves from LISTEN to ESTABLISHED state, so
 it was really doing a wake-three not a wake-one :). So the brainer part
 is to fix TCP and not the scheduler/waitqueue part.

hmm.. Has this been changed in 2.4?  The performance is the same.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Re: Negative scalability by removal of

2000-11-06 Thread Andrew Morton


Alan Cox wrote:
 
  Even 2.2.x can be fixed to do the wake-one for accept(), if required.
 
 Do we really want to retrofit wake_one to 2.2. I know Im not terribly keen to
 try and backport all the mechanism. I think for 2.2 using the semaphore is a
 good approach. Its a hack to fix an old OS kernel. For 2.4 its not needed

It's a 16-liner!  I'll cheerfully admit that this patch
may be completely broken, but hey, it's free.  I suggest
that _something_ has to be done for 2.2 now, because
Apache has switched to unserialised accept().


More columns of numbers...

Apache connection rate.  2.2.18-pre19-UP, uniprocessor:

Servers Unpatched  (conn/sec)   Patched (conn/sec)
  3   371   375
  10  361   374
  30  361   374
  150 112   351


Same test as the last email - minimal connect/close to
localhost:

Servers Unpatched   Patched
(waiters in accept())   conn/secconn/sec

  1 16702000
  2 670 670
  3 470 470
  4 470 1670
  5 470 470
  6 16701670
  7 16701670
  8 14301670
  9 14301670
  1014301670
  50590 1430
  100   250 1250
  200   117 1110


The fact that the throughput is 3-4 time worse for 2, 3, 4 and 5
server processes is completely wierd.  Perhaps some strange miss
pattern, but it doesn't do it on 2.4.  I'll dump this problem
onto the netdev list, see if anyone has any bright ideas.





--- linux-2.2.18-pre19/include/linux/sched.hSun Nov  5 11:46:54 2000
+++ linux-akpm/include/linux/sched.hMon Nov  6 22:01:51 2000
@@ -79,6 +79,7 @@
 #define TASK_ZOMBIE4
 #define TASK_STOPPED   8
 #define TASK_SWAPPING  16
+#define TASK_EXCLUSIVE 32
 
 /*
  * Scheduling policies
@@ -251,6 +252,7 @@
struct task_struct *next_task, *prev_task;
struct task_struct *next_run,  *prev_run;
 
+   unsigned int task_exclusive;/* task wants wake-one semantics in 
+__wake_up() */
 /* task state */
struct linux_binfmt *binfmt;
int exit_code, exit_signal;
@@ -370,6 +372,7 @@
 /* counter */  DEF_PRIORITY,DEF_PRIORITY,0, \
 /* SMP */  0,0,0,-1, \
 /* schedlink */init_task,init_task, init_task, init_task, \
+/* task_exclusive */ 0, \
 /* binfmt */   NULL, \
 /* ec,brk... */0,0,0,0,0,0, \
 /* pid etc.. */0,0,0,0,0, \
@@ -496,8 +499,8 @@
signed long timeout));
 extern void FASTCALL(wake_up_process(struct task_struct * tsk));
 
-#define wake_up(x) __wake_up((x),TASK_UNINTERRUPTIBLE | 
TASK_INTERRUPTIBLE)
-#define wake_up_interruptible(x)   __wake_up((x),TASK_INTERRUPTIBLE)
+#define wake_up(x) __wake_up((x),TASK_UNINTERRUPTIBLE | 
+TASK_INTERRUPTIBLE | TASK_EXCLUSIVE)
+#define wake_up_interruptible(x)   __wake_up((x),TASK_INTERRUPTIBLE | 
+TASK_EXCLUSIVE)
 
 #define __set_current_state(state_value)   do { current-state = state_value; } 
while (0)
 #ifdef __SMP__
--- linux-2.2.18-pre19/kernel/sched.c   Sun Nov  5 11:46:54 2000
+++ linux-akpm/kernel/sched.c   Mon Nov  6 21:51:57 2000
@@ -892,6 +892,7 @@
 {
struct task_struct *p;
struct wait_queue *head, *next;
+   unsigned int done_exclusive, do_exclusive;
 
 if (!q)
goto out;
@@ -906,10 +907,18 @@
if (!next)
goto out_unlock;
 
+   done_exclusive = 0;
+   do_exclusive = mode  TASK_EXCLUSIVE;
while (next != head) {
p = next-task;
next = next-next;
if (p-state  mode) {
+   if (do_exclusive  p-task_exclusive) {
+   if (done_exclusive)
+   continue;
+   done_exclusive = 1;
+   }
+
/*
 * We can drop the read-lock early if this
 * is the only/last process.
--- linux-2.2.18-pre19/net/ipv4/tcp.c   Sun Nov  5 11:46:54 2000
+++ linux-akpm/net/ipv4/tcp.c   Mon Nov  6 21:26:48 2000
@@ -1619,6 +1619,7 @@
struct wait_queue wait = { current, NULL };
struct open_request *req;
 
+   current-task_exclusive = 1;
add_wait_queue(sk-sleep, wait);
for (;;) {
current-state = TASK_INTERRUPTIBLE;
@@ -1632,6 +1633,8 @@

Re: [PATCH] Re: Negative scalability by removal of

2000-11-06 Thread Andrea Arcangeli


On Tue, Nov 07, 2000 at 01:27:07AM +1100, Andrew Morton wrote:
 context. For networking, where it is called from softirq context
 it is O(N).

Yes, the heuristic that runs from irqs has an O(N) worst case in wakeup.
But the current implementation is silly because the best case could run 
without browsing the whole queue. This is an untested fix for such heuristic
implementation complexity problem:

--- 2.4.0-test10/kernel/sched.c.~1~ Thu Nov  2 20:59:05 2000
+++ 2.4.0-test10/kernel/sched.c Mon Nov  6 16:44:07 2000
@@ -742,9 +742,10 @@
if (irq  (state  mode  TASK_EXCLUSIVE)) {
if (!best_exclusive)
best_exclusive = p;
-   else if ((p-processor == best_cpu) 
-   (best_exclusive-processor != best_cpu))
-   best_exclusive = p;
+   if (p-processor == best_cpu) {
+   best_exclusive = p;
+   break;
+   }
} else {
if (sync)
wake_up_process_synchronous(p);

This also ensure a FIFO order between the threads that are affine to the
current CPU (while previously it was first LIFO for the affine threads and then
FIFO if there was none affine thread and that doesn't make sense, since for
seldom cases like accept we want LIFO in both cases and for everything else
like semaphores we need FIFO and just allowing affine tasks to pass the other
threads can generate starvation in theory... (and that heuristic in infact
not fair at all with tasks running in other CPUs while using irq affinity, and
on alpha we are likley to use it most of the time to avoid interrupting all the
CPUs).

 hmm.. Has this been changed in 2.4?  The performance is the same.

It is fixed in 2.4.x. Anyways it probably worth to do the wake-one in 2.2.x
too even if it's a wake-three in the first stage.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Re: Negative scalability by removal of

2000-11-06 Thread Alan Cox


 It's a 16-liner!  I'll cheerfully admit that this patch
 may be completely broken, but hey, it's free.  I suggest
 that _something_ has to be done for 2.2 now, because
 Apache has switched to unserialised accept().

Interesting

 The fact that the throughput is 3-4 time worse for 2, 3, 4 and 5
 server processes is completely wierd.  Perhaps some strange miss
 pattern, but it doesn't do it on 2.4.  I'll dump this problem
 onto the netdev list, see if anyone has any bright ideas.

That would be consistent with the fact that thttpd is single threaded and
kicks apache for performance in 2.2 (less so 2.4!)

 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Re: Negative scalability by removal of

2000-11-06 Thread Linus Torvalds




On Tue, 7 Nov 2000, Andrew Morton wrote:

 Alan Cox wrote:
  
   Even 2.2.x can be fixed to do the wake-one for accept(), if required.
  
  Do we really want to retrofit wake_one to 2.2. I know Im not terribly keen to
  try and backport all the mechanism. I think for 2.2 using the semaphore is a
  good approach. Its a hack to fix an old OS kernel. For 2.4 its not needed
 
 It's a 16-liner!  I'll cheerfully admit that this patch
 may be completely broken, but hey, it's free.  I suggest
 that _something_ has to be done for 2.2 now, because
 Apache has switched to unserialised accept().

This is why I'd love to _not_ see silly work-arounds in apache: we
obviously _can_ fix the places where our performance sucks, but only if we
don't have other band-aids hiding the true issues.

For example, with a file-locking apache, we'd have to fix the (noticeably
harder) file locking thing to be wake-one instead, and even then we'd
never be able to do as well as something that gets the same wake-one thing
without the two extra system calls.

The patch looks superficially fine to me, although it does seem to add
another cache-line to the wakeup setup - it migth be worth-while to have
the exclusive state closer. But maybe I just didn't count right.

Linus


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Re: Negative scalability by removal of

2000-11-05 Thread Alan Cox


> oh, someone reminded me of the other reason sysvsems suck:  a cgi can grab
> the semaphore and hold it, causing a DoS.  of course folks could, and
> should use suexec/cgiwrap to avoid this.

The same cgi can killall -STOP httpd
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Re: Negative scalability by removal of

2000-11-05 Thread dean gaudet

the numbers didn't look that bad for the small numbers of concurrent
clients on 2.2... a few % slower without the serialisation.  compared to
orders of magnitude slower with large numbers of concurrent client.

oh, someone reminded me of the other reason sysvsems suck:  a cgi can grab
the semaphore and hold it, causing a DoS.  of course folks could, and
should use suexec/cgiwrap to avoid this.

-dean

On Sat, 4 Nov 2000, Alan Cox wrote:

> > Even 2.2.x can be fixed to do the wake-one for accept(), if required. 
> 
> Do we really want to retrofit wake_one to 2.2. I know Im not terribly keen to
> try and backport all the mechanism. I think for 2.2 using the semaphore is a 
> good approach. Its a hack to fix an old OS kernel. For 2.4 its not needed
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> Please read the FAQ at http://www.tux.org/lkml/
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Re: Negative scalability by removal of

2000-11-05 Thread Andrea Arcangeli

On Sat, Nov 04, 2000 at 09:22:58AM -0800, Linus Torvalds wrote:
> We don't need to backport of the full exclusive wait queues: we could do
> the equivalent of the semaphore inside the kernel around just accept(). It
> wouldn't be a generic thing, but it would fix the specific case of
> accept().

The first wake-one patch floating around was against 2.2.x waitqueues and
it's a very simple patch and it fixes the problem (it also gives LIFO
to accept with the downside that it needs to do an O(N) browse on the
waitqueue before doing the exclusive wakeup compared to 2.4.x that does
the wake-one task selection in O(1) if everybody is sleeping in accept, but it
does that FIFO unfortunately).

The real problem that DaveM knows well is that TCP in 2.2.x will end doing
three wakeups every time the socket moves from LISTEN to ESTABLISHED state, so
it was really doing a wake-three not a wake-one :). So the brainer part
is to fix TCP and not the scheduler/waitqueue part.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Re: Negative scalability by removal of

2000-11-05 Thread Andrea Arcangeli


On Sat, Nov 04, 2000 at 09:22:58AM -0800, Linus Torvalds wrote:
 We don't need to backport of the full exclusive wait queues: we could do
 the equivalent of the semaphore inside the kernel around just accept(). It
 wouldn't be a generic thing, but it would fix the specific case of
 accept().

The first wake-one patch floating around was against 2.2.x waitqueues and
it's a very simple patch and it fixes the problem (it also gives LIFO
to accept with the downside that it needs to do an O(N) browse on the
waitqueue before doing the exclusive wakeup compared to 2.4.x that does
the wake-one task selection in O(1) if everybody is sleeping in accept, but it
does that FIFO unfortunately).

The real problem that DaveM knows well is that TCP in 2.2.x will end doing
three wakeups every time the socket moves from LISTEN to ESTABLISHED state, so
it was really doing a wake-three not a wake-one :). So the brainer part
is to fix TCP and not the scheduler/waitqueue part.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Re: Negative scalability by removal of

2000-11-05 Thread dean gaudet


the numbers didn't look that bad for the small numbers of concurrent
clients on 2.2... a few % slower without the serialisation.  compared to
orders of magnitude slower with large numbers of concurrent client.

oh, someone reminded me of the other reason sysvsems suck:  a cgi can grab
the semaphore and hold it, causing a DoS.  of course folks could, and
should use suexec/cgiwrap to avoid this.

-dean

On Sat, 4 Nov 2000, Alan Cox wrote:

  Even 2.2.x can be fixed to do the wake-one for accept(), if required. 
 
 Do we really want to retrofit wake_one to 2.2. I know Im not terribly keen to
 try and backport all the mechanism. I think for 2.2 using the semaphore is a 
 good approach. Its a hack to fix an old OS kernel. For 2.4 its not needed
 -
 To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
 the body of a message to [EMAIL PROTECTED]
 Please read the FAQ at http://www.tux.org/lkml/
 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was:Strange

2000-11-04 Thread dean gaudet

On Sat, 4 Nov 2000, Alan Cox wrote:

> > sysv semaphores have a very unfortunate negative feature -- if the admin
> > kill -9's the server (impatient admins do this all the time) then you end
> > up leaving a semaphore lying around.  sysvsem don't have the usual unix
> 
> Umm they have SEM_UNDO. Its a case of deeper magic

we use SEM_UNDO, that's not quite what i was worrying about.  i was
worrying about leaving a stale semaphore in the global semaphore table.

IPC_RMID causes the semaphore to be destroyed immediately, rather than
after all the users are done.

-dean

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was:Strange performance behavior of 2.4.0-test9)

2000-11-04 Thread Dave Wagner


Linus Torvalds wrote:
>
> No.
>
> Please use unserialized accept() _always_, because we can fix that.
>
> Even 2.2.x can be fixed to do the wake-one for accept(), if required.
> It's not going to be any worse than the current apache config, and
> basically the less games apache plays, the better the kernel can try to
> accomodate what apache _really_ wants done.  When playing games, you
> hide what you really want done, and suddenly kernel profiles etc end up
> being completely useless, because they no longer give the data we needed
> to fix the problem.
>
> Basically, the whole serialization crap is all about the Apache people
> saying the equivalent of "the OS does a bad job on something we consider
> to be incredibly important, so we do something else instead to hide it".
>
> And regardless of _what_ workaround Apache does, whether it is the sucky
> fcntl() thing or using SysV semaphores, it's going to hide the real
> issue and mean that it never gets fixed properly.
>
> And in the end it will result in really really bad performance.
>
> Instead, if apache had just done the thing it wanted to do in the first
> place, the wake-one accept() semantics would have happened a hell of a
> lot earlier.
>
> Now it's there in 2.4.x. Please use it. PLEASE PLEASE PLEASE don't play
> games trying to outsmart the OS, it will just hurt Apache in the long run.
>

But how would you suggest people using 2.2 configure their
Apache?  Will flock/fcntl or semaphores perform better (albeit
"uglier") than unserialized accept()'s in 2.2.  I'm willing
and expecting to rebuild apache when 2.4 is released.  I do
not, though, want to leave performance on the table today,
just so I can say that my apache binary is 2.4-ready.

Do any of the apache serialization methods (flock/fcntl/semops)
have any performance improvement over unserialized accept() with
Apache running on a 2.2 kernel?

Dave Wagner

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was:Strange

2000-11-04 Thread Alan Cox


> sysv semaphores have a very unfortunate negative feature -- if the admin
> kill -9's the server (impatient admins do this all the time) then you end
> up leaving a semaphore lying around.  sysvsem don't have the usual unix

Umm they have SEM_UNDO. Its a case of deeper magic


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was:Strange

2000-11-04 Thread Alan Cox


> > Instead, if apache had just done the thing it wanted to do in the first
> > place, the wake-one accept() semantics would have happened a hell of a
> > lot earlier.
> 
> counter-example:  freebsd had wake-one semantics a few years before linux.

And Im sure apache authors can use the utsname() syscall 8)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was:Strangeperformance behavior of 2.4.0-test9)

2000-11-04 Thread dean gaudet

On Sat, 4 Nov 2000, Andrew Morton wrote:

> Dean,
> 
> neither flock() nor fcntl() serialisation are effective
> on linux 2.2 or linux 2.4.

i have to admit the last time i timed any of the methods on linux was in
2.0.x days.  thanks for the updated data!

> For kernel 2.2 I recommend that Apache consider using
> sysv semaphores for serialisation. They use wake-one. 

sysv semaphores have a very unfortunate negative feature -- if the admin
kill -9's the server (impatient admins do this all the time) then you end
up leaving a semaphore lying around.  sysvsem don't have the usual unix
unlink semantics.  actually flock has the same problem... which is why i
generally preferred fcntl whenever it was a performance wash, as it was
back in 2.0.x days.

however given the vast performance difference i think it warrants the
change.  i'll include your results with the commit.

> For kernel 2.4 I recommend that Apache use unserialised
> accept.

per linus' request i'll unserialise 2.2 as well.

i'll leave 2.0.x settings alone.

(oh yeah, and compile-time only detection.)

-dean

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was:Strangeperformance behavior of 2.4.0-test9)

2000-11-04 Thread dean gaudet

On Fri, 3 Nov 2000, Linus Torvalds wrote:

> Please use unserialized accept() _always_, because we can fix that.

i can unserialise the single socket case, but the multiple socket case is
not so simple.

the executive summary is that when you've got multiple sockets you have to
use select().  select is necessarily wake-all.  remember there's N
children trying to do select/accept.  if the listening socket is
non-blocking then you spin in N-1 children; if it's blocking then you
starve other sockets.

see http://www.apache.org/docs/misc/perf-tuning.html, search for "multiple
sockets" for my full analysis of the problem.

> Instead, if apache had just done the thing it wanted to do in the first
> place, the wake-one accept() semantics would have happened a hell of a
> lot earlier.

counter-example:  freebsd had wake-one semantics a few years before linux.

revision 1.237
date: 1998/09/29 01:22:57;  author: marc;  state: Exp;  lines: +1 -0
Unserialized accept() should be safe (in all versions) and efficient
(in anything vaguely recent) on FreeBSD.

ok, we done finger pointing? :)

-dean

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Re: Negative scalability by removal of

2000-11-04 Thread Linus Torvalds

On Sat, 4 Nov 2000, Alan Cox wrote:
>
> > Even 2.2.x can be fixed to do the wake-one for accept(), if required. 
> 
> Do we really want to retrofit wake_one to 2.2. I know Im not terribly keen to
> try and backport all the mechanism. I think for 2.2 using the semaphore is a 
> good approach. Its a hack to fix an old OS kernel. For 2.4 its not needed

We don't need to backport of the full exclusive wait queues: we could do
the equivalent of the semaphore inside the kernel around just accept(). It
wouldn't be a generic thing, but it would fix the specific case of
accept().

Otherwise we're going to have old binaries of apache lying around forever
that do the wrong thing..

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Re: Negative scalability by removal of

2000-11-04 Thread Alan Cox


> Even 2.2.x can be fixed to do the wake-one for accept(), if required. 

Do we really want to retrofit wake_one to 2.2. I know Im not terribly keen to
try and backport all the mechanism. I think for 2.2 using the semaphore is a 
good approach. Its a hack to fix an old OS kernel. For 2.4 its not needed
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was:Strangeperformance behavior of 2.4.0-test9)

2000-11-04 Thread dean gaudet


On Sat, 4 Nov 2000, Andrew Morton wrote:

 Dean,
 
 neither flock() nor fcntl() serialisation are effective
 on linux 2.2 or linux 2.4.

i have to admit the last time i timed any of the methods on linux was in
2.0.x days.  thanks for the updated data!

 For kernel 2.2 I recommend that Apache consider using
 sysv semaphores for serialisation. They use wake-one. 

sysv semaphores have a very unfortunate negative feature -- if the admin
kill -9's the server (impatient admins do this all the time) then you end
up leaving a semaphore lying around.  sysvsem don't have the usual unix
unlink semantics.  actually flock has the same problem... which is why i
generally preferred fcntl whenever it was a performance wash, as it was
back in 2.0.x days.

however given the vast performance difference i think it warrants the
change.  i'll include your results with the commit.

 For kernel 2.4 I recommend that Apache use unserialised
 accept.

per linus' request i'll unserialise 2.2 as well.

i'll leave 2.0.x settings alone.

(oh yeah, and compile-time only detection.)

-dean

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was:Strange

2000-11-04 Thread Alan Cox


 sysv semaphores have a very unfortunate negative feature -- if the admin
 kill -9's the server (impatient admins do this all the time) then you end
 up leaving a semaphore lying around.  sysvsem don't have the usual unix

Umm they have SEM_UNDO. Its a case of deeper magic


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was:Strange

2000-11-04 Thread dean gaudet


On Sat, 4 Nov 2000, Alan Cox wrote:

  sysv semaphores have a very unfortunate negative feature -- if the admin
  kill -9's the server (impatient admins do this all the time) then you end
  up leaving a semaphore lying around.  sysvsem don't have the usual unix
 
 Umm they have SEM_UNDO. Its a case of deeper magic

we use SEM_UNDO, that's not quite what i was worrying about.  i was
worrying about leaving a stale semaphore in the global semaphore table.

IPC_RMID causes the semaphore to be destroyed immediately, rather than
after all the users are done.

-dean

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Re: Negative scalability by removal of

2000-11-04 Thread Alan Cox


 Even 2.2.x can be fixed to do the wake-one for accept(), if required. 

Do we really want to retrofit wake_one to 2.2. I know Im not terribly keen to
try and backport all the mechanism. I think for 2.2 using the semaphore is a 
good approach. Its a hack to fix an old OS kernel. For 2.4 its not needed
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Re: Negative scalability by removal of

2000-11-04 Thread Linus Torvalds




On Sat, 4 Nov 2000, Alan Cox wrote:

  Even 2.2.x can be fixed to do the wake-one for accept(), if required. 
 
 Do we really want to retrofit wake_one to 2.2. I know Im not terribly keen to
 try and backport all the mechanism. I think for 2.2 using the semaphore is a 
 good approach. Its a hack to fix an old OS kernel. For 2.4 its not needed

We don't need to backport of the full exclusive wait queues: we could do
the equivalent of the semaphore inside the kernel around just accept(). It
wouldn't be a generic thing, but it would fix the specific case of
accept().

Otherwise we're going to have old binaries of apache lying around forever
that do the wrong thing..

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was:Strangeperformance behavior of 2.4.0-test9)

2000-11-04 Thread dean gaudet


On Fri, 3 Nov 2000, Linus Torvalds wrote:

 Please use unserialized accept() _always_, because we can fix that.

i can unserialise the single socket case, but the multiple socket case is
not so simple.

the executive summary is that when you've got multiple sockets you have to
use select().  select is necessarily wake-all.  remember there's N
children trying to do select/accept.  if the listening socket is
non-blocking then you spin in N-1 children; if it's blocking then you
starve other sockets.

see http://www.apache.org/docs/misc/perf-tuning.html, search for "multiple
sockets" for my full analysis of the problem.

 Instead, if apache had just done the thing it wanted to do in the first
 place, the wake-one accept() semantics would have happened a hell of a
 lot earlier.

counter-example:  freebsd had wake-one semantics a few years before linux.

revision 1.237
date: 1998/09/29 01:22:57;  author: marc;  state: Exp;  lines: +1 -0
Unserialized accept() should be safe (in all versions) and efficient
(in anything vaguely recent) on FreeBSD.

ok, we done finger pointing? :)

-dean

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was:Strange

2000-11-04 Thread Alan Cox


  Instead, if apache had just done the thing it wanted to do in the first
  place, the wake-one accept() semantics would have happened a hell of a
  lot earlier.
 
 counter-example:  freebsd had wake-one semantics a few years before linux.

And Im sure apache authors can use the utsname() syscall 8)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was:Strange performance behavior of 2.4.0-test9)

2000-11-04 Thread Dave Wagner


Linus Torvalds wrote:

 No.

 Please use unserialized accept() _always_, because we can fix that.

 Even 2.2.x can be fixed to do the wake-one for accept(), if required.
 It's not going to be any worse than the current apache config, and
 basically the less games apache plays, the better the kernel can try to
 accomodate what apache _really_ wants done.  When playing games, you
 hide what you really want done, and suddenly kernel profiles etc end up
 being completely useless, because they no longer give the data we needed
 to fix the problem.

 Basically, the whole serialization crap is all about the Apache people
 saying the equivalent of "the OS does a bad job on something we consider
 to be incredibly important, so we do something else instead to hide it".

 And regardless of _what_ workaround Apache does, whether it is the sucky
 fcntl() thing or using SysV semaphores, it's going to hide the real
 issue and mean that it never gets fixed properly.

 And in the end it will result in really really bad performance.

 Instead, if apache had just done the thing it wanted to do in the first
 place, the wake-one accept() semantics would have happened a hell of a
 lot earlier.

 Now it's there in 2.4.x. Please use it. PLEASE PLEASE PLEASE don't play
 games trying to outsmart the OS, it will just hurt Apache in the long run.


But how would you suggest people using 2.2 configure their
Apache?  Will flock/fcntl or semaphores perform better (albeit
"uglier") than unserialized accept()'s in 2.2.  I'm willing
and expecting to rebuild apache when 2.4 is released.  I do
not, though, want to leave performance on the table today,
just so I can say that my apache binary is 2.4-ready.

Do any of the apache serialization methods (flock/fcntl/semops)
have any performance improvement over unserialized accept() with
Apache running on a 2.2 kernel?

Dave Wagner

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was:Strange performance behavior of 2.4.0-test9)

2000-11-03 Thread Linus Torvalds

In article <[EMAIL PROTECTED]>,
Andrew Morton  <[EMAIL PROTECTED]> wrote:
>
>neither flock() nor fcntl() serialisation are effective
>on linux 2.2 or linux 2.4.  This is because the file
>locking code still wakes up _all_ waiters.  In my testing
>with fcntl serialisation I have seen a single Apache
>instance get woken and put back to sleep 1,500 times
>before the poor thing actually got to service a request.

Indeed.

flock() is the absolute worst case, and always has been.  I guess nobody
every actually bothered to benchmark it.

>For kernel 2.2 I recommend that Apache consider using
>sysv semaphores for serialisation. They use wake-one. 
>
>For kernel 2.4 I recommend that Apache use unserialised
>accept.

No.

Please use unserialized accept() _always_, because we can fix that. 

Even 2.2.x can be fixed to do the wake-one for accept(), if required. 
It's not going to be any worse than the current apache config, and
basically the less games apache plays, the better the kernel can try to
accomodate what apache _really_ wants done.  When playing games, you
hide what you really want done, and suddenly kernel profiles etc end up
being completely useless, because they no longer give the data we needed
to fix the problem. 

Basically, the whole serialization crap is all about the Apache people
saying the equivalent of "the OS does a bad job on something we consider
to be incredibly important, so we do something else instead to hide it".

And regardless of _what_ workaround Apache does, whether it is the sucky
fcntl() thing or using SysV semaphores, it's going to hide the real
issue and mean that it never gets fixed properly.

And in the end it will result in really really bad performance. 

Instead, if apache had just done the thing it wanted to do in the first
place, the wake-one accept() semantics would have happened a hell of a
lot earlier. 

Now it's there in 2.4.x. Please use it. PLEASE PLEASE PLEASE don't play
games trying to outsmart the OS, it will just hurt Apache in the long run.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was:Strange performance behavior of 2.4.0-test9)

2000-11-03 Thread Andrew Morton


dean gaudet wrote:
> 
> On Tue, 31 Oct 2000, Andrew Morton wrote:
> 
> > Dean,  it looks like the same problem will occur with flock()-based
> > serialisation.  Does Apache/Linux ever use that option?
> 
> from apache/src/include/ap_config.h in the linux section there's
> this:
> 
> /* flock is faster ... but hasn't been tested on 1.x systems */
> /* PR#3531 indicates flock() may not be stable, probably depends on
>  * kernel version.  Go back to using fcntl, but provide a way for
>  * folks to tweak their Configuration to get flock.
>  */
> #ifndef USE_FLOCK_SERIALIZED_ACCEPT
> #define USE_FCNTL_SERIALIZED_ACCEPT
> #endif
> 
> so you should be able to -DUSE_FLOCK_SERIALIZED_ACCEPT to try it.
> 

Dean,

neither flock() nor fcntl() serialisation are effective
on linux 2.2 or linux 2.4.  This is because the file
locking code still wakes up _all_ waiters.  In my testing
with fcntl serialisation I have seen a single Apache
instance get woken and put back to sleep 1,500 times
before the poor thing actually got to service a request.

For kernel 2.2 I recommend that Apache consider using
sysv semaphores for serialisation. They use wake-one. 

For kernel 2.4 I recommend that Apache use unserialised
accept.

This means that you'll need to make a runtime decision
on whether to use unserialised, serialised with sysv or
serialised with fcntl (if sysv IPC isn't installed).


In my testing I launched 3, 10, 30 or 150 Apache instances and then used

httperf --num-conns=2000 --num-calls=1 --uri=/index.html

to open, use and close 2000 connections.

Here are the (terrible) results on 2.4 SMP with fcntl
serialisation:

fcntl accept, 3 servers, vanilla: 938.0 req/s
fcntl accept, 30 servers, vanilla: 697.1 req/s
fcntl accept, 150 servers, vanilla: 99.9 req/s (sic)

2.4 SMP with no serialisation:

unserialised accept, 3 servers, vanilla: 1049.0 req/s
unserialised accept, 10 servers, vanilla: 968.8 req/s
unserialised accept, 30 servers, vanilla: 1040.2 req/s
unserialised accept, 150 servers, vanilla: 1091.4 req/s

2.4 SMP with no serialisation and my patch to the
wakeup and waitqueue code:

unserialised accept, 3 servers, task_exclusive: 1117.4 req/s
unserialised accept, 10 servers, task_exclusive: 1118.6 req/s
unserialised accept, 30 servers, task_exclusive: 1105.6 req/s
unserialised accept, 150 servers, task_exclusive: 1077.1 req/s

2.4 SMP with sysv semaphore serialisation:

sysvsem accept, 3 servers: 1001.2 req/s
sysvsem accept, 10 servers: 1061.0 req/s
sysvsem accept, 30 servers: 1021.2 req/s
sysvsem accept, 150 servers: 943.6 req/s

2.2.14 SMP with fcntl serialisation:

fcntl accept, 3 servers: 1053.8 req/s
fcntl accept, 10 servers: 996.2 req/s
fcntl accept, 30 servers: 934.3 req/s
fcntl accept, 150 servers: 141.4 req/s(sic)

2.2.14 SMP with no serialisation:

unserialised accept, 3 servers: 1039.9 req/s
unserialised accept, 10 servers: 983.1 req/s
unserialised accept, 30 servers: 775.7 req/s
unserialised accept, 150 servers: 220.7 req/s (sic)

2.2.14 SMP with sysv sem serialisation:

sysv accept, 3 servers: 932.2 req/s
sysv accept, 10 servers: 910.6 req/s
sysv accept, 30 servers: 1026.6 req/s
sysv accept, 150 servers: 927.2 req/s


Note that the first test (2.4 with fcntl serialisation) was
with an unpatched 2.4.0-test10-pre5.  Once the simple
flock.patch is applied, the performance with 150 servers
doubles.  But it's still sucky.  The flock.patch change
is effective in increasing scalability wiht a large number
of CPUs, not a large number of httpd's.

Here's the silly patch I used to turn on sysv sem serialisation
in Apache.  There's probably a better way than this :)

--- apache_1.3.14.orig/src/main/http_main.c Fri Sep 29 00:32:36 2000
+++ apache_1.3.14/src/main/http_main.c  Sat Nov  4 15:01:41 2000
@@ -172,6 +172,13 @@
 
 #include "explain.h"
 
+/* AKPM */
+#if 1
+#define NEED_UNION_SEMUN
+#define USE_SYSVSEM_SERIALIZED_ACCEPT
+#define USE_FCNTL_SERIALIZED_ACCEPT
+#endif
+
 #if !defined(max)
 #define max(a,b)(a > b ? a : b)
 #endif
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was: Strange performance behavior of 2.4.0-test9)

2000-11-03 Thread Andrew Morton

[EMAIL PROTECTED] wrote:
> 
> Andrew Morton writes:
>  > This patch is a moderate rewrite of __wake_up_common.  I'd be
>  > interested in seeing how much difference it makes to the
>  > performance of Apache when the file-locking serialisation is
>  > disabled.
>  > - It implements last-in/first-out semantics for waking
>  >   TASK_EXCLUSIVE tasks.
>  > - It fixes what was surely a bug wherein __wake_up_common
>  >   scans the entire wait queue even when it has found the
>  >   task which it wants to run.
> 
> We've measured Apache w/ and w/o serialize_accept on
> several kernel configurations.
> 
> Apache compilation settings are those two:
> * without option (conventional setting)
>   (w/ serialize)
> * with -DSINGLE_LISTEN_UNSERIALIZED_ACCEPT
>   (w/o serialize)
> 
> We compared the performance of distributed binary and our binary with
> default setting, the performance is almost equivalent. All following
> data are based on our binaries.
> 
> w/ serializew/o serialize
> 2.40-t10-pre5   22375358
> 2.40-t10-pre5+P252535355**
> 2.40-t10-pre5+P3--- NG
> 
> ** with this configuration, once we observed the machine completely
> deadlocked with the following message:
> 
> Unable to handle kernel NULL pointer dereferenceNMI watchdog detected LOCKUP on CPU1.

That's not good.  `P2' just puts a couple of lock_kernel's into
the file locking code, and apache wasn't using that code when
your machine died.  So it looks like dropping the serialisation
has tickled a problem in test10-pre5.

> 
> 2.4.0-test10-pre5 with the LIFO patch (P3), we can't get the values.
> It always deadlock with same manner.  Perhaps, it failed to get
> console lock then deadlock.

mm..  Relying on the task_struct.state value in __wake_up_common() is
tricky - I guess you hit some races.  Sorry.  I ended up having to
rewrite the guts of the waitqueue and wakeup code to be able to
reliably avoid scanning the entire wait queue every time we wake an
exclusive task.

There are two patches attached here.

flock.patch: this simply puts the lock_kernel calls around the
file locking code.  We already have figures for that (it doubles
Apache's connection rate on uniprocessor, but it's still
_terrible_ when you have 100 servers).

task_exclusive.patch:  this is a redesign of the current waitqueue
and wakeup code.  Again, it provides LIFO handling of exclusive
tasks and avoids a scan of the entire waitqueue.  Hopefully it won't
lock up your machine this time!

I'd be interested in seeing the results on both serialised and
unserialised Apache when these are applied, please.

Here are my numbers.  This is with
httperf --num-conns=2000 --num-calls=1 --uri=/index.html
Basically, there's not a lot of difference here because
the test machine only has 2 CPUs.

   #Serversunpatched   patched
   conn/secconn/sec

2xCPU, serialised

3 938.0 956.2
10  943.2
30697.1 737.9
150   99.9  196.2(sic)

2xCPU, unserialised

3 1049.01117.4
10 968.81118.6
301040.21105.6
150   1091.41077.1

--- linux-2.4.0-test10-pre5/fs/fcntl.c  Sun Oct 15 01:27:45 2000
+++ linux-akpm/fs/fcntl.c   Fri Nov  3 20:42:47 2000
@@ -254,11 +254,15 @@
unlock_kernel();
break;
case F_GETLK:
+   lock_kernel();
err = fcntl_getlk(fd, (struct flock *) arg);
+   unlock_kernel();
break;
case F_SETLK:
case F_SETLKW:
+   lock_kernel();
err = fcntl_setlk(fd, cmd, (struct flock *) arg);
+   unlock_kernel();
break;
case F_GETOWN:
/*

--- linux-2.4.0-test10-pre5/include/linux/wait.hSun Oct 15 01:27:46 2000
+++ linux-akpm/include/linux/wait.h Fri Nov  3 21:14:30 2000
@@ -14,45 +14,24 @@
 #include 
 #include 
 #include 
+#include 

 #include 
 #include 

 /*
- * Temporary debugging help until all code is converted to the new
- * waitqueue usage.
+ * Debugging control.  Slow and useful.
  */
 #define WAITQUEUE_DEBUG 0

-#if WAITQUEUE_DEBUG
-extern int printk(const char *fmt, ...);
-#define WQ_BUG() do { \
-   printk("wq bug, forcing oops.\n"); \
-   BUG(); \
-} while (0)
-
-#define CHECK_MAGIC(x) if (x != (long)&(x)) \
-   { printk("bad magic %lx (should be %lx), ", (long)x, (long)&(x)); WQ_BUG(); }
-
-#define CHECK_MAGIC_WQHEAD(x) do { \
-   if (x->__magic != (long)&(x->__magic)) { \
-   printk("bad magic %lx (should be %lx, creator %lx),

Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was:Strange performance behavior of 2.4.0-test9)

2000-11-03 Thread Linus Torvalds


In article [EMAIL PROTECTED],
Andrew Morton  [EMAIL PROTECTED] wrote:

neither flock() nor fcntl() serialisation are effective
on linux 2.2 or linux 2.4.  This is because the file
locking code still wakes up _all_ waiters.  In my testing
with fcntl serialisation I have seen a single Apache
instance get woken and put back to sleep 1,500 times
before the poor thing actually got to service a request.

Indeed.

flock() is the absolute worst case, and always has been.  I guess nobody
every actually bothered to benchmark it.

For kernel 2.2 I recommend that Apache consider using
sysv semaphores for serialisation. They use wake-one. 

For kernel 2.4 I recommend that Apache use unserialised
accept.

No.

Please use unserialized accept() _always_, because we can fix that. 

Even 2.2.x can be fixed to do the wake-one for accept(), if required. 
It's not going to be any worse than the current apache config, and
basically the less games apache plays, the better the kernel can try to
accomodate what apache _really_ wants done.  When playing games, you
hide what you really want done, and suddenly kernel profiles etc end up
being completely useless, because they no longer give the data we needed
to fix the problem. 

Basically, the whole serialization crap is all about the Apache people
saying the equivalent of "the OS does a bad job on something we consider
to be incredibly important, so we do something else instead to hide it".

And regardless of _what_ workaround Apache does, whether it is the sucky
fcntl() thing or using SysV semaphores, it's going to hide the real
issue and mean that it never gets fixed properly.

And in the end it will result in really really bad performance. 

Instead, if apache had just done the thing it wanted to do in the first
place, the wake-one accept() semantics would have happened a hell of a
lot earlier. 

Now it's there in 2.4.x. Please use it. PLEASE PLEASE PLEASE don't play
games trying to outsmart the OS, it will just hurt Apache in the long run.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was: Strange performance behavior of 2.4.0-test9)

2000-11-03 Thread Andrew Morton


[EMAIL PROTECTED] wrote:
 
 Andrew Morton writes:
   This patch is a moderate rewrite of __wake_up_common.  I'd be
   interested in seeing how much difference it makes to the
   performance of Apache when the file-locking serialisation is
   disabled.
   - It implements last-in/first-out semantics for waking
 TASK_EXCLUSIVE tasks.
   - It fixes what was surely a bug wherein __wake_up_common
 scans the entire wait queue even when it has found the
 task which it wants to run.
 
 We've measured Apache w/ and w/o serialize_accept on
 several kernel configurations.
 
 Apache compilation settings are those two:
 * without option (conventional setting)
   (w/ serialize)
 * with -DSINGLE_LISTEN_UNSERIALIZED_ACCEPT
   (w/o serialize)
 
 We compared the performance of distributed binary and our binary with
 default setting, the performance is almost equivalent. All following
 data are based on our binaries.
 
 w/ serializew/o serialize
 2.40-t10-pre5   22375358
 2.40-t10-pre5+P252535355**
 2.40-t10-pre5+P3--- NG
 
 ** with this configuration, once we observed the machine completely
 deadlocked with the following message:
 
 Unable to handle kernel NULL pointer dereferenceNMI watchdog detected LOCKUP on CPU1.

That's not good.  `P2' just puts a couple of lock_kernel's into
the file locking code, and apache wasn't using that code when
your machine died.  So it looks like dropping the serialisation
has tickled a problem in test10-pre5.

 
 2.4.0-test10-pre5 with the LIFO patch (P3), we can't get the values.
 It always deadlock with same manner.  Perhaps, it failed to get
 console lock then deadlock.

mm..  Relying on the task_struct.state value in __wake_up_common() is
tricky - I guess you hit some races.  Sorry.  I ended up having to
rewrite the guts of the waitqueue and wakeup code to be able to
reliably avoid scanning the entire wait queue every time we wake an
exclusive task.

There are two patches attached here.

flock.patch: this simply puts the lock_kernel calls around the
file locking code.  We already have figures for that (it doubles
Apache's connection rate on uniprocessor, but it's still
_terrible_ when you have 100 servers).

task_exclusive.patch:  this is a redesign of the current waitqueue
and wakeup code.  Again, it provides LIFO handling of exclusive
tasks and avoids a scan of the entire waitqueue.  Hopefully it won't
lock up your machine this time!

I'd be interested in seeing the results on both serialised and
unserialised Apache when these are applied, please.

Here are my numbers.  This is with
httperf --num-conns=2000 --num-calls=1 --uri=/index.html
Basically, there's not a lot of difference here because
the test machine only has 2 CPUs.

   #Serversunpatched   patched
   conn/secconn/sec

2xCPU, serialised

3 938.0 956.2
10  943.2
30697.1 737.9
150   99.9  196.2(sic)

2xCPU, unserialised

3 1049.01117.4
10 968.81118.6
301040.21105.6
150   1091.41077.1

--- linux-2.4.0-test10-pre5/fs/fcntl.c  Sun Oct 15 01:27:45 2000
+++ linux-akpm/fs/fcntl.c   Fri Nov  3 20:42:47 2000
@@ -254,11 +254,15 @@
unlock_kernel();
break;
case F_GETLK:
+   lock_kernel();
err = fcntl_getlk(fd, (struct flock *) arg);
+   unlock_kernel();
break;
case F_SETLK:
case F_SETLKW:
+   lock_kernel();
err = fcntl_setlk(fd, cmd, (struct flock *) arg);
+   unlock_kernel();
break;
case F_GETOWN:
/*





--- linux-2.4.0-test10-pre5/include/linux/wait.hSun Oct 15 01:27:46 2000
+++ linux-akpm/include/linux/wait.h Fri Nov  3 21:14:30 2000
@@ -14,45 +14,24 @@
 #include linux/list.h
 #include linux/stddef.h
 #include linux/spinlock.h
+#include linux/cache.h
 
 #include asm/page.h
 #include asm/processor.h
 
 /*
- * Temporary debugging help until all code is converted to the new
- * waitqueue usage.
+ * Debugging control.  Slow and useful.
  */
 #define WAITQUEUE_DEBUG 0
 
-#if WAITQUEUE_DEBUG
-extern int printk(const char *fmt, ...);
-#define WQ_BUG() do { \
-   printk("wq bug, forcing oops.\n"); \
-   BUG(); \
-} while (0)
-
-#define CHECK_MAGIC(x) if (x != (long)(x)) \
-   { printk("bad magic %lx (should be %lx), ", (long)x, (long)(x)); WQ_BUG(); }
-
-#define CHECK_MAGIC_WQHEAD(x) do { \
-   if (x-__magic != (long)(x-__magic)) { \
-   printk("bad magic %lx

Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was:Strange performance behavior of 2.4.0-test9)

2000-11-03 Thread Andrew Morton


dean gaudet wrote:
 
 On Tue, 31 Oct 2000, Andrew Morton wrote:
 
  Dean,  it looks like the same problem will occur with flock()-based
  serialisation.  Does Apache/Linux ever use that option?
 
 from apache/src/include/ap_config.h in the linux section there's
 this:
 
 /* flock is faster ... but hasn't been tested on 1.x systems */
 /* PR#3531 indicates flock() may not be stable, probably depends on
  * kernel version.  Go back to using fcntl, but provide a way for
  * folks to tweak their Configuration to get flock.
  */
 #ifndef USE_FLOCK_SERIALIZED_ACCEPT
 #define USE_FCNTL_SERIALIZED_ACCEPT
 #endif
 
 so you should be able to -DUSE_FLOCK_SERIALIZED_ACCEPT to try it.
 

Dean,

neither flock() nor fcntl() serialisation are effective
on linux 2.2 or linux 2.4.  This is because the file
locking code still wakes up _all_ waiters.  In my testing
with fcntl serialisation I have seen a single Apache
instance get woken and put back to sleep 1,500 times
before the poor thing actually got to service a request.

For kernel 2.2 I recommend that Apache consider using
sysv semaphores for serialisation. They use wake-one. 

For kernel 2.4 I recommend that Apache use unserialised
accept.

This means that you'll need to make a runtime decision
on whether to use unserialised, serialised with sysv or
serialised with fcntl (if sysv IPC isn't installed).


In my testing I launched 3, 10, 30 or 150 Apache instances and then used

httperf --num-conns=2000 --num-calls=1 --uri=/index.html

to open, use and close 2000 connections.

Here are the (terrible) results on 2.4 SMP with fcntl
serialisation:

fcntl accept, 3 servers, vanilla: 938.0 req/s
fcntl accept, 30 servers, vanilla: 697.1 req/s
fcntl accept, 150 servers, vanilla: 99.9 req/s (sic)

2.4 SMP with no serialisation:

unserialised accept, 3 servers, vanilla: 1049.0 req/s
unserialised accept, 10 servers, vanilla: 968.8 req/s
unserialised accept, 30 servers, vanilla: 1040.2 req/s
unserialised accept, 150 servers, vanilla: 1091.4 req/s

2.4 SMP with no serialisation and my patch to the
wakeup and waitqueue code:

unserialised accept, 3 servers, task_exclusive: 1117.4 req/s
unserialised accept, 10 servers, task_exclusive: 1118.6 req/s
unserialised accept, 30 servers, task_exclusive: 1105.6 req/s
unserialised accept, 150 servers, task_exclusive: 1077.1 req/s

2.4 SMP with sysv semaphore serialisation:

sysvsem accept, 3 servers: 1001.2 req/s
sysvsem accept, 10 servers: 1061.0 req/s
sysvsem accept, 30 servers: 1021.2 req/s
sysvsem accept, 150 servers: 943.6 req/s

2.2.14 SMP with fcntl serialisation:

fcntl accept, 3 servers: 1053.8 req/s
fcntl accept, 10 servers: 996.2 req/s
fcntl accept, 30 servers: 934.3 req/s
fcntl accept, 150 servers: 141.4 req/s(sic)

2.2.14 SMP with no serialisation:

unserialised accept, 3 servers: 1039.9 req/s
unserialised accept, 10 servers: 983.1 req/s
unserialised accept, 30 servers: 775.7 req/s
unserialised accept, 150 servers: 220.7 req/s (sic)

2.2.14 SMP with sysv sem serialisation:

sysv accept, 3 servers: 932.2 req/s
sysv accept, 10 servers: 910.6 req/s
sysv accept, 30 servers: 1026.6 req/s
sysv accept, 150 servers: 927.2 req/s


Note that the first test (2.4 with fcntl serialisation) was
with an unpatched 2.4.0-test10-pre5.  Once the simple
flock.patch is applied, the performance with 150 servers
doubles.  But it's still sucky.  The flock.patch change
is effective in increasing scalability wiht a large number
of CPUs, not a large number of httpd's.

Here's the silly patch I used to turn on sysv sem serialisation
in Apache.  There's probably a better way than this :)

--- apache_1.3.14.orig/src/main/http_main.c Fri Sep 29 00:32:36 2000
+++ apache_1.3.14/src/main/http_main.c  Sat Nov  4 15:01:41 2000
@@ -172,6 +172,13 @@
 
 #include "explain.h"
 
+/* AKPM */
+#if 1
+#define NEED_UNION_SEMUN
+#define USE_SYSVSEM_SERIALIZED_ACCEPT
+#define USE_FCNTL_SERIALIZED_ACCEPT
+#endif
+
 #if !defined(max)
 #define max(a,b)(a  b ? a : b)
 #endif
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was:

2000-10-30 Thread Andrea Arcangeli


On Mon, Oct 30, 2000 at 02:36:39PM -0200, Rik van Riel wrote:
> For stuff like ___wait_on_page(), OTOH, you really want FIFO
> wakeup to avoid starvation (yes, I know we're currently doing

Sure agreed. In my _whole_ previous email I was only talking about accept.
Semaphores file locking etc.. all needs FIFO for fairness as you said.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was:

2000-10-30 Thread Andrea Arcangeli


On Mon, Oct 30, 2000 at 02:36:39PM -0200, Rik van Riel wrote:
 For stuff like ___wait_on_page(), OTOH, you really want FIFO
 wakeup to avoid starvation (yes, I know we're currently doing

Sure agreed. In my _whole_ previous email I was only talking about accept.
Semaphores file locking etc.. all needs FIFO for fairness as you said.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was:

2000-10-29 Thread Andi Kleen


On Sun, Oct 29, 2000 at 11:45:49AM -0800, dean gaudet wrote:
> On Sat, 28 Oct 2000, Alan Cox wrote:
> 
> > > The big question is: why is Apache using file locking so
> > > much?  Is this normal behaviour for Apache?
> > 
> > Apache uses file locking to serialize accept on hosts where accept either has
> > bad thundering heard problems or was simply broken with multiple acceptors
> 
> if apache 1.3 is compiled with -DSINGLE_LISTEN_UNSERIALIZED_ACCEPT it'll
> avoid the fcntl() serialisation when there is only one listening port.  
> (it still uses it for multiple listeners... you can read all about my
> logic for that at .)
> 
> is it appropriate for this to be defined for newer linux kernels?  i
> haven't kept track, sorry.  tell me what versions to conditionalize it on.

It should not be needed anymore for 2.4, because the accept() wakeup has been
fixed.


-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was:

2000-10-29 Thread Andi Kleen


On Sun, Oct 29, 2000 at 11:45:49AM -0800, dean gaudet wrote:
 On Sat, 28 Oct 2000, Alan Cox wrote:
 
   The big question is: why is Apache using file locking so
   much?  Is this normal behaviour for Apache?
  
  Apache uses file locking to serialize accept on hosts where accept either has
  bad thundering heard problems or was simply broken with multiple acceptors
 
 if apache 1.3 is compiled with -DSINGLE_LISTEN_UNSERIALIZED_ACCEPT it'll
 avoid the fcntl() serialisation when there is only one listening port.  
 (it still uses it for multiple listeners... you can read all about my
 logic for that at http://www.apache.org/docs/misc/perf-tuning.html.)
 
 is it appropriate for this to be defined for newer linux kernels?  i
 haven't kept track, sorry.  tell me what versions to conditionalize it on.

It should not be needed anymore for 2.4, because the accept() wakeup has been
fixed.


-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was: Strange performance behavior of 2.4.0-test9)

2000-10-28 Thread Andrew Morton


Andrew Morton wrote:
> 
> I think it's more expedient at this time to convert
> acquire_fl_sem/release_fl_sem into lock_kernel/unlock_kernel
> (so we _can_ sleep) and to fix the above alleged deadlock
> via the creation of __posix_unblock_lock()

I agree with me.  Could you please test the scalability
of this?



--- linux-2.4.0-test10-pre5/fs/locks.c  Tue Oct 24 21:34:13 2000
+++ linux-akpm/fs/locks.c   Sun Oct 29 03:34:15 2000
@@ -1706,11 +1706,25 @@
 posix_unblock_lock(struct file_lock *waiter)
 {
acquire_fl_sem();
+   __posix_unblock_lock(waiter);
+   release_fl_sem();
+}
+
+/**
+ * __posix_unblock_lock - stop waiting for a file lock
+ * @waiter: the lock which was waiting
+ *
+ * lockd needs to block waiting for locks.
+ * Like posix_unblock_lock(), except it doesn't
+ * acquire the file lock semaphore.
+ */
+void
+__posix_unblock_lock(struct file_lock *waiter)
+{
if (!list_empty(>fl_block)) {
locks_delete_block(waiter);
wake_up(>fl_wait);
}
-   release_fl_sem();
 }
 
 static void lock_get_status(char* out, struct file_lock *fl, int id, char *pfx)
--- linux-2.4.0-test10-pre5/include/linux/fs.h  Tue Oct 24 21:34:13 2000
+++ linux-akpm/include/linux/fs.h   Sun Oct 29 03:32:00 2000
@@ -564,6 +564,7 @@
 extern struct file_lock *posix_test_lock(struct file *, struct file_lock *);
 extern int posix_lock_file(struct file *, struct file_lock *, unsigned int);
 extern void posix_block_lock(struct file_lock *, struct file_lock *);
+extern void __posix_unblock_lock(struct file_lock *);
 extern void posix_unblock_lock(struct file_lock *);
 extern int __get_lease(struct inode *inode, unsigned int flags);
 extern time_t lease_get_mtime(struct inode *);
--- linux-2.4.0-test10-pre5/kernel/ksyms.c  Sun Oct 15 01:27:46 2000
+++ linux-akpm/kernel/ksyms.c   Sun Oct 29 03:32:38 2000
@@ -221,6 +221,7 @@
 EXPORT_SYMBOL(posix_lock_file);
 EXPORT_SYMBOL(posix_test_lock);
 EXPORT_SYMBOL(posix_block_lock);
+EXPORT_SYMBOL(__posix_unblock_lock);
 EXPORT_SYMBOL(posix_unblock_lock);
 EXPORT_SYMBOL(locks_mandatory_area);
 EXPORT_SYMBOL(dput);
--- linux-2.4.0-test10-pre5/fs/lockd/svclock.c  Tue Oct 24 21:34:13 2000
+++ linux-akpm/fs/lockd/svclock.c   Sun Oct 29 03:32:22 2000
@@ -456,7 +456,7 @@
struct nlm_block**bp, *block;
 
dprintk("lockd: VFS unblock notification for block %p\n", fl);
-   posix_unblock_lock(fl);
+   __posix_unblock_lock(fl);
for (bp = _blocked; (block = *bp); bp = >b_next) {
if (nlm_compare_locks(>b_call.a_args.lock.fl, fl)) {
svc_wake_up(block->b_daemon);
--- linux-2.4.0-test10-pre5/fs/fcntl.c  Sun Oct 15 01:27:45 2000
+++ linux-akpm/fs/fcntl.c   Sun Oct 29 03:35:47 2000
@@ -254,11 +254,15 @@
unlock_kernel();
break;
case F_GETLK:
+   lock_kernel();
err = fcntl_getlk(fd, (struct flock *) arg);
+   unlock_kernel();
break;
case F_SETLK:
case F_SETLKW:
+   lock_kernel();
err = fcntl_setlk(fd, cmd, (struct flock *) arg);
+   unlock_kernel();
break;
case F_GETOWN:
/*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was: Strange performance behavior of 2.4.0-test9)

2000-10-28 Thread Jeff Garzik


Andrew Morton wrote:
> --- linux-2.4.0-test10-pre5/fs/locks.c  Tue Oct 24 21:34:13 2000
> +++ linux-akpm/fs/locks.c   Sun Oct 29 02:31:10 2000
> @@ -125,10 +125,9 @@
>  #include 
>  #include 
> 
> -DECLARE_MUTEX(file_lock_sem);
> -
> -#define acquire_fl_sem()   down(_lock_sem)
> -#define release_fl_sem()   up(_lock_sem)
> +spinlock_t file_lock_lock = SPIN_LOCK_UNLOCKED;
> +#define acquire_fl_lock()  spin_lock(_lock_lock);
> +#define release_fl_lock()  spin_unlock(_lock_lock);

It seems like better concurrency could be achieved with reader-writer
locks.  Some of the lock test routines simply scan the list, without
modifying it.

-- 
Jeff Garzik | "Mind if I drive?"  -Sam
Building 1024   | "Not if you don't mind me clawing at the
MandrakeSoft|  dash and screaming like a cheerleader."
|  -Max
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was: Strange performance behavior of 2.4.0-test9)

2000-10-28 Thread Andi Kleen


On Sun, Oct 29, 2000 at 02:46:14AM +1100, Andrew Morton wrote:
> [EMAIL PROTECTED] wrote:
> > 
> > Change the following two macros:
> > acquire_fl_sem()->lock_kernel()
> > release_fl_sem()->unlock_kernel()
> > then
> > 5192 Req/s @8cpu is got. It is same as test8 within fluctuation.
> 
> hmm..  BKL increases scalability.  News at 11.
> 
> The big question is: why is Apache using file locking so
> much?  Is this normal behaviour for Apache?

It serializes accept() to avoid the thundering herd from the wake-all
semantics.

With the 2.4 stack that is probably not needed anymore (it was in 2.2), 
it may just work to remove the file locking (it should always be correct,
just on 2.2 it may be slower to remove it) 

> Because if so, the file locking code will be significantly
> bad for the scalability of Apache on SMP (of all things!).
> It basically grabs a big global lock for _anything_.  It
> looks like it could be a lot more granular. 

iirc everybody who looked at the code agrees that it needs a rewrite
badly.


-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was: Strange performance behavior of 2.4.0-test9)

2000-10-28 Thread Andrew Morton


[EMAIL PROTECTED] wrote:
> 
> Change the following two macros:
> acquire_fl_sem()->lock_kernel()
> release_fl_sem()->unlock_kernel()
> then
> 5192 Req/s @8cpu is got. It is same as test8 within fluctuation.

hmm..  BKL increases scalability.  News at 11.

The big question is: why is Apache using file locking so
much?  Is this normal behaviour for Apache?

Because if so, the file locking code will be significantly
bad for the scalability of Apache on SMP (of all things!).
It basically grabs a big global lock for _anything_.  It
looks like it could be a lot more granular. 

> 
> I put the patch to change all occurrence of semaphore "file_lock_sem"
> into spinlock variable "file_lock_lock".

Unfortunately this is deadlocky - in several places functions
which can sleep are called when file_lock_lock is held.

More unfortunately I think we _already_ have a deadlock in
the file locking:

locks_wake_up_blocks() is always called with file_lock_sem held.
locks_wake_up_blocks() calls waiter->fl_notify(), which is really 
nlmsvc_notify_blocked()
nlmsvc_notify_blocked() calls posix_unblock_lock()
posix_unblock_lock() calls acquire_fl_sem()

Can someone please confirm this?

Anyway, here is a modified version of your patch which
fixes everything.

But I'm not sure that it should be applied.  I think it's
more expedient at this time to convert acquire_fl_sem/release_fl_sem
into lock_kernel/unlock_kernel (so we _can_ sleep) and to fix the
above alleged deadlock via the creation of __posix_unblock_lock() as in
this patch.


--- linux-2.4.0-test10-pre5/fs/lockd/clntlock.c Tue Oct 24 21:34:13 2000
+++ linux-akpm/fs/lockd/clntlock.c  Sun Oct 29 00:39:15 2000
@@ -168,10 +168,10 @@
 * reclaim is in progress */
lock_kernel();
lockd_up();
-   down(_lock_sem);
 
/* First, reclaim all locks that have been granted previously. */
 restart:
+   spin_lock(_lock_lock);
tmp = file_lock_list.next;
while (tmp != _lock_list) {
struct file_lock *fl = list_entry(tmp, struct file_lock, fl_link);
@@ -181,12 +181,13 @@
fl->fl_u.nfs_fl.state != host->h_state &&
(fl->fl_u.nfs_fl.flags & NFS_LCK_GRANTED)) {
fl->fl_u.nfs_fl.flags &= ~ NFS_LCK_GRANTED;
+   spin_unlock(_lock_lock);
nlmclnt_reclaim(host, fl);
goto restart;
}
tmp = tmp->next;
}
-   up(_lock_sem);
+   spin_unlock(_lock_lock);
 
host->h_reclaiming = 0;
wake_up(>h_gracewait);
--- linux-2.4.0-test10-pre5/fs/locks.c  Tue Oct 24 21:34:13 2000
+++ linux-akpm/fs/locks.c   Sun Oct 29 02:31:10 2000
@@ -125,10 +125,9 @@
 #include 
 #include 
 
-DECLARE_MUTEX(file_lock_sem);
-
-#define acquire_fl_sem()   down(_lock_sem)
-#define release_fl_sem()   up(_lock_sem)
+spinlock_t file_lock_lock = SPIN_LOCK_UNLOCKED;
+#define acquire_fl_lock()  spin_lock(_lock_lock);
+#define release_fl_lock()  spin_unlock(_lock_lock);
 
 int leases_enable = 1;
 int lease_break_time = 45;
@@ -352,29 +351,27 @@
 #endif
 
 /* Allocate a file_lock initialised to this type of lease */
-static int lease_alloc(struct file *filp, int type, struct file_lock **flp)
+static struct file_lock *lease_alloc(struct file *filp, int type)
 {
struct file_lock *fl = locks_alloc_lock(1);
-   if (fl == NULL)
-   return -ENOMEM;
-
-   fl->fl_owner = current->files;
-   fl->fl_pid = current->pid;
+   if (fl != NULL) {
+   fl->fl_owner = current->files;
+   fl->fl_pid = current->pid;
 
-   fl->fl_file = filp;
-   fl->fl_flags = FL_LEASE;
-   if (assign_type(fl, type) != 0) {
-   locks_free_lock(fl);
-   return -EINVAL;
+   fl->fl_file = filp;
+   fl->fl_flags = FL_LEASE;
+   if (assign_type(fl, type) != 0) {
+   locks_free_lock(fl);
+   fl = NULL;
+   } else {
+   fl->fl_start = 0;
+   fl->fl_end = OFFSET_MAX;
+   fl->fl_notify = NULL;
+   fl->fl_insert = NULL;
+   fl->fl_remove = NULL;
+   }
}
-   fl->fl_start = 0;
-   fl->fl_end = OFFSET_MAX;
-   fl->fl_notify = NULL;
-   fl->fl_insert = NULL;
-   fl->fl_remove = NULL;
-
-   *flp = fl;
-   return 0;
+   return fl;
 }
 
 /* Check if two locks overlap each other.
@@ -431,6 +428,9 @@
 /* Wake up processes blocked waiting for blocker.
  * If told to wait then schedule the processes until the block list
  * is empty, otherwise empty the block list ourselves.
+ *
+ * waiter->fl_notify() is not allowed to block.  Otherwise
+ * we could deadlock on file_lock_lock - AKPM.
  */
 static void locks_wake_up_blocks(struct file_lock *blocker, unsigned

Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was: Strange performance behavior of 2.4.0-test9)

2000-10-28 Thread Andrew Morton


Andrew Morton wrote:
 
 I think it's more expedient at this time to convert
 acquire_fl_sem/release_fl_sem into lock_kernel/unlock_kernel
 (so we _can_ sleep) and to fix the above alleged deadlock
 via the creation of __posix_unblock_lock()

I agree with me.  Could you please test the scalability
of this?



--- linux-2.4.0-test10-pre5/fs/locks.c  Tue Oct 24 21:34:13 2000
+++ linux-akpm/fs/locks.c   Sun Oct 29 03:34:15 2000
@@ -1706,11 +1706,25 @@
 posix_unblock_lock(struct file_lock *waiter)
 {
acquire_fl_sem();
+   __posix_unblock_lock(waiter);
+   release_fl_sem();
+}
+
+/**
+ * __posix_unblock_lock - stop waiting for a file lock
+ * @waiter: the lock which was waiting
+ *
+ * lockd needs to block waiting for locks.
+ * Like posix_unblock_lock(), except it doesn't
+ * acquire the file lock semaphore.
+ */
+void
+__posix_unblock_lock(struct file_lock *waiter)
+{
if (!list_empty(waiter-fl_block)) {
locks_delete_block(waiter);
wake_up(waiter-fl_wait);
}
-   release_fl_sem();
 }
 
 static void lock_get_status(char* out, struct file_lock *fl, int id, char *pfx)
--- linux-2.4.0-test10-pre5/include/linux/fs.h  Tue Oct 24 21:34:13 2000
+++ linux-akpm/include/linux/fs.h   Sun Oct 29 03:32:00 2000
@@ -564,6 +564,7 @@
 extern struct file_lock *posix_test_lock(struct file *, struct file_lock *);
 extern int posix_lock_file(struct file *, struct file_lock *, unsigned int);
 extern void posix_block_lock(struct file_lock *, struct file_lock *);
+extern void __posix_unblock_lock(struct file_lock *);
 extern void posix_unblock_lock(struct file_lock *);
 extern int __get_lease(struct inode *inode, unsigned int flags);
 extern time_t lease_get_mtime(struct inode *);
--- linux-2.4.0-test10-pre5/kernel/ksyms.c  Sun Oct 15 01:27:46 2000
+++ linux-akpm/kernel/ksyms.c   Sun Oct 29 03:32:38 2000
@@ -221,6 +221,7 @@
 EXPORT_SYMBOL(posix_lock_file);
 EXPORT_SYMBOL(posix_test_lock);
 EXPORT_SYMBOL(posix_block_lock);
+EXPORT_SYMBOL(__posix_unblock_lock);
 EXPORT_SYMBOL(posix_unblock_lock);
 EXPORT_SYMBOL(locks_mandatory_area);
 EXPORT_SYMBOL(dput);
--- linux-2.4.0-test10-pre5/fs/lockd/svclock.c  Tue Oct 24 21:34:13 2000
+++ linux-akpm/fs/lockd/svclock.c   Sun Oct 29 03:32:22 2000
@@ -456,7 +456,7 @@
struct nlm_block**bp, *block;
 
dprintk("lockd: VFS unblock notification for block %p\n", fl);
-   posix_unblock_lock(fl);
+   __posix_unblock_lock(fl);
for (bp = nlm_blocked; (block = *bp); bp = block-b_next) {
if (nlm_compare_locks(block-b_call.a_args.lock.fl, fl)) {
svc_wake_up(block-b_daemon);
--- linux-2.4.0-test10-pre5/fs/fcntl.c  Sun Oct 15 01:27:45 2000
+++ linux-akpm/fs/fcntl.c   Sun Oct 29 03:35:47 2000
@@ -254,11 +254,15 @@
unlock_kernel();
break;
case F_GETLK:
+   lock_kernel();
err = fcntl_getlk(fd, (struct flock *) arg);
+   unlock_kernel();
break;
case F_SETLK:
case F_SETLKW:
+   lock_kernel();
err = fcntl_setlk(fd, cmd, (struct flock *) arg);
+   unlock_kernel();
break;
case F_GETOWN:
/*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was: Strange performance behavior of 2.4.0-test9)

2000-10-28 Thread Andrew Morton


[EMAIL PROTECTED] wrote:
 
 Change the following two macros:
 acquire_fl_sem()-lock_kernel()
 release_fl_sem()-unlock_kernel()
 then
 5192 Req/s @8cpu is got. It is same as test8 within fluctuation.

hmm..  BKL increases scalability.  News at 11.

The big question is: why is Apache using file locking so
much?  Is this normal behaviour for Apache?

Because if so, the file locking code will be significantly
bad for the scalability of Apache on SMP (of all things!).
It basically grabs a big global lock for _anything_.  It
looks like it could be a lot more granular. 

 
 I put the patch to change all occurrence of semaphore "file_lock_sem"
 into spinlock variable "file_lock_lock".

Unfortunately this is deadlocky - in several places functions
which can sleep are called when file_lock_lock is held.

More unfortunately I think we _already_ have a deadlock in
the file locking:

locks_wake_up_blocks() is always called with file_lock_sem held.
locks_wake_up_blocks() calls waiter-fl_notify(), which is really 
nlmsvc_notify_blocked()
nlmsvc_notify_blocked() calls posix_unblock_lock()
posix_unblock_lock() calls acquire_fl_sem()

Can someone please confirm this?

Anyway, here is a modified version of your patch which
fixes everything.

But I'm not sure that it should be applied.  I think it's
more expedient at this time to convert acquire_fl_sem/release_fl_sem
into lock_kernel/unlock_kernel (so we _can_ sleep) and to fix the
above alleged deadlock via the creation of __posix_unblock_lock() as in
this patch.


--- linux-2.4.0-test10-pre5/fs/lockd/clntlock.c Tue Oct 24 21:34:13 2000
+++ linux-akpm/fs/lockd/clntlock.c  Sun Oct 29 00:39:15 2000
@@ -168,10 +168,10 @@
 * reclaim is in progress */
lock_kernel();
lockd_up();
-   down(file_lock_sem);
 
/* First, reclaim all locks that have been granted previously. */
 restart:
+   spin_lock(file_lock_lock);
tmp = file_lock_list.next;
while (tmp != file_lock_list) {
struct file_lock *fl = list_entry(tmp, struct file_lock, fl_link);
@@ -181,12 +181,13 @@
fl-fl_u.nfs_fl.state != host-h_state 
(fl-fl_u.nfs_fl.flags  NFS_LCK_GRANTED)) {
fl-fl_u.nfs_fl.flags = ~ NFS_LCK_GRANTED;
+   spin_unlock(file_lock_lock);
nlmclnt_reclaim(host, fl);
goto restart;
}
tmp = tmp-next;
}
-   up(file_lock_sem);
+   spin_unlock(file_lock_lock);
 
host-h_reclaiming = 0;
wake_up(host-h_gracewait);
--- linux-2.4.0-test10-pre5/fs/locks.c  Tue Oct 24 21:34:13 2000
+++ linux-akpm/fs/locks.c   Sun Oct 29 02:31:10 2000
@@ -125,10 +125,9 @@
 #include asm/semaphore.h
 #include asm/uaccess.h
 
-DECLARE_MUTEX(file_lock_sem);
-
-#define acquire_fl_sem()   down(file_lock_sem)
-#define release_fl_sem()   up(file_lock_sem)
+spinlock_t file_lock_lock = SPIN_LOCK_UNLOCKED;
+#define acquire_fl_lock()  spin_lock(file_lock_lock);
+#define release_fl_lock()  spin_unlock(file_lock_lock);
 
 int leases_enable = 1;
 int lease_break_time = 45;
@@ -352,29 +351,27 @@
 #endif
 
 /* Allocate a file_lock initialised to this type of lease */
-static int lease_alloc(struct file *filp, int type, struct file_lock **flp)
+static struct file_lock *lease_alloc(struct file *filp, int type)
 {
struct file_lock *fl = locks_alloc_lock(1);
-   if (fl == NULL)
-   return -ENOMEM;
-
-   fl-fl_owner = current-files;
-   fl-fl_pid = current-pid;
+   if (fl != NULL) {
+   fl-fl_owner = current-files;
+   fl-fl_pid = current-pid;
 
-   fl-fl_file = filp;
-   fl-fl_flags = FL_LEASE;
-   if (assign_type(fl, type) != 0) {
-   locks_free_lock(fl);
-   return -EINVAL;
+   fl-fl_file = filp;
+   fl-fl_flags = FL_LEASE;
+   if (assign_type(fl, type) != 0) {
+   locks_free_lock(fl);
+   fl = NULL;
+   } else {
+   fl-fl_start = 0;
+   fl-fl_end = OFFSET_MAX;
+   fl-fl_notify = NULL;
+   fl-fl_insert = NULL;
+   fl-fl_remove = NULL;
+   }
}
-   fl-fl_start = 0;
-   fl-fl_end = OFFSET_MAX;
-   fl-fl_notify = NULL;
-   fl-fl_insert = NULL;
-   fl-fl_remove = NULL;
-
-   *flp = fl;
-   return 0;
+   return fl;
 }
 
 /* Check if two locks overlap each other.
@@ -431,6 +428,9 @@
 /* Wake up processes blocked waiting for blocker.
  * If told to wait then schedule the processes until the block list
  * is empty, otherwise empty the block list ourselves.
+ *
+ * waiter-fl_notify() is not allowed to block.  Otherwise
+ * we could deadlock on file_lock_lock - AKPM.
  */
 static void locks_wake_up_blocks(struct

Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was: Strange performance behavior of 2.4.0-test9)

2000-10-28 Thread Andi Kleen


On Sun, Oct 29, 2000 at 02:46:14AM +1100, Andrew Morton wrote:
 [EMAIL PROTECTED] wrote:
  
  Change the following two macros:
  acquire_fl_sem()-lock_kernel()
  release_fl_sem()-unlock_kernel()
  then
  5192 Req/s @8cpu is got. It is same as test8 within fluctuation.
 
 hmm..  BKL increases scalability.  News at 11.
 
 The big question is: why is Apache using file locking so
 much?  Is this normal behaviour for Apache?

It serializes accept() to avoid the thundering herd from the wake-all
semantics.

With the 2.4 stack that is probably not needed anymore (it was in 2.2), 
it may just work to remove the file locking (it should always be correct,
just on 2.2 it may be slower to remove it) 

 Because if so, the file locking code will be significantly
 bad for the scalability of Apache on SMP (of all things!).
 It basically grabs a big global lock for _anything_.  It
 looks like it could be a lot more granular. 

iirc everybody who looked at the code agrees that it needs a rewrite
badly.


-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was: Strange performance behavior of 2.4.0-test9)

2000-10-28 Thread Jeff Garzik


Andrew Morton wrote:
 --- linux-2.4.0-test10-pre5/fs/locks.c  Tue Oct 24 21:34:13 2000
 +++ linux-akpm/fs/locks.c   Sun Oct 29 02:31:10 2000
 @@ -125,10 +125,9 @@
  #include asm/semaphore.h
  #include asm/uaccess.h
 
 -DECLARE_MUTEX(file_lock_sem);
 -
 -#define acquire_fl_sem()   down(file_lock_sem)
 -#define release_fl_sem()   up(file_lock_sem)
 +spinlock_t file_lock_lock = SPIN_LOCK_UNLOCKED;
 +#define acquire_fl_lock()  spin_lock(file_lock_lock);
 +#define release_fl_lock()  spin_unlock(file_lock_lock);

It seems like better concurrency could be achieved with reader-writer
locks.  Some of the lock test routines simply scan the list, without
modifying it.

-- 
Jeff Garzik | "Mind if I drive?"  -Sam
Building 1024   | "Not if you don't mind me clawing at the
MandrakeSoft|  dash and screaming like a cheerleader."
|  -Max
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

59 matches

Mail list logo