subject:"Re\: scheduling problem\?"

Re: scheduling problem?

2001-01-03 Thread Daniel Phillips


Mike Galbraith wrote:
> 
> On Wed, 3 Jan 2001, Daniel Phillips wrote:
> 
> > Mike Galbraith wrote:
> > > Semaphore timed out during boot, leaving bdflush as zombie.
> >
> > Wait a sec, what do you mean by 'semaphore timed out'?  These should
> > wait patiently forever.
> 
> IKD has a semaphore deadlock detector.

That was my tentative conclusion.

> Any place you take a semaphore
> and have to wait longer than 5 seconds (what I had it set to because
> with trace buffer set to 300 entries, it can only cover ~8 seconds
> of disk [slowest] load), it triggers and freezes the trace buffer for
> later use.  It firing under load may not be of interest. (but it firing
> looks to be very closly coupled to observed stalls with virgin source.
> Linus fixes big stall and deadlock detector mostly shuts up.  I fix a
> smaller stall and it shuts up entirely.. for this workload)

But it's entirely legal for a semaphore to wait forever when used in the
way I've used them, a producer/consumer pattern.  You should be able to
run happily (at least as happily as before) with the watchdog disabled.

This begs the question of what to do about the 99.99% of cases where the
watchdog is a good thing to have.  Shouldn't the watchdog just log the
'suspicious' event and continue?

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: scheduling problem?

2001-01-03 Thread Mike Galbraith

On Wed, 3 Jan 2001, Daniel Phillips wrote:

> Mike Galbraith wrote:
> > Semaphore timed out during boot, leaving bdflush as zombie.
> 
> Wait a sec, what do you mean by 'semaphore timed out'?  These should
> wait patiently forever.

IKD has a semaphore deadlock detector.  Any place you take a semaphore
and have to wait longer than 5 seconds (what I had it set to because
with trace buffer set to 300 entries, it can only cover ~8 seconds
of disk [slowest] load), it triggers and freezes the trace buffer for
later use.  It firing under load may not be of interest. (but it firing
looks to be very closly coupled to observed stalls with virgin source.
Linus fixes big stall and deadlock detector mostly shuts up.  I fix a
smaller stall and it shuts up entirely.. for this workload)

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: scheduling problem?

2001-01-03 Thread Mike Galbraith


On Wed, 3 Jan 2001, Daniel Phillips wrote:

> Mike Galbraith wrote:
> > 
> > On Wed, 3 Jan 2001, Daniel Phillips wrote:
> > 
> > > Could you try this patch just to see what happens?  It uses semaphores
> > > for the bdflush synchronization instead of banging directly on the task
> > > wait queues.  It's supposed to be a drop-in replacement for the bdflush
> > > wakeup/waitfor mechanism, but who knows, it may have subtly different
> > > behavious in your case.
> > 
> > Semaphore timed out during boot, leaving bdflush as zombie.
> 
> Hmm, how could that happen?  I'm booted and running with that patch
> right now and have beaten on it extensively - it sounds like something
> else is broken.  Or maybe we've already established that - let me read
> the thread again.
> 
> Which semaphore timed out, bdflush_request or bdflush_waiter?

I didn't watch closely (running virgin prerelease).  I can run it again
if you think it's important.

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: scheduling problem?

2001-01-03 Thread Daniel Phillips


Mike Galbraith wrote:
> Semaphore timed out during boot, leaving bdflush as zombie.

Wait a sec, what do you mean by 'semaphore timed out'?  These should
wait patiently forever.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: scheduling problem?

2001-01-03 Thread Daniel Phillips

Mike Galbraith wrote:
> 
> On Wed, 3 Jan 2001, Daniel Phillips wrote:
> 
> > Could you try this patch just to see what happens?  It uses semaphores
> > for the bdflush synchronization instead of banging directly on the task
> > wait queues.  It's supposed to be a drop-in replacement for the bdflush
> > wakeup/waitfor mechanism, but who knows, it may have subtly different
> > behavious in your case.
> 
> Semaphore timed out during boot, leaving bdflush as zombie.

Hmm, how could that happen?  I'm booted and running with that patch
right now and have beaten on it extensively - it sounds like something
else is broken.  Or maybe we've already established that - let me read
the thread again.

Which semaphore timed out, bdflush_request or bdflush_waiter?

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: scheduling problem?

2001-01-03 Thread Mike Galbraith


On Wed, 3 Jan 2001, Mike Galbraith wrote:

> Feel is _vastly_ improved.

Except while beating on it, I found a way to turn it into a brick.

If I run Christoph Rohland's swptst proggy, interactive disappears
to the point that login while it is running is impossible. ~15 minutes
later I got 'login timed out after 30 seconds'.  ~10 minutes after
that, the prompt came back.

-Mike

on other vt..
./swptst 1 4800 4 12 100

Script started on Wed Jan  3 11:16:46 2001
[root]:# schedctl -R vmstat 1
   procs  memoryswap  io system cpu
 r  b  w   swpd   free   buff  cache  si  sobibo   incs  us  sy  id
 1  0  0  17688 105040396  15400  84 100   515  1117  315  1321   2  50  47
 0  0  0  17688 105040396  15400   0   0 0 0  10210   0  95   5
 0  0  0  17688 105040396  15400   0   0 0 0  111 8   0  96   4
 0  0  0  17688 105040396  15400   0   0 0 0  10610   0  94   6
 0  0  0  17688 105028396  15412  12   0 4 0  10921   0  92   8
 0  4  2  85240   1748184  11996 4024 81876  2014 20512 3122  8444   0  35  65
 0  5  3  99064   1432188  25964 8856 5240  2472  1310 3645  5724   0   5  95
 0  5  2 102404   1436188  29072 4476 9896  1172  2474  880  1594   0  14  86
 0  4  2 102404   1460188  29032 5700   0  1425 0  440   656   1   9  91
 0  4  2 114452   1432188  40996 3224 4392   806  1098  344   523   0  19  81
 1  3  3 114452   1432188  40732 4080 11712  1020  2928  438  1061   0  26  74
 1  4  2 189568   1732184 115060 75948 31108 19811   4840  7186   0  18  82
 0  5  2 189568   1436196 114560 5988 38224  1641  9556 1065   938   0  34  66
 0  5  2 192564   1432196 116960 48712 6416 12235  1604 4913  6910   0   8  92
 0  4  1 192564   1432196 116960 4136   0  1034 0  357   479   0   7  93
 1  3  2 192528   1432196 116968 3284   8   834 2  371   510   0   5  95
 2  3  3 192512   1436184 116972 17772 2452  4561   613  894  1110   0  13  87
 4  0  1  14964  24988188  10068 548   0   787 0  162   137   0  75  25
 3  1  2  41356   1404184   8180 108 35200   792  8800  818  5539   0  53  47
 4  2  2  75304   1432184   9536 204 30896   215  7724  570  1705   0  56  44
 1  3  0  99440   1584184  26304 8512 13972  2128  3493  622   858   0  28  72
 0  5  2 114228   1432184  40880 16708 16464  4206  4116 2633  4967   0  11  89
   procs  memoryswap  io system cpu
 r  b  w   swpd   free   buff  cache  si  sobibo   incs  us  sy  id
 1  4  3 115288   1432184  41732 3600 11000   900  2750  370   832   1  32  67
 1  3  1 116576   1432184  43020 3168   0   792 0  415   561   0   5  95
 1  4  2 117940   1580180  44092 8256 2644  2089   661 1196  1659   0   6  94
 1  4  3 117912   1432180  44236 1960   0   490 0  965  1216   0   9  91
 0  5  2 118076   1884184  43908 1108 1588   283   397  590   747   0  15  85
 1  4  2 161668   2300184  86164 155412 129228 40971 32307 9921 17331   0  23  77
 4  2  2  15368  57384196  10448 46812 3396 13012   849 2707  3139   0  20  80
 0  7  2 134100   1460180  60068 195944 223620 53567 55908 47036 84839   0  12  88
 4  1  1  19824  40776188  14752 195852 199904 52674 49976 45992 77956   0  11  89
 0  5  2 142680   1656184  68304 184740 184356 49300 46089 48109 82885   0   9  91
 6  3  0  19384   1404184  13268 879100 1027052 246649 256793 204487 547788   0  
22  78
 4  4  1  53712  50064208  11264 484836 552776 137106 138209 125215 250077   0  18 
 82
 2  6  3 108684  15180196  52812 398128 462772 35 115729 53739 351528   1  53  
46
 4  5  0  54808  45656256  13984 12880 840  5412   242 1627  2070   3  97   0
 0  5  0  22620 104664220  17392 18216 50944  6685 12746 2394 11021   1  44  56
 0  3  0  22528 103264284  18476 692   0   666 0  254   390   5  18  77
 1  1  0  22160 102096296  19564 536   0   840 0  242   438   7  15  78
 0  0  0  22120 102192316  19932  60   0   320 8  171   333   3  50  48
 0  0  0  22120 102192316  19932   0   0 0 0  101 7   0  96   4
 0  0  0  22120 102184324  19932   0   0 7 0  10833   0  84  16
 0  0  0  22120 102184324  19932   0   0 0 0  101 7   0  96   4

[root]:# exit
exit

Script done on Wed Jan  3 12:00:31 2001

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: scheduling problem?

2001-01-02 Thread Mike Galbraith

On Tue, 2 Jan 2001, Linus Torvalds wrote:

> On Wed, 3 Jan 2001, Mike Galbraith wrote:
> > 
> > No difference (except more context switching as expected)
> 
> What about the current prerelese patch in testing? It doesn't switch to
> bdflush at all, but instead does the buffer cleaning by hand.

99% gone.  The remaining 1% is refill_freelist().  If I use
flush_dirty_buffers() there instead of waiting, I have no more
semaphore timeouts (so far.. not thoroughly pounded upon). Without
that change, I still take hits.  (in my tinker tree, I usually
make a 'small flush' mode for flush_dirty_buffers() to do that)

Feel is _vastly_ improved.

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: scheduling problem?

2001-01-02 Thread Linus Torvalds

On Wed, 3 Jan 2001, Mike Galbraith wrote:
> 
> No difference (except more context switching as expected)

What about the current prerelese patch in testing? It doesn't switch to
bdflush at all, but instead does the buffer cleaning by hand.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: scheduling problem?

2001-01-02 Thread Mike Galbraith

On Wed, 3 Jan 2001, Roger Larsson wrote:

> Hi,
> 
> I have played around with this code previously.
> This is my current understanding.
> [yield problem?]

Hmm.. this ~could be.  I once dove into the VM waters (me=stone)
and changed __alloc_pages() to only yield instead of scheduling.
The results (along with many other strange changes) were.. weirdest
feeling kernel I ever ran.  Damn fast, but very very weird ;-)

> Possible (in -prerelease) untested possibilities.
> 
> * Be tougher when yielding.
> 
> 
>   wakeup_kswapd(0);
>   if (gfp_mask & __GFP_WAIT) {
>   __set_current_state(TASK_RUNNING);
>   current->policy |= SCHED_YIELD;
> +   current->counter--; /* be faster to let kswapd run */
> or
> +   current->counter = 0; /* too fast? [not tested] */
>   schedule();
>   }

That looks a lot like cheating.

> * Move wakeup of bflushd to kswapd. Somewhere after 'do_try_to_free_pages(..)'
>   has been run. Before going to sleep... 
>   [a variant tested with mixed results - this is likely a better one]

I also did some things along this line.. also with mixed results.

:) the changes I've done that I actually like best is to kill bdflush
graveyard dead.  Did that twice and didn't miss it at all.  (next time,
I think I'll erect a headstone)

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: scheduling problem?

2001-01-02 Thread Mike Galbraith


On Tue, 2 Jan 2001, Linus Torvalds wrote:

> On Tue, 2 Jan 2001, Mike Galbraith wrote:
> > 
> > Yes and no.  I've seen nasty stalls for quite a while now.  (I think
> > that there is a wakeup problem lurking)
> > 
> > I found the change which triggers my horrid stalls.  Nobody is going
> > to believe this...
> 
> Hmm.. I can believe it. The code that waits on bdflush in wakeup_bdflush()
> is somewhat suspicious. In particular, if/when that ever triggers, and
> bdflush() is busy in flush_dirty_buffers(), then the process that is
> trying to wake bdflush up is going to wait until flush_dirty_buffers() is
> done. 
> 
> Which, if there is a process dirtying pages, can basically be
> pretty much forever.
> 
> This was probably hidden by the lower limits simply by virtue of bdflush
> never being very active before.
> 
> What does the system feel like if you just change the "sleep for bdflush"
> logic in wakeup_bdflush() to something like
> 
>   wake_up_process(bdflush_tsk);
>   __set_current_state(TASK_RUNNING);
>   current->policy |= SCHED_YIELD;
>   schedule();
> 
> instead of trying to wait for bdflush to wake us up?

No difference (except more context switching as expected)

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: scheduling problem?

2001-01-02 Thread Mike Galbraith


On Wed, 3 Jan 2001, Daniel Phillips wrote:

> Could you try this patch just to see what happens?  It uses semaphores
> for the bdflush synchronization instead of banging directly on the task
> wait queues.  It's supposed to be a drop-in replacement for the bdflush
> wakeup/waitfor mechanism, but who knows, it may have subtly different
> behavious in your case.

Semaphore timed out during boot, leaving bdflush as zombie.

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: scheduling problem?

2001-01-02 Thread Roger Larsson

Hi,

I have played around with this code previously.
This is my current understanding.
[yield problem?]

On Tuesday 02 January 2001 09:27, Mike Galbraith wrote:
> Hi,
>
> I am seeing (what I believe is;) severe process CPU starvation in
> 2.4.0-prerelease.  At first, I attributed it to semaphore troubles
> as when I enable semaphore deadlock detection in IKD and set it to
> 5 seconds, it triggers 100% of the time on nscd when I do sequential
> I/O (iozone eg).  In the meantime, I've done a slew of tracing, and
> I think the holder of the semaphore I'm timing out on just flat isn't
> being scheduled so it can release it.  In the usual case of nscd, I
> _think_ it's another nscd holding the semaphore.  In no trace can I
> go back far enough to catch the taker of the semaphore or any user
> task other than iozone running between __down() time and timeout 5
> seconds later.  (trace buffer covers ~8 seconds of kernel time)
>
> I think the snippet below captures the gist of the problem.
>
> c012f32e  nr_free_pages + (0.16) pid(256)
> c012f37a  nr_inactive_clean_pages + (0.22) pid(256)

wakeup_bdflush (from beginning of __alloc_pages; page_alloc.c:324 ) 
> c01377f2  wakeup_bdflush +<12/a0> (0.14) pid(256)
> c011620a  wake_up_process + (0.29) pid(256)

> c012eea4  __alloc_pages_limit +<10/b8> (0.28) pid(256)
> c012eea4  __alloc_pages_limit +<10/b8> (0.30) pid(256)
Two __alloc_pages_limit

wakeup_kswapd(0) (from page_alloc.c:392 )
> c012e3fa  wakeup_kswapd +<12/d4> (0.25) pid(256)
> c0115613  __wake_up +<13/130> (0.41) pid(256)

schedule() (from page_alloc.c:396 )
> c011527b  schedule +<13/398> (0.66) pid(256->6)
> c01077db  __switch_to +<13/d0> (0.70) pid(6)

bdflush is running!!!
> c01893c6  generic_unplug_device + (0.25) pid(6)

bdflush is ready. (but how likely is it that it will run
for long enough to get hit by a tick i.e. current->counter--
unless it is it will continue to be preferred to kswapd, and
since only one process is yielded... )
> c011527b  schedule +<13/398> (0.50) pid(6->256)
> c01077db  __switch_to +<13/d0> (0.29) pid(256)

back to client, not the additionally runable kswapd...
Why not - nothing remaining of timeslice.
Not that the yield only yields one process. Not all
in runqueue - IMHO. [is this intended?]

3:rd __alloc_pages_limit this time direct_reclaim
tests are fulfilled
> c012eea4  __alloc_pages_limit +<10/b8> (0.22) pid(256)
> c012d267  reclaim_page +<13/408> (0.54) pid(256)

Possible (in -prerelease) untested possibilities.

* Be tougher when yielding.

wakeup_kswapd(0);
if (gfp_mask & __GFP_WAIT) {
__set_current_state(TASK_RUNNING);
current->policy |= SCHED_YIELD;
+   current->counter--; /* be faster to let kswapd run */
or
+   current->counter = 0; /* too fast? [not tested] */
schedule();
}

Might be to tough on the client not doing any actual work... think dbench...

* Be tougher on bflushd, decrement its counter now and then... 
  [naive, not tested]

* Move wakeup of bflushd to kswapd. Somewhere after 'do_try_to_free_pages(..)'
  has been run. Before going to sleep... 
  [a variant tested with mixed results - this is likely a better one]

/* 
 * We go to sleep if either the free page shortage
 * or the inactive page shortage is gone. We do this
 * because:
 * 1) we need no more free pages   or
 * 2) the inactive pages need to be flushed to disk,
 *it wouldn't help to eat CPU time now ...
 *
 * We go to sleep for one second, but if it's needed
 * we'll be woken up earlier...
 */
if (!free_shortage() || !inactive_shortage()) {
/*
 * If we are about to get low on free pages and cleaning
 * the inactive_dirty pages would fix the situation,
 * wake up bdflush.
 */
if (free_shortage() && nr_inactive_dirty_pages > 
free_shortage()
&& nr_inactive_dirty_pages >= freepages.high)
wakeup_bdflush(0);

interruptible_sleep_on_timeout(&kswapd_wait, HZ);
}

--
Home page:
  http://www.norran.net/nra02596/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: scheduling problem?

2001-01-02 Thread Daniel Phillips


Mike Galbraith wrote:
> 
> On Wed, 3 Jan 2001, Anton Blanchard wrote:
> >
> > > I am seeing (what I believe is;) severe process CPU starvation in
> > > 2.4.0-prerelease.  At first, I attributed it to semaphore troubles
> > > as when I enable semaphore deadlock detection in IKD and set it to
> > > 5 seconds, it triggers 100% of the time on nscd when I do sequential
> > > I/O (iozone eg).  In the meantime, I've done a slew of tracing, and
> > > I think the holder of the semaphore I'm timing out on just flat isn't
> > > being scheduled so it can release it.  In the usual case of nscd, I
> > > _think_ it's another nscd holding the semaphore.  In no trace can I
> > > go back far enough to catch the taker of the semaphore or any user
> > > task other than iozone running between __down() time and timeout 5
> > > seconds later.  (trace buffer covers ~8 seconds of kernel time)
> >
> > Did this just appear in recent kernels? Maybe bdflush was hiding the
> > situation in earlier kernels as it would cause io hogs to block when
> > things got only mildly interesting.
> 
> Yes and no.  I've seen nasty stalls for quite a while now.  (I think
> that there is a wakeup problem lurking)

Could you try this patch just to see what happens?  It uses semaphores
for the bdflush synchronization instead of banging directly on the task
wait queues.  It's supposed to be a drop-in replacement for the bdflush
wakeup/waitfor mechanism, but who knows, it may have subtly different
behavious in your case.

--- 2.4.0.clean/fs/buffer.c Sat Dec 30 20:19:13 2000
+++ 2.4.0/fs/buffer.c   Tue Jan  2 23:05:14 2001
@@ -2528,33 +2528,28 @@
  * response to dirty buffers.  Once this process is activated, we write
back
  * a limited number of buffers to the disks and then go back to sleep
again.
  */
-static DECLARE_WAIT_QUEUE_HEAD(bdflush_done);
+
+/* Semaphore wakeups, Daniel Phillips, [EMAIL PROTECTED], 2000/12
*/
+
 struct task_struct *bdflush_tsk = 0;
+DECLARE_MUTEX_LOCKED(bdflush_request);
+DECLARE_MUTEX_LOCKED(bdflush_waiter);
+atomic_t bdflush_waiters /*= 0*/;
 
 void wakeup_bdflush(int block)
 {
-   DECLARE_WAITQUEUE(wait, current);
-
if (current == bdflush_tsk)
return;
 
-   if (!block) {
-   wake_up_process(bdflush_tsk);
+   if (!block)
+   {
+   up(&bdflush_request);
return;
}
 
-   /* bdflush can wakeup us before we have a chance to
-  go to sleep so we must be smart in handling
-  this wakeup event from bdflush to avoid deadlocking in SMP
-  (we are not holding any lock anymore in these two paths). */
-   __set_current_state(TASK_UNINTERRUPTIBLE);
-   add_wait_queue(&bdflush_done, &wait);
-
-   wake_up_process(bdflush_tsk);
-   schedule();
-
-   remove_wait_queue(&bdflush_done, &wait);
-   __set_current_state(TASK_RUNNING);
+   atomic_inc(&bdflush_waiters);
+   up(&bdflush_request);
+   down(&bdflush_waiter);
 }
 
 /* This is the _only_ function that deals with flushing async writes
@@ -2699,7 +2694,7 @@
 int bdflush(void *sem)
 {
struct task_struct *tsk = current;
-   int flushed;
+   int flushed, waiters;
/*
 *  We have a bare-bones task_struct, and really should fill
 *  in a few more things so "top" and /proc/2/{exe,root,cwd}
@@ -2727,28 +2722,16 @@
if (free_shortage())
flushed += page_launder(GFP_BUFFER, 0);
 
-   /* If wakeup_bdflush will wakeup us
-  after our bdflush_done wakeup, then
-  we must make sure to not sleep
-  in schedule_timeout otherwise
-  wakeup_bdflush may wait for our
-  bdflush_done wakeup that would never arrive
-  (as we would be sleeping) and so it would
-  deadlock in SMP. */
-   __set_current_state(TASK_INTERRUPTIBLE);
-   wake_up_all(&bdflush_done);
-   /*
-* If there are still a lot of dirty buffers around,
-* skip the sleep and flush some more. Otherwise, we
-* go to sleep waiting a wakeup.
-*/
-   if (!flushed || balance_dirty_state(NODEV) < 0) {
+   waiters = atomic_read(&bdflush_waiters);
+   atomic_sub(waiters, &bdflush_waiters);
+   while (waiters--)
+   up(&bdflush_waiter);
+
+   if (!flushed || balance_dirty_state(NODEV) < 0) 
+   {
run_task_queue(&tq_disk);
-   schedule();
+   down(&bdflush_request);
}
-   /* Remember to mark us as running otherwise
-  the next schedule will block. */
-   __set_current_state(TASK_RUNNING);
}
 }
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a mess

Re: scheduling problem?

2001-01-02 Thread Linus Torvalds

On Tue, 2 Jan 2001, Linus Torvalds wrote:
> 
> Right now, the automatic balancing only hurts. The stuff that hasn't been
> converted is probably worse off doing balancing when they don't want to,
> than we would be to leave the balancing altogether.
> 
> Which is why I don't like it.

Actually, there is right now another problem with the synchronous waiting,
which is completely different: because bdflush can be waited on
synchronously by various entities that hold various IO locks, bdflush
itself cannot do certain kinds of IO at all. In particular, it has to use
GFP_BUFFER when it calls down to page_launder(), because it cannot afford
to write out dirty pages which might deadlock on the locks that are held
by people waiting for bdflush..

The deadlock issue is the one I dislike the most: bdflush being
synchronously waited on is fundamentally always going to cripple it. In
comparison, the automatic rebalancing is just a latency issue (but the
automatic balancing _is_ the thing that brings on the fact that we call
rebalance with locks held, so they are certainly related).

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: scheduling problem?

2001-01-02 Thread Linus Torvalds

On Tue, 2 Jan 2001, Andrea Arcangeli wrote:
> 
> > NOTE! I think that throttling writers is fine and good, but as it stands
> > now, the dirty buffer balancing will throttle anybody, not just the
> > writer. That's partly because of the 2.4.x mis-feature of doing the
> 
> How can it throttle everybody and not only the writers? _Only_ the
> writers calls balance_dirty.

A lot of people call mark_buffer_dirty() on one or two buffers. Things
like file creation etc. Think about inode bitmap blocks that are marked
dirty with the superblock held.. Ugh.

> Right way to avoid blocking with lock helds is to replace mark_buffer_dirty
> with __mark_buffer_dirty() and to call balance_dirty() later when the locks are
> released.

The point being that because _everybody_ should do this, we shouldn't have
the "mark_buffer_dirty()" that we have. There are no really valid uses of
the automatic rebalancing: either we're writing meta-data (which
definitely should balance on its own _after_ the fact), or we're writing
normal data (which already _does_ balance after the fact).

Right now, the automatic balancing only hurts. The stuff that hasn't been
converted is probably worse off doing balancing when they don't want to,
than we would be to leave the balancing altogether.

Which is why I don't like it.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: scheduling problem?

2001-01-02 Thread Andrea Arcangeli

On Tue, Jan 02, 2001 at 01:02:30PM -0800, Linus Torvalds wrote:
> 
> 
> On Tue, 2 Jan 2001, Andrea Arcangeli wrote:
> 
> > On Tue, Jan 02, 2001 at 11:02:41AM -0800, Linus Torvalds wrote:
> > > What does the system feel like if you just change the "sleep for bdflush"
> > > logic in wakeup_bdflush() to something like
> > > 
> > >   wake_up_process(bdflush_tsk);
> > >   __set_current_state(TASK_RUNNING);
> > >   current->policy |= SCHED_YIELD;
> > >   schedule();
> > > 
> > > instead of trying to wait for bdflush to wake us up?
> > 
> > My bet is a `VM: killing' message.
> 
> Maybe in 2.2.x, yes.
> 
> > Waiting bdflush back-wakeup is mandatory to do write throttling correctly.  The
> > above will break write throttling at least unless something foundamental is
> > changed recently and that doesn't seem the case.
> 
> page_launder() should wait for the dirty pages, and that's not something
> 2.2.x ever did.

In late 2.2.x we have sync_page_buffers too but I'm not sure how well it
behaves when the whole MM is costantly kept totally dirty and we don't have
swap. Infact also the 2.4.x implementation:

static void sync_page_buffers(struct buffer_head *bh, int wait)
{
struct buffer_head * tmp = bh;

do {
struct buffer_head *p = tmp;
tmp = tmp->b_this_page;
if (buffer_locked(p)) {
if (wait > 1)
__wait_on_buffer(p);
} else if (buffer_dirty(p))
ll_rw_block(WRITE, 1, &p);
} while (tmp != bh);
}

won't cope with the memory totally dirty. It will make the buffer from dirty to
locked then it will wait I/O completion at the second pass, but it
won't try again to free the page for the third time (when the page is finally
freeable):

if (wait) {
sync_page_buffers(bh, wait);
/* We waited synchronously, so we can free the buffers. */
if (wait > 1 && !loop) {
loop = 1;
goto cleaned_buffers_try_again;
}

Probably not a big deal.

The real point is that even if try_to_free_buffers will deal perfectly with the
VM totally dirty we'll end waiting I/O completion in the wrong place.
setiathome will end waiting I/O completion instead of `cp`. It's not setiathome
but `cp` that should do write throttling. And `cp` will block again very soon
even if setiathome blocks too. The whole point is that the write throttling
must happen in balance_dirty(), _not_ in sync_page_buffers().

Infact from 2.2.19pre2 there's a wait_io per-bh bitflag that remembers when a
dirty bh is very old and it doesn't get flushed away automatically (from
either kupdate or kflushd). So we don't block in sync_page_buffers until it's
necessary to avoid hurting non-IO apps when I/O is going on.

> NOTE! I think that throttling writers is fine and good, but as it stands
> now, the dirty buffer balancing will throttle anybody, not just the
> writer. That's partly because of the 2.4.x mis-feature of doing the

How can it throttle everybody and not only the writers? _Only_ the
writers calls balance_dirty.

> balance_dirty call even for previously dirty buffers (fixed in my tree,
> btw).

Yes I seen, people overwriting dirty data was blocking too, that was
not necessary, but they were still writers.

> It's _really_ bad to wait for bdflush to finish if we hold on to things
> like the superblock lock - which _does_ happen right now. That's why I'm
> pretty convinced that we should NOT blindly do the dirty balance in
> "mark_buffer_dirty()", but instead at more well-defined points (in places
> like "generic_file_write()", for example).

Right way to avoid blocking with lock helds is to replace mark_buffer_dirty
with __mark_buffer_dirty() and to call balance_dirty() later when the locks are
released.  That's why it's exported to modules.  Everybody is always been
allowed to optimize away the mark_buffer_dirty(), it's just that nobody did
that yet. I think it's useful to keep providing an interface that does the
write throttling automatically.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: scheduling problem?

2001-01-02 Thread Linus Torvalds

On Tue, 2 Jan 2001, Andrea Arcangeli wrote:

> On Tue, Jan 02, 2001 at 11:02:41AM -0800, Linus Torvalds wrote:
> > What does the system feel like if you just change the "sleep for bdflush"
> > logic in wakeup_bdflush() to something like
> > 
> > wake_up_process(bdflush_tsk);
> > __set_current_state(TASK_RUNNING);
> > current->policy |= SCHED_YIELD;
> > schedule();
> > 
> > instead of trying to wait for bdflush to wake us up?
> 
> My bet is a `VM: killing' message.

Maybe in 2.2.x, yes.

> Waiting bdflush back-wakeup is mandatory to do write throttling correctly.  The
> above will break write throttling at least unless something foundamental is
> changed recently and that doesn't seem the case.

page_launder() should wait for the dirty pages, and that's not something
2.2.x ever did.

This way, the issue of dirty data in the VM is handled by the VM pressure,
not by trying to artificially throttle writers.

NOTE! I think that throttling writers is fine and good, but as it stands
now, the dirty buffer balancing will throttle anybody, not just the
writer. That's partly because of the 2.4.x mis-feature of doing the
balance_dirty call even for previously dirty buffers (fixed in my tree,
btw).

It's _really_ bad to wait for bdflush to finish if we hold on to things
like the superblock lock - which _does_ happen right now. That's why I'm
pretty convinced that we should NOT blindly do the dirty balance in
"mark_buffer_dirty()", but instead at more well-defined points (in places
like "generic_file_write()", for example).

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: scheduling problem?

2001-01-02 Thread Andrea Arcangeli

On Tue, Jan 02, 2001 at 11:02:41AM -0800, Linus Torvalds wrote:
> What does the system feel like if you just change the "sleep for bdflush"
> logic in wakeup_bdflush() to something like
> 
>   wake_up_process(bdflush_tsk);
>   __set_current_state(TASK_RUNNING);
>   current->policy |= SCHED_YIELD;
>   schedule();
> 
> instead of trying to wait for bdflush to wake us up?

My bet is a `VM: killing' message.

Waiting bdflush back-wakeup is mandatory to do write throttling correctly.  The
above will break write throttling at least unless something foundamental is
changed recently and that doesn't seem the case.

What I like to do there is to just make bdflush the same thing that kswapd
_should_ (I said "should" because it seems it's not the case anymore in 2.4.x
from some email I read recently, I didn't checked that myself yet) be for
memory pressure (I implemented that at some point in my private local tree). I
mean: bdflush only does the async writeouts and the task context calls
something like flush_dirty_buffers itself. The main reason I was doing that is
to fix the case of >bdf_prm.ndirty tasks all waiting on bdflush at the same
time (that will break write throttling even now in 2.2.x and in current 2.4.x).
That's an unlukcy condition very similar to the one in GFP that is fixed
correctly in 2.2.19pre2 putting pages in a per-process freelist during memory
balancing.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: scheduling problem?

2001-01-02 Thread Linus Torvalds

On Tue, 2 Jan 2001, Mike Galbraith wrote:
> 
> Yes and no.  I've seen nasty stalls for quite a while now.  (I think
> that there is a wakeup problem lurking)
> 
> I found the change which triggers my horrid stalls.  Nobody is going
> to believe this...

Hmm.. I can believe it. The code that waits on bdflush in wakeup_bdflush()
is somewhat suspicious. In particular, if/when that ever triggers, and
bdflush() is busy in flush_dirty_buffers(), then the process that is
trying to wake bdflush up is going to wait until flush_dirty_buffers() is
done. 

Which, if there is a process dirtying pages, can basically be
pretty much forever.

This was probably hidden by the lower limits simply by virtue of bdflush
never being very active before.

What does the system feel like if you just change the "sleep for bdflush"
logic in wakeup_bdflush() to something like

wake_up_process(bdflush_tsk);
__set_current_state(TASK_RUNNING);
current->policy |= SCHED_YIELD;
schedule();

instead of trying to wait for bdflush to wake us up?

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: scheduling problem?

2001-01-02 Thread Mike Galbraith


On Wed, 3 Jan 2001, Anton Blanchard wrote:

>  
> Hi Mike,
> 
> > I am seeing (what I believe is;) severe process CPU starvation in
> > 2.4.0-prerelease.  At first, I attributed it to semaphore troubles
> > as when I enable semaphore deadlock detection in IKD and set it to
> > 5 seconds, it triggers 100% of the time on nscd when I do sequential
> > I/O (iozone eg).  In the meantime, I've done a slew of tracing, and
> > I think the holder of the semaphore I'm timing out on just flat isn't
> > being scheduled so it can release it.  In the usual case of nscd, I
> > _think_ it's another nscd holding the semaphore.  In no trace can I
> > go back far enough to catch the taker of the semaphore or any user
> > task other than iozone running between __down() time and timeout 5
> > seconds later.  (trace buffer covers ~8 seconds of kernel time)
> 
> Did this just appear in recent kernels? Maybe bdflush was hiding the
> situation in earlier kernels as it would cause io hogs to block when
> things got only mildly interesting.

Yes and no.  I've seen nasty stalls for quite a while now.  (I think
that there is a wakeup problem lurking)

I found the change which triggers my horrid stalls.  Nobody is going
to believe this...

diff -urN linux-2.4.0-test13-pre6/fs/buffer.c linux-2.4.0-test13-pre7/fs/buffer.c
--- linux-2.4.0-test13-pre6/fs/buffer.c Sat Dec 30 08:58:56 2000
+++ linux-2.4.0-test13-pre7/fs/buffer.c Sun Dec 31 06:22:31 2000
@@ -122,16 +122,17 @@
  when trying to refill buffers. */
int interval; /* jiffies delay between kupdate flushes */
int age_buffer;  /* Time for normal buffer to age before we flush it */
-   int dummy1;/* unused, was age_super */
+   int nfract_sync; /* Percentage of buffer cache dirty to 
+   activate bdflush synchronously */
int dummy2;/* unused */
int dummy3;/* unused */
} b_un;
unsigned int data[N_PARAM];
-} bdf_prm = {{40, 500, 64, 256, 5*HZ, 30*HZ, 5*HZ, 1884, 2}};
+} bdf_prm = {{40, 500, 64, 256, 5*HZ, 30*HZ, 80, 0, 0}};
 
 /* These are the min and max parameter values that we will allow to be assigned */
-int bdflush_min[N_PARAM] = {  0,  10,5,   25,  0,   1*HZ,   1*HZ, 1, 1};
-int bdflush_max[N_PARAM] = {100,5, 2, 2,600*HZ, 6000*HZ, 6000*HZ, 2047, 
5};
+int bdflush_min[N_PARAM] = {  0,  10,5,   25,  0,   1*HZ,   0, 0, 0};
+int bdflush_max[N_PARAM] = {100,5, 2, 2,600*HZ, 6000*HZ, 100, 0, 0};
 
 /*
  * Rewrote the wait-routines to use the "new" wait-queue functionality,
@@ -1032,9 +1034,9 @@
dirty = size_buffers_type[BUF_DIRTY] >> PAGE_SHIFT;
tot = nr_free_buffer_pages();
 
-   dirty *= 200;
+   dirty *= 100;
soft_dirty_limit = tot * bdf_prm.b_un.nfract;
-   hard_dirty_limit = soft_dirty_limit * 2;
+   hard_dirty_limit = tot * bdf_prm.b_un.nfract_sync;
 
/* First, check for the "real" dirty limit. */
if (dirty > soft_dirty_limit) {

...but reversing this cures my semaphore timeouts.  Don't say impossible
:) I didn't believe it either until I retested several times.  I wager
that if I just fiddle with parameters I'll be able to make the problem
come and go at will.  (means the real problem is gonna be a weird one:)

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: scheduling problem?

2001-01-02 Thread Anton Blanchard


 
Hi Mike,

> I am seeing (what I believe is;) severe process CPU starvation in
> 2.4.0-prerelease.  At first, I attributed it to semaphore troubles
> as when I enable semaphore deadlock detection in IKD and set it to
> 5 seconds, it triggers 100% of the time on nscd when I do sequential
> I/O (iozone eg).  In the meantime, I've done a slew of tracing, and
> I think the holder of the semaphore I'm timing out on just flat isn't
> being scheduled so it can release it.  In the usual case of nscd, I
> _think_ it's another nscd holding the semaphore.  In no trace can I
> go back far enough to catch the taker of the semaphore or any user
> task other than iozone running between __down() time and timeout 5
> seconds later.  (trace buffer covers ~8 seconds of kernel time)

Did this just appear in recent kernels? Maybe bdflush was hiding the
situation in earlier kernels as it would cause io hogs to block when
things got only mildly interesting.

You might be able to get some useful information with ps axl and checking
the WCHAN value. Of course it wont be possible if like nscd you cant get
ps to schedule :)

Anton
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: scheduling problem test13-pre7

2000-12-31 Thread Mike Galbraith


On Sun, 31 Dec 2000, Matti Aarnio wrote:

> On Sun, Dec 31, 2000 at 10:42:26AM +0100, Mike Galbraith wrote:
> > Hi,
> > 
> > While running iozone, I notice severe stalls of vmstat output
> > despite vmstat running SCHED_RR and mlockall().
> 
>Lets eliminate the obvious:
> 
>- Are you running with IDE disk ?

Yes.

>- Does   hdparm  /dev/hda(whatever)report:
> 
>   /dev/hda:
>unmaskirq=  0 (off)
>using_dma=  0 (off)

No.

/dev/hda:
 multcount=  0 (off)
 I/O support  =  1 (32-bit)
 unmaskirq=  1 (on)
 using_dma=  1 (on)
 keepsettings =  1 (on)
 nowerr   =  0 (off)
 readonly =  0 (off)
 readahead=  8 (on)
 geometry = 2482/255/63, sectors = 39876480, start = 0

> 
>The IKD uses local interrupts, so this isn't necessarily true...

I just did a (mondo) trace covering 8 seconds of kernel time, and
vmstat ran twice.  (Those two times were before I noticed the stall
and started counting down toward 'poke the freeze-frame button')

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: scheduling problem test13-pre7

2000-12-31 Thread Matti Aarnio


On Sun, Dec 31, 2000 at 10:42:26AM +0100, Mike Galbraith wrote:
> Hi,
> 
> While running iozone, I notice severe stalls of vmstat output
> despite vmstat running SCHED_RR and mlockall().

   Lets eliminate the obvious:

   - Are you running with IDE disk ?
   - Does   hdparm  /dev/hda(whatever)report:

/dev/hda:
 unmaskirq=  0 (off)
 using_dma=  0 (off)

   The IKD uses local interrupts, so this isn't necessarily true...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: scheduling problem?

Re: scheduling problem?

Re: scheduling problem?

Re: scheduling problem?

Re: scheduling problem?

Re: scheduling problem?

Re: scheduling problem?

Re: scheduling problem?

Re: scheduling problem?

Re: scheduling problem?

Re: scheduling problem?

Re: scheduling problem?

Re: scheduling problem?

Re: scheduling problem?

Re: scheduling problem?

Re: scheduling problem?

Re: scheduling problem?

Re: scheduling problem?

Re: scheduling problem?

Re: scheduling problem?

Re: scheduling problem?

Re: scheduling problem test13-pre7

Re: scheduling problem test13-pre7

23 matches

Site Navigation

Mail list logo

Footer information