Re: scheduling problem?
Mike Galbraith wrote: > > On Wed, 3 Jan 2001, Daniel Phillips wrote: > > > Mike Galbraith wrote: > > > Semaphore timed out during boot, leaving bdflush as zombie. > > > > Wait a sec, what do you mean by 'semaphore timed out'? These should > > wait patiently forever. > > IKD has a semaphore deadlock detector. That was my tentative conclusion. > Any place you take a semaphore > and have to wait longer than 5 seconds (what I had it set to because > with trace buffer set to 300 entries, it can only cover ~8 seconds > of disk [slowest] load), it triggers and freezes the trace buffer for > later use. It firing under load may not be of interest. (but it firing > looks to be very closly coupled to observed stalls with virgin source. > Linus fixes big stall and deadlock detector mostly shuts up. I fix a > smaller stall and it shuts up entirely.. for this workload) But it's entirely legal for a semaphore to wait forever when used in the way I've used them, a producer/consumer pattern. You should be able to run happily (at least as happily as before) with the watchdog disabled. This begs the question of what to do about the 99.99% of cases where the watchdog is a good thing to have. Shouldn't the watchdog just log the 'suspicious' event and continue? -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: scheduling problem?
On Wed, 3 Jan 2001, Daniel Phillips wrote: > Mike Galbraith wrote: > > Semaphore timed out during boot, leaving bdflush as zombie. > > Wait a sec, what do you mean by 'semaphore timed out'? These should > wait patiently forever. IKD has a semaphore deadlock detector. Any place you take a semaphore and have to wait longer than 5 seconds (what I had it set to because with trace buffer set to 300 entries, it can only cover ~8 seconds of disk [slowest] load), it triggers and freezes the trace buffer for later use. It firing under load may not be of interest. (but it firing looks to be very closly coupled to observed stalls with virgin source. Linus fixes big stall and deadlock detector mostly shuts up. I fix a smaller stall and it shuts up entirely.. for this workload) -Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: scheduling problem?
On Wed, 3 Jan 2001, Daniel Phillips wrote: > Mike Galbraith wrote: > > > > On Wed, 3 Jan 2001, Daniel Phillips wrote: > > > > > Could you try this patch just to see what happens? It uses semaphores > > > for the bdflush synchronization instead of banging directly on the task > > > wait queues. It's supposed to be a drop-in replacement for the bdflush > > > wakeup/waitfor mechanism, but who knows, it may have subtly different > > > behavious in your case. > > > > Semaphore timed out during boot, leaving bdflush as zombie. > > Hmm, how could that happen? I'm booted and running with that patch > right now and have beaten on it extensively - it sounds like something > else is broken. Or maybe we've already established that - let me read > the thread again. > > Which semaphore timed out, bdflush_request or bdflush_waiter? I didn't watch closely (running virgin prerelease). I can run it again if you think it's important. -Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: scheduling problem?
Mike Galbraith wrote: > Semaphore timed out during boot, leaving bdflush as zombie. Wait a sec, what do you mean by 'semaphore timed out'? These should wait patiently forever. -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: scheduling problem?
Mike Galbraith wrote: > > On Wed, 3 Jan 2001, Daniel Phillips wrote: > > > Could you try this patch just to see what happens? It uses semaphores > > for the bdflush synchronization instead of banging directly on the task > > wait queues. It's supposed to be a drop-in replacement for the bdflush > > wakeup/waitfor mechanism, but who knows, it may have subtly different > > behavious in your case. > > Semaphore timed out during boot, leaving bdflush as zombie. Hmm, how could that happen? I'm booted and running with that patch right now and have beaten on it extensively - it sounds like something else is broken. Or maybe we've already established that - let me read the thread again. Which semaphore timed out, bdflush_request or bdflush_waiter? -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: scheduling problem?
On Wed, 3 Jan 2001, Mike Galbraith wrote: > Feel is _vastly_ improved. Except while beating on it, I found a way to turn it into a brick. If I run Christoph Rohland's swptst proggy, interactive disappears to the point that login while it is running is impossible. ~15 minutes later I got 'login timed out after 30 seconds'. ~10 minutes after that, the prompt came back. -Mike on other vt.. ./swptst 1 4800 4 12 100 Script started on Wed Jan 3 11:16:46 2001 [root]:# schedctl -R vmstat 1 procs memoryswap io system cpu r b w swpd free buff cache si sobibo incs us sy id 1 0 0 17688 105040396 15400 84 100 515 1117 315 1321 2 50 47 0 0 0 17688 105040396 15400 0 0 0 0 10210 0 95 5 0 0 0 17688 105040396 15400 0 0 0 0 111 8 0 96 4 0 0 0 17688 105040396 15400 0 0 0 0 10610 0 94 6 0 0 0 17688 105028396 15412 12 0 4 0 10921 0 92 8 0 4 2 85240 1748184 11996 4024 81876 2014 20512 3122 8444 0 35 65 0 5 3 99064 1432188 25964 8856 5240 2472 1310 3645 5724 0 5 95 0 5 2 102404 1436188 29072 4476 9896 1172 2474 880 1594 0 14 86 0 4 2 102404 1460188 29032 5700 0 1425 0 440 656 1 9 91 0 4 2 114452 1432188 40996 3224 4392 806 1098 344 523 0 19 81 1 3 3 114452 1432188 40732 4080 11712 1020 2928 438 1061 0 26 74 1 4 2 189568 1732184 115060 75948 31108 19811 4840 7186 0 18 82 0 5 2 189568 1436196 114560 5988 38224 1641 9556 1065 938 0 34 66 0 5 2 192564 1432196 116960 48712 6416 12235 1604 4913 6910 0 8 92 0 4 1 192564 1432196 116960 4136 0 1034 0 357 479 0 7 93 1 3 2 192528 1432196 116968 3284 8 834 2 371 510 0 5 95 2 3 3 192512 1436184 116972 17772 2452 4561 613 894 1110 0 13 87 4 0 1 14964 24988188 10068 548 0 787 0 162 137 0 75 25 3 1 2 41356 1404184 8180 108 35200 792 8800 818 5539 0 53 47 4 2 2 75304 1432184 9536 204 30896 215 7724 570 1705 0 56 44 1 3 0 99440 1584184 26304 8512 13972 2128 3493 622 858 0 28 72 0 5 2 114228 1432184 40880 16708 16464 4206 4116 2633 4967 0 11 89 procs memoryswap io system cpu r b w swpd free buff cache si sobibo incs us sy id 1 4 3 115288 1432184 41732 3600 11000 900 2750 370 832 1 32 67 1 3 1 116576 1432184 43020 3168 0 792 0 415 561 0 5 95 1 4 2 117940 1580180 44092 8256 2644 2089 661 1196 1659 0 6 94 1 4 3 117912 1432180 44236 1960 0 490 0 965 1216 0 9 91 0 5 2 118076 1884184 43908 1108 1588 283 397 590 747 0 15 85 1 4 2 161668 2300184 86164 155412 129228 40971 32307 9921 17331 0 23 77 4 2 2 15368 57384196 10448 46812 3396 13012 849 2707 3139 0 20 80 0 7 2 134100 1460180 60068 195944 223620 53567 55908 47036 84839 0 12 88 4 1 1 19824 40776188 14752 195852 199904 52674 49976 45992 77956 0 11 89 0 5 2 142680 1656184 68304 184740 184356 49300 46089 48109 82885 0 9 91 6 3 0 19384 1404184 13268 879100 1027052 246649 256793 204487 547788 0 22 78 4 4 1 53712 50064208 11264 484836 552776 137106 138209 125215 250077 0 18 82 2 6 3 108684 15180196 52812 398128 462772 35 115729 53739 351528 1 53 46 4 5 0 54808 45656256 13984 12880 840 5412 242 1627 2070 3 97 0 0 5 0 22620 104664220 17392 18216 50944 6685 12746 2394 11021 1 44 56 0 3 0 22528 103264284 18476 692 0 666 0 254 390 5 18 77 1 1 0 22160 102096296 19564 536 0 840 0 242 438 7 15 78 0 0 0 22120 102192316 19932 60 0 320 8 171 333 3 50 48 0 0 0 22120 102192316 19932 0 0 0 0 101 7 0 96 4 0 0 0 22120 102184324 19932 0 0 7 0 10833 0 84 16 0 0 0 22120 102184324 19932 0 0 0 0 101 7 0 96 4 [root]:# exit exit Script done on Wed Jan 3 12:00:31 2001 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: scheduling problem?
On Tue, 2 Jan 2001, Linus Torvalds wrote: > On Wed, 3 Jan 2001, Mike Galbraith wrote: > > > > No difference (except more context switching as expected) > > What about the current prerelese patch in testing? It doesn't switch to > bdflush at all, but instead does the buffer cleaning by hand. 99% gone. The remaining 1% is refill_freelist(). If I use flush_dirty_buffers() there instead of waiting, I have no more semaphore timeouts (so far.. not thoroughly pounded upon). Without that change, I still take hits. (in my tinker tree, I usually make a 'small flush' mode for flush_dirty_buffers() to do that) Feel is _vastly_ improved. -Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: scheduling problem?
On Wed, 3 Jan 2001, Mike Galbraith wrote: > > No difference (except more context switching as expected) What about the current prerelese patch in testing? It doesn't switch to bdflush at all, but instead does the buffer cleaning by hand. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: scheduling problem?
On Wed, 3 Jan 2001, Roger Larsson wrote: > Hi, > > I have played around with this code previously. > This is my current understanding. > [yield problem?] Hmm.. this ~could be. I once dove into the VM waters (me=stone) and changed __alloc_pages() to only yield instead of scheduling. The results (along with many other strange changes) were.. weirdest feeling kernel I ever ran. Damn fast, but very very weird ;-) > Possible (in -prerelease) untested possibilities. > > * Be tougher when yielding. > > > wakeup_kswapd(0); > if (gfp_mask & __GFP_WAIT) { > __set_current_state(TASK_RUNNING); > current->policy |= SCHED_YIELD; > + current->counter--; /* be faster to let kswapd run */ > or > + current->counter = 0; /* too fast? [not tested] */ > schedule(); > } That looks a lot like cheating. > * Move wakeup of bflushd to kswapd. Somewhere after 'do_try_to_free_pages(..)' > has been run. Before going to sleep... > [a variant tested with mixed results - this is likely a better one] I also did some things along this line.. also with mixed results. :) the changes I've done that I actually like best is to kill bdflush graveyard dead. Did that twice and didn't miss it at all. (next time, I think I'll erect a headstone) -Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: scheduling problem?
On Tue, 2 Jan 2001, Linus Torvalds wrote: > On Tue, 2 Jan 2001, Mike Galbraith wrote: > > > > Yes and no. I've seen nasty stalls for quite a while now. (I think > > that there is a wakeup problem lurking) > > > > I found the change which triggers my horrid stalls. Nobody is going > > to believe this... > > Hmm.. I can believe it. The code that waits on bdflush in wakeup_bdflush() > is somewhat suspicious. In particular, if/when that ever triggers, and > bdflush() is busy in flush_dirty_buffers(), then the process that is > trying to wake bdflush up is going to wait until flush_dirty_buffers() is > done. > > Which, if there is a process dirtying pages, can basically be > pretty much forever. > > This was probably hidden by the lower limits simply by virtue of bdflush > never being very active before. > > What does the system feel like if you just change the "sleep for bdflush" > logic in wakeup_bdflush() to something like > > wake_up_process(bdflush_tsk); > __set_current_state(TASK_RUNNING); > current->policy |= SCHED_YIELD; > schedule(); > > instead of trying to wait for bdflush to wake us up? No difference (except more context switching as expected) -Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: scheduling problem?
On Wed, 3 Jan 2001, Daniel Phillips wrote: > Could you try this patch just to see what happens? It uses semaphores > for the bdflush synchronization instead of banging directly on the task > wait queues. It's supposed to be a drop-in replacement for the bdflush > wakeup/waitfor mechanism, but who knows, it may have subtly different > behavious in your case. Semaphore timed out during boot, leaving bdflush as zombie. -Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: scheduling problem?
Hi, I have played around with this code previously. This is my current understanding. [yield problem?] On Tuesday 02 January 2001 09:27, Mike Galbraith wrote: > Hi, > > I am seeing (what I believe is;) severe process CPU starvation in > 2.4.0-prerelease. At first, I attributed it to semaphore troubles > as when I enable semaphore deadlock detection in IKD and set it to > 5 seconds, it triggers 100% of the time on nscd when I do sequential > I/O (iozone eg). In the meantime, I've done a slew of tracing, and > I think the holder of the semaphore I'm timing out on just flat isn't > being scheduled so it can release it. In the usual case of nscd, I > _think_ it's another nscd holding the semaphore. In no trace can I > go back far enough to catch the taker of the semaphore or any user > task other than iozone running between __down() time and timeout 5 > seconds later. (trace buffer covers ~8 seconds of kernel time) > > I think the snippet below captures the gist of the problem. > > c012f32e nr_free_pages + (0.16) pid(256) > c012f37a nr_inactive_clean_pages + (0.22) pid(256) wakeup_bdflush (from beginning of __alloc_pages; page_alloc.c:324 ) > c01377f2 wakeup_bdflush +<12/a0> (0.14) pid(256) > c011620a wake_up_process + (0.29) pid(256) > c012eea4 __alloc_pages_limit +<10/b8> (0.28) pid(256) > c012eea4 __alloc_pages_limit +<10/b8> (0.30) pid(256) Two __alloc_pages_limit wakeup_kswapd(0) (from page_alloc.c:392 ) > c012e3fa wakeup_kswapd +<12/d4> (0.25) pid(256) > c0115613 __wake_up +<13/130> (0.41) pid(256) schedule() (from page_alloc.c:396 ) > c011527b schedule +<13/398> (0.66) pid(256->6) > c01077db __switch_to +<13/d0> (0.70) pid(6) bdflush is running!!! > c01893c6 generic_unplug_device + (0.25) pid(6) bdflush is ready. (but how likely is it that it will run for long enough to get hit by a tick i.e. current->counter-- unless it is it will continue to be preferred to kswapd, and since only one process is yielded... ) > c011527b schedule +<13/398> (0.50) pid(6->256) > c01077db __switch_to +<13/d0> (0.29) pid(256) back to client, not the additionally runable kswapd... Why not - nothing remaining of timeslice. Not that the yield only yields one process. Not all in runqueue - IMHO. [is this intended?] 3:rd __alloc_pages_limit this time direct_reclaim tests are fulfilled > c012eea4 __alloc_pages_limit +<10/b8> (0.22) pid(256) > c012d267 reclaim_page +<13/408> (0.54) pid(256) Possible (in -prerelease) untested possibilities. * Be tougher when yielding. wakeup_kswapd(0); if (gfp_mask & __GFP_WAIT) { __set_current_state(TASK_RUNNING); current->policy |= SCHED_YIELD; + current->counter--; /* be faster to let kswapd run */ or + current->counter = 0; /* too fast? [not tested] */ schedule(); } Might be to tough on the client not doing any actual work... think dbench... * Be tougher on bflushd, decrement its counter now and then... [naive, not tested] * Move wakeup of bflushd to kswapd. Somewhere after 'do_try_to_free_pages(..)' has been run. Before going to sleep... [a variant tested with mixed results - this is likely a better one] /* * We go to sleep if either the free page shortage * or the inactive page shortage is gone. We do this * because: * 1) we need no more free pages or * 2) the inactive pages need to be flushed to disk, *it wouldn't help to eat CPU time now ... * * We go to sleep for one second, but if it's needed * we'll be woken up earlier... */ if (!free_shortage() || !inactive_shortage()) { /* * If we are about to get low on free pages and cleaning * the inactive_dirty pages would fix the situation, * wake up bdflush. */ if (free_shortage() && nr_inactive_dirty_pages > free_shortage() && nr_inactive_dirty_pages >= freepages.high) wakeup_bdflush(0); interruptible_sleep_on_timeout(&kswapd_wait, HZ); } -- Home page: http://www.norran.net/nra02596/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: scheduling problem?
Mike Galbraith wrote: > > On Wed, 3 Jan 2001, Anton Blanchard wrote: > > > > > I am seeing (what I believe is;) severe process CPU starvation in > > > 2.4.0-prerelease. At first, I attributed it to semaphore troubles > > > as when I enable semaphore deadlock detection in IKD and set it to > > > 5 seconds, it triggers 100% of the time on nscd when I do sequential > > > I/O (iozone eg). In the meantime, I've done a slew of tracing, and > > > I think the holder of the semaphore I'm timing out on just flat isn't > > > being scheduled so it can release it. In the usual case of nscd, I > > > _think_ it's another nscd holding the semaphore. In no trace can I > > > go back far enough to catch the taker of the semaphore or any user > > > task other than iozone running between __down() time and timeout 5 > > > seconds later. (trace buffer covers ~8 seconds of kernel time) > > > > Did this just appear in recent kernels? Maybe bdflush was hiding the > > situation in earlier kernels as it would cause io hogs to block when > > things got only mildly interesting. > > Yes and no. I've seen nasty stalls for quite a while now. (I think > that there is a wakeup problem lurking) Could you try this patch just to see what happens? It uses semaphores for the bdflush synchronization instead of banging directly on the task wait queues. It's supposed to be a drop-in replacement for the bdflush wakeup/waitfor mechanism, but who knows, it may have subtly different behavious in your case. --- 2.4.0.clean/fs/buffer.c Sat Dec 30 20:19:13 2000 +++ 2.4.0/fs/buffer.c Tue Jan 2 23:05:14 2001 @@ -2528,33 +2528,28 @@ * response to dirty buffers. Once this process is activated, we write back * a limited number of buffers to the disks and then go back to sleep again. */ -static DECLARE_WAIT_QUEUE_HEAD(bdflush_done); + +/* Semaphore wakeups, Daniel Phillips, [EMAIL PROTECTED], 2000/12 */ + struct task_struct *bdflush_tsk = 0; +DECLARE_MUTEX_LOCKED(bdflush_request); +DECLARE_MUTEX_LOCKED(bdflush_waiter); +atomic_t bdflush_waiters /*= 0*/; void wakeup_bdflush(int block) { - DECLARE_WAITQUEUE(wait, current); - if (current == bdflush_tsk) return; - if (!block) { - wake_up_process(bdflush_tsk); + if (!block) + { + up(&bdflush_request); return; } - /* bdflush can wakeup us before we have a chance to - go to sleep so we must be smart in handling - this wakeup event from bdflush to avoid deadlocking in SMP - (we are not holding any lock anymore in these two paths). */ - __set_current_state(TASK_UNINTERRUPTIBLE); - add_wait_queue(&bdflush_done, &wait); - - wake_up_process(bdflush_tsk); - schedule(); - - remove_wait_queue(&bdflush_done, &wait); - __set_current_state(TASK_RUNNING); + atomic_inc(&bdflush_waiters); + up(&bdflush_request); + down(&bdflush_waiter); } /* This is the _only_ function that deals with flushing async writes @@ -2699,7 +2694,7 @@ int bdflush(void *sem) { struct task_struct *tsk = current; - int flushed; + int flushed, waiters; /* * We have a bare-bones task_struct, and really should fill * in a few more things so "top" and /proc/2/{exe,root,cwd} @@ -2727,28 +2722,16 @@ if (free_shortage()) flushed += page_launder(GFP_BUFFER, 0); - /* If wakeup_bdflush will wakeup us - after our bdflush_done wakeup, then - we must make sure to not sleep - in schedule_timeout otherwise - wakeup_bdflush may wait for our - bdflush_done wakeup that would never arrive - (as we would be sleeping) and so it would - deadlock in SMP. */ - __set_current_state(TASK_INTERRUPTIBLE); - wake_up_all(&bdflush_done); - /* -* If there are still a lot of dirty buffers around, -* skip the sleep and flush some more. Otherwise, we -* go to sleep waiting a wakeup. -*/ - if (!flushed || balance_dirty_state(NODEV) < 0) { + waiters = atomic_read(&bdflush_waiters); + atomic_sub(waiters, &bdflush_waiters); + while (waiters--) + up(&bdflush_waiter); + + if (!flushed || balance_dirty_state(NODEV) < 0) + { run_task_queue(&tq_disk); - schedule(); + down(&bdflush_request); } - /* Remember to mark us as running otherwise - the next schedule will block. */ - __set_current_state(TASK_RUNNING); } } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a mess
Re: scheduling problem?
On Tue, 2 Jan 2001, Linus Torvalds wrote: > > Right now, the automatic balancing only hurts. The stuff that hasn't been > converted is probably worse off doing balancing when they don't want to, > than we would be to leave the balancing altogether. > > Which is why I don't like it. Actually, there is right now another problem with the synchronous waiting, which is completely different: because bdflush can be waited on synchronously by various entities that hold various IO locks, bdflush itself cannot do certain kinds of IO at all. In particular, it has to use GFP_BUFFER when it calls down to page_launder(), because it cannot afford to write out dirty pages which might deadlock on the locks that are held by people waiting for bdflush.. The deadlock issue is the one I dislike the most: bdflush being synchronously waited on is fundamentally always going to cripple it. In comparison, the automatic rebalancing is just a latency issue (but the automatic balancing _is_ the thing that brings on the fact that we call rebalance with locks held, so they are certainly related). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: scheduling problem?
On Tue, 2 Jan 2001, Andrea Arcangeli wrote: > > > NOTE! I think that throttling writers is fine and good, but as it stands > > now, the dirty buffer balancing will throttle anybody, not just the > > writer. That's partly because of the 2.4.x mis-feature of doing the > > How can it throttle everybody and not only the writers? _Only_ the > writers calls balance_dirty. A lot of people call mark_buffer_dirty() on one or two buffers. Things like file creation etc. Think about inode bitmap blocks that are marked dirty with the superblock held.. Ugh. > Right way to avoid blocking with lock helds is to replace mark_buffer_dirty > with __mark_buffer_dirty() and to call balance_dirty() later when the locks are > released. The point being that because _everybody_ should do this, we shouldn't have the "mark_buffer_dirty()" that we have. There are no really valid uses of the automatic rebalancing: either we're writing meta-data (which definitely should balance on its own _after_ the fact), or we're writing normal data (which already _does_ balance after the fact). Right now, the automatic balancing only hurts. The stuff that hasn't been converted is probably worse off doing balancing when they don't want to, than we would be to leave the balancing altogether. Which is why I don't like it. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: scheduling problem?
On Tue, Jan 02, 2001 at 01:02:30PM -0800, Linus Torvalds wrote: > > > On Tue, 2 Jan 2001, Andrea Arcangeli wrote: > > > On Tue, Jan 02, 2001 at 11:02:41AM -0800, Linus Torvalds wrote: > > > What does the system feel like if you just change the "sleep for bdflush" > > > logic in wakeup_bdflush() to something like > > > > > > wake_up_process(bdflush_tsk); > > > __set_current_state(TASK_RUNNING); > > > current->policy |= SCHED_YIELD; > > > schedule(); > > > > > > instead of trying to wait for bdflush to wake us up? > > > > My bet is a `VM: killing' message. > > Maybe in 2.2.x, yes. > > > Waiting bdflush back-wakeup is mandatory to do write throttling correctly. The > > above will break write throttling at least unless something foundamental is > > changed recently and that doesn't seem the case. > > page_launder() should wait for the dirty pages, and that's not something > 2.2.x ever did. In late 2.2.x we have sync_page_buffers too but I'm not sure how well it behaves when the whole MM is costantly kept totally dirty and we don't have swap. Infact also the 2.4.x implementation: static void sync_page_buffers(struct buffer_head *bh, int wait) { struct buffer_head * tmp = bh; do { struct buffer_head *p = tmp; tmp = tmp->b_this_page; if (buffer_locked(p)) { if (wait > 1) __wait_on_buffer(p); } else if (buffer_dirty(p)) ll_rw_block(WRITE, 1, &p); } while (tmp != bh); } won't cope with the memory totally dirty. It will make the buffer from dirty to locked then it will wait I/O completion at the second pass, but it won't try again to free the page for the third time (when the page is finally freeable): if (wait) { sync_page_buffers(bh, wait); /* We waited synchronously, so we can free the buffers. */ if (wait > 1 && !loop) { loop = 1; goto cleaned_buffers_try_again; } Probably not a big deal. The real point is that even if try_to_free_buffers will deal perfectly with the VM totally dirty we'll end waiting I/O completion in the wrong place. setiathome will end waiting I/O completion instead of `cp`. It's not setiathome but `cp` that should do write throttling. And `cp` will block again very soon even if setiathome blocks too. The whole point is that the write throttling must happen in balance_dirty(), _not_ in sync_page_buffers(). Infact from 2.2.19pre2 there's a wait_io per-bh bitflag that remembers when a dirty bh is very old and it doesn't get flushed away automatically (from either kupdate or kflushd). So we don't block in sync_page_buffers until it's necessary to avoid hurting non-IO apps when I/O is going on. > NOTE! I think that throttling writers is fine and good, but as it stands > now, the dirty buffer balancing will throttle anybody, not just the > writer. That's partly because of the 2.4.x mis-feature of doing the How can it throttle everybody and not only the writers? _Only_ the writers calls balance_dirty. > balance_dirty call even for previously dirty buffers (fixed in my tree, > btw). Yes I seen, people overwriting dirty data was blocking too, that was not necessary, but they were still writers. > It's _really_ bad to wait for bdflush to finish if we hold on to things > like the superblock lock - which _does_ happen right now. That's why I'm > pretty convinced that we should NOT blindly do the dirty balance in > "mark_buffer_dirty()", but instead at more well-defined points (in places > like "generic_file_write()", for example). Right way to avoid blocking with lock helds is to replace mark_buffer_dirty with __mark_buffer_dirty() and to call balance_dirty() later when the locks are released. That's why it's exported to modules. Everybody is always been allowed to optimize away the mark_buffer_dirty(), it's just that nobody did that yet. I think it's useful to keep providing an interface that does the write throttling automatically. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: scheduling problem?
On Tue, 2 Jan 2001, Andrea Arcangeli wrote: > On Tue, Jan 02, 2001 at 11:02:41AM -0800, Linus Torvalds wrote: > > What does the system feel like if you just change the "sleep for bdflush" > > logic in wakeup_bdflush() to something like > > > > wake_up_process(bdflush_tsk); > > __set_current_state(TASK_RUNNING); > > current->policy |= SCHED_YIELD; > > schedule(); > > > > instead of trying to wait for bdflush to wake us up? > > My bet is a `VM: killing' message. Maybe in 2.2.x, yes. > Waiting bdflush back-wakeup is mandatory to do write throttling correctly. The > above will break write throttling at least unless something foundamental is > changed recently and that doesn't seem the case. page_launder() should wait for the dirty pages, and that's not something 2.2.x ever did. This way, the issue of dirty data in the VM is handled by the VM pressure, not by trying to artificially throttle writers. NOTE! I think that throttling writers is fine and good, but as it stands now, the dirty buffer balancing will throttle anybody, not just the writer. That's partly because of the 2.4.x mis-feature of doing the balance_dirty call even for previously dirty buffers (fixed in my tree, btw). It's _really_ bad to wait for bdflush to finish if we hold on to things like the superblock lock - which _does_ happen right now. That's why I'm pretty convinced that we should NOT blindly do the dirty balance in "mark_buffer_dirty()", but instead at more well-defined points (in places like "generic_file_write()", for example). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: scheduling problem?
On Tue, Jan 02, 2001 at 11:02:41AM -0800, Linus Torvalds wrote: > What does the system feel like if you just change the "sleep for bdflush" > logic in wakeup_bdflush() to something like > > wake_up_process(bdflush_tsk); > __set_current_state(TASK_RUNNING); > current->policy |= SCHED_YIELD; > schedule(); > > instead of trying to wait for bdflush to wake us up? My bet is a `VM: killing' message. Waiting bdflush back-wakeup is mandatory to do write throttling correctly. The above will break write throttling at least unless something foundamental is changed recently and that doesn't seem the case. What I like to do there is to just make bdflush the same thing that kswapd _should_ (I said "should" because it seems it's not the case anymore in 2.4.x from some email I read recently, I didn't checked that myself yet) be for memory pressure (I implemented that at some point in my private local tree). I mean: bdflush only does the async writeouts and the task context calls something like flush_dirty_buffers itself. The main reason I was doing that is to fix the case of >bdf_prm.ndirty tasks all waiting on bdflush at the same time (that will break write throttling even now in 2.2.x and in current 2.4.x). That's an unlukcy condition very similar to the one in GFP that is fixed correctly in 2.2.19pre2 putting pages in a per-process freelist during memory balancing. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: scheduling problem?
On Tue, 2 Jan 2001, Mike Galbraith wrote: > > Yes and no. I've seen nasty stalls for quite a while now. (I think > that there is a wakeup problem lurking) > > I found the change which triggers my horrid stalls. Nobody is going > to believe this... Hmm.. I can believe it. The code that waits on bdflush in wakeup_bdflush() is somewhat suspicious. In particular, if/when that ever triggers, and bdflush() is busy in flush_dirty_buffers(), then the process that is trying to wake bdflush up is going to wait until flush_dirty_buffers() is done. Which, if there is a process dirtying pages, can basically be pretty much forever. This was probably hidden by the lower limits simply by virtue of bdflush never being very active before. What does the system feel like if you just change the "sleep for bdflush" logic in wakeup_bdflush() to something like wake_up_process(bdflush_tsk); __set_current_state(TASK_RUNNING); current->policy |= SCHED_YIELD; schedule(); instead of trying to wait for bdflush to wake us up? Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: scheduling problem?
On Wed, 3 Jan 2001, Anton Blanchard wrote: > > Hi Mike, > > > I am seeing (what I believe is;) severe process CPU starvation in > > 2.4.0-prerelease. At first, I attributed it to semaphore troubles > > as when I enable semaphore deadlock detection in IKD and set it to > > 5 seconds, it triggers 100% of the time on nscd when I do sequential > > I/O (iozone eg). In the meantime, I've done a slew of tracing, and > > I think the holder of the semaphore I'm timing out on just flat isn't > > being scheduled so it can release it. In the usual case of nscd, I > > _think_ it's another nscd holding the semaphore. In no trace can I > > go back far enough to catch the taker of the semaphore or any user > > task other than iozone running between __down() time and timeout 5 > > seconds later. (trace buffer covers ~8 seconds of kernel time) > > Did this just appear in recent kernels? Maybe bdflush was hiding the > situation in earlier kernels as it would cause io hogs to block when > things got only mildly interesting. Yes and no. I've seen nasty stalls for quite a while now. (I think that there is a wakeup problem lurking) I found the change which triggers my horrid stalls. Nobody is going to believe this... diff -urN linux-2.4.0-test13-pre6/fs/buffer.c linux-2.4.0-test13-pre7/fs/buffer.c --- linux-2.4.0-test13-pre6/fs/buffer.c Sat Dec 30 08:58:56 2000 +++ linux-2.4.0-test13-pre7/fs/buffer.c Sun Dec 31 06:22:31 2000 @@ -122,16 +122,17 @@ when trying to refill buffers. */ int interval; /* jiffies delay between kupdate flushes */ int age_buffer; /* Time for normal buffer to age before we flush it */ - int dummy1;/* unused, was age_super */ + int nfract_sync; /* Percentage of buffer cache dirty to + activate bdflush synchronously */ int dummy2;/* unused */ int dummy3;/* unused */ } b_un; unsigned int data[N_PARAM]; -} bdf_prm = {{40, 500, 64, 256, 5*HZ, 30*HZ, 5*HZ, 1884, 2}}; +} bdf_prm = {{40, 500, 64, 256, 5*HZ, 30*HZ, 80, 0, 0}}; /* These are the min and max parameter values that we will allow to be assigned */ -int bdflush_min[N_PARAM] = { 0, 10,5, 25, 0, 1*HZ, 1*HZ, 1, 1}; -int bdflush_max[N_PARAM] = {100,5, 2, 2,600*HZ, 6000*HZ, 6000*HZ, 2047, 5}; +int bdflush_min[N_PARAM] = { 0, 10,5, 25, 0, 1*HZ, 0, 0, 0}; +int bdflush_max[N_PARAM] = {100,5, 2, 2,600*HZ, 6000*HZ, 100, 0, 0}; /* * Rewrote the wait-routines to use the "new" wait-queue functionality, @@ -1032,9 +1034,9 @@ dirty = size_buffers_type[BUF_DIRTY] >> PAGE_SHIFT; tot = nr_free_buffer_pages(); - dirty *= 200; + dirty *= 100; soft_dirty_limit = tot * bdf_prm.b_un.nfract; - hard_dirty_limit = soft_dirty_limit * 2; + hard_dirty_limit = tot * bdf_prm.b_un.nfract_sync; /* First, check for the "real" dirty limit. */ if (dirty > soft_dirty_limit) { ...but reversing this cures my semaphore timeouts. Don't say impossible :) I didn't believe it either until I retested several times. I wager that if I just fiddle with parameters I'll be able to make the problem come and go at will. (means the real problem is gonna be a weird one:) -Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: scheduling problem?
Hi Mike, > I am seeing (what I believe is;) severe process CPU starvation in > 2.4.0-prerelease. At first, I attributed it to semaphore troubles > as when I enable semaphore deadlock detection in IKD and set it to > 5 seconds, it triggers 100% of the time on nscd when I do sequential > I/O (iozone eg). In the meantime, I've done a slew of tracing, and > I think the holder of the semaphore I'm timing out on just flat isn't > being scheduled so it can release it. In the usual case of nscd, I > _think_ it's another nscd holding the semaphore. In no trace can I > go back far enough to catch the taker of the semaphore or any user > task other than iozone running between __down() time and timeout 5 > seconds later. (trace buffer covers ~8 seconds of kernel time) Did this just appear in recent kernels? Maybe bdflush was hiding the situation in earlier kernels as it would cause io hogs to block when things got only mildly interesting. You might be able to get some useful information with ps axl and checking the WCHAN value. Of course it wont be possible if like nscd you cant get ps to schedule :) Anton - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: scheduling problem test13-pre7
On Sun, 31 Dec 2000, Matti Aarnio wrote: > On Sun, Dec 31, 2000 at 10:42:26AM +0100, Mike Galbraith wrote: > > Hi, > > > > While running iozone, I notice severe stalls of vmstat output > > despite vmstat running SCHED_RR and mlockall(). > >Lets eliminate the obvious: > >- Are you running with IDE disk ? Yes. >- Does hdparm /dev/hda(whatever)report: > > /dev/hda: >unmaskirq= 0 (off) >using_dma= 0 (off) No. /dev/hda: multcount= 0 (off) I/O support = 1 (32-bit) unmaskirq= 1 (on) using_dma= 1 (on) keepsettings = 1 (on) nowerr = 0 (off) readonly = 0 (off) readahead= 8 (on) geometry = 2482/255/63, sectors = 39876480, start = 0 > >The IKD uses local interrupts, so this isn't necessarily true... I just did a (mondo) trace covering 8 seconds of kernel time, and vmstat ran twice. (Those two times were before I noticed the stall and started counting down toward 'poke the freeze-frame button') -Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: scheduling problem test13-pre7
On Sun, Dec 31, 2000 at 10:42:26AM +0100, Mike Galbraith wrote: > Hi, > > While running iozone, I notice severe stalls of vmstat output > despite vmstat running SCHED_RR and mlockall(). Lets eliminate the obvious: - Are you running with IDE disk ? - Does hdparm /dev/hda(whatever)report: /dev/hda: unmaskirq= 0 (off) using_dma= 0 (off) The IKD uses local interrupts, so this isn't necessarily true... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/