Re: ps hang in 241-pre10
Linus Torvalds wrote: > > In article <[EMAIL PROTECTED]>, David Ford <[EMAIL PROTECTED]> > wrote: > > > >We've narrowed it down to "we're all running xmms" when it happend. > > Does anybody have a clue about what is different with xmms? > > Does it use KNI if it can, for example? We used to have a problem with Seeing this - I'll add my post here too - I've been burning one audio CD last week and while I've been moving slider the system has locked - I think the kernel version has been -ac7 - then I've used pre8 and I've been playing divx file while burning four other CD with no problem. My system is SMP Bp6 with SBLive kernel's emu driver. -- There are three types of people in the world: those who can count, and those who can't. Zdenek Kabelac http://i.am/kabi/ [EMAIL PROTECTED] {debian.org; fi.muni.cz} - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
Linus Torvalds wrote: In article [EMAIL PROTECTED], David Ford [EMAIL PROTECTED] wrote: We've narrowed it down to "we're all running xmms" when it happend. Does anybody have a clue about what is different with xmms? Does it use KNI if it can, for example? We used to have a problem with Seeing this - I'll add my post here too - I've been burning one audio CD last week and while I've been moving slider the system has locked - I think the kernel version has been -ac7 - then I've used pre8 and I've been playing divx file while burning four other CD with no problem. My system is SMP Bp6 with SBLive kernel's emu driver. -- There are three types of people in the world: those who can count, and those who can't. Zdenek Kabelac http://i.am/kabi/ [EMAIL PROTECTED] {debian.org; fi.muni.cz} - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.1-pre10 deadlock (Re: ps hang in 241-pre10)
On Sun, Jan 28 2001, Linus Torvalds wrote: > On Sun, 28 Jan 2001, Jens Axboe wrote: > > > > How about this instead? > > I really don't like this one. It will basically re-introduce the old > behaviour of waking people up in a trickle, as far as I can tell. The > reason we want the batching is to make people have more requests to sort > in the elevator, and as far as I can tell this will just hurt that. > > Are there any downsides to just _always_ batching, regardless of whether > the request freelist is empty or not? Sure, it will make the "effective" > size of the freelist a bit smaller, but that's probably not actually > noticeable under any load except for the one that empties the freelist (in > which case the old code would have triggered the batching anyway). The problem with removing the !list_empty test like you suggested is that batching is no longer controlled anymore. If we start batching once the lists are empty and start wakeups once batch_requests has been reached, we know we'll give the elevator enough to work with to be effective. With !list_empty removed, batch_requests is no longer a measure of how many requests we want to batch. Always batching is not a in problem in itself, the effective smaller freelist effect should be neglible. The sent patch will only trickle wakeups in case of batching already in effect, but batch_request wakeups were not enough to deplete the freelist again. At least that was the intended effect :-) > Performance numbers? Don't have any right now, will test a bit later. -- * Jens Axboe <[EMAIL PROTECTED]> * SuSE Labs - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.1-pre10 deadlock (Re: ps hang in 241-pre10)
On Sun, 28 Jan 2001, Jens Axboe wrote: > > How about this instead? I really don't like this one. It will basically re-introduce the old behaviour of waking people up in a trickle, as far as I can tell. The reason we want the batching is to make people have more requests to sort in the elevator, and as far as I can tell this will just hurt that. Are there any downsides to just _always_ batching, regardless of whether the request freelist is empty or not? Sure, it will make the "effective" size of the freelist a bit smaller, but that's probably not actually noticeable under any load except for the one that empties the freelist (in which case the old code would have triggered the batching anyway). Performance numbers? Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.1-pre10 deadlock (Re: ps hang in 241-pre10)
On Sun, Jan 28 2001, Lorenzo Allegrucci wrote: > >Ho humm. Jens: imagine that you have more people waiting for requests than > >"batchcount". Further, imagine that you have multiple requests finishing > >at the same time. Not unlikely. Now, imagine that one request finishes, > >and causes "batchcount" users to wake up, and immediately another request > >finishes but THAT one doesn't wake anybody up because it notices that the > >freelist isn't empty - so it thinks that it doesn't need to wake anybody. > > > >Lorenzo, does the problem go away for you if you remove the > > > > if (!list_empty(>request_freelist[rw])) { > > ... > > } > > > >code from blkdev_release_request() in drivers/block/ll_rw_block.c? > > Yes, it does. How about this instead? --- /opt/kernel/linux-2.4.1-pre10/drivers/block/ll_rw_blk.c Thu Jan 25 19:15:12 2001 +++ drivers/block/ll_rw_blk.c Sun Jan 28 19:22:20 2001 @@ -633,6 +634,8 @@ if (!list_empty(>request_freelist[rw])) { blk_refill_freelist(q, rw); list_add(>table, >request_freelist[rw]); + if (waitqueue_active(>wait_for_request)) + wake_up_nr(>wait_for_request, 2); return; } -- * Jens Axboe <[EMAIL PROTECTED]> * SuSE Labs - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
On Sun, 28 Jan 2001, Marcelo Tosatti wrote: > > Why dont you just put set_page_dirty() back in page_launder() in case > writepage() fails? Because a EIO or similar should _not_ be re-tried or kept dirty. Imagine a bad user that goes over his quota on purpose, and then every single write will always return an error. What should we do? Let him eat all physical memory? I don't think so. write-out errors will be ignored. We _might_ send a signal or something, but considering the fact that we don't even know who caused the dirty page in the first place, even that is kind of hard. Shared memory and out-of-swap is special - the shared memory code is supposed to check that we have enough memory before it even allocates anything. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.1-pre10 deadlock (Re: ps hang in 241-pre10)
At 15.40 27/01/01 -0800, you wrote: > > >On Sat, 27 Jan 2001, Lorenzo Allegrucci wrote: >> >> A trivial "while(1) fork()" is enough to trigger it. >> "mem=32M" by lilo, ulimit -u is 1024. > >Hmm.. This does not look like a VM deadlock - it looks like some IO >request is waiting forever on "__get_request_wait()". In fact, it looks >like a _lot_ of people are waiting for requests. > >So what happens is that somebody takes a page fault (and gets the mm >lock), tries to read something in, and never gets anything back, thus >leaving the MM locked. > >Jens: this looks suspiciously like somebody isn't waking things up when >they add requests back to the request lists. Alternatively, maybe the >unplugging isn't properly done, so that we have a lot of pending IO that >doesn't get started.. > >Ho humm. Jens: imagine that you have more people waiting for requests than >"batchcount". Further, imagine that you have multiple requests finishing >at the same time. Not unlikely. Now, imagine that one request finishes, >and causes "batchcount" users to wake up, and immediately another request >finishes but THAT one doesn't wake anybody up because it notices that the >freelist isn't empty - so it thinks that it doesn't need to wake anybody. > >Lorenzo, does the problem go away for you if you remove the > > if (!list_empty(>request_freelist[rw])) { > ... > } > >code from blkdev_release_request() in drivers/block/ll_rw_block.c? Yes, it does. -- Lorenzo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
On 27 Jan 2001, Linus Torvalds wrote: > In article <[EMAIL PROTECTED]>, David Ford <[EMAIL PROTECTED]> > wrote: > > > >We've narrowed it down to "we're all running xmms" when it happend. > > Does anybody have a clue about what is different with xmms? > > Does it use KNI if it can, for example? We used to have a problem with > KNI+Athlons, for example. Not KNI, I don't think, but 1.2.4 did add support for 3dnow!, with auto-detection of CPU type. Disabled by default, but available. Are there any 3dnow! issues?? > It might also be that it's threading-related, and that XMMS is one of > the few things that uses threads. Things like that. I'm not an XMMS > user, can somebody who knows XMMS comment on things that it does that > are unusual? Always uses threads, can use 3dnow!, DGA and realtime priority. Can also do direct hardware access to some graphics cards (inc SB16), but I haven't looked at that one closely. James. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
On 27 Jan 2001, Linus Torvalds wrote: In article [EMAIL PROTECTED], David Ford [EMAIL PROTECTED] wrote: We've narrowed it down to "we're all running xmms" when it happend. Does anybody have a clue about what is different with xmms? Does it use KNI if it can, for example? We used to have a problem with KNI+Athlons, for example. Not KNI, I don't think, but 1.2.4 did add support for 3dnow!, with auto-detection of CPU type. Disabled by default, but available. Are there any 3dnow! issues?? It might also be that it's threading-related, and that XMMS is one of the few things that uses threads. Things like that. I'm not an XMMS user, can somebody who knows XMMS comment on things that it does that are unusual? Always uses threads, can use 3dnow!, DGA and realtime priority. Can also do direct hardware access to some graphics cards (inc SB16), but I haven't looked at that one closely. James. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.1-pre10 deadlock (Re: ps hang in 241-pre10)
At 15.40 27/01/01 -0800, you wrote: On Sat, 27 Jan 2001, Lorenzo Allegrucci wrote: A trivial "while(1) fork()" is enough to trigger it. "mem=32M" by lilo, ulimit -u is 1024. Hmm.. This does not look like a VM deadlock - it looks like some IO request is waiting forever on "__get_request_wait()". In fact, it looks like a _lot_ of people are waiting for requests. So what happens is that somebody takes a page fault (and gets the mm lock), tries to read something in, and never gets anything back, thus leaving the MM locked. Jens: this looks suspiciously like somebody isn't waking things up when they add requests back to the request lists. Alternatively, maybe the unplugging isn't properly done, so that we have a lot of pending IO that doesn't get started.. Ho humm. Jens: imagine that you have more people waiting for requests than "batchcount". Further, imagine that you have multiple requests finishing at the same time. Not unlikely. Now, imagine that one request finishes, and causes "batchcount" users to wake up, and immediately another request finishes but THAT one doesn't wake anybody up because it notices that the freelist isn't empty - so it thinks that it doesn't need to wake anybody. Lorenzo, does the problem go away for you if you remove the if (!list_empty(q-request_freelist[rw])) { ... } code from blkdev_release_request() in drivers/block/ll_rw_block.c? Yes, it does. -- Lorenzo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
On Sun, 28 Jan 2001, Marcelo Tosatti wrote: Why dont you just put set_page_dirty() back in page_launder() in case writepage() fails? Because a EIO or similar should _not_ be re-tried or kept dirty. Imagine a bad user that goes over his quota on purpose, and then every single write will always return an error. What should we do? Let him eat all physical memory? I don't think so. write-out errors will be ignored. We _might_ send a signal or something, but considering the fact that we don't even know who caused the dirty page in the first place, even that is kind of hard. Shared memory and out-of-swap is special - the shared memory code is supposed to check that we have enough memory before it even allocates anything. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.1-pre10 deadlock (Re: ps hang in 241-pre10)
On Sun, 28 Jan 2001, Jens Axboe wrote: How about this instead? I really don't like this one. It will basically re-introduce the old behaviour of waking people up in a trickle, as far as I can tell. The reason we want the batching is to make people have more requests to sort in the elevator, and as far as I can tell this will just hurt that. Are there any downsides to just _always_ batching, regardless of whether the request freelist is empty or not? Sure, it will make the "effective" size of the freelist a bit smaller, but that's probably not actually noticeable under any load except for the one that empties the freelist (in which case the old code would have triggered the batching anyway). Performance numbers? Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.1-pre10 deadlock (Re: ps hang in 241-pre10)
On Sun, Jan 28 2001, Linus Torvalds wrote: On Sun, 28 Jan 2001, Jens Axboe wrote: How about this instead? I really don't like this one. It will basically re-introduce the old behaviour of waking people up in a trickle, as far as I can tell. The reason we want the batching is to make people have more requests to sort in the elevator, and as far as I can tell this will just hurt that. Are there any downsides to just _always_ batching, regardless of whether the request freelist is empty or not? Sure, it will make the "effective" size of the freelist a bit smaller, but that's probably not actually noticeable under any load except for the one that empties the freelist (in which case the old code would have triggered the batching anyway). The problem with removing the !list_empty test like you suggested is that batching is no longer controlled anymore. If we start batching once the lists are empty and start wakeups once batch_requests has been reached, we know we'll give the elevator enough to work with to be effective. With !list_empty removed, batch_requests is no longer a measure of how many requests we want to batch. Always batching is not a in problem in itself, the effective smaller freelist effect should be neglible. The sent patch will only trickle wakeups in case of batching already in effect, but batch_request wakeups were not enough to deplete the freelist again. At least that was the intended effect :-) Performance numbers? Don't have any right now, will test a bit later. -- * Jens Axboe [EMAIL PROTECTED] * SuSE Labs - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
Patch appears to work, for i in [0-9]*; do echo $i; cat $i/stat > /dev/null; done completes successfully with xmms running in "real-time" priority. Shawn. Marcelo Tosatti wrote: > On Sat, 27 Jan 2001, Linus Torvalds wrote: > > > > > > > On Sun, 28 Jan 2001, Marcelo Tosatti wrote: > > > > > > > > This is the smoking gun here, I bet, but I'd like to make sure I see the > > > > whole thing. I don't see _why_ we'd have deadlocked on __wait_on_page(), > > > > but I think this is the thread that hangs on to the mm semaphore. > > > > > > I was able to reproduce it here with dbench. > > > > > > Nothing is locked except this dbench thread (the only dbench thread): > > > > > > dbenchD C1C9FE64 5200 1013 1(L-TLB)1370 785 > > > Call Trace: [___wait_on_page+130/160] [truncate_list_pages+100/404] >[truncate_inode_pages+93/128] [iput+162/360] [dput+262/356] [fput+121/232] >[exit_mmap+218/292] > > > [mmput+56/80] [do_exit+208/680] [do_signal+566/656] [dput+25/356] >[path_release+13/60] [sys_newstat+100/112] [sys_read+188/196] [signal_return+20/24] > > > > Ok, this definitely seems to be the pattern. > > > > I don't see _what_ is going on, though. > > > > I know of one "known bug" in pre10: if you run out of swap-space with > > shared memory segments, it will do the wrong thing (return 1 without > > unlocking the page). xmms might trigger this, but I didn't think that > > dbench used shared memory? > > It does. Bingo. > > I'm not able to reproduce the problem here with your patch. > > Btw, there is another bug in shm_writepage() where it does not set the > page dirty in case of failure... > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
On Sun, 28 Jan 2001, Marcelo Tosatti wrote: > On Sat, 27 Jan 2001, Linus Torvalds wrote: > > > > > > > On Sun, 28 Jan 2001, Marcelo Tosatti wrote: > > > > > > > > This is the smoking gun here, I bet, but I'd like to make sure I see the > > > > whole thing. I don't see _why_ we'd have deadlocked on __wait_on_page(), > > > > but I think this is the thread that hangs on to the mm semaphore. > > > > > > I was able to reproduce it here with dbench. > > > > > > Nothing is locked except this dbench thread (the only dbench thread): > > > > > > dbenchD C1C9FE64 5200 1013 1(L-TLB)1370 785 > > > Call Trace: [___wait_on_page+130/160] [truncate_list_pages+100/404] >[truncate_inode_pages+93/128] [iput+162/360] [dput+262/356] [fput+121/232] >[exit_mmap+218/292] > > > [mmput+56/80] [do_exit+208/680] [do_signal+566/656] [dput+25/356] >[path_release+13/60] [sys_newstat+100/112] [sys_read+188/196] [signal_return+20/24] > > > > Ok, this definitely seems to be the pattern. > > > > I don't see _what_ is going on, though. > > > > I know of one "known bug" in pre10: if you run out of swap-space with > > shared memory segments, it will do the wrong thing (return 1 without > > unlocking the page). xmms might trigger this, but I didn't think that > > dbench used shared memory? > > It does. Bingo. > > I'm not able to reproduce the problem here with your patch. > > Btw, there is another bug in shm_writepage() where it does not set the > page dirty in case of failure... Why dont you just put set_page_dirty() back in page_launder() in case writepage() fails? Otherwise you'll have to do in every specific implementation of writepage(). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
On Sat, 27 Jan 2001, Linus Torvalds wrote: > > > On Sun, 28 Jan 2001, Marcelo Tosatti wrote: > > > > > > This is the smoking gun here, I bet, but I'd like to make sure I see the > > > whole thing. I don't see _why_ we'd have deadlocked on __wait_on_page(), > > > but I think this is the thread that hangs on to the mm semaphore. > > > > I was able to reproduce it here with dbench. > > > > Nothing is locked except this dbench thread (the only dbench thread): > > > > dbenchD C1C9FE64 5200 1013 1(L-TLB)1370 785 > > Call Trace: [___wait_on_page+130/160] [truncate_list_pages+100/404] >[truncate_inode_pages+93/128] [iput+162/360] [dput+262/356] [fput+121/232] >[exit_mmap+218/292] > > [mmput+56/80] [do_exit+208/680] [do_signal+566/656] [dput+25/356] >[path_release+13/60] [sys_newstat+100/112] [sys_read+188/196] [signal_return+20/24] > > Ok, this definitely seems to be the pattern. > > I don't see _what_ is going on, though. > > I know of one "known bug" in pre10: if you run out of swap-space with > shared memory segments, it will do the wrong thing (return 1 without > unlocking the page). xmms might trigger this, but I didn't think that > dbench used shared memory? It does. Bingo. I'm not able to reproduce the problem here with your patch. Btw, there is another bug in shm_writepage() where it does not set the page dirty in case of failure... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
On Sun, 28 Jan 2001, Marcelo Tosatti wrote: > > > > This is the smoking gun here, I bet, but I'd like to make sure I see the > > whole thing. I don't see _why_ we'd have deadlocked on __wait_on_page(), > > but I think this is the thread that hangs on to the mm semaphore. > > I was able to reproduce it here with dbench. > > Nothing is locked except this dbench thread (the only dbench thread): > > dbenchD C1C9FE64 5200 1013 1(L-TLB)1370 785 > Call Trace: [___wait_on_page+130/160] [truncate_list_pages+100/404] >[truncate_inode_pages+93/128] [iput+162/360] [dput+262/356] [fput+121/232] >[exit_mmap+218/292] > [mmput+56/80] [do_exit+208/680] [do_signal+566/656] [dput+25/356] >[path_release+13/60] [sys_newstat+100/112] [sys_read+188/196] [signal_return+20/24] Ok, this definitely seems to be the pattern. I don't see _what_ is going on, though. I know of one "known bug" in pre10: if you run out of swap-space with shared memory segments, it will do the wrong thing (return 1 without unlocking the page). xmms might trigger this, but I didn't think that dbench used shared memory? There's also an ugliness in the truncate ordering. I don't think it should matter, but I do believe it's conceptually wrong as-is. Does this patch make any difference at all? Linus - diff -u --recursive --new-file pre10/linux/mm/memory.c linux/mm/memory.c --- pre10/linux/mm/memory.c Sat Jan 27 10:53:39 2001 +++ linux/mm/memory.c Sat Jan 27 19:12:35 2001 @@ -945,7 +945,6 @@ if (inode->i_size < offset) goto do_expand; inode->i_size = offset; - truncate_inode_pages(mapping, offset); spin_lock(>i_shared_lock); if (!mapping->i_mmap && !mapping->i_mmap_shared) goto out_unlock; @@ -960,8 +959,7 @@ out_unlock: spin_unlock(>i_shared_lock); - /* this should go into ->truncate */ - inode->i_size = offset; + truncate_inode_pages(mapping, offset); if (inode->i_op && inode->i_op->truncate) inode->i_op->truncate(inode); return; diff -u --recursive --new-file pre10/linux/mm/shmem.c linux/mm/shmem.c --- pre10/linux/mm/shmem.c Sat Jan 27 10:53:39 2001 +++ linux/mm/shmem.cSat Jan 27 19:50:08 2001 @@ -217,8 +217,11 @@ info = >mapping->host->u.shmem_i; swap = __get_swap_page(2); - if (!swap.val) - return 1; + if (!swap.val) { + set_page_dirty(page); + UnlockPage(page); + return -ENOMEM; + } spin_lock(>lock); shmem_recalc_inode(page->mapping->host); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
(ugh, sorry about last mail) On 27 Jan 2001, Linus Torvalds wrote: > In article <[EMAIL PROTECTED]>, David Ford <[EMAIL PROTECTED]> wrote: > >Unfortunately klogd reads /procerg. > > > >So the following is a painstakingly slow hand translation, I'll only print > >the D state entries unless someone asks otherwise. > > You seem to be pretty much able to reproduce this at will, right? > > I'd really like to see the raw System.map and dmesg output if your > syslogd doesn't do a proper job of getting the symbols interpreted: just > send the things by email, and I'll put something together. It's too > hard to interpret your half-way decoded thing, and I really want to see > what this xmms thing is doing.. > > >xmms D CACC5EA8 4116 713155 715 (NOTLB)1493 674 > >Call Trace: [] [] [] [] [] > >[] [] > > [] [] [] [] [] > > > >c01248e4 T ___wait_on_page > >c0124984 t __lock_page > > > >c01240dc t truncate_list_pages > >c0124268 T truncate_inode_pages > >c01242d4 t writeout_one_page > > This is the smoking gun here, I bet, but I'd like to make sure I see the > whole thing. I don't see _why_ we'd have deadlocked on __wait_on_page(), > but I think this is the thread that hangs on to the mm semaphore. I was able to reproduce it here with dbench. Nothing is locked except this dbench thread (the only dbench thread): dbenchD C1C9FE64 5200 1013 1(L-TLB)1370 785 Call Trace: [___wait_on_page+130/160] [truncate_list_pages+100/404] [truncate_inode_pages+93/128] [iput+162/360] [dput+262/356] [fput+121/232] [exit_mmap+218/292] [mmput+56/80] [do_exit+208/680] [do_signal+566/656] [dput+25/356] [path_release+13/60] [sys_newstat+100/112] [sys_read+188/196] [signal_return+20/24] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.1-pre10 deadlock (Re: ps hang in 241-pre10)
On Sun, 28 Jan 2001, Jens Axboe wrote: > On Sat, Jan 27 2001, Linus Torvalds wrote: > > > What was the trace of this? Just curious, the below case outlined by > > > Linus should be pretty generic, but I'd still like to know what > > > can lead to this condition. > > > > It was posted on linux-kernel - I don't save the dang things because I > > have too much in my "archives" as is ;) > > Ok I see it now, confused wrt the different threads... > > > > Good spotting. Actually I see one more problem with it too. If > > > we've started batching (under heavy I/O of course), we could > > > splice the pending list and wake up X number of sleepers, but > > > there's a) no guarentee that these sleepers will actually get > > > the requests if new ones keep flooding in > > > > (a) is ok. They'll go back to sleep - it's a loop waiting for requests.. > > My point is not that it's broken, but it will favor new comers > instead of tasks having blocked on a free slot already. So it > would still be nice to get right. > > > >and b) no guarentee > > > that X sleepers require X request slots. > > > > Well, IF they are sleeping (and thus, if the wake_up_nr() will trigger on > > them), they _will_ use a request. I don't think we have to worry about > > that. At most we will wake up "too many" - we'll wake up processes even > > though they end up not being able to get a request anyway because somebody > > else got to it first. And that's ok. It's the "wake up too few" that > > causes trouble, and I think that will be fixed by my suggestion. > > Yes they may end up sleeing right away again as per the above a) case > for instance. The logic now is 'we have X free slots now, wake up > x sleepers' where it instead should be 'we have X free slots now, > wake up people until the free list is exhausted'. > > > Now, I'd worred if somebody wants several requests at the same time, and > > doesn't feed them to the IO layer until it has gotten all of them. In that > > case, you can get starvation with many people having "reserved" their > > requests, and there not be enough free requests around to actually ever > > wake anybody up again. But the regular IO paths do not do this: they will > > all allocate a request and just submit it immediately, no "reservation". > > Right, the I/O path doesn't do this and it would seem more appropriate > to have such users use their own requests instead of eating from > the internal pool. > > -- > * Jens Axboe <[EMAIL PROTECTED]> > * SuSE Labs > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > Please read the FAQ at http://www.tux.org/lkml/ > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
> Does anybody have a clue about what is different with xmms? > > Does it use KNI if it can, for example? We used to have a problem > with KNI+Athlons, for example. No, it doesn't. > It might also be that it's threading-related, and that XMMS is one > of the few things that uses threads. Things like that. I'm not an > XMMS user, can somebody who knows XMMS comment on things that it > does that are unusual? Yes, threads could be the thing that makes a difference. I can't think of anything else that is special about XMMS. -- Håvard Kvålen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.1-pre10 deadlock (Re: ps hang in 241-pre10)
On Sat, Jan 27 2001, Linus Torvalds wrote: > > What was the trace of this? Just curious, the below case outlined by > > Linus should be pretty generic, but I'd still like to know what > > can lead to this condition. > > It was posted on linux-kernel - I don't save the dang things because I > have too much in my "archives" as is ;) Ok I see it now, confused wrt the different threads... > > Good spotting. Actually I see one more problem with it too. If > > we've started batching (under heavy I/O of course), we could > > splice the pending list and wake up X number of sleepers, but > > there's a) no guarentee that these sleepers will actually get > > the requests if new ones keep flooding in > > (a) is ok. They'll go back to sleep - it's a loop waiting for requests.. My point is not that it's broken, but it will favor new comers instead of tasks having blocked on a free slot already. So it would still be nice to get right. > > and b) no guarentee > > that X sleepers require X request slots. > > Well, IF they are sleeping (and thus, if the wake_up_nr() will trigger on > them), they _will_ use a request. I don't think we have to worry about > that. At most we will wake up "too many" - we'll wake up processes even > though they end up not being able to get a request anyway because somebody > else got to it first. And that's ok. It's the "wake up too few" that > causes trouble, and I think that will be fixed by my suggestion. Yes they may end up sleeing right away again as per the above a) case for instance. The logic now is 'we have X free slots now, wake up x sleepers' where it instead should be 'we have X free slots now, wake up people until the free list is exhausted'. > Now, I'd worred if somebody wants several requests at the same time, and > doesn't feed them to the IO layer until it has gotten all of them. In that > case, you can get starvation with many people having "reserved" their > requests, and there not be enough free requests around to actually ever > wake anybody up again. But the regular IO paths do not do this: they will > all allocate a request and just submit it immediately, no "reservation". Right, the I/O path doesn't do this and it would seem more appropriate to have such users use their own requests instead of eating from the internal pool. -- * Jens Axboe <[EMAIL PROTECTED]> * SuSE Labs - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.1-pre10 deadlock (Re: ps hang in 241-pre10)
On Sun, 28 Jan 2001, Jens Axboe wrote: > > > > So what happens is that somebody takes a page fault (and gets the mm > > lock), tries to read something in, and never gets anything back, thus > > leaving the MM locked. > > What was the trace of this? Just curious, the below case outlined by > Linus should be pretty generic, but I'd still like to know what > can lead to this condition. It was posted on linux-kernel - I don't save the dang things because I have too much in my "archives" as is ;) > > Lorenzo, does the problem go away for you if you remove the > > > > if (!list_empty(>request_freelist[rw])) { > > ... > > } > > > > code from blkdev_release_request() in drivers/block/ll_rw_block.c? > > Good spotting. Actually I see one more problem with it too. If > we've started batching (under heavy I/O of course), we could > splice the pending list and wake up X number of sleepers, but > there's a) no guarentee that these sleepers will actually get > the requests if new ones keep flooding in (a) is ok. They'll go back to sleep - it's a loop waiting for requests.. >and b) no guarentee > that X sleepers require X request slots. Well, IF they are sleeping (and thus, if the wake_up_nr() will trigger on them), they _will_ use a request. I don't think we have to worry about that. At most we will wake up "too many" - we'll wake up processes even though they end up not being able to get a request anyway because somebody else got to it first. And that's ok. It's the "wake up too few" that causes trouble, and I think that will be fixed by my suggestion. Now, I'd worred if somebody wants several requests at the same time, and doesn't feed them to the IO layer until it has gotten all of them. In that case, you can get starvation with many people having "reserved" their requests, and there not be enough free requests around to actually ever wake anybody up again. But the regular IO paths do not do this: they will all allocate a request and just submit it immediately, no "reservation". (Obviously, _submitting_ the request doesn't mean that we'd actually start processing it, but if somebody ends up waiting for requests they'll do the unplug that does start it all going, so effectively we can think of it as a logical "start this request now" thing even if it gets delayed in order to coalesce IO). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
In article <[EMAIL PROTECTED]>, David Ford <[EMAIL PROTECTED]> wrote: >Unfortunately klogd reads /procerg. > >So the following is a painstakingly slow hand translation, I'll only print >the D state entries unless someone asks otherwise. You seem to be pretty much able to reproduce this at will, right? I'd really like to see the raw System.map and dmesg output if your syslogd doesn't do a proper job of getting the symbols interpreted: just send the things by email, and I'll put something together. It's too hard to interpret your half-way decoded thing, and I really want to see what this xmms thing is doing.. >xmms D CACC5EA8 4116 713155 715 (NOTLB)1493 674 >Call Trace: [] [] [] [] [] >[] [] > [] [] [] [] [] > >c01248e4 T ___wait_on_page >c0124984 t __lock_page > >c01240dc t truncate_list_pages >c0124268 T truncate_inode_pages >c01242d4 t writeout_one_page This is the smoking gun here, I bet, but I'd like to make sure I see the whole thing. I don't see _why_ we'd have deadlocked on __wait_on_page(), but I think this is the thread that hangs on to the mm semaphore. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
yes, I should also mention I have also a SoundBlaster 32AWE (0MB on the daughterboard). J Sloan wrote: > OK, here's the details you asked about: > > Soundblaster Awe 32 sound card > Voodoo 3 pci video card > Running Xfree86-4.0.0 (rpms from 3dfx.com) > Playing unreal tournament, no special game > options, just 800x600 graphics @ 16 bits. > > To recap, the symptoms (hung ps, etc) occurred > on kernel 2.4.1-pre8 + low latency patches. (but > I don't think the low latency patches had anything > to do with it, based on the other reports) > > Hope this helps > > jjs > > David Ford wrote: > > > On 2.4.0-ac12, I played music for about 30 minutes without any problems. I >started up an mpeg in xmms and it > > locked in short order. I'm sure now that it has something to do with the >graphics. What DGA or other config > > options do you have enabled for your game? > > > > What video and sound card? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
OK, here's the details you asked about: Soundblaster Awe 32 sound card Voodoo 3 pci video card Running Xfree86-4.0.0 (rpms from 3dfx.com) Playing unreal tournament, no special game options, just 800x600 graphics @ 16 bits. To recap, the symptoms (hung ps, etc) occurred on kernel 2.4.1-pre8 + low latency patches. (but I don't think the low latency patches had anything to do with it, based on the other reports) Hope this helps jjs David Ford wrote: > On 2.4.0-ac12, I played music for about 30 minutes without any problems. I started >up an mpeg in xmms and it > locked in short order. I'm sure now that it has something to do with the graphics. >What DGA or other config > options do you have enabled for your game? > > What video and sound card? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
It is important to note that when I hit the magic key and rebooted (SUB), a split second before it rebooted, a stalled 'lspci' snapped back to life and printed out my expected data. -d -- There is a natural aristocracy among men. The grounds of this are virtue and talents. Thomas Jefferson The good thing about standards is that there are so many to choose from. Andrew S. Tanenbaum - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
On 2.4.0-ac12, I played music for about 30 minutes without any problems. I started up an mpeg in xmms and it locked in short order. I'm sure now that it has something to do with the graphics. What DGA or other config options do you have enabled for your game? What video and sound card? I have an ATI Rage LT Pro AGP-133 according to lspci. -d J Sloan wrote: > Sorry, there was no xmms involved here - > > The behavior occurred while playing unreal tournament. > > But at least the sound card was in use, FWIW - > > jjs > > David Ford wrote: > > > We've narrowed it down to "we're all running xmms" when it happend. > > > > -d > > > > J Sloan wrote: > > > > > Just for the record, the system where I saw the problem > > > has only ext2 - > > > > -- > > There is a natural aristocracy among men. The grounds of this are virtue and >talents. Thomas Jefferson > > The good thing about standards is that there are so many to choose from. Andrew >S. Tanenbaum > > > > - > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > > the body of a message to [EMAIL PROTECTED] > > Please read the FAQ at http://www.tux.org/lkml/ -- There is a natural aristocracy among men. The grounds of this are virtue and talents. Thomas Jefferson The good thing about standards is that there are so many to choose from. Andrew S. Tanenbaum - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
Unfortunately klogd reads /procerg. So the following is a painstakingly slow hand translation, I'll only print the D state entries unless someone asks otherwise. Prior to this: XMMS is running playing star wars mpeg. (regular user) (frozen) TOP is running (regular user) (frozen) while [ 1 ]; do ls -laR /proc ; done (regular user) (frozen) skill -9 xmms (root) (frozen) X 4.0.2 running, scp of 600meg file over pegasus usb ethernet (10Mbit). syslog caught: Jan 27 16:42:26 nifty kernel: SysRq: Show State Jan 27 16:42:26 nifty kernel: Jan 27 16:42:26 nifty kernel: freesibling Jan 27 16:42:26 nifty kernel: task PCstack pid father child younger older Jan 27 16:42:26 nifty kernel: init S CBFEBF2C 3184 1 0 187 (NOTLB) dmesg shows (only D state for brevity): top D CA98B3DC 4440 219158(NOTLB) Call Trace: [] [] [] [] [] [] [] c01078c8 T __down c0107964 T __down_interruptible c0107a28 T __down_trylock c0107a60 T __down_failed c0107a6c T __down_failed_interruptible c02f6a00 T stext_lock c02f827e A _etext c014b578 t proc_info_read c014b688 t mem_read c0131150 T sys_read c013121c T sys_write c0108d2c T system_call c0108d64 T ret_from_sys_call c010 t startup_32 c0100139 t is486 xmms D CACC5EA8 4116 713155 715 (NOTLB)1493 674 Call Trace: [] [] [] [] [] [] [] [] [] [] [] [] c01248e4 T ___wait_on_page c0124984 t __lock_page c01240dc t truncate_list_pages c0124268 T truncate_inode_pages c01242d4 t writeout_one_page c0144094 T remove_inode_hash c01440a8 T iput c01441fc T force_delete c01422a0 T dput c01423e4 T d_invalidate c0131c58 T fput c0131d28 T fget c012365c t unmap_fixup c0123788 t free_pgtables c012380c T do_munmap c0123a5c T sys_munmap ...ask if you want more xmms S C2979F30 0 715713 725 (NOTLB) Call Trace: [] [] [] [] [] [] xmms S C2B75F2C 1156 716715(NOTLB) 718 Call Trace: [] [] [] [] [] xmms S 7FFF 0 718715(NOTLB) 719 716 Call Trace: [] [] [] [] xmms S C2975F88 832 719715(NOTLB) 725 718 Call Trace: [] [] [] [] [] xmms S CA8D7F88 2672 725715(NOTLB) 719 Call Trace: [] [] [] [] c0114240 t process_timeout c0114288 T schedule_timeout c011431c T schedule_tail c0113d70 t remap_area_pages c0114020 T __ioremap c0108d2c T system_call c0108d64 T ret_from_sys_call lsD CA98B3DC 0 1896222(NOTLB) Call Trace: [] [] [] [] [] [] skill D CA98B3DC 0 1897187(NOTLB) Call Trace: [] [] [] [] [] [] c0107964 T __down_interruptible c0107a28 T __down_trylock c0107a60 T __down_failed c0107a6c T __down_failed_interruptible c02f6a00 T stext_lock c02f827e A _etext ... SysRq: Show Memory Mem-info: Free pages:2240kB ( 0kB HighMem) ( Active: 4153, inactive_dirty: 198, inactive_clean: 1077, free: 560 (383 766 1149) ) 31*4kB 1*8kB 1*16kB 0*32kB 0*64kB 0*128kB 0*256kB 1*512kB 0*1024kB 0*2048kB = 660kB) 125*4kB 5*8kB 1*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB 0*2048kB = 1580kB) = 0kB) Swap cache: add 3165, delete 547, find 25/124 Free swap:53104kB 49136 pages of RAM 0 pages of HIGHMEM 1798 reserved pages 2619 pages shared 2618 pages swap cached 0 pages in page table cache Buffer memory: 1276kB -d -- There is a natural aristocracy among men. The grounds of this are virtue and talents. Thomas Jefferson The good thing about standards is that there are so many to choose from. Andrew S. Tanenbaum - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
On Sat, Jan 27, 2001 at 04:42:45PM -0800, J Sloan wrote: > But at least the sound card was in use, FWIW - Not for me. My xmms was sitting idle when it froze. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
Sorry, there was no xmms involved here - The behavior occurred while playing unreal tournament. But at least the sound card was in use, FWIW - jjs David Ford wrote: > We've narrowed it down to "we're all running xmms" when it happend. > > -d > > J Sloan wrote: > > > Just for the record, the system where I saw the problem > > has only ext2 - > > -- > There is a natural aristocracy among men. The grounds of this are virtue and >talents. Thomas Jefferson > The good thing about standards is that there are so many to choose from. Andrew S. >Tanenbaum > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
Linus Torvalds wrote: > In article <[EMAIL PROTECTED]>, David Ford <[EMAIL PROTECTED]> > wrote: > > > >We've narrowed it down to "we're all running xmms" when it happend. > > Does anybody have a clue about what is different with xmms? Not sure. > Does it use KNI if it can, for example? We used to have a problem with > KNI+Athlons, for example. > > It might also be that it's threading-related, and that XMMS is one of > the few things that uses threads. Things like that. I'm not an XMMS > user, can somebody who knows XMMS comment on things that it does that > are unusual? If I was clued enough to know KNI, I could say for a certainty. I am assuming it's a form of MMX or related. My notebook is a mobile pII 366. I'm stress testing it now with ac12. I originally had pre9 on it. There is one difference other than that, I have Marcelo's bg aging patch on here which seems to have improved responsiveness significantly but I'll save that for another story. I've triggered it, report follows in next email. -d -- There is a natural aristocracy among men. The grounds of this are virtue and talents. Thomas Jefferson The good thing about standards is that there are so many to choose from. Andrew S. Tanenbaum - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
This system is the following: AcerOPEN AP53/AX Motherboard, Intel Pentium 200Mhz w/o MMX (1996-1997) Chipsets: 430HX, PIIX3 (EIDE) 64MB RAM EDO 60ns (Kingston brand) Linus Torvalds wrote: > In article <[EMAIL PROTECTED]>, David Ford <[EMAIL PROTECTED]> > wrote: > > > >We've narrowed it down to "we're all running xmms" when it happend. > > Does anybody have a clue about what is different with xmms? > > Does it use KNI if it can, for example? We used to have a problem with > KNI+Athlons, for example. > > It might also be that it's threading-related, and that XMMS is one of > the few things that uses threads. Things like that. I'm not an XMMS > user, can somebody who knows XMMS comment on things that it does that > are unusual? > > Linus > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
In article <[EMAIL PROTECTED]>, David Ford <[EMAIL PROTECTED]> wrote: > >We've narrowed it down to "we're all running xmms" when it happend. Does anybody have a clue about what is different with xmms? Does it use KNI if it can, for example? We used to have a problem with KNI+Athlons, for example. It might also be that it's threading-related, and that XMMS is one of the few things that uses threads. Things like that. I'm not an XMMS user, can somebody who knows XMMS comment on things that it does that are unusual? Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
At the time I had temporary access to my notebook and had a mismatched System.map file :S -d Linus Torvalds wrote: > In article <[EMAIL PROTECTED]>, David Ford <[EMAIL PROTECTED]> wrote: > >I can quickly and easily duplicate it on my notebook by playing music or > >mpegs in xmms. It may take a few minutes but it's guaranteed. > > > >xmms stalls flat on it's face and anything accessing /proc stalls. If I get > >the time to do it, I'll take a gander at it with kdb. > > Please, if you see something like this, just do a simple > followed by while in text-mode. The > magic keystrokes will give a stack trace of the currently running > process and all processes respectively. -- There is a natural aristocracy among men. The grounds of this are virtue and talents. Thomas Jefferson The good thing about standards is that there are so many to choose from. Andrew S. Tanenbaum - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
We've narrowed it down to "we're all running xmms" when it happend. -d J Sloan wrote: > Just for the record, the system where I saw the problem > has only ext2 - -- There is a natural aristocracy among men. The grounds of this are virtue and talents. Thomas Jefferson The good thing about standards is that there are so many to choose from. Andrew S. Tanenbaum - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
On Sat, Jan 27, 2001 at 04:33:42AM -0500, Shawn Starr wrote: > Yes, I have ReiserFS as well...hrm... I don't. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
Just for the record, the system where I saw the problem has only ext2 - jjs Shawn Starr wrote: > Yes, I have ReiserFS as well...hrm... > > David Ford wrote: > > > I can quickly and easily duplicate it on my notebook by playing music or > > mpegs in xmms. It may take a few minutes but it's guaranteed. > > > > xmms stalls flat on it's face and anything accessing /proc stalls. If I get > > the time to do it, I'll take a gander at it with kdb. > > > > I have no patches applied to p10, I have reiserfs onboard but I highly doubt > > it's reiserfs. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
In article <[EMAIL PROTECTED]>, David Ford <[EMAIL PROTECTED]> wrote: >I can quickly and easily duplicate it on my notebook by playing music or >mpegs in xmms. It may take a few minutes but it's guaranteed. > >xmms stalls flat on it's face and anything accessing /proc stalls. If I get >the time to do it, I'll take a gander at it with kdb. Please, if you see something like this, just do a simple followed by while in text-mode. The magic keystrokes will give a stack trace of the currently running process and all processes respectively. Then, just look in your /var/log/messages, and if you have everything set up correctly the system should have done the conversion to symbolic kernel addresses for you - so you can see directly where the different processes are sleeping. Sanity-check that your System.map information (and thus the symbolic conversion) ooks to be ok: the processes that hang should show up in the trace as being in __down_failed() or something like that. Tha only reason for a hang with /proc// tends to be that some process would have deadlocked on it's MM semaphore or is somehow stuck inside it's critical region on something else. Finally, try to pinpoint _which_ process it is. Usully most easily done by simply seeing where it is that the /proc accesses get stuck, with something simple like cd /proc for i in [0-9]*; do echo $i cat $i/stat > /dev/null done and see what the last pid it printed out was (not that the above guarantees that you found the thing, because there might be several things. But it's one more piece to the puzzle). And send the information to the kernel mailing list, along with anything else you might think of. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
I have not compiled or used reiserfs here yet. compiling Mikes semaphore debug patch now and adding sysrq - but this took three days to happen just once here. ..john Shawn Starr wrote: > > Yes, I have ReiserFS as well...hrm... > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
Yes, I have ReiserFS as well...hrm... David Ford wrote: > I can quickly and easily duplicate it on my notebook by playing music or > mpegs in xmms. It may take a few minutes but it's guaranteed. > > xmms stalls flat on it's face and anything accessing /proc stalls. If I get > the time to do it, I'll take a gander at it with kdb. > > I have no patches applied to p10, I have reiserfs onboard but I highly doubt > it's reiserfs. > > -d > > J Sloan wrote: > > > OK, It's official now, I didn't know if it was some > > weird hardware fluke or something, but one of > > the computers here exhibited the same problem - > > -- > There is a natural aristocracy among men. The grounds of this are virtue and >talents. Thomas Jefferson > The good thing about standards is that there are so many to choose from. Andrew S. >Tanenbaum > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
I can quickly and easily duplicate it on my notebook by playing music or mpegs in xmms. It may take a few minutes but it's guaranteed. xmms stalls flat on it's face and anything accessing /proc stalls. If I get the time to do it, I'll take a gander at it with kdb. I have no patches applied to p10, I have reiserfs onboard but I highly doubt it's reiserfs. -d J Sloan wrote: > OK, It's official now, I didn't know if it was some > weird hardware fluke or something, but one of > the computers here exhibited the same problem - -- There is a natural aristocracy among men. The grounds of this are virtue and talents. Thomas Jefferson The good thing about standards is that there are so many to choose from. Andrew S. Tanenbaum - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
I can quickly and easily duplicate it on my notebook by playing music or mpegs in xmms. It may take a few minutes but it's guaranteed. xmms stalls flat on it's face and anything accessing /proc stalls. If I get the time to do it, I'll take a gander at it with kdb. I have no patches applied to p10, I have reiserfs onboard but I highly doubt it's reiserfs. -d J Sloan wrote: OK, It's official now, I didn't know if it was some weird hardware fluke or something, but one of the computers here exhibited the same problem - -- There is a natural aristocracy among men. The grounds of this are virtue and talents. Thomas Jefferson The good thing about standards is that there are so many to choose from. Andrew S. Tanenbaum - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
Yes, I have ReiserFS as well...hrm... David Ford wrote: I can quickly and easily duplicate it on my notebook by playing music or mpegs in xmms. It may take a few minutes but it's guaranteed. xmms stalls flat on it's face and anything accessing /proc stalls. If I get the time to do it, I'll take a gander at it with kdb. I have no patches applied to p10, I have reiserfs onboard but I highly doubt it's reiserfs. -d J Sloan wrote: OK, It's official now, I didn't know if it was some weird hardware fluke or something, but one of the computers here exhibited the same problem - -- There is a natural aristocracy among men. The grounds of this are virtue and talents. Thomas Jefferson The good thing about standards is that there are so many to choose from. Andrew S. Tanenbaum - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
I have not compiled or used reiserfs here yet. compiling Mikes semaphore debug patch now and adding sysrq - but this took three days to happen just once here. ..john Shawn Starr wrote: Yes, I have ReiserFS as well...hrm... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
Just for the record, the system where I saw the problem has only ext2 - jjs Shawn Starr wrote: Yes, I have ReiserFS as well...hrm... David Ford wrote: I can quickly and easily duplicate it on my notebook by playing music or mpegs in xmms. It may take a few minutes but it's guaranteed. xmms stalls flat on it's face and anything accessing /proc stalls. If I get the time to do it, I'll take a gander at it with kdb. I have no patches applied to p10, I have reiserfs onboard but I highly doubt it's reiserfs. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
On Sat, Jan 27, 2001 at 04:33:42AM -0500, Shawn Starr wrote: Yes, I have ReiserFS as well...hrm... I don't. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
At the time I had temporary access to my notebook and had a mismatched System.map file :S -d Linus Torvalds wrote: In article [EMAIL PROTECTED], David Ford [EMAIL PROTECTED] wrote: I can quickly and easily duplicate it on my notebook by playing music or mpegs in xmms. It may take a few minutes but it's guaranteed. xmms stalls flat on it's face and anything accessing /proc stalls. If I get the time to do it, I'll take a gander at it with kdb. Please, if you see something like this, just do a simple Alt+ScrollLock followed by Ctrl+ScrollLock while in text-mode. The magic keystrokes will give a stack trace of the currently running process and all processes respectively. -- There is a natural aristocracy among men. The grounds of this are virtue and talents. Thomas Jefferson The good thing about standards is that there are so many to choose from. Andrew S. Tanenbaum - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
This system is the following: AcerOPEN AP53/AX Motherboard, Intel Pentium 200Mhz w/o MMX (1996-1997) Chipsets: 430HX, PIIX3 (EIDE) 64MB RAM EDO 60ns (Kingston brand) Linus Torvalds wrote: In article [EMAIL PROTECTED], David Ford [EMAIL PROTECTED] wrote: We've narrowed it down to "we're all running xmms" when it happend. Does anybody have a clue about what is different with xmms? Does it use KNI if it can, for example? We used to have a problem with KNI+Athlons, for example. It might also be that it's threading-related, and that XMMS is one of the few things that uses threads. Things like that. I'm not an XMMS user, can somebody who knows XMMS comment on things that it does that are unusual? Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
Sorry, there was no xmms involved here - The behavior occurred while playing unreal tournament. But at least the sound card was in use, FWIW - jjs David Ford wrote: We've narrowed it down to "we're all running xmms" when it happend. -d J Sloan wrote: Just for the record, the system where I saw the problem has only ext2 - -- There is a natural aristocracy among men. The grounds of this are virtue and talents. Thomas Jefferson The good thing about standards is that there are so many to choose from. Andrew S. Tanenbaum - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
On Sat, Jan 27, 2001 at 04:42:45PM -0800, J Sloan wrote: But at least the sound card was in use, FWIW - Not for me. My xmms was sitting idle when it froze. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
Unfortunately klogd reads /procerg. So the following is a painstakingly slow hand translation, I'll only print the D state entries unless someone asks otherwise. Prior to this: XMMS is running playing star wars mpeg. (regular user) (frozen) TOP is running (regular user) (frozen) while [ 1 ]; do ls -laR /proc ; done (regular user) (frozen) skill -9 xmms (root) (frozen) X 4.0.2 running, scp of 600meg file over pegasus usb ethernet (10Mbit). syslog caught: Jan 27 16:42:26 nifty kernel: SysRq: Show State Jan 27 16:42:26 nifty kernel: Jan 27 16:42:26 nifty kernel: freesibling Jan 27 16:42:26 nifty kernel: task PCstack pid father child younger older Jan 27 16:42:26 nifty kernel: init S CBFEBF2C 3184 1 0 187 (NOTLB) end dmesg shows (only D state for brevity): top D CA98B3DC 4440 219158(NOTLB) Call Trace: [c010791d] [c0107a68] [c02f73dd] [c014b5cb] [c01311e6] [c0108d5f] [c010002b] c01078c8 T __down c0107964 T __down_interruptible c0107a28 T __down_trylock c0107a60 T __down_failed c0107a6c T __down_failed_interruptible c02f6a00 T stext_lock c02f827e A _etext c014b578 t proc_info_read c014b688 t mem_read c0131150 T sys_read c013121c T sys_write c0108d2c T system_call c0108d64 T ret_from_sys_call c010 t startup_32 c0100139 t is486 xmms D CACC5EA8 4116 713155 715 (NOTLB)1493 674 Call Trace: [c0124966] [c012412f] [c01242b8] [c0144138] [c014238e] [c0131cd0] [c01236b2] [c01239f2] [c01ac5ca] [c010d1f6] [c0108e7c] [c0108d5f] c01248e4 T ___wait_on_page c0124984 t __lock_page c01240dc t truncate_list_pages c0124268 T truncate_inode_pages c01242d4 t writeout_one_page c0144094 T remove_inode_hash c01440a8 T iput c01441fc T force_delete c01422a0 T dput c01423e4 T d_invalidate c0131c58 T fput c0131d28 T fget c012365c t unmap_fixup c0123788 t free_pgtables c012380c T do_munmap c0123a5c T sys_munmap ...ask if you want more xmms S C2979F30 0 715713 725 (NOTLB) Call Trace: [c01142fb] [c0114240] [c013f95e] [c013fb53] [c0119fff] [c0108d5f] xmms S C2B75F2C 1156 716715(NOTLB) 718 Call Trace: [c01142fb] [c0114240] [c013f341] [c013f6e0] [c0108d5f] xmms S 7FFF 0 718715(NOTLB) 719 716 Call Trace: [c011429f] [c013f341] [c013f6e0] [c0108d5f] xmms S C2975F88 832 719715(NOTLB) 725 718 Call Trace: [c01142fb] [c0114240] [c011d468] [c0108d5f] [c010002b] xmms S CA8D7F88 2672 725715(NOTLB) 719 Call Trace: [c01142fb] [c0114240] [c011d468] [c0108d5f] c0114240 t process_timeout c0114288 T schedule_timeout c011431c T schedule_tail c0113d70 t remap_area_pages c0114020 T __ioremap c0108d2c T system_call c0108d64 T ret_from_sys_call lsD CA98B3DC 0 1896222(NOTLB) Call Trace: [c010791d] [c0107a68] [c02f73b5] [c014b95a] [c01389a2] [c0108d5f] skill D CA98B3DC 0 1897187(NOTLB) Call Trace: [c010791d] [c0107a68] [c02f73dd] [c014b5cb] [c01311e6] [c0108d5f] c0107964 T __down_interruptible c0107a28 T __down_trylock c0107a60 T __down_failed c0107a6c T __down_failed_interruptible c02f6a00 T stext_lock c02f827e A _etext ... SysRq: Show Memory Mem-info: Free pages:2240kB ( 0kB HighMem) ( Active: 4153, inactive_dirty: 198, inactive_clean: 1077, free: 560 (383 766 1149) ) 31*4kB 1*8kB 1*16kB 0*32kB 0*64kB 0*128kB 0*256kB 1*512kB 0*1024kB 0*2048kB = 660kB) 125*4kB 5*8kB 1*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB 0*2048kB = 1580kB) = 0kB) Swap cache: add 3165, delete 547, find 25/124 Free swap:53104kB 49136 pages of RAM 0 pages of HIGHMEM 1798 reserved pages 2619 pages shared 2618 pages swap cached 0 pages in page table cache Buffer memory: 1276kB -d -- There is a natural aristocracy among men. The grounds of this are virtue and talents. Thomas Jefferson The good thing about standards is that there are so many to choose from. Andrew S. Tanenbaum - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
On 2.4.0-ac12, I played music for about 30 minutes without any problems. I started up an mpeg in xmms and it locked in short order. I'm sure now that it has something to do with the graphics. What DGA or other config options do you have enabled for your game? What video and sound card? I have an ATI Rage LT Pro AGP-133 according to lspci. -d J Sloan wrote: Sorry, there was no xmms involved here - The behavior occurred while playing unreal tournament. But at least the sound card was in use, FWIW - jjs David Ford wrote: We've narrowed it down to "we're all running xmms" when it happend. -d J Sloan wrote: Just for the record, the system where I saw the problem has only ext2 - -- There is a natural aristocracy among men. The grounds of this are virtue and talents. Thomas Jefferson The good thing about standards is that there are so many to choose from. Andrew S. Tanenbaum - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/ -- There is a natural aristocracy among men. The grounds of this are virtue and talents. Thomas Jefferson The good thing about standards is that there are so many to choose from. Andrew S. Tanenbaum - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
It is important to note that when I hit the magic key and rebooted (SUB), a split second before it rebooted, a stalled 'lspci' snapped back to life and printed out my expected data. -d -- There is a natural aristocracy among men. The grounds of this are virtue and talents. Thomas Jefferson The good thing about standards is that there are so many to choose from. Andrew S. Tanenbaum - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
OK, here's the details you asked about: Soundblaster Awe 32 sound card Voodoo 3 pci video card Running Xfree86-4.0.0 (rpms from 3dfx.com) Playing unreal tournament, no special game options, just 800x600 graphics @ 16 bits. To recap, the symptoms (hung ps, etc) occurred on kernel 2.4.1-pre8 + low latency patches. (but I don't think the low latency patches had anything to do with it, based on the other reports) Hope this helps jjs David Ford wrote: On 2.4.0-ac12, I played music for about 30 minutes without any problems. I started up an mpeg in xmms and it locked in short order. I'm sure now that it has something to do with the graphics. What DGA or other config options do you have enabled for your game? What video and sound card? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
yes, I should also mention I have also a SoundBlaster 32AWE (0MB on the daughterboard). J Sloan wrote: OK, here's the details you asked about: Soundblaster Awe 32 sound card Voodoo 3 pci video card Running Xfree86-4.0.0 (rpms from 3dfx.com) Playing unreal tournament, no special game options, just 800x600 graphics @ 16 bits. To recap, the symptoms (hung ps, etc) occurred on kernel 2.4.1-pre8 + low latency patches. (but I don't think the low latency patches had anything to do with it, based on the other reports) Hope this helps jjs David Ford wrote: On 2.4.0-ac12, I played music for about 30 minutes without any problems. I started up an mpeg in xmms and it locked in short order. I'm sure now that it has something to do with the graphics. What DGA or other config options do you have enabled for your game? What video and sound card? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
In article [EMAIL PROTECTED], David Ford [EMAIL PROTECTED] wrote: Unfortunately klogd reads /procerg. So the following is a painstakingly slow hand translation, I'll only print the D state entries unless someone asks otherwise. You seem to be pretty much able to reproduce this at will, right? I'd really like to see the raw System.map and dmesg output if your syslogd doesn't do a proper job of getting the symbols interpreted: just send the things by email, and I'll put something together. It's too hard to interpret your half-way decoded thing, and I really want to see what this xmms thing is doing.. xmms D CACC5EA8 4116 713155 715 (NOTLB)1493 674 Call Trace: [c0124966] [c012412f] [c01242b8] [c0144138] [c014238e] [c0131cd0] [c01236b2] [c01239f2] [c01ac5ca] [c010d1f6] [c0108e7c] [c0108d5f] c01248e4 T ___wait_on_page c0124984 t __lock_page c01240dc t truncate_list_pages c0124268 T truncate_inode_pages c01242d4 t writeout_one_page This is the smoking gun here, I bet, but I'd like to make sure I see the whole thing. I don't see _why_ we'd have deadlocked on __wait_on_page(), but I think this is the thread that hangs on to the mm semaphore. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.1-pre10 deadlock (Re: ps hang in 241-pre10)
On Sun, 28 Jan 2001, Jens Axboe wrote: So what happens is that somebody takes a page fault (and gets the mm lock), tries to read something in, and never gets anything back, thus leaving the MM locked. What was the trace of this? Just curious, the below case outlined by Linus should be pretty generic, but I'd still like to know what can lead to this condition. It was posted on linux-kernel - I don't save the dang things because I have too much in my "archives" as is ;) Lorenzo, does the problem go away for you if you remove the if (!list_empty(q-request_freelist[rw])) { ... } code from blkdev_release_request() in drivers/block/ll_rw_block.c? Good spotting. Actually I see one more problem with it too. If we've started batching (under heavy I/O of course), we could splice the pending list and wake up X number of sleepers, but there's a) no guarentee that these sleepers will actually get the requests if new ones keep flooding in (a) is ok. They'll go back to sleep - it's a loop waiting for requests.. and b) no guarentee that X sleepers require X request slots. Well, IF they are sleeping (and thus, if the wake_up_nr() will trigger on them), they _will_ use a request. I don't think we have to worry about that. At most we will wake up "too many" - we'll wake up processes even though they end up not being able to get a request anyway because somebody else got to it first. And that's ok. It's the "wake up too few" that causes trouble, and I think that will be fixed by my suggestion. Now, I'd worred if somebody wants several requests at the same time, and doesn't feed them to the IO layer until it has gotten all of them. In that case, you can get starvation with many people having "reserved" their requests, and there not be enough free requests around to actually ever wake anybody up again. But the regular IO paths do not do this: they will all allocate a request and just submit it immediately, no "reservation". (Obviously, _submitting_ the request doesn't mean that we'd actually start processing it, but if somebody ends up waiting for requests they'll do the unplug that does start it all going, so effectively we can think of it as a logical "start this request now" thing even if it gets delayed in order to coalesce IO). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.1-pre10 deadlock (Re: ps hang in 241-pre10)
On Sat, Jan 27 2001, Linus Torvalds wrote: What was the trace of this? Just curious, the below case outlined by Linus should be pretty generic, but I'd still like to know what can lead to this condition. It was posted on linux-kernel - I don't save the dang things because I have too much in my "archives" as is ;) Ok I see it now, confused wrt the different threads... Good spotting. Actually I see one more problem with it too. If we've started batching (under heavy I/O of course), we could splice the pending list and wake up X number of sleepers, but there's a) no guarentee that these sleepers will actually get the requests if new ones keep flooding in (a) is ok. They'll go back to sleep - it's a loop waiting for requests.. My point is not that it's broken, but it will favor new comers instead of tasks having blocked on a free slot already. So it would still be nice to get right. and b) no guarentee that X sleepers require X request slots. Well, IF they are sleeping (and thus, if the wake_up_nr() will trigger on them), they _will_ use a request. I don't think we have to worry about that. At most we will wake up "too many" - we'll wake up processes even though they end up not being able to get a request anyway because somebody else got to it first. And that's ok. It's the "wake up too few" that causes trouble, and I think that will be fixed by my suggestion. Yes they may end up sleeing right away again as per the above a) case for instance. The logic now is 'we have X free slots now, wake up x sleepers' where it instead should be 'we have X free slots now, wake up people until the free list is exhausted'. Now, I'd worred if somebody wants several requests at the same time, and doesn't feed them to the IO layer until it has gotten all of them. In that case, you can get starvation with many people having "reserved" their requests, and there not be enough free requests around to actually ever wake anybody up again. But the regular IO paths do not do this: they will all allocate a request and just submit it immediately, no "reservation". Right, the I/O path doesn't do this and it would seem more appropriate to have such users use their own requests instead of eating from the internal pool. -- * Jens Axboe [EMAIL PROTECTED] * SuSE Labs - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
Does anybody have a clue about what is different with xmms? Does it use KNI if it can, for example? We used to have a problem with KNI+Athlons, for example. No, it doesn't. It might also be that it's threading-related, and that XMMS is one of the few things that uses threads. Things like that. I'm not an XMMS user, can somebody who knows XMMS comment on things that it does that are unusual? Yes, threads could be the thing that makes a difference. I can't think of anything else that is special about XMMS. -- Hvard Kvlen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.1-pre10 deadlock (Re: ps hang in 241-pre10)
On Sun, 28 Jan 2001, Jens Axboe wrote: On Sat, Jan 27 2001, Linus Torvalds wrote: What was the trace of this? Just curious, the below case outlined by Linus should be pretty generic, but I'd still like to know what can lead to this condition. It was posted on linux-kernel - I don't save the dang things because I have too much in my "archives" as is ;) Ok I see it now, confused wrt the different threads... Good spotting. Actually I see one more problem with it too. If we've started batching (under heavy I/O of course), we could splice the pending list and wake up X number of sleepers, but there's a) no guarentee that these sleepers will actually get the requests if new ones keep flooding in (a) is ok. They'll go back to sleep - it's a loop waiting for requests.. My point is not that it's broken, but it will favor new comers instead of tasks having blocked on a free slot already. So it would still be nice to get right. and b) no guarentee that X sleepers require X request slots. Well, IF they are sleeping (and thus, if the wake_up_nr() will trigger on them), they _will_ use a request. I don't think we have to worry about that. At most we will wake up "too many" - we'll wake up processes even though they end up not being able to get a request anyway because somebody else got to it first. And that's ok. It's the "wake up too few" that causes trouble, and I think that will be fixed by my suggestion. Yes they may end up sleeing right away again as per the above a) case for instance. The logic now is 'we have X free slots now, wake up x sleepers' where it instead should be 'we have X free slots now, wake up people until the free list is exhausted'. Now, I'd worred if somebody wants several requests at the same time, and doesn't feed them to the IO layer until it has gotten all of them. In that case, you can get starvation with many people having "reserved" their requests, and there not be enough free requests around to actually ever wake anybody up again. But the regular IO paths do not do this: they will all allocate a request and just submit it immediately, no "reservation". Right, the I/O path doesn't do this and it would seem more appropriate to have such users use their own requests instead of eating from the internal pool. -- * Jens Axboe [EMAIL PROTECTED] * SuSE Labs - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
(ugh, sorry about last mail) On 27 Jan 2001, Linus Torvalds wrote: In article [EMAIL PROTECTED], David Ford [EMAIL PROTECTED] wrote: Unfortunately klogd reads /procerg. So the following is a painstakingly slow hand translation, I'll only print the D state entries unless someone asks otherwise. You seem to be pretty much able to reproduce this at will, right? I'd really like to see the raw System.map and dmesg output if your syslogd doesn't do a proper job of getting the symbols interpreted: just send the things by email, and I'll put something together. It's too hard to interpret your half-way decoded thing, and I really want to see what this xmms thing is doing.. xmms D CACC5EA8 4116 713155 715 (NOTLB)1493 674 Call Trace: [c0124966] [c012412f] [c01242b8] [c0144138] [c014238e] [c0131cd0] [c01236b2] [c01239f2] [c01ac5ca] [c010d1f6] [c0108e7c] [c0108d5f] c01248e4 T ___wait_on_page c0124984 t __lock_page c01240dc t truncate_list_pages c0124268 T truncate_inode_pages c01242d4 t writeout_one_page This is the smoking gun here, I bet, but I'd like to make sure I see the whole thing. I don't see _why_ we'd have deadlocked on __wait_on_page(), but I think this is the thread that hangs on to the mm semaphore. I was able to reproduce it here with dbench. Nothing is locked except this dbench thread (the only dbench thread): dbenchD C1C9FE64 5200 1013 1(L-TLB)1370 785 Call Trace: [___wait_on_page+130/160] [truncate_list_pages+100/404] [truncate_inode_pages+93/128] [iput+162/360] [dput+262/356] [fput+121/232] [exit_mmap+218/292] [mmput+56/80] [do_exit+208/680] [do_signal+566/656] [dput+25/356] [path_release+13/60] [sys_newstat+100/112] [sys_read+188/196] [signal_return+20/24] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
On Sun, 28 Jan 2001, Marcelo Tosatti wrote: This is the smoking gun here, I bet, but I'd like to make sure I see the whole thing. I don't see _why_ we'd have deadlocked on __wait_on_page(), but I think this is the thread that hangs on to the mm semaphore. I was able to reproduce it here with dbench. Nothing is locked except this dbench thread (the only dbench thread): dbenchD C1C9FE64 5200 1013 1(L-TLB)1370 785 Call Trace: [___wait_on_page+130/160] [truncate_list_pages+100/404] [truncate_inode_pages+93/128] [iput+162/360] [dput+262/356] [fput+121/232] [exit_mmap+218/292] [mmput+56/80] [do_exit+208/680] [do_signal+566/656] [dput+25/356] [path_release+13/60] [sys_newstat+100/112] [sys_read+188/196] [signal_return+20/24] Ok, this definitely seems to be the pattern. I don't see _what_ is going on, though. I know of one "known bug" in pre10: if you run out of swap-space with shared memory segments, it will do the wrong thing (return 1 without unlocking the page). xmms might trigger this, but I didn't think that dbench used shared memory? There's also an ugliness in the truncate ordering. I don't think it should matter, but I do believe it's conceptually wrong as-is. Does this patch make any difference at all? Linus - diff -u --recursive --new-file pre10/linux/mm/memory.c linux/mm/memory.c --- pre10/linux/mm/memory.c Sat Jan 27 10:53:39 2001 +++ linux/mm/memory.c Sat Jan 27 19:12:35 2001 @@ -945,7 +945,6 @@ if (inode-i_size offset) goto do_expand; inode-i_size = offset; - truncate_inode_pages(mapping, offset); spin_lock(mapping-i_shared_lock); if (!mapping-i_mmap !mapping-i_mmap_shared) goto out_unlock; @@ -960,8 +959,7 @@ out_unlock: spin_unlock(mapping-i_shared_lock); - /* this should go into -truncate */ - inode-i_size = offset; + truncate_inode_pages(mapping, offset); if (inode-i_op inode-i_op-truncate) inode-i_op-truncate(inode); return; diff -u --recursive --new-file pre10/linux/mm/shmem.c linux/mm/shmem.c --- pre10/linux/mm/shmem.c Sat Jan 27 10:53:39 2001 +++ linux/mm/shmem.cSat Jan 27 19:50:08 2001 @@ -217,8 +217,11 @@ info = page-mapping-host-u.shmem_i; swap = __get_swap_page(2); - if (!swap.val) - return 1; + if (!swap.val) { + set_page_dirty(page); + UnlockPage(page); + return -ENOMEM; + } spin_lock(info-lock); shmem_recalc_inode(page-mapping-host); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
On Sat, 27 Jan 2001, Linus Torvalds wrote: On Sun, 28 Jan 2001, Marcelo Tosatti wrote: This is the smoking gun here, I bet, but I'd like to make sure I see the whole thing. I don't see _why_ we'd have deadlocked on __wait_on_page(), but I think this is the thread that hangs on to the mm semaphore. I was able to reproduce it here with dbench. Nothing is locked except this dbench thread (the only dbench thread): dbenchD C1C9FE64 5200 1013 1(L-TLB)1370 785 Call Trace: [___wait_on_page+130/160] [truncate_list_pages+100/404] [truncate_inode_pages+93/128] [iput+162/360] [dput+262/356] [fput+121/232] [exit_mmap+218/292] [mmput+56/80] [do_exit+208/680] [do_signal+566/656] [dput+25/356] [path_release+13/60] [sys_newstat+100/112] [sys_read+188/196] [signal_return+20/24] Ok, this definitely seems to be the pattern. I don't see _what_ is going on, though. I know of one "known bug" in pre10: if you run out of swap-space with shared memory segments, it will do the wrong thing (return 1 without unlocking the page). xmms might trigger this, but I didn't think that dbench used shared memory? It does. Bingo. I'm not able to reproduce the problem here with your patch. Btw, there is another bug in shm_writepage() where it does not set the page dirty in case of failure... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
On Sun, 28 Jan 2001, Marcelo Tosatti wrote: On Sat, 27 Jan 2001, Linus Torvalds wrote: On Sun, 28 Jan 2001, Marcelo Tosatti wrote: This is the smoking gun here, I bet, but I'd like to make sure I see the whole thing. I don't see _why_ we'd have deadlocked on __wait_on_page(), but I think this is the thread that hangs on to the mm semaphore. I was able to reproduce it here with dbench. Nothing is locked except this dbench thread (the only dbench thread): dbenchD C1C9FE64 5200 1013 1(L-TLB)1370 785 Call Trace: [___wait_on_page+130/160] [truncate_list_pages+100/404] [truncate_inode_pages+93/128] [iput+162/360] [dput+262/356] [fput+121/232] [exit_mmap+218/292] [mmput+56/80] [do_exit+208/680] [do_signal+566/656] [dput+25/356] [path_release+13/60] [sys_newstat+100/112] [sys_read+188/196] [signal_return+20/24] Ok, this definitely seems to be the pattern. I don't see _what_ is going on, though. I know of one "known bug" in pre10: if you run out of swap-space with shared memory segments, it will do the wrong thing (return 1 without unlocking the page). xmms might trigger this, but I didn't think that dbench used shared memory? It does. Bingo. I'm not able to reproduce the problem here with your patch. Btw, there is another bug in shm_writepage() where it does not set the page dirty in case of failure... Why dont you just put set_page_dirty() back in page_launder() in case writepage() fails? Otherwise you'll have to do in every specific implementation of writepage(). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
Patch appears to work, for i in [0-9]*; do echo $i; cat $i/stat /dev/null; done completes successfully with xmms running in "real-time" priority. Shawn. Marcelo Tosatti wrote: On Sat, 27 Jan 2001, Linus Torvalds wrote: On Sun, 28 Jan 2001, Marcelo Tosatti wrote: This is the smoking gun here, I bet, but I'd like to make sure I see the whole thing. I don't see _why_ we'd have deadlocked on __wait_on_page(), but I think this is the thread that hangs on to the mm semaphore. I was able to reproduce it here with dbench. Nothing is locked except this dbench thread (the only dbench thread): dbenchD C1C9FE64 5200 1013 1(L-TLB)1370 785 Call Trace: [___wait_on_page+130/160] [truncate_list_pages+100/404] [truncate_inode_pages+93/128] [iput+162/360] [dput+262/356] [fput+121/232] [exit_mmap+218/292] [mmput+56/80] [do_exit+208/680] [do_signal+566/656] [dput+25/356] [path_release+13/60] [sys_newstat+100/112] [sys_read+188/196] [signal_return+20/24] Ok, this definitely seems to be the pattern. I don't see _what_ is going on, though. I know of one "known bug" in pre10: if you run out of swap-space with shared memory segments, it will do the wrong thing (return 1 without unlocking the page). xmms might trigger this, but I didn't think that dbench used shared memory? It does. Bingo. I'm not able to reproduce the problem here with your patch. Btw, there is another bug in shm_writepage() where it does not set the page dirty in case of failure... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
OK, It's official now, I didn't know if it was some weird hardware fluke or something, but one of the computers here exhibited the same problem - The system in question is a Pentium II 400, scsi only (aic7xxx), running 2.4.1-pre8 plus Andrew Morton's low latency patches. The user was playing unreal tournament at the time and reported that it "got weird all of a sudden". I logged in and tried to do a ps, but the ps froze after listing a few lines. weird, never saw that one before. The user rebooted, so there was further opportunity to investigate, but I thought I ought to mention it after seeing these reports! jjs Aaron Lehmann wrote: > On Sat, Jan 27, 2001 at 03:34:26PM +1100, John Sheahan wrote: > > Hi > > my box has been running 2.4.1-pre10 for three days. > > This morning I noticed odd behavioue - ps and top wouuld freeze > > with no output. > > I had the same problem with 2.4.1-pre10 and the zerocopy patchset. > I came home one day and xmms was frozen. Attempting to determine > whether it was stuck in an odd state, I ran ps aux. At a certain > point (presumably just when it started trying to print info about the > xmms process), ps froze up too. And any attempts to killall -9 these > processes made the killall freeze! > > I'm not sure what made xmms freeze up in the first place. My first > though was a problem in the zerocopy patchset -- most of my mp3s are > played over NFS. However, XMMS was completely idle during the time I > was away from the computer, so I'm not sure what caused it. It seemed > clear, however, that the problem was contagious between processes. > > I reverted back to 2.4.0-ac7 and have not had any more problems of this > nature. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
I noticed this problem in 2.4.1-pre8. Odd, thats EXACLY what happened to me. I had to do a hard restart as killall locked when i tried to kill ps. Any word on why this is happening? Aaron Lehmann wrote: > On Sat, Jan 27, 2001 at 03:34:26PM +1100, John Sheahan wrote: > > Hi > > my box has been running 2.4.1-pre10 for three days. > > This morning I noticed odd behavioue - ps and top wouuld freeze > > with no output. > > I had the same problem with 2.4.1-pre10 and the zerocopy patchset. > I came home one day and xmms was frozen. Attempting to determine > whether it was stuck in an odd state, I ran ps aux. At a certain > point (presumably just when it started trying to print info about the > xmms process), ps froze up too. And any attempts to killall -9 these > processes made the killall freeze! > > I'm not sure what made xmms freeze up in the first place. My first > though was a problem in the zerocopy patchset -- most of my mp3s are > played over NFS. However, XMMS was completely idle during the time I > was away from the computer, so I'm not sure what caused it. It seemed > clear, however, that the problem was contagious between processes. > > I reverted back to 2.4.0-ac7 and have not had any more problems of this > nature. > > >Part 1.2Type: application/pgp-signature - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
On Sat, Jan 27, 2001 at 03:34:26PM +1100, John Sheahan wrote: > Hi > my box has been running 2.4.1-pre10 for three days. > This morning I noticed odd behavioue - ps and top wouuld freeze > with no output. I had the same problem with 2.4.1-pre10 and the zerocopy patchset. I came home one day and xmms was frozen. Attempting to determine whether it was stuck in an odd state, I ran ps aux. At a certain point (presumably just when it started trying to print info about the xmms process), ps froze up too. And any attempts to killall -9 these processes made the killall freeze! I'm not sure what made xmms freeze up in the first place. My first though was a problem in the zerocopy patchset -- most of my mp3s are played over NFS. However, XMMS was completely idle during the time I was away from the computer, so I'm not sure what caused it. It seemed clear, however, that the problem was contagious between processes. I reverted back to 2.4.0-ac7 and have not had any more problems of this nature. PGP signature
ps hang in 241-pre10
Hi my box has been running 2.4.1-pre10 for three days. This morning I noticed odd behavioue - ps and top wouuld freeze with no output. running strace on 'ps' open("/proc/669/environ", O_RDONLY) = 7 read(7, "INIT_VERSION=sysvinit-2.78\0previ"..., 2047) = 254 close(7)= 0 stat("/proc/683", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0 open("/proc/683/stat", O_RDONLY)= 7 read(7, --- and things just stop in that window-- I cannot read from /proc/683/ process 683 does not show up in /var/log/messages. How to I find what it is? Any suggestions on how to debug? Kernel 2.4.1-pre10 on a 2-processor i686 the box has run various 240-test for no unusual issues. john - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
ps hang in 241-pre10
Hi my box has been running 2.4.1-pre10 for three days. This morning I noticed odd behavioue - ps and top wouuld freeze with no output. running strace on 'ps' open("/proc/669/environ", O_RDONLY) = 7 read(7, "INIT_VERSION=sysvinit-2.78\0previ"..., 2047) = 254 close(7)= 0 stat("/proc/683", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0 open("/proc/683/stat", O_RDONLY)= 7 read(7, --- and things just stop in that window-- I cannot read from /proc/683/anything process 683 does not show up in /var/log/messages. How to I find what it is? Any suggestions on how to debug? Kernel 2.4.1-pre10 on a 2-processor i686 the box has run various 240-test for no unusual issues. john - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
On Sat, Jan 27, 2001 at 03:34:26PM +1100, John Sheahan wrote: Hi my box has been running 2.4.1-pre10 for three days. This morning I noticed odd behavioue - ps and top wouuld freeze with no output. I had the same problem with 2.4.1-pre10 and the zerocopy patchset. I came home one day and xmms was frozen. Attempting to determine whether it was stuck in an odd state, I ran ps aux. At a certain point (presumably just when it started trying to print info about the xmms process), ps froze up too. And any attempts to killall -9 these processes made the killall freeze! I'm not sure what made xmms freeze up in the first place. My first though was a problem in the zerocopy patchset -- most of my mp3s are played over NFS. However, XMMS was completely idle during the time I was away from the computer, so I'm not sure what caused it. It seemed clear, however, that the problem was contagious between processes. I reverted back to 2.4.0-ac7 and have not had any more problems of this nature. PGP signature
Re: ps hang in 241-pre10
I noticed this problem in 2.4.1-pre8. Odd, thats EXACLY what happened to me. I had to do a hard restart as killall locked when i tried to kill ps. Any word on why this is happening? Aaron Lehmann wrote: On Sat, Jan 27, 2001 at 03:34:26PM +1100, John Sheahan wrote: Hi my box has been running 2.4.1-pre10 for three days. This morning I noticed odd behavioue - ps and top wouuld freeze with no output. I had the same problem with 2.4.1-pre10 and the zerocopy patchset. I came home one day and xmms was frozen. Attempting to determine whether it was stuck in an odd state, I ran ps aux. At a certain point (presumably just when it started trying to print info about the xmms process), ps froze up too. And any attempts to killall -9 these processes made the killall freeze! I'm not sure what made xmms freeze up in the first place. My first though was a problem in the zerocopy patchset -- most of my mp3s are played over NFS. However, XMMS was completely idle during the time I was away from the computer, so I'm not sure what caused it. It seemed clear, however, that the problem was contagious between processes. I reverted back to 2.4.0-ac7 and have not had any more problems of this nature. Part 1.2Type: application/pgp-signature - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: ps hang in 241-pre10
OK, It's official now, I didn't know if it was some weird hardware fluke or something, but one of the computers here exhibited the same problem - The system in question is a Pentium II 400, scsi only (aic7xxx), running 2.4.1-pre8 plus Andrew Morton's low latency patches. The user was playing unreal tournament at the time and reported that it "got weird all of a sudden". I logged in and tried to do a ps, but the ps froze after listing a few lines. weird, never saw that one before. The user rebooted, so there was further opportunity to investigate, but I thought I ought to mention it after seeing these reports! jjs Aaron Lehmann wrote: On Sat, Jan 27, 2001 at 03:34:26PM +1100, John Sheahan wrote: Hi my box has been running 2.4.1-pre10 for three days. This morning I noticed odd behavioue - ps and top wouuld freeze with no output. I had the same problem with 2.4.1-pre10 and the zerocopy patchset. I came home one day and xmms was frozen. Attempting to determine whether it was stuck in an odd state, I ran ps aux. At a certain point (presumably just when it started trying to print info about the xmms process), ps froze up too. And any attempts to killall -9 these processes made the killall freeze! I'm not sure what made xmms freeze up in the first place. My first though was a problem in the zerocopy patchset -- most of my mp3s are played over NFS. However, XMMS was completely idle during the time I was away from the computer, so I'm not sure what caused it. It seemed clear, however, that the problem was contagious between processes. I reverted back to 2.4.0-ac7 and have not had any more problems of this nature. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/