Re: Signal 11 - the continuing saga
On Wed, 13 Dec 2000, Linus Torvalds wrote: > On Wed, 13 Dec 2000, Linus Torvalds wrote: > > > > Hint: "ptep_mkdirty()". rather obvious oopsie.. once spotted. > In case you wonder why the bug was so insidious, what this caused was two > separate problems, both of them able to cause SIGSGV's. > > One: we didn't mark the page table entry dirty like we were supposed to. > > Two: by making it writable, we also made the page shared, even if it > wasn't supposed to be shared (so when the next process wrote to the page, > if the swap page was shared with somebody else, the changes would show up > even in the process that _didn't_ write to it). > > And "ptep_mkdirty()" is only used by swapoff, so nothing else would show > this. Which was why it hadn't been immediately obvious that anything was > broken. The terminal OOM problem is now gone and I haven't seen a SIGSEGV yet running virgin source. IOU 5 bogo$$ -Mike (I still see something with IKD that _could_ be timing related troubles. There are a couple of grubby fingerprints I need to wipe off, and some churn/burn hours to be sure) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Signal 11 - the continuing saga
On Wed, 13 Dec 2000, Linus Torvalds wrote: > On Wed, 13 Dec 2000, Mike Galbraith wrote: > > > > Not in my test tree. Same fault, and same trace leading up to it. no > > Ok. > > It definitely looks like a swapoff() problem. > > Have you ever seen the behaviour without running swapoff? No. > Also, can you re-create it without running swapon() (if it's something > like a lost dirty bit, it should be possible to trigger even without the > swapon, and I'd like to hear if that can happen - if it only happens with > swapon() and you can't trigger it with just a swapoff() it might be a > question of re-using some swap file stuff and delaying the writeout or > whatever). I'll try loading up swap, swapoff and then doing jobs that fit in ram. (hmm.. what about inactive_clean list when you do swapoff.. might there be pages sitting there that are [were] swap cache? reclaim_page=kaboom?) -Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
RE: Signal 11 - the continuing saga
Err, for those of us who aren't up to our elbows in the kernel code, is there a patch for this? Presumeably this will be rolled into 2.4.0test13 but I'd like to try it out? Also, can someone summarize the fix in English along with the expected, improved behavior (e.g. Linux will never have a signal 11 again and will never, ever crash ;-) Finally, as soon as there is a patch, can other people who have seen this problem test it. My problem is so random that I'd need at least a few days to gain some confidence this is fixed. Thanks all. --Rainer > -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED]]On Behalf Of Linus Torvalds > Sent: Thursday, December 14, 2000 5:19 AM > To: Mike Galbraith > Cc: Kernel Mailing List > Subject: Re: Signal 11 - the continuing saga > > > On Wed, 13 Dec 2000, Linus Torvalds wrote: > > > > Hint: "ptep_mkdirty()". > > In case you wonder why the bug was so insidious, what this caused was two > separate problems, both of them able to cause SIGSGV's. > > One: we didn't mark the page table entry dirty like we were supposed to. > > Two: by making it writable, we also made the page shared, even if it > wasn't supposed to be shared (so when the next process wrote to the page, > if the swap page was shared with somebody else, the changes would show up > even in the process that _didn't_ write to it). > > And "ptep_mkdirty()" is only used by swapoff, so nothing else would show > this. Which was why it hadn't been immediately obvious that anything was > broken. > > Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Signal 11 - the continuing saga
On Wed, 13 Dec 2000, Linus Torvalds wrote: > > > Ehh, I think I found it. > > Hint: "ptep_mkdirty()". > > Oops. > > I'll bet you $5 USD (and these days, that's about a gadzillion Euros) that Poor European Gérard as slim as 1.84 meter - 78 Kg these days. What about old days poor European Linus versus these days American Linus on these points ? ;-) > this explains it. Really ? :o) > Linus Gérard. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Signal 11 - the continuing saga
On Wed, Dec 13, 2000 at 11:35:57AM -0800, Linus Torvalds wrote: > > > Ehh, I think I found it. > > Hint: "ptep_mkdirty()". > > Oops. > > I'll bet you $5 USD (and these days, that's about a gadzillion Euros) that > this explains it. > > Linus Good. Sounds like you guys have a handle on it now. :-) Jeff > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Signal 11 - the continuing saga
On Wed, 13 Dec 2000, Linus Torvalds wrote: > > Hint: "ptep_mkdirty()". In case you wonder why the bug was so insidious, what this caused was two separate problems, both of them able to cause SIGSGV's. One: we didn't mark the page table entry dirty like we were supposed to. Two: by making it writable, we also made the page shared, even if it wasn't supposed to be shared (so when the next process wrote to the page, if the swap page was shared with somebody else, the changes would show up even in the process that _didn't_ write to it). And "ptep_mkdirty()" is only used by swapoff, so nothing else would show this. Which was why it hadn't been immediately obvious that anything was broken. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Signal 11 - the continuing saga
On Wed, 13 Dec 2000, Mike Galbraith wrote: > > Not in my test tree. Same fault, and same trace leading up to it. no Ok. It definitely looks like a swapoff() problem. Have you ever seen the behaviour without running swapoff? Also, can you re-create it without running swapon() (if it's something like a lost dirty bit, it should be possible to trigger even without the swapon, and I'd like to hear if that can happen - if it only happens with swapon() and you can't trigger it with just a swapoff() it might be a question of re-using some swap file stuff and delaying the writeout or whatever). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Signal 11 - the continuing saga
On Wed, 13 Dec 2000, Linus Torvalds wrote: > On Wed, 13 Dec 2000, Linus Torvalds wrote: > > > > Lookin gat "swapoff()", it could easily be something like > > > > - swapoff walks theough the processes, marking the pages dirty > >(correctly) > > - swapoff goes on to the next swap entry, and because it needs memory for > >this, the VM layer will swap out old entries by marking them dirty in > >the "struct page". > > - final stages of swapoff() removes the swap cache entry, never minding > >the fact that it is marked dirty again in "struct page", and clean in > >various VM page tables. > > > > Ho humm.. I don't think that is it exactly, but something along those > > lines. > > Actually, having thought about it for five more minutes, I actually think > that that _is_ it. > > If so, the fix looks like it could be really simple. The whole problem > arises from the fact that we remove the page from the swap cache only > _after_ we've walked the page-tables to look at it. It looks like the > fairly trivial fix is simply to remove it from the swap cache before, > getting rid of all such races in swapoff(). > > Mind trying out this patch? > > NOTE! It's untested. It might not work. It might trigger some sanity-test > somewhere else. But it looks like it should do the right thing (the page > might be moved to _another_ swap device early, if there are multiple swap > areas, but even that should be fine - the unuse_process() stuff doesn't > care about what swapcache this actually is any more. > > Does this patch make a difference (I moved the delete seven lines upwards, > and removed the test - the test looks extraneous). Not in my test tree. Same fault, and same trace leading up to it. I'll run virgin source hard tomorrow to be sure. (No message means no change) -Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Signal 11 - the continuing saga
On Wed, 13 Dec 2000, Linus Torvalds wrote: > > Lookin gat "swapoff()", it could easily be something like > > - swapoff walks theough the processes, marking the pages dirty >(correctly) > - swapoff goes on to the next swap entry, and because it needs memory for >this, the VM layer will swap out old entries by marking them dirty in >the "struct page". > - final stages of swapoff() removes the swap cache entry, never minding >the fact that it is marked dirty again in "struct page", and clean in >various VM page tables. > > Ho humm.. I don't think that is it exactly, but something along those > lines. Actually, having thought about it for five more minutes, I actually think that that _is_ it. If so, the fix looks like it could be really simple. The whole problem arises from the fact that we remove the page from the swap cache only _after_ we've walked the page-tables to look at it. It looks like the fairly trivial fix is simply to remove it from the swap cache before, getting rid of all such races in swapoff(). Mind trying out this patch? NOTE! It's untested. It might not work. It might trigger some sanity-test somewhere else. But it looks like it should do the right thing (the page might be moved to _another_ swap device early, if there are multiple swap areas, but even that should be fine - the unuse_process() stuff doesn't care about what swapcache this actually is any more. Does this patch make a difference (I moved the delete seven lines upwards, and removed the test - the test looks extraneous). Linus --- v2.4.0-test12/linux/mm/swapfile.c Tue Oct 31 12:42:27 2000 +++ linux/mm/swapfile.c Wed Dec 13 09:17:51 2000 @@ -370,6 +370,7 @@ swap_free(entry); return -ENOMEM; } + delete_from_swap_cache(page); read_lock(_lock); for_each_task(p) unuse_process(p->mm, entry, page); @@ -377,8 +378,6 @@ shm_unuse(entry, page); /* Now get rid of the extra reference to the temporary page we've been using. */ - if (PageSwapCache(page)) - delete_from_swap_cache(page); page_cache_release(page); /* * Check for and clear any overflowed swap map counts. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Signal 11 - the continuing saga
On Tue, Dec 12, 2000 at 07:17:41PM -0800, Linus Torvalds wrote: > In article <[EMAIL PROTECTED]>, > Jeff V. Merkey <[EMAIL PROTECTED]> wrote: > >On Wed, Dec 13, 2000 at 09:22:55AM +0900, Rainer Mager wrote: > >>I have a tiny bash script that launches a Java swing app. If I run my > >> script from an xterm (or gnome-terminal or whatever) then it starts up fine. > >> If, however, I try to launch it from my gnome taskbar's menu then it dies > >> with signal 11 (the Java log is available upon request). This seems to be > >> 100% consistent, since I noticed it yesterday, even across reboots. > >> Interestingly, the same behavior occurs if I try to run the program from > >> withis JBuilder 4. > >>So, is this related to the larger signal 11 problems? > > > >There's a corruption bug in the page cache somewhere, and it's 100% > >reproducable. Finding it will be tough > > Unlikely. If the actual program data was corrupted, it would SIGSEGV > regardless of how it's executed. > > I'd guess that the program has a bug, and depending on the arguments and > environment (especially the latter will be different), it shows up or > not. Things like not having a LOCALE set in either case or similar. > > Linus Linus, I agree that there may be some problem in the code above -- the question is what has changed to make this behavior emerge? I see it with a host of programs(ssh, make, netscape) -- true all are userspace. Time permitting, I may attempt to track this down in ssh and make in jobserver mode. It may be related to some interaction that changed underneath. Jeff > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
RE: Signal 11 - the continuing saga
On Wed, 13 Dec 2000, Rainer Mager wrote: > Mike et al, > > I have no idea what IKD is and I don't know what to do with any results I > might find BUT I'd be happy to do this if it will help. Please pass on the > info with the instructions. Who should I report the results to? IKD is a debugging toolkit. The trap I have set up freezes the kernel trace buffer at SIGSEGV time. From there you have to read it backward looking for problems. (which isn't particularly easy). I was thinking you wanted to roll your shirt sleeves up and maybe this would help ;-) If you want it, and do a trace, I'b be very interested in the last couple of schedules to compare to my traces. It's not something you can just run and report though. -Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
RE: Signal 11 - the continuing saga
Mike et al, I have no idea what IKD is and I don't know what to do with any results I might find BUT I'd be happy to do this if it will help. Please pass on the info with the instructions. Who should I report the results to? --Rainer > [mailto:[EMAIL PROTECTED]]On Behalf Of Mike Galbraith > If you want, I can extract IKD.. which happens to have a trap in place > for this (because I have a 100% reproducable swap related SIGSEGV that > I'm trying to figure out). > > If you're interested, let me know and I'll extract it (quite large) and > send it along instructions on how to do the trap. > > -Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
RE: Signal 11 - the continuing saga
Give that man a cigarit was an env var (not LOCALE but LANG). I'd actually checked this but I didn't think that made a difference in my case. Thanks Linus, now can you fix the larger signal 11 problem? --Rainer > [mailto:[EMAIL PROTECTED]]On Behalf Of Linus Torvalds > I'd guess that the program has a bug, and depending on the arguments and > environment (especially the latter will be different), it shows up or > not. Things like not having a LOCALE set in either case or similar. > > Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
RE: Signal 11 - the continuing saga
Mike et al, I have no idea what IKD is and I don't know what to do with any results I might find BUT I'd be happy to do this if it will help. Please pass on the info with the instructions. Who should I report the results to? --Rainer [mailto:[EMAIL PROTECTED]]On Behalf Of Mike Galbraith If you want, I can extract IKD.. which happens to have a trap in place for this (because I have a 100% reproducable swap related SIGSEGV that I'm trying to figure out). If you're interested, let me know and I'll extract it (quite large) and send it along instructions on how to do the trap. -Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
RE: Signal 11 - the continuing saga
Give that man a cigarit was an env var (not LOCALE but LANG). I'd actually checked this but I didn't think that made a difference in my case. Thanks Linus, now can you fix the larger signal 11 problem? --Rainer [mailto:[EMAIL PROTECTED]]On Behalf Of Linus Torvalds I'd guess that the program has a bug, and depending on the arguments and environment (especially the latter will be different), it shows up or not. Things like not having a LOCALE set in either case or similar. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
RE: Signal 11 - the continuing saga
On Wed, 13 Dec 2000, Rainer Mager wrote: Mike et al, I have no idea what IKD is and I don't know what to do with any results I might find BUT I'd be happy to do this if it will help. Please pass on the info with the instructions. Who should I report the results to? IKD is a debugging toolkit. The trap I have set up freezes the kernel trace buffer at SIGSEGV time. From there you have to read it backward looking for problems. (which isn't particularly easy). I was thinking you wanted to roll your shirt sleeves up and maybe this would help ;-) If you want it, and do a trace, I'b be very interested in the last couple of schedules to compare to my traces. It's not something you can just run and report though. -Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Signal 11 - the continuing saga
On Tue, Dec 12, 2000 at 07:17:41PM -0800, Linus Torvalds wrote: In article [EMAIL PROTECTED], Jeff V. Merkey [EMAIL PROTECTED] wrote: On Wed, Dec 13, 2000 at 09:22:55AM +0900, Rainer Mager wrote: I have a tiny bash script that launches a Java swing app. If I run my script from an xterm (or gnome-terminal or whatever) then it starts up fine. If, however, I try to launch it from my gnome taskbar's menu then it dies with signal 11 (the Java log is available upon request). This seems to be 100% consistent, since I noticed it yesterday, even across reboots. Interestingly, the same behavior occurs if I try to run the program from withis JBuilder 4. So, is this related to the larger signal 11 problems? There's a corruption bug in the page cache somewhere, and it's 100% reproducable. Finding it will be tough Unlikely. If the actual program data was corrupted, it would SIGSEGV regardless of how it's executed. I'd guess that the program has a bug, and depending on the arguments and environment (especially the latter will be different), it shows up or not. Things like not having a LOCALE set in either case or similar. Linus Linus, I agree that there may be some problem in the code above -- the question is what has changed to make this behavior emerge? I see it with a host of programs(ssh, make, netscape) -- true all are userspace. Time permitting, I may attempt to track this down in ssh and make in jobserver mode. It may be related to some interaction that changed underneath. Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Signal 11 - the continuing saga
On Wed, 13 Dec 2000, Linus Torvalds wrote: Lookin gat "swapoff()", it could easily be something like - swapoff walks theough the processes, marking the pages dirty (correctly) - swapoff goes on to the next swap entry, and because it needs memory for this, the VM layer will swap out old entries by marking them dirty in the "struct page". - final stages of swapoff() removes the swap cache entry, never minding the fact that it is marked dirty again in "struct page", and clean in various VM page tables. Ho humm.. I don't think that is it exactly, but something along those lines. Actually, having thought about it for five more minutes, I actually think that that _is_ it. If so, the fix looks like it could be really simple. The whole problem arises from the fact that we remove the page from the swap cache only _after_ we've walked the page-tables to look at it. It looks like the fairly trivial fix is simply to remove it from the swap cache before, getting rid of all such races in swapoff(). Mind trying out this patch? NOTE! It's untested. It might not work. It might trigger some sanity-test somewhere else. But it looks like it should do the right thing (the page might be moved to _another_ swap device early, if there are multiple swap areas, but even that should be fine - the unuse_process() stuff doesn't care about what swapcache this actually is any more. Does this patch make a difference (I moved the delete seven lines upwards, and removed the test - the test looks extraneous). Linus --- v2.4.0-test12/linux/mm/swapfile.c Tue Oct 31 12:42:27 2000 +++ linux/mm/swapfile.c Wed Dec 13 09:17:51 2000 @@ -370,6 +370,7 @@ swap_free(entry); return -ENOMEM; } + delete_from_swap_cache(page); read_lock(tasklist_lock); for_each_task(p) unuse_process(p-mm, entry, page); @@ -377,8 +378,6 @@ shm_unuse(entry, page); /* Now get rid of the extra reference to the temporary page we've been using. */ - if (PageSwapCache(page)) - delete_from_swap_cache(page); page_cache_release(page); /* * Check for and clear any overflowed swap map counts. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Signal 11 - the continuing saga
On Wed, 13 Dec 2000, Linus Torvalds wrote: On Wed, 13 Dec 2000, Linus Torvalds wrote: Lookin gat "swapoff()", it could easily be something like - swapoff walks theough the processes, marking the pages dirty (correctly) - swapoff goes on to the next swap entry, and because it needs memory for this, the VM layer will swap out old entries by marking them dirty in the "struct page". - final stages of swapoff() removes the swap cache entry, never minding the fact that it is marked dirty again in "struct page", and clean in various VM page tables. Ho humm.. I don't think that is it exactly, but something along those lines. Actually, having thought about it for five more minutes, I actually think that that _is_ it. If so, the fix looks like it could be really simple. The whole problem arises from the fact that we remove the page from the swap cache only _after_ we've walked the page-tables to look at it. It looks like the fairly trivial fix is simply to remove it from the swap cache before, getting rid of all such races in swapoff(). Mind trying out this patch? NOTE! It's untested. It might not work. It might trigger some sanity-test somewhere else. But it looks like it should do the right thing (the page might be moved to _another_ swap device early, if there are multiple swap areas, but even that should be fine - the unuse_process() stuff doesn't care about what swapcache this actually is any more. Does this patch make a difference (I moved the delete seven lines upwards, and removed the test - the test looks extraneous). Not in my test tree. Same fault, and same trace leading up to it. I'll run virgin source hard tomorrow to be sure. (No message means no change) -Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Signal 11 - the continuing saga
On Wed, 13 Dec 2000, Mike Galbraith wrote: Not in my test tree. Same fault, and same trace leading up to it. no Ok. It definitely looks like a swapoff() problem. Have you ever seen the behaviour without running swapoff? Also, can you re-create it without running swapon() (if it's something like a lost dirty bit, it should be possible to trigger even without the swapon, and I'd like to hear if that can happen - if it only happens with swapon() and you can't trigger it with just a swapoff() it might be a question of re-using some swap file stuff and delaying the writeout or whatever). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Signal 11 - the continuing saga
On Wed, 13 Dec 2000, Linus Torvalds wrote: Hint: "ptep_mkdirty()". In case you wonder why the bug was so insidious, what this caused was two separate problems, both of them able to cause SIGSGV's. One: we didn't mark the page table entry dirty like we were supposed to. Two: by making it writable, we also made the page shared, even if it wasn't supposed to be shared (so when the next process wrote to the page, if the swap page was shared with somebody else, the changes would show up even in the process that _didn't_ write to it). And "ptep_mkdirty()" is only used by swapoff, so nothing else would show this. Which was why it hadn't been immediately obvious that anything was broken. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Signal 11 - the continuing saga
On Wed, Dec 13, 2000 at 11:35:57AM -0800, Linus Torvalds wrote: Ehh, I think I found it. Hint: "ptep_mkdirty()". Oops. I'll bet you $5 USD (and these days, that's about a gadzillion Euros) that this explains it. Linus Good. Sounds like you guys have a handle on it now. :-) Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Signal 11 - the continuing saga
On Wed, 13 Dec 2000, Linus Torvalds wrote: Ehh, I think I found it. Hint: "ptep_mkdirty()". Oops. I'll bet you $5 USD (and these days, that's about a gadzillion Euros) that Poor European Gérard as slim as 1.84 meter - 78 Kg these days. What about old days poor European Linus versus these days American Linus on these points ? ;-) this explains it. Really ? :o) Linus Gérard. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
RE: Signal 11 - the continuing saga
Err, for those of us who aren't up to our elbows in the kernel code, is there a patch for this? Presumeably this will be rolled into 2.4.0test13 but I'd like to try it out? Also, can someone summarize the fix in English along with the expected, improved behavior (e.g. Linux will never have a signal 11 again and will never, ever crash ;-) Finally, as soon as there is a patch, can other people who have seen this problem test it. My problem is so random that I'd need at least a few days to gain some confidence this is fixed. Thanks all. --Rainer -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Linus Torvalds Sent: Thursday, December 14, 2000 5:19 AM To: Mike Galbraith Cc: Kernel Mailing List Subject: Re: Signal 11 - the continuing saga On Wed, 13 Dec 2000, Linus Torvalds wrote: Hint: "ptep_mkdirty()". In case you wonder why the bug was so insidious, what this caused was two separate problems, both of them able to cause SIGSGV's. One: we didn't mark the page table entry dirty like we were supposed to. Two: by making it writable, we also made the page shared, even if it wasn't supposed to be shared (so when the next process wrote to the page, if the swap page was shared with somebody else, the changes would show up even in the process that _didn't_ write to it). And "ptep_mkdirty()" is only used by swapoff, so nothing else would show this. Which was why it hadn't been immediately obvious that anything was broken. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Signal 11 - the continuing saga
On Wed, 13 Dec 2000, Linus Torvalds wrote: On Wed, 13 Dec 2000, Mike Galbraith wrote: Not in my test tree. Same fault, and same trace leading up to it. no Ok. It definitely looks like a swapoff() problem. Have you ever seen the behaviour without running swapoff? No. Also, can you re-create it without running swapon() (if it's something like a lost dirty bit, it should be possible to trigger even without the swapon, and I'd like to hear if that can happen - if it only happens with swapon() and you can't trigger it with just a swapoff() it might be a question of re-using some swap file stuff and delaying the writeout or whatever). I'll try loading up swap, swapoff and then doing jobs that fit in ram. (hmm.. what about inactive_clean list when you do swapoff.. might there be pages sitting there that are [were] swap cache? reclaim_page=kaboom?) -Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Signal 11 - the continuing saga
On Wed, 13 Dec 2000, Linus Torvalds wrote: On Wed, 13 Dec 2000, Linus Torvalds wrote: Hint: "ptep_mkdirty()". g rather obvious oopsie.. once spotted. In case you wonder why the bug was so insidious, what this caused was two separate problems, both of them able to cause SIGSGV's. One: we didn't mark the page table entry dirty like we were supposed to. Two: by making it writable, we also made the page shared, even if it wasn't supposed to be shared (so when the next process wrote to the page, if the swap page was shared with somebody else, the changes would show up even in the process that _didn't_ write to it). And "ptep_mkdirty()" is only used by swapoff, so nothing else would show this. Which was why it hadn't been immediately obvious that anything was broken. The terminal OOM problem is now gone and I haven't seen a SIGSEGV yet running virgin source. IOU 5 bogo$$ -Mike (I still see something with IKD that _could_ be timing related troubles. There are a couple of grubby fingerprints I need to wipe off, and some churn/burn hours to be sure) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
RE: Signal 11 - the continuing saga
On Wed, 13 Dec 2000, Rainer Mager wrote: > Thanks for the info... > > > [mailto:[EMAIL PROTECTED]]On Behalf Of Jeff V. Merkey > > > So, is this related to the larger signal 11 problems? > > > > There's a corruption bug in the page cache somewhere, and it's 100% > > reproducable. Finding it will be tough > > Ok, granted this will be tough but is anyone even actively working on it? > What can I do to help? If you want, I can extract IKD.. which happens to have a trap in place for this (because I have a 100% reproducable swap related SIGSEGV that I'm trying to figure out). If you're interested, let me know and I'll extract it (quite large) and send it along instructions on how to do the trap. -Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Signal 11 - the continuing saga
In article <[EMAIL PROTECTED]>, Jeff V. Merkey <[EMAIL PROTECTED]> wrote: >On Wed, Dec 13, 2000 at 09:22:55AM +0900, Rainer Mager wrote: >> I have a tiny bash script that launches a Java swing app. If I run my >> script from an xterm (or gnome-terminal or whatever) then it starts up fine. >> If, however, I try to launch it from my gnome taskbar's menu then it dies >> with signal 11 (the Java log is available upon request). This seems to be >> 100% consistent, since I noticed it yesterday, even across reboots. >> Interestingly, the same behavior occurs if I try to run the program from >> withis JBuilder 4. >> So, is this related to the larger signal 11 problems? > >There's a corruption bug in the page cache somewhere, and it's 100% >reproducable. Finding it will be tough Unlikely. If the actual program data was corrupted, it would SIGSEGV regardless of how it's executed. I'd guess that the program has a bug, and depending on the arguments and environment (especially the latter will be different), it shows up or not. Things like not having a LOCALE set in either case or similar. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
RE: Signal 11 - the continuing saga
Thanks for the info... > [mailto:[EMAIL PROTECTED]]On Behalf Of Jeff V. Merkey > > So, is this related to the larger signal 11 problems? > > There's a corruption bug in the page cache somewhere, and it's 100% > reproducable. Finding it will be tough Ok, granted this will be tough but is anyone even actively working on it? What can I do to help? > > Anyone know how to do [disable L1 and L2 caches]? > > Usually this is performed in the BIOS setup. You can also disable L1 > with a sequence of instructions that write to the CR0 register on intel > and flip a bit, but in doing this you have to execute a WBINV (write > back invalidate) instruction to flush out the cache. BIOS setup is > probably simpler. Disabling Level I will make the machine slower > than mollasses, BTW, and if this bug is race related (they always > are) it won't help much in running it down. Aha, just as I suspected. My BIOS doesn't appear to support this. You seem to be saying that doing so won't really contribute anything anyway so I will hold off for now. --Rainer - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Signal 11 - the continuing saga
On Wed, Dec 13, 2000 at 09:22:55AM +0900, Rainer Mager wrote: > Hi again, > > Ok, I just upgraded to 2.4.0test12 (although I don't think there was any > work in 12 that directly addresses this signal 11 problem). When compiling > the new kernel I chose to disable AGPGart and RDM as suggested by > [EMAIL PROTECTED] I will report later if this makes any difference. > > On another, possibly related note, I'm getting some really weird behavior > with a Java program. The only reason I mention it here is because it dies > with our old friend Signal 11. Anyway, please bear with the description > below. > I have a tiny bash script that launches a Java swing app. If I run my > script from an xterm (or gnome-terminal or whatever) then it starts up fine. > If, however, I try to launch it from my gnome taskbar's menu then it dies > with signal 11 (the Java log is available upon request). This seems to be > 100% consistent, since I noticed it yesterday, even across reboots. > Interestingly, the same behavior occurs if I try to run the program from > withis JBuilder 4. > So, is this related to the larger signal 11 problems? There's a corruption bug in the page cache somewhere, and it's 100% reproducable. Finding it will be tough > > > What else can I do regarding these issues to help fix it? Would a core dump > help anyone? I'd really like to contribute somehow but I need some > direction. > > > --Rainer > > > From: CMA [mailto:[EMAIL PROTECTED]] > > Did you already try to selectively disable L1 and L2 caches (if > > your box has both) and see what happens? > > Anyone know how to do this? Usually this is performed in the BIOS setup. You can also disable L1 with a sequence of instructions that write to the CR0 register on intel and flip a bit, but in doing this you have to execute a WBINV (write back invalidate) instruction to flush out the cache. BIOS setup is probably simpler. Disabling Level I will make the machine slower than mollasses, BTW, and if this bug is race related (they always are) it won't help much in running it down. Jeff > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
RE: Signal 11 - the continuing saga
Hi again, Ok, I just upgraded to 2.4.0test12 (although I don't think there was any work in 12 that directly addresses this signal 11 problem). When compiling the new kernel I chose to disable AGPGart and RDM as suggested by [EMAIL PROTECTED] I will report later if this makes any difference. On another, possibly related note, I'm getting some really weird behavior with a Java program. The only reason I mention it here is because it dies with our old friend Signal 11. Anyway, please bear with the description below. I have a tiny bash script that launches a Java swing app. If I run my script from an xterm (or gnome-terminal or whatever) then it starts up fine. If, however, I try to launch it from my gnome taskbar's menu then it dies with signal 11 (the Java log is available upon request). This seems to be 100% consistent, since I noticed it yesterday, even across reboots. Interestingly, the same behavior occurs if I try to run the program from withis JBuilder 4. So, is this related to the larger signal 11 problems? What else can I do regarding these issues to help fix it? Would a core dump help anyone? I'd really like to contribute somehow but I need some direction. --Rainer > From: CMA [mailto:[EMAIL PROTECTED]] > Did you already try to selectively disable L1 and L2 caches (if > your box has both) and see what happens? Anyone know how to do this? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
RE: Signal 11 - the continuing saga
Hi again, Ok, I just upgraded to 2.4.0test12 (although I don't think there was any work in 12 that directly addresses this signal 11 problem). When compiling the new kernel I chose to disable AGPGart and RDM as suggested by [EMAIL PROTECTED] I will report later if this makes any difference. On another, possibly related note, I'm getting some really weird behavior with a Java program. The only reason I mention it here is because it dies with our old friend Signal 11. Anyway, please bear with the description below. I have a tiny bash script that launches a Java swing app. If I run my script from an xterm (or gnome-terminal or whatever) then it starts up fine. If, however, I try to launch it from my gnome taskbar's menu then it dies with signal 11 (the Java log is available upon request). This seems to be 100% consistent, since I noticed it yesterday, even across reboots. Interestingly, the same behavior occurs if I try to run the program from withis JBuilder 4. So, is this related to the larger signal 11 problems? What else can I do regarding these issues to help fix it? Would a core dump help anyone? I'd really like to contribute somehow but I need some direction. --Rainer From: CMA [mailto:[EMAIL PROTECTED]] Did you already try to selectively disable L1 and L2 caches (if your box has both) and see what happens? Anyone know how to do this? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Signal 11 - the continuing saga
On Wed, Dec 13, 2000 at 09:22:55AM +0900, Rainer Mager wrote: Hi again, Ok, I just upgraded to 2.4.0test12 (although I don't think there was any work in 12 that directly addresses this signal 11 problem). When compiling the new kernel I chose to disable AGPGart and RDM as suggested by [EMAIL PROTECTED] I will report later if this makes any difference. On another, possibly related note, I'm getting some really weird behavior with a Java program. The only reason I mention it here is because it dies with our old friend Signal 11. Anyway, please bear with the description below. I have a tiny bash script that launches a Java swing app. If I run my script from an xterm (or gnome-terminal or whatever) then it starts up fine. If, however, I try to launch it from my gnome taskbar's menu then it dies with signal 11 (the Java log is available upon request). This seems to be 100% consistent, since I noticed it yesterday, even across reboots. Interestingly, the same behavior occurs if I try to run the program from withis JBuilder 4. So, is this related to the larger signal 11 problems? There's a corruption bug in the page cache somewhere, and it's 100% reproducable. Finding it will be tough What else can I do regarding these issues to help fix it? Would a core dump help anyone? I'd really like to contribute somehow but I need some direction. --Rainer From: CMA [mailto:[EMAIL PROTECTED]] Did you already try to selectively disable L1 and L2 caches (if your box has both) and see what happens? Anyone know how to do this? Usually this is performed in the BIOS setup. You can also disable L1 with a sequence of instructions that write to the CR0 register on intel and flip a bit, but in doing this you have to execute a WBINV (write back invalidate) instruction to flush out the cache. BIOS setup is probably simpler. Disabling Level I will make the machine slower than mollasses, BTW, and if this bug is race related (they always are) it won't help much in running it down. Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
RE: Signal 11 - the continuing saga
Thanks for the info... [mailto:[EMAIL PROTECTED]]On Behalf Of Jeff V. Merkey So, is this related to the larger signal 11 problems? There's a corruption bug in the page cache somewhere, and it's 100% reproducable. Finding it will be tough Ok, granted this will be tough but is anyone even actively working on it? What can I do to help? Anyone know how to do [disable L1 and L2 caches]? Usually this is performed in the BIOS setup. You can also disable L1 with a sequence of instructions that write to the CR0 register on intel and flip a bit, but in doing this you have to execute a WBINV (write back invalidate) instruction to flush out the cache. BIOS setup is probably simpler. Disabling Level I will make the machine slower than mollasses, BTW, and if this bug is race related (they always are) it won't help much in running it down. Aha, just as I suspected. My BIOS doesn't appear to support this. You seem to be saying that doing so won't really contribute anything anyway so I will hold off for now. --Rainer - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Signal 11 - the continuing saga
In article [EMAIL PROTECTED], Jeff V. Merkey [EMAIL PROTECTED] wrote: On Wed, Dec 13, 2000 at 09:22:55AM +0900, Rainer Mager wrote: I have a tiny bash script that launches a Java swing app. If I run my script from an xterm (or gnome-terminal or whatever) then it starts up fine. If, however, I try to launch it from my gnome taskbar's menu then it dies with signal 11 (the Java log is available upon request). This seems to be 100% consistent, since I noticed it yesterday, even across reboots. Interestingly, the same behavior occurs if I try to run the program from withis JBuilder 4. So, is this related to the larger signal 11 problems? There's a corruption bug in the page cache somewhere, and it's 100% reproducable. Finding it will be tough Unlikely. If the actual program data was corrupted, it would SIGSEGV regardless of how it's executed. I'd guess that the program has a bug, and depending on the arguments and environment (especially the latter will be different), it shows up or not. Things like not having a LOCALE set in either case or similar. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
RE: Signal 11 - the continuing saga
On Wed, 13 Dec 2000, Rainer Mager wrote: Thanks for the info... [mailto:[EMAIL PROTECTED]]On Behalf Of Jeff V. Merkey So, is this related to the larger signal 11 problems? There's a corruption bug in the page cache somewhere, and it's 100% reproducable. Finding it will be tough Ok, granted this will be tough but is anyone even actively working on it? What can I do to help? If you want, I can extract IKD.. which happens to have a trap in place for this (because I have a 100% reproducable swap related SIGSEGV that I'm trying to figure out). If you're interested, let me know and I'll extract it (quite large) and send it along instructions on how to do the trap. -Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/