Re: Signal 11 - the continuing saga

2000-12-13 Thread Mike Galbraith

On Wed, 13 Dec 2000, Linus Torvalds wrote:

> On Wed, 13 Dec 2000, Linus Torvalds wrote:
> > 
> > Hint: "ptep_mkdirty()".

 rather obvious oopsie.. once spotted.

> In case you wonder why the bug was so insidious, what this caused was two
> separate problems, both of them able to cause SIGSGV's. 
> 
> One: we didn't mark the page table entry dirty like we were supposed to.
> 
> Two: by making it writable, we also made the page shared, even if it
> wasn't supposed to be shared (so when the next process wrote to the page,
> if the swap page was shared with somebody else, the changes would show up
> even in the process that _didn't_ write to it).
> 
> And "ptep_mkdirty()" is only used by swapoff, so nothing else would show
> this. Which was why it hadn't been immediately obvious that anything was
> broken.

The terminal OOM problem is now gone and I haven't seen a SIGSEGV yet
running virgin source.

IOU 5 bogo$$

-Mike

(I still see something with IKD that _could_ be timing related troubles.
There are a couple of grubby fingerprints I need to wipe off, and some
churn/burn hours to be sure)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-13 Thread Mike Galbraith

On Wed, 13 Dec 2000, Linus Torvalds wrote:

> On Wed, 13 Dec 2000, Mike Galbraith wrote:
> > 
> > Not in my test tree.  Same fault, and same trace leading up to it. no
> 
> Ok.
> 
> It definitely looks like a swapoff() problem.
> 
> Have you ever seen the behaviour without running swapoff?

No.

> Also, can you re-create it without running swapon() (if it's something
> like a lost dirty bit, it should be possible to trigger even without the
> swapon, and I'd like to hear if that can happen - if it only happens with
> swapon() and you can't trigger it with just a swapoff() it might be a
> question of re-using some swap file stuff and delaying the writeout or
> whatever).

I'll try loading up swap, swapoff and then doing jobs that fit in ram.

(hmm.. what about inactive_clean list when you do swapoff.. might there
be pages sitting there that are [were] swap cache? reclaim_page=kaboom?)

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Signal 11 - the continuing saga

2000-12-13 Thread Rainer Mager

Err, for those of us who aren't up to our elbows in the kernel code, is
there a patch for this? Presumeably this will be rolled into 2.4.0test13 but
I'd like to try it out? Also, can someone summarize the fix in English along
with the expected, improved behavior (e.g. Linux will never have a signal 11
again and will never, ever crash ;-)

Finally, as soon as there is a patch, can other people who have seen this
problem test it. My problem is so random that I'd need at least a few days
to gain some confidence this is fixed.


Thanks all.

--Rainer

> -Original Message-
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED]]On Behalf Of Linus Torvalds
> Sent: Thursday, December 14, 2000 5:19 AM
> To: Mike Galbraith
> Cc: Kernel Mailing List
> Subject: Re: Signal 11 - the continuing saga
>
>
> On Wed, 13 Dec 2000, Linus Torvalds wrote:
> >
> > Hint: "ptep_mkdirty()".
>
> In case you wonder why the bug was so insidious, what this caused was two
> separate problems, both of them able to cause SIGSGV's.
>
> One: we didn't mark the page table entry dirty like we were supposed to.
>
> Two: by making it writable, we also made the page shared, even if it
> wasn't supposed to be shared (so when the next process wrote to the page,
> if the swap page was shared with somebody else, the changes would show up
> even in the process that _didn't_ write to it).
>
> And "ptep_mkdirty()" is only used by swapoff, so nothing else would show
> this. Which was why it hadn't been immediately obvious that anything was
> broken.
>
>   Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-13 Thread Gérard Roudier



On Wed, 13 Dec 2000, Linus Torvalds wrote:

> 
> 
> Ehh, I think I found it.
> 
> Hint: "ptep_mkdirty()".
> 
> Oops.
> 
> I'll bet you $5 USD (and these days, that's about a gadzillion Euros) that

Poor European Gérard as slim as 1.84 meter - 78 Kg these days.
What about old days poor European Linus versus these days American Linus
on these points ? ;-)

> this explains it.

Really ? :o)

>   Linus

  Gérard.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-13 Thread Jeff V. Merkey

On Wed, Dec 13, 2000 at 11:35:57AM -0800, Linus Torvalds wrote:
> 
> 
> Ehh, I think I found it.
> 
> Hint: "ptep_mkdirty()".
> 
> Oops.
> 
> I'll bet you $5 USD (and these days, that's about a gadzillion Euros) that
> this explains it.
> 
>   Linus

Good.  Sounds like you guys have a handle on it now.

:-)

Jeff

> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> Please read the FAQ at http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-13 Thread Linus Torvalds



On Wed, 13 Dec 2000, Linus Torvalds wrote:
> 
> Hint: "ptep_mkdirty()".

In case you wonder why the bug was so insidious, what this caused was two
separate problems, both of them able to cause SIGSGV's. 

One: we didn't mark the page table entry dirty like we were supposed to.

Two: by making it writable, we also made the page shared, even if it
wasn't supposed to be shared (so when the next process wrote to the page,
if the swap page was shared with somebody else, the changes would show up
even in the process that _didn't_ write to it).

And "ptep_mkdirty()" is only used by swapoff, so nothing else would show
this. Which was why it hadn't been immediately obvious that anything was
broken.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-13 Thread Linus Torvalds



On Wed, 13 Dec 2000, Mike Galbraith wrote:
> 
> Not in my test tree.  Same fault, and same trace leading up to it. no

Ok.

It definitely looks like a swapoff() problem.

Have you ever seen the behaviour without running swapoff?

Also, can you re-create it without running swapon() (if it's something
like a lost dirty bit, it should be possible to trigger even without the
swapon, and I'd like to hear if that can happen - if it only happens with
swapon() and you can't trigger it with just a swapoff() it might be a
question of re-using some swap file stuff and delaying the writeout or
whatever).

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-13 Thread Mike Galbraith

On Wed, 13 Dec 2000, Linus Torvalds wrote:

> On Wed, 13 Dec 2000, Linus Torvalds wrote:
> > 
> > Lookin gat "swapoff()", it could easily be something like
> > 
> >  - swapoff walks theough the processes, marking the pages dirty
> >(correctly)
> >  - swapoff goes on to the next swap entry, and because it needs memory for
> >this, the VM layer will swap out old entries by marking them dirty in
> >the "struct page".
> >  - final stages of swapoff() removes the swap cache entry, never minding
> >the fact that it is marked dirty again in "struct page", and clean in
> >various VM page tables.
> > 
> > Ho humm.. I don't think that is it exactly, but something along those
> > lines.
> 
> Actually, having thought about it for five more minutes, I actually think
> that that _is_ it.
> 
> If so, the fix looks like it could be really simple. The whole problem
> arises from the fact that we remove the page from the swap cache only
> _after_ we've walked the page-tables to look at it. It looks like the
> fairly trivial fix is simply to remove it from the swap cache before,
> getting rid of all such races in swapoff().
> 
> Mind trying out this patch?
> 
> NOTE! It's untested. It might not work. It might trigger some sanity-test
> somewhere else. But it looks like it should do the right thing (the page
> might be moved to _another_ swap device early, if there are multiple swap
> areas, but even that should be fine - the unuse_process() stuff doesn't
> care about what swapcache this actually is any more.
> 
> Does this patch make a difference (I moved the delete seven lines upwards,
> and removed the test - the test looks extraneous).

Not in my test tree.  Same fault, and same trace leading up to it.
I'll run virgin source hard tomorrow to be sure. (No message means
no change)

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-13 Thread Linus Torvalds



On Wed, 13 Dec 2000, Linus Torvalds wrote:
> 
> Lookin gat "swapoff()", it could easily be something like
> 
>  - swapoff walks theough the processes, marking the pages dirty
>(correctly)
>  - swapoff goes on to the next swap entry, and because it needs memory for
>this, the VM layer will swap out old entries by marking them dirty in
>the "struct page".
>  - final stages of swapoff() removes the swap cache entry, never minding
>the fact that it is marked dirty again in "struct page", and clean in
>various VM page tables.
> 
> Ho humm.. I don't think that is it exactly, but something along those
> lines.

Actually, having thought about it for five more minutes, I actually think
that that _is_ it.

If so, the fix looks like it could be really simple. The whole problem
arises from the fact that we remove the page from the swap cache only
_after_ we've walked the page-tables to look at it. It looks like the
fairly trivial fix is simply to remove it from the swap cache before,
getting rid of all such races in swapoff().

Mind trying out this patch?

NOTE! It's untested. It might not work. It might trigger some sanity-test
somewhere else. But it looks like it should do the right thing (the page
might be moved to _another_ swap device early, if there are multiple swap
areas, but even that should be fine - the unuse_process() stuff doesn't
care about what swapcache this actually is any more.

Does this patch make a difference (I moved the delete seven lines upwards,
and removed the test - the test looks extraneous).

Linus


--- v2.4.0-test12/linux/mm/swapfile.c   Tue Oct 31 12:42:27 2000
+++ linux/mm/swapfile.c Wed Dec 13 09:17:51 2000
@@ -370,6 +370,7 @@
swap_free(entry);
return -ENOMEM;
}
+   delete_from_swap_cache(page);
read_lock(_lock);
for_each_task(p)
unuse_process(p->mm, entry, page);
@@ -377,8 +378,6 @@
shm_unuse(entry, page);
/* Now get rid of the extra reference to the temporary
page we've been using. */
-   if (PageSwapCache(page))
-   delete_from_swap_cache(page);
page_cache_release(page);
/*
 * Check for and clear any overflowed swap map counts.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-13 Thread Jeff V. Merkey

On Tue, Dec 12, 2000 at 07:17:41PM -0800, Linus Torvalds wrote:
> In article <[EMAIL PROTECTED]>,
> Jeff V. Merkey <[EMAIL PROTECTED]> wrote:
> >On Wed, Dec 13, 2000 at 09:22:55AM +0900, Rainer Mager wrote:
> >>I have a tiny bash script that launches a Java swing app. If I run my
> >> script from an xterm (or gnome-terminal or whatever) then it starts up fine.
> >> If, however, I try to launch it from my gnome taskbar's menu then it dies
> >> with signal 11 (the Java log is available upon request). This seems to be
> >> 100% consistent, since I noticed it yesterday, even across reboots.
> >> Interestingly, the same behavior occurs if I try to run the program from
> >> withis JBuilder 4.
> >>So, is this related to the larger signal 11 problems?
> >
> >There's a corruption bug in the page cache somewhere, and it's 100%
> >reproducable.  Finding it will be tough
> 
> Unlikely. If the actual program data was corrupted, it would SIGSEGV
> regardless of how it's executed.
> 
> I'd guess that the program has a bug, and depending on the arguments and
> environment (especially the latter will be different), it shows up or
> not. Things like not having a LOCALE set in either case or similar.
> 
>   Linus

Linus,

I agree that there may be some problem in the code above -- the question is
what has changed to make this behavior emerge?  I see it with a host of 
programs(ssh, make, netscape) -- true all are userspace.  Time permitting, 
I may attempt to track this down in ssh and make in jobserver mode.  It
may be related to some interaction that changed underneath.

Jeff


> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> Please read the FAQ at http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Signal 11 - the continuing saga

2000-12-13 Thread Mike Galbraith

On Wed, 13 Dec 2000, Rainer Mager wrote:

> Mike et al,
> 
>   I have no idea what IKD is and I don't know what to do with any results I
> might find BUT I'd be happy to do this if it will help. Please pass on the
> info with the instructions. Who should I report the results to?

IKD is a debugging toolkit.  The trap I have set up freezes the kernel
trace buffer at SIGSEGV time.  From there you have to read it backward
looking for problems. (which isn't particularly easy).  I was thinking
you wanted to roll your shirt sleeves up and maybe this would help ;-)  

If you want it, and do a trace, I'b be very interested in the last
couple of schedules to compare to my traces.  It's not something you
can just run and report though.

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Signal 11 - the continuing saga

2000-12-13 Thread Rainer Mager

Mike et al,

I have no idea what IKD is and I don't know what to do with any results I
might find BUT I'd be happy to do this if it will help. Please pass on the
info with the instructions. Who should I report the results to?



--Rainer

> [mailto:[EMAIL PROTECTED]]On Behalf Of Mike Galbraith
> If you want, I can extract IKD.. which happens to have a trap in place
> for this (because I have a 100% reproducable swap related SIGSEGV that
> I'm trying to figure out).
>
> If you're interested, let me know and I'll extract it (quite large) and
> send it along instructions on how to do the trap.
>
>   -Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Signal 11 - the continuing saga

2000-12-13 Thread Rainer Mager
Give that man a cigarit was an env var (not LOCALE but LANG). I'd
actually checked this but I didn't think that made a difference in my case.

Thanks Linus, now can you fix the larger signal 11 problem?

--Rainer


> [mailto:[EMAIL PROTECTED]]On Behalf Of Linus Torvalds
> I'd guess that the program has a bug, and depending on the arguments and
> environment (especially the latter will be different), it shows up or
> not. Things like not having a LOCALE set in either case or similar.
>
>   Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/


RE: Signal 11 - the continuing saga

2000-12-13 Thread Rainer Mager

Mike et al,

I have no idea what IKD is and I don't know what to do with any results I
might find BUT I'd be happy to do this if it will help. Please pass on the
info with the instructions. Who should I report the results to?



--Rainer

 [mailto:[EMAIL PROTECTED]]On Behalf Of Mike Galbraith
 If you want, I can extract IKD.. which happens to have a trap in place
 for this (because I have a 100% reproducable swap related SIGSEGV that
 I'm trying to figure out).

 If you're interested, let me know and I'll extract it (quite large) and
 send it along instructions on how to do the trap.

   -Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Signal 11 - the continuing saga

2000-12-13 Thread Rainer Mager
Give that man a cigarit was an env var (not LOCALE but LANG). I'd
actually checked this but I didn't think that made a difference in my case.

Thanks Linus, now can you fix the larger signal 11 problem?

--Rainer


 [mailto:[EMAIL PROTECTED]]On Behalf Of Linus Torvalds
 I'd guess that the program has a bug, and depending on the arguments and
 environment (especially the latter will be different), it shows up or
 not. Things like not having a LOCALE set in either case or similar.

   Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/


RE: Signal 11 - the continuing saga

2000-12-13 Thread Mike Galbraith

On Wed, 13 Dec 2000, Rainer Mager wrote:

 Mike et al,
 
   I have no idea what IKD is and I don't know what to do with any results I
 might find BUT I'd be happy to do this if it will help. Please pass on the
 info with the instructions. Who should I report the results to?

IKD is a debugging toolkit.  The trap I have set up freezes the kernel
trace buffer at SIGSEGV time.  From there you have to read it backward
looking for problems. (which isn't particularly easy).  I was thinking
you wanted to roll your shirt sleeves up and maybe this would help ;-)  

If you want it, and do a trace, I'b be very interested in the last
couple of schedules to compare to my traces.  It's not something you
can just run and report though.

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-13 Thread Jeff V. Merkey

On Tue, Dec 12, 2000 at 07:17:41PM -0800, Linus Torvalds wrote:
 In article [EMAIL PROTECTED],
 Jeff V. Merkey [EMAIL PROTECTED] wrote:
 On Wed, Dec 13, 2000 at 09:22:55AM +0900, Rainer Mager wrote:
 I have a tiny bash script that launches a Java swing app. If I run my
  script from an xterm (or gnome-terminal or whatever) then it starts up fine.
  If, however, I try to launch it from my gnome taskbar's menu then it dies
  with signal 11 (the Java log is available upon request). This seems to be
  100% consistent, since I noticed it yesterday, even across reboots.
  Interestingly, the same behavior occurs if I try to run the program from
  withis JBuilder 4.
 So, is this related to the larger signal 11 problems?
 
 There's a corruption bug in the page cache somewhere, and it's 100%
 reproducable.  Finding it will be tough
 
 Unlikely. If the actual program data was corrupted, it would SIGSEGV
 regardless of how it's executed.
 
 I'd guess that the program has a bug, and depending on the arguments and
 environment (especially the latter will be different), it shows up or
 not. Things like not having a LOCALE set in either case or similar.
 
   Linus

Linus,

I agree that there may be some problem in the code above -- the question is
what has changed to make this behavior emerge?  I see it with a host of 
programs(ssh, make, netscape) -- true all are userspace.  Time permitting, 
I may attempt to track this down in ssh and make in jobserver mode.  It
may be related to some interaction that changed underneath.

Jeff


 -
 To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
 the body of a message to [EMAIL PROTECTED]
 Please read the FAQ at http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-13 Thread Linus Torvalds



On Wed, 13 Dec 2000, Linus Torvalds wrote:
 
 Lookin gat "swapoff()", it could easily be something like
 
  - swapoff walks theough the processes, marking the pages dirty
(correctly)
  - swapoff goes on to the next swap entry, and because it needs memory for
this, the VM layer will swap out old entries by marking them dirty in
the "struct page".
  - final stages of swapoff() removes the swap cache entry, never minding
the fact that it is marked dirty again in "struct page", and clean in
various VM page tables.
 
 Ho humm.. I don't think that is it exactly, but something along those
 lines.

Actually, having thought about it for five more minutes, I actually think
that that _is_ it.

If so, the fix looks like it could be really simple. The whole problem
arises from the fact that we remove the page from the swap cache only
_after_ we've walked the page-tables to look at it. It looks like the
fairly trivial fix is simply to remove it from the swap cache before,
getting rid of all such races in swapoff().

Mind trying out this patch?

NOTE! It's untested. It might not work. It might trigger some sanity-test
somewhere else. But it looks like it should do the right thing (the page
might be moved to _another_ swap device early, if there are multiple swap
areas, but even that should be fine - the unuse_process() stuff doesn't
care about what swapcache this actually is any more.

Does this patch make a difference (I moved the delete seven lines upwards,
and removed the test - the test looks extraneous).

Linus


--- v2.4.0-test12/linux/mm/swapfile.c   Tue Oct 31 12:42:27 2000
+++ linux/mm/swapfile.c Wed Dec 13 09:17:51 2000
@@ -370,6 +370,7 @@
swap_free(entry);
return -ENOMEM;
}
+   delete_from_swap_cache(page);
read_lock(tasklist_lock);
for_each_task(p)
unuse_process(p-mm, entry, page);
@@ -377,8 +378,6 @@
shm_unuse(entry, page);
/* Now get rid of the extra reference to the temporary
page we've been using. */
-   if (PageSwapCache(page))
-   delete_from_swap_cache(page);
page_cache_release(page);
/*
 * Check for and clear any overflowed swap map counts.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-13 Thread Mike Galbraith

On Wed, 13 Dec 2000, Linus Torvalds wrote:

 On Wed, 13 Dec 2000, Linus Torvalds wrote:
  
  Lookin gat "swapoff()", it could easily be something like
  
   - swapoff walks theough the processes, marking the pages dirty
 (correctly)
   - swapoff goes on to the next swap entry, and because it needs memory for
 this, the VM layer will swap out old entries by marking them dirty in
 the "struct page".
   - final stages of swapoff() removes the swap cache entry, never minding
 the fact that it is marked dirty again in "struct page", and clean in
 various VM page tables.
  
  Ho humm.. I don't think that is it exactly, but something along those
  lines.
 
 Actually, having thought about it for five more minutes, I actually think
 that that _is_ it.
 
 If so, the fix looks like it could be really simple. The whole problem
 arises from the fact that we remove the page from the swap cache only
 _after_ we've walked the page-tables to look at it. It looks like the
 fairly trivial fix is simply to remove it from the swap cache before,
 getting rid of all such races in swapoff().
 
 Mind trying out this patch?
 
 NOTE! It's untested. It might not work. It might trigger some sanity-test
 somewhere else. But it looks like it should do the right thing (the page
 might be moved to _another_ swap device early, if there are multiple swap
 areas, but even that should be fine - the unuse_process() stuff doesn't
 care about what swapcache this actually is any more.
 
 Does this patch make a difference (I moved the delete seven lines upwards,
 and removed the test - the test looks extraneous).

Not in my test tree.  Same fault, and same trace leading up to it.
I'll run virgin source hard tomorrow to be sure. (No message means
no change)

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-13 Thread Linus Torvalds



On Wed, 13 Dec 2000, Mike Galbraith wrote:
 
 Not in my test tree.  Same fault, and same trace leading up to it. no

Ok.

It definitely looks like a swapoff() problem.

Have you ever seen the behaviour without running swapoff?

Also, can you re-create it without running swapon() (if it's something
like a lost dirty bit, it should be possible to trigger even without the
swapon, and I'd like to hear if that can happen - if it only happens with
swapon() and you can't trigger it with just a swapoff() it might be a
question of re-using some swap file stuff and delaying the writeout or
whatever).

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-13 Thread Linus Torvalds



On Wed, 13 Dec 2000, Linus Torvalds wrote:
 
 Hint: "ptep_mkdirty()".

In case you wonder why the bug was so insidious, what this caused was two
separate problems, both of them able to cause SIGSGV's. 

One: we didn't mark the page table entry dirty like we were supposed to.

Two: by making it writable, we also made the page shared, even if it
wasn't supposed to be shared (so when the next process wrote to the page,
if the swap page was shared with somebody else, the changes would show up
even in the process that _didn't_ write to it).

And "ptep_mkdirty()" is only used by swapoff, so nothing else would show
this. Which was why it hadn't been immediately obvious that anything was
broken.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-13 Thread Jeff V. Merkey

On Wed, Dec 13, 2000 at 11:35:57AM -0800, Linus Torvalds wrote:
 
 
 Ehh, I think I found it.
 
 Hint: "ptep_mkdirty()".
 
 Oops.
 
 I'll bet you $5 USD (and these days, that's about a gadzillion Euros) that
 this explains it.
 
   Linus

Good.  Sounds like you guys have a handle on it now.

:-)

Jeff

 
 -
 To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
 the body of a message to [EMAIL PROTECTED]
 Please read the FAQ at http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-13 Thread Gérard Roudier



On Wed, 13 Dec 2000, Linus Torvalds wrote:

 
 
 Ehh, I think I found it.
 
 Hint: "ptep_mkdirty()".
 
 Oops.
 
 I'll bet you $5 USD (and these days, that's about a gadzillion Euros) that

Poor European Gérard as slim as 1.84 meter - 78 Kg these days.
What about old days poor European Linus versus these days American Linus
on these points ? ;-)

 this explains it.

Really ? :o)

   Linus

  Gérard.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Signal 11 - the continuing saga

2000-12-13 Thread Rainer Mager

Err, for those of us who aren't up to our elbows in the kernel code, is
there a patch for this? Presumeably this will be rolled into 2.4.0test13 but
I'd like to try it out? Also, can someone summarize the fix in English along
with the expected, improved behavior (e.g. Linux will never have a signal 11
again and will never, ever crash ;-)

Finally, as soon as there is a patch, can other people who have seen this
problem test it. My problem is so random that I'd need at least a few days
to gain some confidence this is fixed.


Thanks all.

--Rainer

 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED]]On Behalf Of Linus Torvalds
 Sent: Thursday, December 14, 2000 5:19 AM
 To: Mike Galbraith
 Cc: Kernel Mailing List
 Subject: Re: Signal 11 - the continuing saga


 On Wed, 13 Dec 2000, Linus Torvalds wrote:
 
  Hint: "ptep_mkdirty()".

 In case you wonder why the bug was so insidious, what this caused was two
 separate problems, both of them able to cause SIGSGV's.

 One: we didn't mark the page table entry dirty like we were supposed to.

 Two: by making it writable, we also made the page shared, even if it
 wasn't supposed to be shared (so when the next process wrote to the page,
 if the swap page was shared with somebody else, the changes would show up
 even in the process that _didn't_ write to it).

 And "ptep_mkdirty()" is only used by swapoff, so nothing else would show
 this. Which was why it hadn't been immediately obvious that anything was
 broken.

   Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-13 Thread Mike Galbraith

On Wed, 13 Dec 2000, Linus Torvalds wrote:

 On Wed, 13 Dec 2000, Mike Galbraith wrote:
  
  Not in my test tree.  Same fault, and same trace leading up to it. no
 
 Ok.
 
 It definitely looks like a swapoff() problem.
 
 Have you ever seen the behaviour without running swapoff?

No.

 Also, can you re-create it without running swapon() (if it's something
 like a lost dirty bit, it should be possible to trigger even without the
 swapon, and I'd like to hear if that can happen - if it only happens with
 swapon() and you can't trigger it with just a swapoff() it might be a
 question of re-using some swap file stuff and delaying the writeout or
 whatever).

I'll try loading up swap, swapoff and then doing jobs that fit in ram.

(hmm.. what about inactive_clean list when you do swapoff.. might there
be pages sitting there that are [were] swap cache? reclaim_page=kaboom?)

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-13 Thread Mike Galbraith

On Wed, 13 Dec 2000, Linus Torvalds wrote:

 On Wed, 13 Dec 2000, Linus Torvalds wrote:
  
  Hint: "ptep_mkdirty()".

g rather obvious oopsie.. once spotted.

 In case you wonder why the bug was so insidious, what this caused was two
 separate problems, both of them able to cause SIGSGV's. 
 
 One: we didn't mark the page table entry dirty like we were supposed to.
 
 Two: by making it writable, we also made the page shared, even if it
 wasn't supposed to be shared (so when the next process wrote to the page,
 if the swap page was shared with somebody else, the changes would show up
 even in the process that _didn't_ write to it).
 
 And "ptep_mkdirty()" is only used by swapoff, so nothing else would show
 this. Which was why it hadn't been immediately obvious that anything was
 broken.

The terminal OOM problem is now gone and I haven't seen a SIGSEGV yet
running virgin source.

IOU 5 bogo$$

-Mike

(I still see something with IKD that _could_ be timing related troubles.
There are a couple of grubby fingerprints I need to wipe off, and some
churn/burn hours to be sure)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Signal 11 - the continuing saga

2000-12-12 Thread Mike Galbraith

On Wed, 13 Dec 2000, Rainer Mager wrote:

> Thanks for the info...
> 
> > [mailto:[EMAIL PROTECTED]]On Behalf Of Jeff V. Merkey
> > >   So, is this related to the larger signal 11 problems?
> >
> > There's a corruption bug in the page cache somewhere, and it's 100%
> > reproducable.  Finding it will be tough
> 
> Ok, granted this will be tough but is anyone even actively working on it?
> What can I do to help?

If you want, I can extract IKD.. which happens to have a trap in place
for this (because I have a 100% reproducable swap related SIGSEGV that
I'm trying to figure out). 

If you're interested, let me know and I'll extract it (quite large) and
send it along instructions on how to do the trap.

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-12 Thread Linus Torvalds

In article <[EMAIL PROTECTED]>,
Jeff V. Merkey <[EMAIL PROTECTED]> wrote:
>On Wed, Dec 13, 2000 at 09:22:55AM +0900, Rainer Mager wrote:
>>  I have a tiny bash script that launches a Java swing app. If I run my
>> script from an xterm (or gnome-terminal or whatever) then it starts up fine.
>> If, however, I try to launch it from my gnome taskbar's menu then it dies
>> with signal 11 (the Java log is available upon request). This seems to be
>> 100% consistent, since I noticed it yesterday, even across reboots.
>> Interestingly, the same behavior occurs if I try to run the program from
>> withis JBuilder 4.
>>  So, is this related to the larger signal 11 problems?
>
>There's a corruption bug in the page cache somewhere, and it's 100%
>reproducable.  Finding it will be tough

Unlikely. If the actual program data was corrupted, it would SIGSEGV
regardless of how it's executed.

I'd guess that the program has a bug, and depending on the arguments and
environment (especially the latter will be different), it shows up or
not. Things like not having a LOCALE set in either case or similar.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Signal 11 - the continuing saga

2000-12-12 Thread Rainer Mager

Thanks for the info...

> [mailto:[EMAIL PROTECTED]]On Behalf Of Jeff V. Merkey
> > So, is this related to the larger signal 11 problems?
>
> There's a corruption bug in the page cache somewhere, and it's 100%
> reproducable.  Finding it will be tough

Ok, granted this will be tough but is anyone even actively working on it?
What can I do to help?



> > Anyone know how to do [disable L1 and L2 caches]?
>
> Usually this is performed in the BIOS setup.  You can also disable L1
> with a sequence of instructions that write to the CR0 register on intel
> and flip a bit, but in doing this you have to execute a WBINV (write
> back invalidate) instruction to flush out the cache.  BIOS setup is
> probably simpler.  Disabling Level I will make the machine slower
> than mollasses, BTW, and if this bug is race related (they always
> are) it won't help much in running it down.

Aha, just as I suspected. My BIOS doesn't appear to support this. You seem
to be saying that doing so won't really contribute anything anyway so I will
hold off for now.



--Rainer

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-12 Thread Jeff V. Merkey

On Wed, Dec 13, 2000 at 09:22:55AM +0900, Rainer Mager wrote:
> Hi again,
> 
>   Ok, I just upgraded to 2.4.0test12 (although I don't think there was any
> work in 12 that directly addresses this signal 11 problem). When compiling
> the new kernel I chose to disable AGPGart and RDM as suggested by
> [EMAIL PROTECTED] I will report later if this makes any difference.
> 
>   On another, possibly related note, I'm getting some really weird behavior
> with a Java program. The only reason I mention it here is because it dies
> with our old friend Signal 11. Anyway, please bear with the description
> below.
>   I have a tiny bash script that launches a Java swing app. If I run my
> script from an xterm (or gnome-terminal or whatever) then it starts up fine.
> If, however, I try to launch it from my gnome taskbar's menu then it dies
> with signal 11 (the Java log is available upon request). This seems to be
> 100% consistent, since I noticed it yesterday, even across reboots.
> Interestingly, the same behavior occurs if I try to run the program from
> withis JBuilder 4.
>   So, is this related to the larger signal 11 problems?

There's a corruption bug in the page cache somewhere, and it's 100%
reproducable.  Finding it will be tough

> 
> 
>   What else can I do regarding these issues to help fix it? Would a core dump
> help anyone? I'd really like to contribute somehow but I need some
> direction.
> 
> 
> --Rainer
> 
> > From: CMA [mailto:[EMAIL PROTECTED]]
> > Did you already try to selectively disable L1 and L2 caches (if
> > your box has both) and see what happens?
> 
> Anyone know how to do this?

Usually this is performed in the BIOS setup.  You can also disable L1 
with a sequence of instructions that write to the CR0 register on intel
and flip a bit, but in doing this you have to execute a WBINV (write
back invalidate) instruction to flush out the cache.  BIOS setup is
probably simpler.  Disabling Level I will make the machine slower 
than mollasses, BTW, and if this bug is race related (they always 
are) it won't help much in running it down.

Jeff

> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> Please read the FAQ at http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Signal 11 - the continuing saga

2000-12-12 Thread Rainer Mager

Hi again,

Ok, I just upgraded to 2.4.0test12 (although I don't think there was any
work in 12 that directly addresses this signal 11 problem). When compiling
the new kernel I chose to disable AGPGart and RDM as suggested by
[EMAIL PROTECTED] I will report later if this makes any difference.

On another, possibly related note, I'm getting some really weird behavior
with a Java program. The only reason I mention it here is because it dies
with our old friend Signal 11. Anyway, please bear with the description
below.
I have a tiny bash script that launches a Java swing app. If I run my
script from an xterm (or gnome-terminal or whatever) then it starts up fine.
If, however, I try to launch it from my gnome taskbar's menu then it dies
with signal 11 (the Java log is available upon request). This seems to be
100% consistent, since I noticed it yesterday, even across reboots.
Interestingly, the same behavior occurs if I try to run the program from
withis JBuilder 4.
So, is this related to the larger signal 11 problems?


What else can I do regarding these issues to help fix it? Would a core dump
help anyone? I'd really like to contribute somehow but I need some
direction.


--Rainer

> From: CMA [mailto:[EMAIL PROTECTED]]
> Did you already try to selectively disable L1 and L2 caches (if
> your box has both) and see what happens?

Anyone know how to do this?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Signal 11 - the continuing saga

2000-12-12 Thread Rainer Mager

Hi again,

Ok, I just upgraded to 2.4.0test12 (although I don't think there was any
work in 12 that directly addresses this signal 11 problem). When compiling
the new kernel I chose to disable AGPGart and RDM as suggested by
[EMAIL PROTECTED] I will report later if this makes any difference.

On another, possibly related note, I'm getting some really weird behavior
with a Java program. The only reason I mention it here is because it dies
with our old friend Signal 11. Anyway, please bear with the description
below.
I have a tiny bash script that launches a Java swing app. If I run my
script from an xterm (or gnome-terminal or whatever) then it starts up fine.
If, however, I try to launch it from my gnome taskbar's menu then it dies
with signal 11 (the Java log is available upon request). This seems to be
100% consistent, since I noticed it yesterday, even across reboots.
Interestingly, the same behavior occurs if I try to run the program from
withis JBuilder 4.
So, is this related to the larger signal 11 problems?


What else can I do regarding these issues to help fix it? Would a core dump
help anyone? I'd really like to contribute somehow but I need some
direction.


--Rainer

 From: CMA [mailto:[EMAIL PROTECTED]]
 Did you already try to selectively disable L1 and L2 caches (if
 your box has both) and see what happens?

Anyone know how to do this?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-12 Thread Jeff V. Merkey

On Wed, Dec 13, 2000 at 09:22:55AM +0900, Rainer Mager wrote:
 Hi again,
 
   Ok, I just upgraded to 2.4.0test12 (although I don't think there was any
 work in 12 that directly addresses this signal 11 problem). When compiling
 the new kernel I chose to disable AGPGart and RDM as suggested by
 [EMAIL PROTECTED] I will report later if this makes any difference.
 
   On another, possibly related note, I'm getting some really weird behavior
 with a Java program. The only reason I mention it here is because it dies
 with our old friend Signal 11. Anyway, please bear with the description
 below.
   I have a tiny bash script that launches a Java swing app. If I run my
 script from an xterm (or gnome-terminal or whatever) then it starts up fine.
 If, however, I try to launch it from my gnome taskbar's menu then it dies
 with signal 11 (the Java log is available upon request). This seems to be
 100% consistent, since I noticed it yesterday, even across reboots.
 Interestingly, the same behavior occurs if I try to run the program from
 withis JBuilder 4.
   So, is this related to the larger signal 11 problems?

There's a corruption bug in the page cache somewhere, and it's 100%
reproducable.  Finding it will be tough

 
 
   What else can I do regarding these issues to help fix it? Would a core dump
 help anyone? I'd really like to contribute somehow but I need some
 direction.
 
 
 --Rainer
 
  From: CMA [mailto:[EMAIL PROTECTED]]
  Did you already try to selectively disable L1 and L2 caches (if
  your box has both) and see what happens?
 
 Anyone know how to do this?

Usually this is performed in the BIOS setup.  You can also disable L1 
with a sequence of instructions that write to the CR0 register on intel
and flip a bit, but in doing this you have to execute a WBINV (write
back invalidate) instruction to flush out the cache.  BIOS setup is
probably simpler.  Disabling Level I will make the machine slower 
than mollasses, BTW, and if this bug is race related (they always 
are) it won't help much in running it down.

Jeff

 
 -
 To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
 the body of a message to [EMAIL PROTECTED]
 Please read the FAQ at http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Signal 11 - the continuing saga

2000-12-12 Thread Rainer Mager

Thanks for the info...

 [mailto:[EMAIL PROTECTED]]On Behalf Of Jeff V. Merkey
  So, is this related to the larger signal 11 problems?

 There's a corruption bug in the page cache somewhere, and it's 100%
 reproducable.  Finding it will be tough

Ok, granted this will be tough but is anyone even actively working on it?
What can I do to help?



  Anyone know how to do [disable L1 and L2 caches]?

 Usually this is performed in the BIOS setup.  You can also disable L1
 with a sequence of instructions that write to the CR0 register on intel
 and flip a bit, but in doing this you have to execute a WBINV (write
 back invalidate) instruction to flush out the cache.  BIOS setup is
 probably simpler.  Disabling Level I will make the machine slower
 than mollasses, BTW, and if this bug is race related (they always
 are) it won't help much in running it down.

Aha, just as I suspected. My BIOS doesn't appear to support this. You seem
to be saying that doing so won't really contribute anything anyway so I will
hold off for now.



--Rainer

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Signal 11 - the continuing saga

2000-12-12 Thread Linus Torvalds

In article [EMAIL PROTECTED],
Jeff V. Merkey [EMAIL PROTECTED] wrote:
On Wed, Dec 13, 2000 at 09:22:55AM +0900, Rainer Mager wrote:
  I have a tiny bash script that launches a Java swing app. If I run my
 script from an xterm (or gnome-terminal or whatever) then it starts up fine.
 If, however, I try to launch it from my gnome taskbar's menu then it dies
 with signal 11 (the Java log is available upon request). This seems to be
 100% consistent, since I noticed it yesterday, even across reboots.
 Interestingly, the same behavior occurs if I try to run the program from
 withis JBuilder 4.
  So, is this related to the larger signal 11 problems?

There's a corruption bug in the page cache somewhere, and it's 100%
reproducable.  Finding it will be tough

Unlikely. If the actual program data was corrupted, it would SIGSEGV
regardless of how it's executed.

I'd guess that the program has a bug, and depending on the arguments and
environment (especially the latter will be different), it shows up or
not. Things like not having a LOCALE set in either case or similar.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Signal 11 - the continuing saga

2000-12-12 Thread Mike Galbraith

On Wed, 13 Dec 2000, Rainer Mager wrote:

 Thanks for the info...
 
  [mailto:[EMAIL PROTECTED]]On Behalf Of Jeff V. Merkey
 So, is this related to the larger signal 11 problems?
 
  There's a corruption bug in the page cache somewhere, and it's 100%
  reproducable.  Finding it will be tough
 
 Ok, granted this will be tough but is anyone even actively working on it?
 What can I do to help?

If you want, I can extract IKD.. which happens to have a trap in place
for this (because I have a 100% reproducable swap related SIGSEGV that
I'm trying to figure out). 

If you're interested, let me know and I'll extract it (quite large) and
send it along instructions on how to do the trap.

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/