Re: BUG: Global FPU corruption in 2.2
On Tue, Apr 24, 2001 at 06:56:32PM +0200, Christian Ehrhardt wrote: > On Tue, Apr 24, 2001 at 09:10:07AM -0700, Linus Torvalds wrote: > > ptrace only operates on processes that are stopped. So there are no > > locking issues - we've synchronized on a much higher level than a > > spinlock or semaphore. > > This is only true for requests other than PTRACE_ATTACH and > PTRACE_ATTACH is exactly what I'm worried about. May I remind everybody that at the beginning of this thread I posted another example, from an SMP Alpha, of FPU problems. It certainly was not exactly like the one under discussion but it looked that it had a similar "smell" to it. It looks like that to reproduce this Alpha example one needs processors with a rather fast clock and this hardware version is not yet very widely available. Michal - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: Global FPU corruption in 2.2
> Alan Cox <[EMAIL PROTECTED]> writes: > > The preferable one for performance is certainly to backport the 2.4 changes > > Is it any more substantial than changing all uses of the ptrace flags > to the new variable? It affects asm blocks and offsets on some ports. Its not too bad tho - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: Global FPU corruption in 2.2
Alan Cox <[EMAIL PROTECTED]> writes: > The preferable one for performance is certainly to backport the 2.4 changes Is it any more substantial than changing all uses of the ptrace flags to the new variable? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: Global FPU corruption in 2.2
> > child->flags |= PF_PTRACED; > > > > without waiting for the child to have stopped. > > I can see how this could case PF_USEDFPU to be cleared inadvertently, > but I do not have any ideas for testing this. Is it clear that this > is the source of the problem? There is no guarantee that |= is implemented atomically - in fact its quite likely to read get child->flags or PF_PTRACED write child->flags and a PF_USEDFPU on another processor at the same instant -would- end up being lost. There are two fixes 1. Make all the ops atomic (foo_bit()) 2. Split the flags The preferable one for performance is certainly to backport the 2.4 changes - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: Global FPU corruption in 2.2
Linus Torvalds writes: > Ahh.. This actually _does_ look like a race on "current->flags": > PTRACE_ATTACH will do a > > child->flags |= PF_PTRACED; > > without waiting for the child to have stopped. I can see how this could case PF_USEDFPU to be cleared inadvertently, but I do not have any ideas for testing this. Is it clear that this is the source of the problem? What would be involved in backporting the split ptrace flags to 2.2? Are there other solutions? Vic - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: Global FPU corruption in 2.2
"Christian Ehrhardt" <[EMAIL PROTECTED]> writes: > Victor: Could you try to reproduce the system wide corruption if you > add an explicit call to stts(); at the very end of __switch_to? > This should prevent the FPU corruption from spreading. After adding this call, I cannot reproduce the global corruption. There is still occasional local corruption of individual pi processes while pt is running. Vic - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: Global FPU corruption in 2.2
On Tue, Apr 24, 2001 at 09:10:07AM -0700, Linus Torvalds wrote: > ptrace only operates on processes that are stopped. So there are no > locking issues - we've synchronized on a much higher level than a > spinlock or semaphore. This is only true for requests other than PTRACE_ATTACH and PTRACE_ATTACH is exactly what I'm worried about. regards Christian -- THAT'S ALL FOLKS! - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: Global FPU corruption in 2.2
On Tue, Apr 24, 2001 at 08:05:15AM -0500, Victor Zandy wrote: > > He found that PF_USEDFPU was always set before the machine was broken. > After he found that it was set about 70% of the time. If I'm not mistaken this actully can cause GLOBAL FPU corruption. Here's why: Assyme for a moment that we lose either the PF_USEDFPU flag of one process. This not only means that the current process won't have its state saved, it also means that the next process won't have the TS bit set. This in turn means that this new process won't get PF_USEDFPU set and suddenly we have a second process with a corrupted FPU state. Victor: Could you try to reproduce the system wide corruption if you add an explicit call to stts(); at the very end of __switch_to? This should prevent the FPU corruption from spreading. NOTE: This is just to prove my theory, it is not and isn't meant to be a fix for the actual problem. regards Christian Ehrhardt -- THAT'S ALL FOLKS! - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: Global FPU corruption in 2.2
[ Alan, I'm lazy and only have 2.2.14 sources on-line. Maybe this has been fixed already and there's something else going on. Worth a look ] In article <[EMAIL PROTECTED]>, Victor Zandy <[EMAIL PROTECTED]> wrote: > >Someone else here traced the process flags of a FP-intensive program >on a machine before and after it is put in the faulty FPU state. He >periodically sampled /proc/pid/stat while the program was running. > >He found that PF_USEDFPU was always set before the machine was broken. >After he found that it was set about 70% of the time. [ Looks closer at the ptrace synchronization ] Ahh.. This actually _does_ look like a race on "current->flags": PTRACE_ATTACH will do a child->flags |= PF_PTRACED; without waiting for the child to have stopped. (Aside: thinking more about the stopping logic - I'm not actually sure the ptrace synchronization is complete wrt scheduling, as there will be a window when the process has set the task state to TASK_STOPPED but hasn't actually yet scheduled away. Oh, well). All other ptrace operations (not counting killing the child) will check that the child is quiescent. But PTRACE_ATTACH will not, as we're just setting up the stopping. In 2.4.x, this bug doesn't happen because "flags" was split up into "current->ptrace" and "current->flags". Exactly because of locking concerns. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: Global FPU corruption in 2.2
> >1.) If I'm not mistaken switch_to changes current->flags without > >atomic operations and without any locks and sys_ptrace changes > >child->flags only protected by the big kernel lock. > > ptrace only operates on processes that are stopped. So there are no > locking issues - we've synchronized on a much higher level than a > spinlock or semaphore. In the 2.2 case the ptrace flags themselves are in the same flag set as the PF_ flags. In 2.4 that was fixed. That means there are some bizarre cases where current->flags might not be handled perfectly. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: Global FPU corruption in 2.2
In article <[EMAIL PROTECTED]>, Christian Ehrhardt <[EMAIL PROTECTED]> wrote: > >1.) If I'm not mistaken switch_to changes current->flags without >atomic operations and without any locks and sys_ptrace changes >child->flags only protected by the big kernel lock. ptrace only operates on processes that are stopped. So there are no locking issues - we've synchronized on a much higher level than a spinlock or semaphore. That said, it does look like 2.2.x has a real bug, and maybe the ptrace task stopping sycnhronization is broken.. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: Global FPU corruption in 2.2
Someone else here traced the process flags of a FP-intensive program on a machine before and after it is put in the faulty FPU state. He periodically sampled /proc/pid/stat while the program was running. He found that PF_USEDFPU was always set before the machine was broken. After he found that it was set about 70% of the time. Vic - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: Global FPU corruption in 2.2
Hi, I want to look into this problem. Its seems to be very interesting. But I was not following the thread from the beginning (and I mistakely deleted all these mails :( .. ).. I hope you won't mind answering following questions... 1) you are doing this on an MP or a uniprocessor ? 2) I want to know how are you calling sys_ptrace(Attach) and sys_ptrace(detach).. i.e is it something linke following for(;;){ sys_ptrace(attach to process); sys_wait4(); sys_ptrace(detach from process); } In short the sequence of system calls you are using for attaching and detaching to the process 3) Have you tried doing attach and detach only once ? If not.. can you please try this and let me know whether by doing attach and detach one time also results in global FPU corruption. Please do not fork in the above process. - Whenever process A calls sys_ptrace(Attach) to Process B, sys_ptrace sends SIGSTOP to process B. Now process B in do_signal, checks that it is being traced and then it does the following current->state = TASK_STOPPED; notify_parent(current,SIGCHLD); schedule(); so now in schedule() --> __switch_to --> unlazy_fpu() function we do following if (current->flags & PF_USEDFPU) save_fpu(); In save_fpu() we do following fnsave current->tss.i387 fwait; I want to ask a question... is it possible if 'somehow' we were not able to save the complete floating point state with fnsave i.e. current->tss.i387 is 'invalid' after fnsave current->tss.i387 fwait; Thanks Amol David Konerding <[EMAIL PROTECTED]> on 04/23/2001 01:09:27 AM To: Ulrich Drepper <[EMAIL PROTECTED]> cc: [EMAIL PROTECTED], [EMAIL PROTECTED] (bcc: Amol Lad/HSS) Subject: Re: BUG: Global FPU corruption in 2.2 Ulrich Drepper wrote: > "Richard B. Johnson" <[EMAIL PROTECTED]> writes: > > > The kernel doesn't know if a process is going to use the FPU when > > a new process is created. Only the user's code, i.e., the 'C' runtime > > library knows. > > Maybe you should try to understand the kernel code and the features of > the processor first. The kernel can detect when the FPU is used for > the first time. OK, regardless of how the linux kernel actually manages the FPU for user-space programs, does anybody have any comments on the original bugreport? >We have found that one of our programs can cause system-wide >corruption of the x86 FPU under 2.2.16 and 2.2.17. That is, after we >run this program, the FPU gives bad results to all subsequent >processes. >We see this problem on dual 550MHz Xeons with 1GB RAM. We have 64 of >these things, and we see the problem on every node we try (dozens). >We don't have other SMPs handy. Uniprocessors, including other PIIIs, >don't seem to be affected. >Below are two programs we use to produce the behavior. The first >program, pi, repeatedly spawns 10 parallel computations of pi. When >all is well, each process prints pi as it completes. >The second program, pt, repeatedly attaches to and detaches from >another process. Run pt against the root pi process until the output >of pi begins to look wrong. Then kill everything and run pi by itself >again. It will no longer produce good results. We find that the FPU >persistently gives bad results until we reboot. I tried this on my dual PIII-600 runnng 2.2.19 and got exactly the behavior described. If it is a bug in the linux kernel (I can see nothing wrong with the source code provided), I would suspect probems with SMP and ptrace, somehow causing the wrong FP registers to be returned to a process after the scheduler restarted it. It's very interesting that the PI program works fine until you run PT, but after you run PT, PI is screwed until reboot. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: Global FPU corruption in 2.2
Hi, I want to look into this problem. Its seems to be very interesting. But I was not following the thread from the beginning (and I mistakely deleted all these mails :( .. ).. I hope you won't mind answering following questions... 1) you are doing this on an MP or a uniprocessor ? 2) I want to know how are you calling sys_ptrace(Attach) and sys_ptrace(detach).. i.e is it something linke following for(;;){ sys_ptrace(attach to process); sys_wait4(); sys_ptrace(detach from process); } In short the sequence of system calls you are using for attaching and detaching to the process 3) Have you tried doing attach and detach only once ? If not.. can you please try this and let me know whether by doing attach and detach one time also results in global FPU corruption. Please do not fork in the above process. - Whenever process A calls sys_ptrace(Attach) to Process B, sys_ptrace sends SIGSTOP to process B. Now process B in do_signal, checks that it is being traced and then it does the following current->state = TASK_STOPPED; notify_parent(current,SIGCHLD); schedule(); so now in schedule() --> __switch_to --> unlazy_fpu() function we do following if (current->flags & PF_USEDFPU) save_fpu(); In save_fpu() we do following fnsave current->tss.i387 fwait; I want to ask a question... is it possible if 'somehow' we were not able to save the complete floating point state with fnsave i.e. current->tss.i387 is 'invalid' after fnsave current->tss.i387 fwait; Thanks Amol David Konerding <[EMAIL PROTECTED]> on 04/23/2001 01:09:27 AM To: Ulrich Drepper <[EMAIL PROTECTED]> cc: [EMAIL PROTECTED], [EMAIL PROTECTED] (bcc: Amol Lad/HSS) Subject: Re: BUG: Global FPU corruption in 2.2 Ulrich Drepper wrote: > "Richard B. Johnson" <[EMAIL PROTECTED]> writes: > > > The kernel doesn't know if a process is going to use the FPU when > > a new process is created. Only the user's code, i.e., the 'C' runtime > > library knows. > > Maybe you should try to understand the kernel code and the features of > the processor first. The kernel can detect when the FPU is used for > the first time. OK, regardless of how the linux kernel actually manages the FPU for user-space programs, does anybody have any comments on the original bugreport? >We have found that one of our programs can cause system-wide >corruption of the x86 FPU under 2.2.16 and 2.2.17. That is, after we >run this program, the FPU gives bad results to all subsequent >processes. >We see this problem on dual 550MHz Xeons with 1GB RAM. We have 64 of >these things, and we see the problem on every node we try (dozens). >We don't have other SMPs handy. Uniprocessors, including other PIIIs, >don't seem to be affected. >Below are two programs we use to produce the behavior. The first >program, pi, repeatedly spawns 10 parallel computations of pi. When >all is well, each process prints pi as it completes. >The second program, pt, repeatedly attaches to and detaches from >another process. Run pt against the root pi process until the output >of pi begins to look wrong. Then kill everything and run pi by itself >again. It will no longer produce good results. We find that the FPU >persistently gives bad results until we reboot. I tried this on my dual PIII-600 runnng 2.2.19 and got exactly the behavior described. If it is a bug in the linux kernel (I can see nothing wrong with the source code provided), I would suspect probems with SMP and ptrace, somehow causing the wrong FP registers to be returned to a process after the scheduler restarted it. It's very interesting that the PI program works fine until you run PT, but after you run PT, PI is screwed until reboot. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: Global FPU corruption in 2.2
Erik Paulson <[EMAIL PROTECTED]> on 04/24/2001 01:14:27 AM To: Christian Ehrhardt <[EMAIL PROTECTED]> cc: [EMAIL PROTECTED], [EMAIL PROTECTED] (bcc: Amol Lad/HSS) Subject: Re: BUG: Global FPU corruption in 2.2 On 23 Apr 2001 18:11:48 +0200, Christian Ehrhardt wrote: > On Thu, Apr 19, 2001 at 11:05:03AM -0500, Victor Zandy wrote: > > > > We have found that one of our programs can cause system-wide > > corruption of the x86 FPU under 2.2.16 and 2.2.17. That is, after we > > run this program, the FPU gives bad results to all subsequent > > processes. > <...> > > 3.) It might be interesting to know if the problem can be triggered: > a) If pi doesn't fork, i.e. just one process calculating pi and > another one doing the attach/detach. Yes, we are still able to reproduce it without calling fork (the new program just calls do_pi() a bunch of times, and then we attach and detach to that process) > b) If pi doesn't do FPU Operations, i.e. only the children call do_pi. > You seem to need to attach and detach to a program using the fpu - running pt on a process that is just busy-looping over and over some integer adds does not seem to while running pi on the machine at the same time, but not attaching to it does not seem to affect the floating point state. >>>> well... during context switching.. call to unlazy_fpu() does the following if (current->flags & PF_USEDFPU) save_fpu(); somebody earlier pointed out, for the possible race when in sys_ptrace, at the time of attach we modify child->flags. It really looks again strange that it is software that is causing the problem as the code to handle FPU looks pretty clean. still can we check current->flags when the problem occurs ? Amol -Erik - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: Global FPU corruption in 2.2
On 23 Apr 2001 18:11:48 +0200, Christian Ehrhardt wrote: > On Thu, Apr 19, 2001 at 11:05:03AM -0500, Victor Zandy wrote: > > > > We have found that one of our programs can cause system-wide > > corruption of the x86 FPU under 2.2.16 and 2.2.17. That is, after we > > run this program, the FPU gives bad results to all subsequent > > processes. > <...> > > 3.) It might be interesting to know if the problem can be triggered: > a) If pi doesn't fork, i.e. just one process calculating pi and > another one doing the attach/detach. Yes, we are still able to reproduce it without calling fork (the new program just calls do_pi() a bunch of times, and then we attach and detach to that process) > b) If pi doesn't do FPU Operations, i.e. only the children call do_pi. > You seem to need to attach and detach to a program using the fpu - running pt on a process that is just busy-looping over and over some integer adds does not seem to while running pi on the machine at the same time, but not attaching to it does not seem to affect the floating point state. -Erik - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: Global FPU corruption in 2.2
On Thu, Apr 19, 2001 at 11:05:03AM -0500, Victor Zandy wrote: > > We have found that one of our programs can cause system-wide > corruption of the x86 FPU under 2.2.16 and 2.2.17. That is, after we > run this program, the FPU gives bad results to all subsequent > processes. A few comments, not sure if they will help very much: 1.) If I'm not mistaken switch_to changes current->flags without atomic operations and without any locks and sys_ptrace changes child->flags only protected by the big kernel lock. I could imagine that this causes local corruption on an SMP machine and this is something that changed in 2.4 kernels, but I don't see how this can corrupt FPU state globally. Maybe there is something else. 2.) I guess a single finit (as proposed by someone else in this thread) won't assure that both FPUs are in a sane state. 3.) It might be interesting to know if the problem can be triggered: a) If pi doesn't fork, i.e. just one process calculating pi and another one doing the attach/detach. b) If pi doesn't do FPU Operations, i.e. only the children call do_pi. regardsChristian -- THAT'S ALL FOLKS! - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: Global FPU corruption in 2.2
Hello, Linux 2.2.19 SMP, confirm report. Even games are going weird after running this test, (my wife is complaining :-)) Have to reboot. Kees - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: Global FPU corruption in 2.2
> OK, regardless of how the linux kernel actually manages the FPU for user-space > > programs, does anybody have any comments on the original bugreport? Complete mystification. > >of pi begins to look wrong. Then kill everything and run pi by itself > >again. It will no longer produce good results. We find that the FPU > >persistently gives bad results until we reboot. > > I tried this on my dual PIII-600 runnng 2.2.19 and got exactly the behavior > described. This is the most odd bit of all. The processor state for the FPU is per task private and each task initializes its own FPU state. In terms of FPU state itself I don't currently see what there is that can be left behind - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: Global FPU corruption in 2.2
Ulrich Drepper wrote: > "Richard B. Johnson" <[EMAIL PROTECTED]> writes: > > > The kernel doesn't know if a process is going to use the FPU when > > a new process is created. Only the user's code, i.e., the 'C' runtime > > library knows. > > Maybe you should try to understand the kernel code and the features of > the processor first. The kernel can detect when the FPU is used for > the first time. OK, regardless of how the linux kernel actually manages the FPU for user-space programs, does anybody have any comments on the original bugreport? >We have found that one of our programs can cause system-wide >corruption of the x86 FPU under 2.2.16 and 2.2.17. That is, after we >run this program, the FPU gives bad results to all subsequent >processes. >We see this problem on dual 550MHz Xeons with 1GB RAM. We have 64 of >these things, and we see the problem on every node we try (dozens). >We don't have other SMPs handy. Uniprocessors, including other PIIIs, >don't seem to be affected. >Below are two programs we use to produce the behavior. The first >program, pi, repeatedly spawns 10 parallel computations of pi. When >all is well, each process prints pi as it completes. >The second program, pt, repeatedly attaches to and detaches from >another process. Run pt against the root pi process until the output >of pi begins to look wrong. Then kill everything and run pi by itself >again. It will no longer produce good results. We find that the FPU >persistently gives bad results until we reboot. I tried this on my dual PIII-600 runnng 2.2.19 and got exactly the behavior described. If it is a bug in the linux kernel (I can see nothing wrong with the source code provided), I would suspect probems with SMP and ptrace, somehow causing the wrong FP registers to be returned to a process after the scheduler restarted it. It's very interesting that the PI program works fine until you run PT, but after you run PT, PI is screwed until reboot. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: Global FPU corruption in 2.2
"Richard B. Johnson" <[EMAIL PROTECTED]> writes: > The kernel doesn't know if a process is going to use the FPU when > a new process is created. Only the user's code, i.e., the 'C' runtime > library knows. Maybe you should try to understand the kernel code and the features of the processor first. The kernel can detect when the FPU is used for the first time. -- ---. ,-. 1325 Chesapeake Terrace Ulrich Drepper \,---' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com ` - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: Global FPU corruption in 2.2
It looks to me like the kernel sets a trap for FP operations when a process is switched in. Then when the process executes an FP op, the kernel clears the trap and either loads the FP context or initializes it, depending on whether it is the process' first FP operation. So no help is need from anything in user space. Vic "Richard B. Johnson" <[EMAIL PROTECTED]> writes: > On 20 Apr 2001, Ulrich Drepper wrote: > > > "Richard B. Johnson" <[EMAIL PROTECTED]> writes: > > > > > If it "fixes" it, there is no problem with the FPU, but with the > > > 'C' runtime library which doesn't initialize the FPU to a known > > > state before it uses it. > > > > It's the kernel which initializes the FPU. This was always the case > > and necessary to implement the fast lazy FPU saving/restoring. > > Processes which never use the FPU never initialize it. > > The kernel doesn't know if a process is going to use the FPU when > a new process is created. Only the user's code, i.e., the 'C' runtime > library knows. If the user is using 'asm' or whatever, the user must - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: Global FPU corruption in 2.2
On 20 Apr 2001, Victor Zandy wrote: > > No dice. Your program does not fix the problem. > > If it were a hardware problem, I would expect the problem to occur > under 2.4.2 as well as 2.2.*, and I would be surprised that we can > consistently produce the behavior across our 64 node cluster. But we > are keeping the possibility in mind. > > Thanks for your suggestions. > > Vic > Then, if the FPU is fine, you have just proven that the storage where the FPU context is saved, gets overwritten. Further, once the initial write occurs, all subsequent fnsave/frestore operations also encounter the same spurious write. --OR some continuously-running floating-point has sneaked into the kernel. Cheers, Dick Johnson Penguin : Linux version 2.4.1 on an i686 machine (799.53 BogoMips). "Memory is like gasoline. You use it up when you are running. Of course you get it all back when you reboot..."; Actual explanation obtained from the Micro$oft help desk. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: Global FPU corruption in 2.2
On 20 Apr 2001, Ulrich Drepper wrote: > "Richard B. Johnson" <[EMAIL PROTECTED]> writes: > > > If it "fixes" it, there is no problem with the FPU, but with the > > 'C' runtime library which doesn't initialize the FPU to a known > > state before it uses it. > > It's the kernel which initializes the FPU. This was always the case > and necessary to implement the fast lazy FPU saving/restoring. > Processes which never use the FPU never initialize it. The kernel doesn't know if a process is going to use the FPU when a new process is created. Only the user's code, i.e., the 'C' runtime library knows. If the user is using 'asm' or whatever, the user must initialize the FPU before using it, otherwise, the user doesn't know anything about its state and the results ... (let's see, what was at TOS, errm, is this a NAN?). The results are indeterminate. Cheers, Dick Johnson Penguin : Linux version 2.4.1 on an i686 machine (799.53 BogoMips). "Memory is like gasoline. You use it up when you are running. Of course you get it all back when you reboot..."; Actual explanation obtained from the Micro$oft help desk. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: Global FPU corruption in 2.2
"Richard B. Johnson" <[EMAIL PROTECTED]> writes: > If it "fixes" it, there is no problem with the FPU, but with the > 'C' runtime library which doesn't initialize the FPU to a known > state before it uses it. It's the kernel which initializes the FPU. This was always the case and necessary to implement the fast lazy FPU saving/restoring. Processes which never use the FPU never initialize it. -- ---. ,-. 1325 Chesapeake Terrace Ulrich Drepper \,---' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com ` - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: Global FPU corruption in 2.2
No dice. Your program does not fix the problem. If it were a hardware problem, I would expect the problem to occur under 2.4.2 as well as 2.2.*, and I would be surprised that we can consistently produce the behavior across our 64 node cluster. But we are keeping the possibility in mind. Thanks for your suggestions. Vic - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: Global FPU corruption in 2.2
On 20 Apr 2001, Victor Zandy wrote: > > Victor Zandy <[EMAIL PROTECTED]> writes: > > We have found that one of our programs can cause system-wide > > corruption of the x86 FPU under 2.2.16 and 2.2.17. That is, after we > > run this program, the FPU gives bad results to all subsequent > > processes. > > We have now tested 2.4.2 and 2.2.19. > > 2.2.19 has the same problem. > > 2.4.3 does not seem to be affected. Unfortunately, we really need a > working 2.2 kernel at this time. > > We also patched the 2.2.19 kernel with the PIII patch found in > /pub/linux/kernel/people/andrea/patches/v2.2/2.2.19pre13/PIII-10.bz2 > on ftp.kernel.org. Same problem. > > Does anyone have any ideas for us? > > Thanks. > > Vic Just for kicks, do whatever is necessary to "break" the fpu. Then run this program: int main() { __asm__("finit\n"); return 0; } If it "fixes" it, there is no problem with the FPU, but with the 'C' runtime library which doesn't initialize the FPU to a known state before it uses it. It is possible for the kernel to work around th 'C' library problem by clearing the FPU after every fork(). The last time I checked (years ago), 'finit' was executed during the fork. Maybe it isn't anymore because it takes many machine-cycles to complete. If this doesn't "fix" it, then your hardware may have a problem like overheating, etc., (loose heatsink?). Cheers, Dick Johnson Penguin : Linux version 2.4.1 on an i686 machine (799.53 BogoMips). "Memory is like gasoline. You use it up when you are running. Of course you get it all back when you reboot..."; Actual explanation obtained from the Micro$oft help desk. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: Global FPU corruption in 2.2
Victor Zandy <[EMAIL PROTECTED]> writes: > We have found that one of our programs can cause system-wide > corruption of the x86 FPU under 2.2.16 and 2.2.17. That is, after we > run this program, the FPU gives bad results to all subsequent > processes. We have now tested 2.4.2 and 2.2.19. 2.2.19 has the same problem. 2.4.3 does not seem to be affected. Unfortunately, we really need a working 2.2 kernel at this time. We also patched the 2.2.19 kernel with the PIII patch found in /pub/linux/kernel/people/andrea/patches/v2.2/2.2.19pre13/PIII-10.bz2 on ftp.kernel.org. Same problem. Does anyone have any ideas for us? Thanks. Vic - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: Global FPU corruption in 2.2
On Thu, Apr 19, 2001 at 11:05:03AM -0500, Victor Zandy wrote: > > We have found that one of our programs can cause system-wide > corruption of the x86 FPU under 2.2.16 and 2.2.17. > > We see this problem on dual 550MHz Xeons with 1GB RAM. Hm, I started to wonder if this is not somewhat related to a recent report I got. "The victim" was running 2.2.19 (basically) on an SMP Alpha UP2000+ with two 800 MHz processors. He managed to reduce the problem to a rather small test case and I attach sources, Makefile and a "loop.sh" driver as a shar archive if you want to have a closer look. This "loop.sh" simply fires triplets of "harry" process in a loop. The guy hit by this gets apparently random floating point exceptions starting with roughly sixth process and later intervals between bombs will vary. I have also 'strace' outputs from failing processes but they are not telling very much. 'gdb' is also not very illuminating: Program received signal SIGFPE, Arithmetic exception. 0x1200010a8 in vadd_ (a=0x11fff21e4, ia=0x120003294, b=0x11fff7004, ib=0x120003294, c=0x11fffbe20, ic=0x120003294, n=0x11c70) at vadd.f:99 99 C(CI) = A(AI) + B(BI) Current language: auto; currently fortran (gdb) p *ia $10 = 1 (gdb) p *ib $11 = 1 (gdb) p *ic $12 = 1 (gdb) p *n Cannot access memory at address 0x4 (gdb) p *(0x11c70) $13 = 1024 (gdb) info locals n = (PTR TO -> ( integer )) 0x4 __g77_expr_0 = 10 He tells me that he is getting that on two different machines he has around. The trouble is that I tried to repeat that with different hardware, kernels, compilers and libraries and I failed even on SMP; but I got an access to a box with only 667 MHz processors. OTOH he is running right now 2.4.3-ac9 plus Andrea Arcangeli patches for rw semaphores on Alpha and he reports that the problem went away (and, hopefuly, nothing else will crop out :-). Anybody can offer an insight what that may really be? It may be, of course, totally unrelated to this report from Victor Zandy. Michal [EMAIL PROTECTED] fpbomb.shar